Replace the Java-based HTML indexer with a Python-based one.
The Java-based HTML indexer was based on stemming, which
can reduce the size of the index notably. However, stemming
producespoor search behaviour, and is language specific.
For example, the word 'documentation' has a stem of 'document'.
When a user searches for 'docu', 'document' is not found. Also,
all punctuation was stripped, making searches for IP addresses
or depot paths impossible.
The Python-based indexer collects all whitespace-separated
tokens. The resulting index is around 10% larger than the
stemmed index, but permits much more reasonable results,
particularly while the user is still typing in search terms.
The Python-based indexer should be cross-platform, but has to
date only been tested on Mac OSX. It should work as-is on
Linux, but further work may be required on Windows or other
platforms. Also, there is room for optimization, particularly when
a large number of HTML documents are to be indexed.
Searching multiple terms is possible, and each HTML page must
match all entered terms. There is nothing fancy about handling
the search terms, no conjunctions, phrase searching, etc.
Depending on user feedback, we may need to add more
sophistication in the future.
Included in this change:
- the Python indexer
- removed use of Java-based indexer from common build.xml
and applied the Python indexer
- added additional image src filtering
- better build commentary in certain targets
- rewrote the client-side searching
- fixed the content of the HTML <title> tags to be
'<chapter> // <book>', or just '<book>' when necessary.
Search results strip off the ' // <book>' to be concise.
- identify results as 'pages' rather than results, as each page
may contain multiple matches.
- results now indicate how many matches to expect on each
page.
- re-enabled indexing/search for CmdRef, P4API, P4Dist, and
P4Guide.