- P4Search Copyright (c) 2014, Perforce Software
- ---
- P4Search is built on the following technology stack:
- Apache Solr - combines Apache Lucene (fast indexing) and Apache
- Tika (file content/metadata parsing) to form the
- base of the index service
- P4Search - Java code that
- * optionally scans an existing Perforce Server for
- content on startup
- * via Solr indexes change commit content as it
- happens (via a server trigger)
- * optionally indexes all file revisions (except
- deletes) and tracks which is the most recent
- * provides an API for searching the Solr index via
- GET or POST requests
- * provides a minimal web UI for searching via a web
- browser
- jetty/Tomcat/etc. web application server - hosts both the Solr
- indexer application and the P4Search application
- Operational theory
- ---
- Initialization: P4Search will walk through any depots it can,
- produce lists of files to index for a queue, and pull things off
- of the queue for indexing. The service can be configured to scan
- all versions or just the head revision, skip certain file types,
- only index metadata, etc. The amount of time will depend on the
- size of your installation. The service is still usable for
- searching what it has while it is scanning. Once scanning is
- complete a key is set on the server to prevent attempting future
- scans, or to resume interrupted scans.
- Changelist trigger: When the search-queue.sh script is installed
- on the Perforce Server, it will send a curl request with the
- changelist number to P4Search. The data is written into a
- temporary location for processing later. Changelist processing
- is largely the same as the initialization phase: turn the
- changelist into a list of files and queue them up for later
- processing.
- Indexing: A p4 print of the file is performed and sent to the
- Apache Solr indexing service. It pulls whatever data it can
- and indexes it. P4Search specific metadata that is indexed is
- as follows (see the schema.xml):
- <field name="depotrevision" type="string" indexed="false" stored="true"/>
- <field name="filename" type="text_en" indexed="true" stored="true"/>
- <!-- path broken into parts, so you can efficiently search on e.g. a valued folder name -->
- <field name="depotpath" type="string" indexed="true" stored="true" multiValued="true" />
- <!-- for indexing all revisions, if this doc is the head rev -->
- <field name="headrevision" type="boolean" indexed="true" stored="true" multiValued="false" />
- <field name="modifiedby" type="string" indexed="true" stored="true" />
- <field name="modifiedtime" type="date" indexed="true" stored="true" />
- <field name="filesize" type="long" indexed="true" stored="true" />
- <!-- if you are configured to index attributes, digest comes along for free -->
- <field name="digest" type="string" indexed="true" stored="true" />
- <!-- file attributes go here, p4attr_ + the raw name to avoid potential conflicts -->
- <dynamicField name="p4attr_*" type="text_general" indexed="true" stored="true" />
- Search: First P4Search queries Apache Solr for the results, then
- it runs p4 files as the supplied user to limit results to what
- that user would normally be able to see in the Perforce Server
- via any other Perforce client application.
- Build and install
- ---
- JDK 1.7 is required to build the project, and gradle (www.gradle.org)
- is required in the $PATH. JDK 1.7, jetty, and Apache solr are
- required to run the service, and having the p4 command line client
- in the path is required when using install.sh. The only platform
- this process has been tested on is Ubuntu Linux.
- $ROOT/search/build.sh - builds and tests the project (war file
- and tgz package)
- Once the project is successfully built, the $ROOT/search/tmp
- directory contains both a build output directory ($OUTDIR) and
- a tgz containing all of its contents. At the root of the output
- directory is the install script (./install.sh) which will
- * prompt for some basic installation information
- * download jetty (8.1) and solr (4.5.1) if the required
- tarballs are not in the directory
- * unpack them and configure both for use by P4Search
- * create a start.sh and stop.sh for starting and stopping
- the services
- To remove the installation simply run "rm -rf $OUTDIR/install".
- The installation location is $OUTDIR/install by default. You can
- run $OUTDIR/install/start.sh to start both services and stop.sh to
- stop them.
- The P4Search tool uses a Perforce Server trigger (see scripts/
- search-queue.sh or install/search-queue.sh) to receive new
- changelists, so the trigger must be installed on the Perforce
- Server or no indexing past the initial scan will take place.
- Configuration
- ---
- The install script creates a search.config properties file in
- the $OUTDIR/install/jetty-.../resources directory. Configuration
- changes currently require the service to be restarted. Except for
- server configuration (serverProtocol, serverHost, serverPort,
- indexerUser, indexerPassword), reasonable defaults are assumed
- for the other properties:
- com.perforce.search...
- (general Solr configuration)
- searchEngine: URL of the solr installation, e.g. http://localhost:8983
- searchEngineToken: magic token matching the Perforce Server trigger
- collectionName: solr collection that contains the indexed data
- (general processing configuration)
- queueLocation: location of queued changelist files to be indexed
- maxSearchResults: maximum results returned by the service
- maxStreamFileSize: largest file size to attempt to index content
- ignoredExtensions: file with a CRLF list of extensions to skip
- content processing
- neverProcessList: file with a CRLF list of extensions to never
- index
- indexMethod: ALL_REVISIONS | HEAD_REVISIONS, HEAD... means to
- only keep the index up to date with the latest revision
- blackFstatRegex: for Perforce attributes, which p4 attr to
- skip (empty means do not index fstat data)
- whiteFstatRegex: for Perforce attributes, which p4 attr to
- include (empty means anything not in the blacklist)
- changelistCatchupKey: key name where the latest processed changelist
- is located. On startup the service will try to "catch up" based
- on this value.
- (file scanner config)
- fileScannerTokenKey: key name to indicate when the initialization
- is complete, empty implies "do not scan"
- fileScannerTokenValue: key value to indicate when the initialization
- is complete, empty implies "do not scan"
- scanPaths: CSV paths in the Perforce server to scan
- fileScannerThreads: number of threads handling the processing
- fileScannerSleep: used to throttle the scanner back
- fileScannerDepth: when scanning how many revisions down to go, 0
- implies all revisions, 1, is head only, etc.
- maxScanQueueSize: used to throttle the amount of files in
- the scan queue
- (GUI config)
- commonsURL, swarmURL, p4webURL: URLs of services that can show
- the files via links in the web browser. Swarm and P4Web
- settings are mutually exclusive, with a preference to the Swarm
- URL.
- API
- ---
- See the API documention included in the installation for how
- to gain programmatic access to search results and how to send
- specific queries to the search service. The P4Search web UI
- exercises the underlying http API using javascript XHR.
- Notes
- ---
- * If you want to restrict the Apache Solr access to certain IP
- addresses, you must add a handler to the etc/jetty.xml file in
- the Solr installation, e.g.
- org.eclipse.jetty.server.handler.IPAccessHandler. See the
- install.sh script for a hint on how to do this.
- * If you suspect that the service has missed some changelists,
- one easy way to fix it is to script sending a number of curl
- requests to the server to force it to re-index files in those
- changelists. Re-indexing files will not corrupt the integrity
- of the index, at worst it will simply duplicate the work.
- * While Apache Solr has the ability to parse many types of files,
- you may find it useful to look through the Solr logs and
- determine if you need additional extractors, e.g. xmpcore.jar for
- media files.
- * This version installs with Apache Solr 4.5.1 by default. When
- testing against mp3 files I found that some mp3 files cause
- the Apache Tika parser in this version to hang the CPU. If
- you have similar problems or expect to index a lot of media
- files you might consider an earlier version of Solr (Solr 4.3.1
- used Apache Tika 1.3 which worked for me), replacing the Tika
- jar with version 1.5 (unreleased) or using the ...ignoredExtensions
- configurable to exclude problematic files.
- * If you're curious on how the initial scan is doing or about other
- things the server is doing, the easiest way to check what is
- happening is to tail the log file, e.g. run
- "tail -f start.log"
# | Change | User | Description | Committed | |
---|---|---|---|---|---|
#2 | 18866 | Sven Erik Knop | Updated with latest version | 9 years ago | |
#1 | 9795 | Sven Erik Knop |
Populate //guest/sven_erik_knop/p4search/... from //guest/perforce_software/p4search/.... |
11 years ago | |
//guest/perforce_software/p4search/search/README | |||||
#2 | 9007 | Doug Scheirer | update workshop p4-search with the latest released code: * code updates - bug fixes * ad...ding jetty + solr tarballs * script updates * updated p4java jar to latest release « |
11 years ago | |
#1 | 8975 | Matt Attaway | Populate official version of p4-search from the original Doug Scheirer source | 11 years ago |