p4-search Copyright (c) 2013, Perforce Software --- p4-search is built on the following technology stack: Apache Solr - combines Apache Lucene (fast indexing) and Apache Tika (file content/metadata parsing) to form the base of the index service p4-search - Java code that * optionally scans an existing Perforce Server for content on startup * via Solr indexes change commit content as it happens (via a server trigger) * optionally indexes all file revisions (except deletes) and tracks which is the most recent * provides an API for searching the Solr index via GET or POST requests * provides a minimal web UI for searching via a web browser jetty/Tomcat/etc. web application server - hosts both the Solr indexer application and the p4-search application Operational theory --- Initialization: p4-search will walk through any depots it can, produce lists of files to index for a queue, and pull things off of the queue for indexing. The service can be configured to scan all versions or just the head revision, skip certain file types, only index metadata, etc. The amount of time will depend on the size of your installation. The service is still usable for searching what it has while it is scanning. Once scanning is complete a key is set on the server to prevent attempting future scans, or to resume interrupted scans. Changelist trigger: When the search-queue.sh script is installed on the Perforce Server, it will send a curl request with the changelist number to p4-search. The data is written into a temporary location for processing later. Changelist processing is largely the same as the initialization phase: turn the changelist into a list of files and queue them up for later processing. Indexing: A p4 print of the file is performed and sent to the Apache Solr indexing service. It pulls whatever data it can and indexes it. p4-search specific metadata that is indexed is as follows (see the schema.xml): Search: First p4-search queries Apache Solr for the results, then it runs p4 files as the supplied user to limit results to what that user would normally be able to see in the Perforce Server via any other Perforce client application. Build and install --- JDK 1.7 is required to build the project, and gradle (www.gradle.org) is required in the $PATH. JDK 1.7, jetty, and Apache solr are required to run the service, and having the p4 command line client in the path is required when using install.sh. The only platform this process has been tested on is Ubuntu Linux. $ROOT/search/build.sh - builds and tests the project (war file and tgz package) Once the project is successfully built, the $ROOT/search/tmp directory contains both a build output directory ($OUTDIR) and a tgz containing all of its contents. At the root of the output directory is the install script (./install.sh) which will * prompt for some basic installation information * download jetty (8.1) and solr (4.5.1) if the required tarballs are not in the directory * unpack them and configure both for use by p4-search * create a start.sh and stop.sh for starting and stopping the services To remove the installation simply run "rm -rf $OUTDIR/install". The installation location is $OUTDIR/install by default. You can run $OUTDIR/install/start.sh to start both services and stop.sh to stop them. The p4-search tool uses a Perforce Server trigger (see scripts/ search-queue.sh) to receive new changelists, so the trigger must be installed on the Perforce Server or no indexing past the initial scan will take place. Configuration --- The install script creates a search.config properties file in the $OUTDIR/install/jetty-.../resources directory. Configuration changes currently require the service to be restarted. Except for server configuration (serverProtocol, serverHost, serverPort, indexerUser, indexerPassword), reasonable defaults are assumed for the other properties: com.perforce.search... (general Solr configuration) searchEngine: URL of the solr installation, e.g. http://localhost:8983 searchEngineToken: magic token matching the Perforce Server trigger collectionName: solr collection that contains the indexed data (general processing configuration) queueLocation: location of queued changelist files to be indexed maxSearchResults: maximum results returned by the service maxStreamFileSize: largest file size to attempt to index content ignoredExtensions: file with a CRLF list of extensions to skip content processing neverProcessList: file with a CRLF list of extensions to never index indexMethod: ALL_REVISIONS | HEAD_REVISIONS, HEAD... means to only keep the index up to date with the latest revision blackFstatRegex: for Perforce attributes, which p4 attr to skip (empty means do not index fstat data) whiteFstatRegex: for Perforce attributes, which p4 attr to include (empty means anything not in the blacklist) changelistCatchupKey: key name where the latest processed changelist is located. On startup the service will try to "catch up" based on this value. (file scanner config) fileScannerTokenKey: key name to indicate when the initialization is complete, empty implies "do not scan" fileScannerTokenValue: key value to indicate when the initialization is complete, empty implies "do not scan" scanPaths: CSV paths in the Perforce server to scan fileScannerThreads: number of threads handling the processing fileScannerSleep: used to throttle the scanner back fileScannerDepth: when scanning how many revisions down to go, 0 implies all revisions, 1, is head only, etc. maxScanQueueSize: used to throttle the amount of files in the scan queue (GUI config) commonsURL, swarmURL, p4webURL: URLs of services that can show the files via links in the web browser. Swarm and P4Web settings are mutually exclusive, with a preference to the Swarm URL. API --- See the API documention included in the installation for how to gain programmatic access to search results and how to send specific queries to the search service. The p4-search web UI exercises the underlying http API using javascript XHR. Notes --- * If you want to restrict the Apache Solr access to certain IP addresses, you must add a handler to the etc/jetty.xml file in the Solr installation, e.g. org.eclipse.jetty.server.handler.IPAccessHandler. See the install.sh script for a hint on how to do this. * If you suspect that the service has missed some changelists, one easy way to fix it is to script sending a number of curl requests to the server to force it to re-index files in those changelists. Re-indexing files will not corrupt the integrity of the index, at worst it will simply duplicate the work. * While Apache Solr has the ability to parse many types of files, you may find it useful to look through the Solr logs and determine if you need additional extractors, e.g. xmpcore.jar for media files. * This version installs with Apache Solr 4.5.1 by default. When testing against mp3 files I found that some mp3 files cause the Apache Tika parser in this version to hang the CPU. If you have similar problems or expect to index a lot of media files you might consider an earlier version of Solr (Solr 4.3.1 used Apache Tika 1.3 which worked for me), replacing the Tika jar with version 1.5 (unreleased) or using the ...ignoredExtensions configurable to exclude problematic files. * If you're curious on how the initial scan is doing or about other things the server is doing, the easiest way to check what is happening is to tail the log file, e.g. run "tail -f start.log"