Near Real Time Search With Solr-RA
By Nagendra Nagarajayya http://solr-ra.tgels.com
Summary
Solr is a very popular open source search platform which uses Lucene as the underlying search library. Solr-RA is Solr with RankingAlgorithm as the underlying search library. RankingAlgorithm uses Lucene indexing to read and write documents but scores and ranks on its own. Solr-RA enables adding documents to the index with concurrent searches and ranking in real time. The real time is near real time. Near real time is very close to real time but is not guaranteed to be real time. With Solr-RA when adding a document, no commit is needed and the Index Searchers are not closed, and the cache is not cleared. As there is no commit, the indexing is quite fast while enabling searches concurrently. A lock-free concurrent time managed access is used to eliminate locking between the IndexWriter and the IndexSearchers. A 262 TPS index write on a dual core intel system with 2GB heap has been observed with searches in parallel.
Steps to enable RT
Add <realtime>true</realtime> <library>rankingalgorithm</library> to solrconfig.xml
Adding documents
No changes to adding documents except, you don't need to call commit after you add a document. Commit is only needed if the index is empty and to create the first document. After that no commits are needed. See below example:
Example:
curl "http://localhost:8983/solr/twitter/update/csv?stream.file=${2}&fieldnames=name,desc,id,userid&encapsulator=%1f";
( you need to add the commit parameter only for the first document when starting indexing with an empty index)
Search concurrently while the indexing is going on
As before, no changes.
http://localhost:8983/solr/twitter/select/?q=airfare+deals&fl=score
Performance
Indexing:
Indexing about 3900 tweet message with curl time: real 0m14.454s user 0m0.007s sys 0m0.013s
Concurrent search during load:
http://192.168.1.126:8983/solr/twitter/select/?fl=score&q=airline+tickets (3rd line in the tweets.txt file)
Implementation
The Near Real Time has been implemented by retrieving the IndexReader from the IndexWriter.getReader() method after a document has been added to the index. The addDoc function in DirectHandlerUpdate2.java has been modified so that retrieved IndexReader is stored in a HashMap in SolrCore.java. To avoid locking, a non locking concurrent time managed access is used to make available the IndexReader to SolrIndexSearchers. The SolrIndexSearchers access this IndexReader instead of the SolrIndexReader and pass this as a parameter to RankingAlgorithm for the search. RankingAlgorithm uses the reader to access the index and returns the results which are in near real time as it is using the updated IndexReader.
The NRT implementation supports faceting, filter queries, etc. The faceting count can be seen changing as documents are added in the screenshots below Fig 1 and Fig2. Fig 1 shows a facet query for “john” from the mbartists index (from the book Solr-14-Enterprise-Search-Server). Fig 2 shows the same query after (browser cache is cleared in firefox 4.0) adding a new artist to the index as below:
curl "http://localhost:8990/solr/mbartists/update/csv?stream.file=/tmp/x.csv&encapsulator=%1f" <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">163</int></lst> </response> cat /tmp/x: id,type,a_name,a_name_sort,a_alias,a_type,a_begin_date,a_end_date,a_member_name,a_member_id,a_release_date_latest,a_spell,a_spellPhrase,r_name,r_name_sort,r_name_facetLetter,r_a_name,r_a_id,r_attributes,r_type,r_official,r_lang,r_tracks,r_event_country,r_event_date,r_event_date_earliest,l_name,l_name_sort,l_type,l_begin_date,l_end_date,t_name,t_duration,t_a_id,t_a_name,t_num,t_r_id,t_r_name,t_r_attributes,t_r_tracks,t_trm_lookups,word,includes Artist:3991866,Artist,John Ab Davis,John Ab Davis,,person,1942-12-29T00:00:00Z,1999-12-10T00:00:00Z,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Fig 1, shows numFound as 3256, and the facet count for “john” as 3256. Fig 2 after adding a doc with curl shows 3257, and the facet count for “john” as 3257. The Solr query is as below:
http://192.168.1.126:8990/solr/mbartists/select/?q=john&facet=on&facet.field=a_name&facet.field=a_type&fl=score
Caveat
The performance is limited by how fast the IndexWriter.getReader() returns. This seems to take the most time, about 70-80ms avg. The faster this goes, the faster the index time.
Download
Download Solr-RA including tweet file and try it out yourself.
You can download Solr-RA from here:
http://solr-ra.tgels.com
The tweets.txt from here:
http://solr-ra.tgels.com/docs/tweets.txt
(The tweets are real tweets sent out to twitter from @eneedsonline, @tgels)
schema.xml and solrconfig.xml from here:
http://solr-ra.tgels.com/docs/schema.xml
http://solr-ra.tgels.com/docs/solrconfig.xml
Conclusion
The near real time search in Solr-RA works well and allows concurrent search with indexing in parallel without closing the IndexSearchers or clearing the cache providing the ability to offer searches in near real time. The indexing performance observed on a 2 core intel system with Fedora Linux 12 is about 262 tps (new document adds). This could be improved to a very high number (from 14 secs for indexing about 3900 documents to about 2 secs) if IndexWriter.getReader() performance is improved; at the moment, it takes about 70-90 ms to get a IndexReader.
Note: solr and lucene are registered trandemarks of apache software foundation. twitter is a trademark of twitter, inc.