Near Real Time Search With Solr-RA ver 3.2
By Nagendra Nagarajayya http://solr-ra.tgels.com
Summary
Solr is a very popular open source search platform which uses Lucene as the underlying search library. Solr-RA is Solr with RankingAlgorithm as the underlying search library. RankingAlgorithm uses Lucene indexing to read and write documents but scores and ranks on its own. Solr-RA enables adding documents to the index with concurrent searches and ranking in real time. The real time is near real time. Near real time is very close to real time but is not guaranteed to be real time. With Solr-RA when adding a document, no commit is needed and the Index Searchers are not closed, and the cache is not cleared. As there is no commit, the indexing is quite fast while enabling searches concurrently. A lock-free concurrent time managed access is used to eliminate locking between the IndexWriter and the IndexSearchers. A ~1500 TPS index write on a dual core intel system with 2GB heap has been observed with searches in parallel.
Steps to enable RT
Add <realtime visible="150">true</realtime> <library>rankingalgorithm</library> to solrconfig.xml
Adding documents
No changes to adding documents except, you don't need to call commit after you add a document. Commit is only needed if the index is empty and to create the first document. After that no commits are needed. See below example:
Example:
curl "http://localhost:8983/solr/twitter/update/csv?stream.file=${2}&fieldnames=name,desc,id,userid&encapsulator=%1f";
( you need to add the commit parameter only for the first document when starting indexing with an empty index)
Search concurrently while the indexing is going on
As before, no changes.
http://localhost:8983/solr/twitter/select/?q=airfare+deals&fl=score
Performance
Indexing:
Indexing about 10000 mbartist entries with curl time: real 0m49.356s user 0m9.383s sys 0m25.852s
Concurrent search during load:
http://192.168.1.126:8983/solr/twitter/select/?fl=score&q=john ab180027&fl=score
Implementation
The Near Real Time has been implemented by retrieving the IndexReader from the IndexWriter.getReader() method after a document has been added to the index. The addDoc function in DirectHandlerUpdate2.java has been modified so that retrieved IndexReader is stored in a HashMap in SolrCore.java. To avoid locking, a non locking concurrent time managed access is used to make available the IndexReader to SolrIndexSearchers. The SolrIndexSearchers access this IndexReader instead of the SolrIndexReader and pass this as a parameter to RankingAlgorithm for the search. RankingAlgorithm uses the reader to access the index and returns the results which are in near real time as it is using the updated IndexReader.
The NRT implementation supports faceting, filter queries, etc. The faceting count can be seen changing as documents are added in the screenshots below Fig 1 and Fig2. Fig 1 shows a facet query for “john” from the mbartists index (from the book Solr-14-Enterprise-Search-Server). Fig 2 shows the same query after (browser cache is cleared in firefox 4.0) adding a new artist to the index as below:
curl "http://localhost:8990/solr/mbartists/update/csv?stream.file=/tmp/x.csv&encapsulator=%1f" <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">163</int></lst> </response> cat /tmp/x: id,type,a_name,a_name_sort,a_alias,a_type,a_begin_date,a_end_date,a_member_name,a_member_id,a_release_date_latest,a_spell,a_spellPhrase,r_name,r_name_sort,r_name_facetLetter,r_a_name,r_a_id,r_attributes,r_type,r_official,r_lang,r_tracks,r_event_country,r_event_date,r_event_date_earliest,l_name,l_name_sort,l_type,l_begin_date,l_end_date,t_name,t_duration,t_a_id,t_a_name,t_num,t_r_id,t_r_name,t_r_attributes,t_r_tracks,t_trm_lookups,word,includes Artist:3991866,Artist,John Ab Davis,John Ab Davis,,person,1942-12-29T00:00:00Z,1999-12-10T00:00:00Z,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Fig 1, shows numFound as 3256, and the facet count for “john” as 3256. Fig 2 after adding a doc with curl shows 3257, and the facet count for “john” as 3257. The Solr query is as below:
http://192.168.1.126:8990/solr/mbartists/select/?q=john&facet=on&facet.field=a_name&facet.field=a_type&fl=score
Caveat
The performance is limited by how fast the IndexWriter.getReader() returns. This seems to take the most time between 2ms to 70ms avg. The faster this goes, the faster the index time.
Download
Download Solr-RA including tweet file and try it out yourself.
You can download Solr-RA from here:
http://solr-ra.tgels.com
The tweets.txt from here:
http://solr-ra.tgels.com/docs/tweets.txt
(The tweets are real tweets sent out to twitter from @eneedsonline, @tgels)
schema.xml and solrconfig.xml from here:
http://solr-ra.tgels.com/docs/schema.xml
http://solr-ra.tgels.com/docs/solrconfig.xml
Conclusion
The near real time search in Solr-RA works well and allows concurrent search with indexing in parallel without closing the IndexSearchers or clearing the cache providing the ability to offer searches in near real time. The indexing performance observed on a 2 core intel system with Fedora Linux 12 is about ~1500 tps (new document adds) with visible set to 200ms.
Note: solr and lucene are registered trandemarks of apache software foundation. twitter is a trademark of twitter, inc.