org.tgels.search.rankingalgorithm
Class RankingQuery

java.lang.Object
  extended by org.tgels.search.rankingalgorithm.RankingQuery

public class RankingQuery
extends java.lang.Object

RankingAgorithm is a search library that uses a new scoring algorithm to rank results accurately and relevantly. RankingAlgorithm is very easy to use since it uses the Apache Lucene index but ranks and scores on its own

Three Algorithms are available SIMPLE, SIMPLE1 and COMPLEX. SIMPLE is a very fast algorithm and can return queries in <50ms on a 10m wikipedia index (complete index). It can also scale to 100m docs or maybe more. SIMPLE1 is the fastest algorithm and maybe used in autocomplete types of processing. COMPLEX is a more complex algorithm so is a little slower compared to the SIMPLE, but can also still return queries in < 50ms on a 10m wikipedia index (complete index). COMPLEX is more accurate and should be able to give you the best rankings as compared to SIMPLE.

RankingAlgorithm can be used in two modes, Document mode (default) and Product mode. The scoring changes with the mode. In Document mode, documents are matched for relevancy while in Product mode, documents are matched for term occurence. Document mode is useful for matching text, html, rich text pdf/word, books, faq, forums discussions, etc. Product mode is useful for small text as in Retail/ecommerce product matches, etc.

 
 Programmtic:
  rq.setMode(RankingQuery.MODE_DOCUMENT);
  
  Property:
  To change MODE, start application with -Dmode=document, 
  for product, -Dmode=document
 
You can also set an attribute, scan to fast/medium/full scan. Fast is the default and the fastest while full scan is the most accurate but also slow, and takes lots of memory.

SIMPLE is also very good and may be well suited than COMPLEX for some type of queries.

 Programmtic:
  rq.setAlgorithm(RankingQuery.ALGORITHM_COMPLEX);
   
  Property: 
  To enable SIMPLE, start application with -Dalgorithm=SIMPLE,
  for SIMPLE1,  -Dalgorithm=SIMPLE1
  for COMPLEX, -Dalgorithm=COMPLEX	 
 
 
You will need to have the Apache Lucene 3.x in the class path. At RankingQuery instantiation a Lucene IndexSearcher or IndexReader object is needed as RankingQuery uses the IndexReader to read the documents from the Index. See examples below ...

(Note: Lucene is a trademark of Apache Software Foundation)
 Example 1:
 		RankingQuery rq = new RankingQuery(); 
 		IndexSearcher is = new IndexSearcher(index);
 		StandardAnalyzer analyzer = new StandardAnalyzer();
		QueryParser parser = new QueryParser(field, analyzer);
		Query query = parser.parse(searchterms);
 		RankingHits rh = rq.search(query, is); //is = Lucene IndexSearcher object
		System.out.println("num hits=" + rh.getNumHits() + "--no docs=" + is.maxDoc()); 
		for (int i=0; i<rh.getNumHits() && i<10; i++) {
			System.out.println("i=" + i + "--" + rh.score(i) + "--docid=" + rh.docid(i) + "--doc=" + rh.doc(i).get(title) );
		}
 
 Example 2: 
 		IndexReader reader = IndexReader.open(FSDirectory.open(new File(index))); 
		RankingQuery rq = new RankingQuery();			
		StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
		QueryParser parser = new QueryParser(Version.LUCENE_30, field, analyzer);
		Query query = parser.parse(searchterms);
		TopScoreDocCollector tdc = TopScoreDocCollector.create(1000, true);
		rq.search(query, null, reader, tdc); //is = Lucene IndexSearcher object
		int hits = tdc.getTotalHits();
		ScoreDoc sda[] = null;
		if (hits > 0) {
			sda = tdc.topDocs().scoreDocs;
		}
		System.out.println("num hits=" + hits + "--no docs=" + reader.maxDoc()); 
		for (int i=0; i<hits && i<10; i++) {
			ScoreDoc sd = sda[i];
			System.out.println("i=" + i + "--" + sd.score + "--docid=" + sd.doc + "--doc=" + reader.document(sd.doc).get(title) );
		}
		reader.close();
  
 

Author:
Nagendra Nagarajayya
See Also:
RankingHits, RankingScore, TopScoreDocCollector

Field Summary
static int ALGORITHM_COMPLEX
           
static int ALGORITHM_SIMPLE
           
static int ALGORITHM_SIMPLE1
           
static int AND
           
static int AND_OR
           
static boolean debug
           
static int MODE_DOCUMENT
           
static int MODE_PRODUCT
           
static int OR
           
static int SCAN_FAST
           
static int SCAN_FULL
           
static int SCAN_MEDIUM
           
 
Constructor Summary
RankingQuery()
           
RankingQuery(org.apache.lucene.index.IndexReader reader)
          Constructor to create a RankingQuery object.
RankingQuery(org.apache.lucene.search.IndexSearcher is)
          Constructor to create a RankingQuery object.
RankingQuery(java.lang.String indexPath)
          Constructor to create a RankingQuery object.
 
Method Summary
 void addToLowerBoostSet(java.lang.String keywords)
          Experimental, can change
 void close()
          Closes the IndexReader objects opened.
 org.apache.lucene.document.Document doc(int docid)
          Similar to IndexSearcher doc(id), returns a Lucene Document object
 int getAlgorithm()
           
 int getAndOr()
           
 int getMode()
           
 int getScan()
           
static void log(java.lang.String s)
           
 RankingHits search(org.apache.lucene.search.Query query)
          Search a Lucene index for terms in the query.
 int search(org.apache.lucene.search.Query query, org.apache.lucene.search.Filter filter, org.apache.lucene.index.IndexReader ir, org.apache.lucene.search.Collector collector)
          Search a Lucene index for terms in the query.
 RankingHits search(org.apache.lucene.search.Query query, org.apache.lucene.search.Filter filter, org.apache.lucene.index.IndexReader ir, int docs)
          Search a Lucene index for terms in the query.
 RankingHits search(org.apache.lucene.search.Query query, org.apache.lucene.index.IndexReader r)
          Search a Lucene index for terms in the query.
 RankingHits search(org.apache.lucene.search.Query query, org.apache.lucene.search.IndexSearcher is)
          Search a Lucene index for terms in the query.
 RankingHits search(java.lang.String field, java.lang.String searchTerms)
          Search a Lucene index for terms in the query.
 int search(org.apache.lucene.search.Weight weight, org.apache.lucene.search.Filter filter, org.apache.lucene.index.IndexReader ir, org.apache.lucene.search.Collector collector)
          Similar to Lucene search.
 int search(org.apache.lucene.search.Weight weight, org.apache.lucene.search.Filter filter, org.apache.lucene.index.IndexReader ir, org.apache.lucene.search.Collector collector, org.tgels.search.rankingalgorithm.Parameters parms)
          Similar to Lucene search.
 void setAlgorithm(int type)
          Set algorithm, SIMPLE, SIMPLE1 or COMPLEX.
 void setAndOr(int type)
          Set And Or or AndOr combinations to get at the results.
 void setMode(int type)
          Set mode, Document or Product mode.
 void setScan(int scan)
          Used along with mode on how to scan a document.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

debug

public static boolean debug

ALGORITHM_COMPLEX

public static final int ALGORITHM_COMPLEX
See Also:
Constant Field Values

ALGORITHM_SIMPLE

public static final int ALGORITHM_SIMPLE
See Also:
Constant Field Values

ALGORITHM_SIMPLE1

public static final int ALGORITHM_SIMPLE1
See Also:
Constant Field Values

MODE_PRODUCT

public static final int MODE_PRODUCT
See Also:
Constant Field Values

MODE_DOCUMENT

public static final int MODE_DOCUMENT
See Also:
Constant Field Values

SCAN_FAST

public static final int SCAN_FAST
See Also:
Constant Field Values

SCAN_MEDIUM

public static final int SCAN_MEDIUM
See Also:
Constant Field Values

SCAN_FULL

public static final int SCAN_FULL
See Also:
Constant Field Values

AND_OR

public static final int AND_OR
See Also:
Constant Field Values

AND

public static final int AND
See Also:
Constant Field Values

OR

public static final int OR
See Also:
Constant Field Values
Constructor Detail

RankingQuery

public RankingQuery(org.apache.lucene.index.IndexReader reader)
Constructor to create a RankingQuery object.

Parameters:
reader - Lucene InndexReader object.

RankingQuery

public RankingQuery(org.apache.lucene.search.IndexSearcher is)
Constructor to create a RankingQuery object.

Parameters:
is - Lucene IndexSearcher object.

RankingQuery

public RankingQuery(java.lang.String indexPath)
             throws java.lang.Throwable
Constructor to create a RankingQuery object. An Lucene IndexReader object is created to read the index in the indexPath.

Parameters:
indexPath - to a Lucene index.
Throws:
java.lang.Throwable

RankingQuery

public RankingQuery()
Method Detail

setScan

public void setScan(int scan)
Used along with mode on how to scan a document. Scan speed can be, FAST/MEDIUM/FULL. Default is FAST

Fast is the default and the fastest while a full scan is the most accurate but also slow, and takes lots of memory. Medium is in between.
 
 Programmtic:
  rq.setScan(RankingQuery.SCAN_FAST);
  
  Property:
  To change SCAN, start application with -Dscan=fast, 
    for product, -Dscan=product
 

Parameters:
scan - Valid values are RankingQuery.SCAN_FAST, RankingQuery.SCAN_MEDIUM, RankingQuery.SCAN_FULL

getScan

public int getScan()
Returns:
RankingQuery.SCAN_FAST, RankingQuery.SCAN_MEDIUM, RankingQuery.SCAN_FULL

setMode

public void setMode(int type)
Set mode, Document or Product mode. Default is Document mode

RankingAlgorithm can be used in two modes, Document mode (default) and Product mode. The scoring changes with the mode. In Document mode, documents are matched for relevancy while in Product mode, documents are matched for term occurence. Document mode is useful for matching text, html, rich text pdf/word, books, faq, forums discussions, etc. Product mode is useful for small text as in Retail/ecommerce product matches, etc.

 
 Programmtic:
  rq.setMode(RankingQuery.MODE_DOCUMENT);
  
  Property:
  To change MODE, start application with -Dmode=document, 
    for product, -Dmode=document
 

Parameters:
type - Valid values are RankingQuery.MODE_DOCUMENT or RankingQuery.MODE_PRODUCT

getMode

public int getMode()
Returns:
RankingQuery.MODE_DOCUMENT or RankingQuery.MODE_PRODUCT

setAlgorithm

public void setAlgorithm(int type)
Set algorithm, SIMPLE, SIMPLE1 or COMPLEX. Default is SIMPLE mode

Three Algorithms are available SIMPLE, SIMPLE1 and COMPLEX. SIMPLE is a very fast algorithm and returns queries in <100ms on a 10m wikipedia index (complete index). It can also scale to 100m docs or maybe more. SIMPLE1 is faster than SIMPLE and is more suited for autocomplete types of processing. COMPLEX is a more complex algorithm so is a little slower compared to the SIMPLE, but can still return queries in < 200ms on a 10m wikipedia index (complete index). COMPLEX is more accurate and should be able to give you the best rankings as compared to SIMPLE.

You can also set an attribute, scan to fast/medium/full scan. Fast is the default and the fastest while full scan is the most accurate but also slow, and takes lots of memory.

SIMPLE is also very good and may be well suited than COMPLEX for some type of queries.

 Programmtic:
  rq.setAlgorithm(RankingQuery.ALGORITHM_COMPLEX);
   
  Property: 
  To enable SIMPLE, start application with -Dalgorithm=SIMPLE,
  for SIMPLE1, -Dalgorithm=SIMPLE1 
  for COMPLEX, -Dalgorithm=COMPLEX	 
 
 

Parameters:
type - Valid values are RankingQuery.ALGORITHM_SIMPLE or RankingQuery.ALGORITHM_SIMPLE1 or RankingQuery.ALGORITHM_COMPLEX

getAlgorithm

public int getAlgorithm()
Returns:
RankingQuery.ALGORITHM_SIMPLE or RankingQuery.ALGORITHM_SIMPLE1 or RankingQuery.ALGORITHM_COMPLEX

setAndOr

public void setAndOr(int type)
Set And Or or AndOr combinations to get at the results. AND is 100%, OR is 0%, AND_OR is 50% relevancy. Similar to mm parameter in Solr Default is Or

Parameters:
type - Valid values are RankingQuery.AND or RankingQuery.AND_OR or RankingQuery.OR. One can also set this to any value between 0 and 100 as needed.

getAndOr

public int getAndOr()
Returns:
a value betewen 1 and 100. 0% = RankingQuery.OR, 50% = RankingQuery.AND or RankingQuery.AND_OR or RankingQuery.AND

close

public void close()
           throws java.lang.Throwable
Closes the IndexReader objects opened.

Note: 1. IndexReader object is closed only if the constructor RankingQuery(indexpath) is used. If the IndexSearcher or IndexReader is passed explicitly as in RankingQuery(IndexSearcher is) or RankingQuery(IndexReader ir) it is not closed. 2. Make sure close is called the last since RankingHits has a reference to an IndexReader that will also get closed in IndexReader is closed. See SimpleExample.java for usage.

Throws:
java.lang.Throwable
See Also:
RankingQuery(String)

doc

public org.apache.lucene.document.Document doc(int docid)
                                        throws java.lang.Throwable
Similar to IndexSearcher doc(id), returns a Lucene Document object

Parameters:
docid - Lucene document id
Returns:
Document object
Throws:
java.lang.Throwable

search

public RankingHits search(org.apache.lucene.search.Query query)
                   throws java.lang.Throwable
Search a Lucene index for terms in the query. RankingQuery needs to have been instantiated with Lucene IndexReader or IndexSearcher objects.

Parameters:
query - A Lucene query object
Returns:
A list of hits matching the search terms
Throws:
java.lang.Throwable
See Also:
RankingHits

search

public RankingHits search(java.lang.String field,
                          java.lang.String searchTerms)
                   throws java.lang.Throwable
Search a Lucene index for terms in the query. RankingQuery needs to have been instantiated with path to a Lucene index.
 Example:
 		RankingQuery rq = new RankingQuery("/lucene/index/perl"); 
 		RankingHits rh = rq.search("search_field", "text"); 
		System.out.println("num hits=" + rh.getNumHits() + "--no docs=" + is.maxDoc()); 
		for (int i=0; i<rh.getNumHits() && i<10; i++) {
			System.out.println("i=" + i + "--" + rh.score(i) + "--docid=" + rh.docid(i) + "--doc=" + rh.doc(i).get(title) );
		}
  
 

Parameters:
field - to search
searchTerms - search terms
Returns:
RankingHits object, a list of documents matching the search terms
Throws:
java.lang.Throwable
See Also:
RankingHits

search

public RankingHits search(org.apache.lucene.search.Query query,
                          org.apache.lucene.search.IndexSearcher is)
                   throws java.lang.Throwable
Search a Lucene index for terms in the query. Needs a Lucene Query and IndexSearcher objects to access the index
 Example 1:
 		RankingQuery rq = new RankingQuery(); 
 		RankingHits rh = rq.search(query, is); //is = Lucene IndexSearcher object
		System.out.println("num hits=" + rh.getNumHits() + "--no docs=" + is.maxDoc()); 
		for (int i=0; i<rh.getNumHits() && i<10; i++) {
			System.out.println("i=" + i + "--" + rh.score(i) + "--docid=" + rh.docid(i) + "--doc=" + rh.doc(i).get(title) );
		}
 
 Example 2: 
 		RankingQuery rq = new RankingQuery();  * 		
 		StandardAnalyzer analyzer = new StandardAnalyzer();
		QueryParser parser = new QueryParser(field, analyzer);
		Query query = parser.parse(searchterms);
		TopScoreDocCollector tdc = new TopScoreDocCollector();
 		rq.search(query, null, indexreader, tdc); //is = Lucene IndexSearcher object
 		int hits = tdc.getTotalHits();
		System.out.println("num hits=" + hits + "--no docs=" + indexreader.maxDoc()); 
		for (int i=0; i<hits && i<10; i++) {
			ScoreDoc sd = tdc.topDocs().scoreDocs[i]
			System.out.println("i=" + i + "--" + sd.score(i) + "--docid=" + sd.doc + "--doc=" + indexreader.document(sd.doc).get(title) );
		}
    
 

Parameters:
query - Lucene query object
is - is a Lucene IndexSearcher object.
Returns:
RankingHits object, a list of documents matching the search terms
Throws:
java.lang.Throwable
See Also:
RankingHits, RankingScore

search

public RankingHits search(org.apache.lucene.search.Query query,
                          org.apache.lucene.index.IndexReader r)
                   throws java.lang.Throwable
Search a Lucene index for terms in the query. Needs a Query object and an IndexReader to access the index
 Example:
 		RankingQuery rq = new RankingQuery(); 
 		RankingHits rh = rq.search(query, is); //is = Lucene IndexSearcher object
		System.out.println("num hits=" + rh.getNumHits() + "--no docs=" + is.maxDoc()); 
		for (int i=0; i<rh.getNumHits() && i<10; i++) {
			System.out.println("i=" + i + "--" + rh.score(i) + "--docid=" + rh.docid(i) + "--doc=" + rh.doc(i).get(title) );
		}
  
 

Parameters:
query - Lucene query object
r - Lucene IndexSearcher object
Returns:
RankingHits object, a list of documents matching the search terms
Throws:
java.lang.Throwable
See Also:
RankingHits

search

public int search(org.apache.lucene.search.Query query,
                  org.apache.lucene.search.Filter filter,
                  org.apache.lucene.index.IndexReader ir,
                  org.apache.lucene.search.Collector collector)
           throws java.lang.Throwable
Search a Lucene index for terms in the query. Uses the Query object and IndexReader to access the index

A Lucene Filter can be used to limit returned results, while a Lucene Collector (like TopScoreDocCollector) can be used to collect the relevant docs.
 Example: 
 		IndexReader reader = IndexReader.open(FSDirectory.open(new File(index))); 
		RankingQuery rq = new RankingQuery();			
		StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
		QueryParser parser = new QueryParser(Version.LUCENE_30, field, analyzer);
		Query query = parser.parse(searchterms);
		TopScoreDocCollector tdc = TopScoreDocCollector.create(1000, true);
		rq.search(query, null, reader, tdc); //is = Lucene IndexSearcher object
		int hits = tdc.getTotalHits();
		ScoreDoc sda[] = null;
		if (hits > 0) {
			sda = tdc.topDocs().scoreDocs;
		}
		System.out.println("num hits=" + hits + "--no docs=" + reader.maxDoc()); 
		for (int i=0; i<sda.length && i<10; i++) {
			ScoreDoc sd = sda[i];
			System.out.println("i=" + i + "--" + sd.score + "--docid=" + sd.doc + "--doc=" + reader.document(sd.doc).get(title) );
		}
		reader.close();
    
 

Parameters:
query - Lucene query object
filter - is a Lucene filter object
ir - is a Lucene IndexReader object.
collector - to collect returned results
Returns:
RankingHits object, a list of documents matching the search terms
Throws:
java.lang.Throwable
See Also:
RankingHits, RankingScore

search

public RankingHits search(org.apache.lucene.search.Query query,
                          org.apache.lucene.search.Filter filter,
                          org.apache.lucene.index.IndexReader ir,
                          int docs)
                   throws java.lang.Throwable
Search a Lucene index for terms in the query. Uses the Query object and IndexReader to access the index

A Lucene Filter can be used to limit returned results, while a Lucene Collector (like TopScoreDocCollector) can be used to collect the relevant docs.
 Example:
 		RankingQuery rq = new RankingQuery(); 
 		RankingHits rh = rq.search(query, filter, ir, 100); 
		System.out.println("num hits=" + rh.getNumHits() + "--no docs=" + is.maxDoc()); 
		for (int i=0; i<rh.getNumHits() && i<100; i++) {
			System.out.println("i=" + i + "--" + rh.score(i) + "--docid=" + rh.docid(i) + "--doc=" + rh.doc(i).get(title) );
		}
 
    
 

Parameters:
query - Lucene query object
filter - is a Lucene filter object
ir - is a Lucene IndexReader object.
docs - number of top hits
Returns:
RankingHits object, a list of documents matching the search terms
Throws:
java.lang.Throwable
See Also:
RankingHits, RankingScore

search

public int search(org.apache.lucene.search.Weight weight,
                  org.apache.lucene.search.Filter filter,
                  org.apache.lucene.index.IndexReader ir,
                  org.apache.lucene.search.Collector collector)
           throws java.lang.Throwable
Similar to Lucene search. Uses the Query object and IndexReader to access the index

A Lucene Filter can be used to limit returned results, while a Lucene Collector (like TopScoreDocCollector) can be used to collect the relevant docs.
 Example: 
 		IndexReader reader = IndexReader.open(FSDirectory.open(new File(index))); 
		RankingQuery rq = new RankingQuery();			
		StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
		QueryParser parser = new QueryParser(Version.LUCENE_30, field, analyzer);
		Query query = parser.parse(searchterms);
		TopScoreDocCollector tdc = TopScoreDocCollector.create(1000, true);
		rq.search(query, null, reader, tdc); 
		int hits = tdc.getTotalHits();
		ScoreDoc sda[] = null;
		if (hits > 0) {
			sda = tdc.topDocs().scoreDocs;
		}
		System.out.println("num hits=" + hits + "--no docs=" + reader.maxDoc()); 
		for (int i=0; i<sda.length && i<10; i++) {
			ScoreDoc sd = sda[i];
			System.out.println("i=" + i + "--" + sd.score + "--docid=" + sd.doc + "--doc=" + reader.document(sd.doc).get(title) );
		}
		reader.close();
    
 

Parameters:
weight - Lucene weight object
filter - is a Lucene filter object
ir - is a Lucene IndexReader object.
collector - to collect returned results
Returns:
hits , number of hits
Throws:
java.lang.Throwable

search

public int search(org.apache.lucene.search.Weight weight,
                  org.apache.lucene.search.Filter filter,
                  org.apache.lucene.index.IndexReader ir,
                  org.apache.lucene.search.Collector collector,
                  org.tgels.search.rankingalgorithm.Parameters parms)
           throws java.lang.Throwable
Similar to Lucene search. Uses the Query object and IndexReader to access the index

A Lucene Filter can be used to limit returned results, while a Lucene Collector (like TopScoreDocCollector) can be used to collect the relevant docs. A Parameter object can be used to pass options. The parameter object can be useful in a multi-threaded environment where each request uses a different algorithm, mode or scan.
 Example: 
 		IndexReader reader = IndexReader.open(FSDirectory.open(new File(index))); 
		RankingQuery rq = new RankingQuery();			
		StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
		QueryParser parser = new QueryParser(Version.LUCENE_30, field, analyzer);
		Query query = parser.parse(searchterms);
		TopScoreDocCollector tdc = TopScoreDocCollector.create(1000, true);
		Parameter parms = new Parameter(rq);
		parms.algorithm = RankingQuery.SIMPLE;
		rq.search(query, null, reader, tdc, parms); 
		int hits = tdc.getTotalHits();
		ScoreDoc sda[] = null;
		if (hits > 0) {
			sda = tdc.topDocs().scoreDocs;
		
			System.out.println("num hits=" + hits + "--no docs=" + reader.maxDoc()); 
			for (int i=0; i<sda.length && i<10; i++) {
				ScoreDoc sd = sda[i];
				System.out.println("i=" + i + "--" + sd.score + "--docid=" + sd.doc + "--doc=" + reader.document(sd.doc).get(title) );
			}
		}
		parms.algorithm = RankingQuery.COMPLEX;
		parms.mode = RankingQuery.PRODUCT;
		rq.search(query, null, reader, tdc, parms); //is = Lucene IndexSearcher object
		reader.close();
    
 

Parameters:
weight - Lucene weight object
filter - is a Lucene filter object
ir - is a Lucene IndexReader object.
collector - to collect returned results
parms - list of options
Returns:
hits , number of hits
Throws:
java.lang.Throwable

addToLowerBoostSet

public void addToLowerBoostSet(java.lang.String keywords)
Experimental, can change


log

public static void log(java.lang.String s)