Indexing Using Tika with Solr & Java

Nagendra Nagarajayya

solr.tgels.com



Introduction

This document introduces you to submitting content like text/html/xml including rich content like PDFs, word/ODF documents, to be indexed in Solr using Java and SolrJ in a very simple way.

Code

The below code was taken from Lucene IndexFile code and modified to submit documents and files to be indexed in Solr.

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.lucene.index.IndexWriter;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.client.solrj.request.AbstractUpdateRequest.ACTION;
import org.apache.solr.client.solrj.response.UpdateResponse;
import org.apache.solr.common.SolrInputDocument;

public class IndexFiles {
	
	static public void main(String args[]) throws Throwable {
		String urlString = "http://localhost:8989/solr";
		if (args != null & args.length > 1) {
			urlString = args[1]; 
		}
					
		SolrServer solr = new CommonsHttpSolrServer(urlString);
		indexDocs(solr, new File(args[0]));
	}
	
	 static void indexDocs(SolrServer solr, File file)
	    throws Exception {
	    // do not try to index files that cannot be read
	    if (file.canRead()) {
	      if (file.isDirectory()) {
	        String[] files = file.list();
	        // an IO error could occur
	        if (files != null) {
	          for (int i = 0; i < files.length; i++) {
	            indexDocs(solr, new File(file, files[i]));
	          }
	        }
	      } else {
	        System.out.println("adding " + file);
	        try {
	        	ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
	        	String parts[] = file.getName().split("\\.");
	        	String type = "text";
	        	if (parts.length>1) {
	        		type = parts[1];
	        	}
	        	req.addFile(file);
	    		req.setParam("literal.id", file.getAbsolutePath());
	    		req.setParam("literal.name", file.getName());
	    		req.setParam("literal.content_type", type);
	    		req.setAction(ACTION.COMMIT, true, true);
	    	
	    		solr.request(req); // submits one req at a time.   
	        }
	        catch (FileNotFoundException fnfe) {
	          fnfe.printStackTrace();
	        }
	      }
	    }
	  }
}


The above code allows you to get started using Solr, Tika and Solr Cell very easily.

Note:

I had to copy all the jar files under contrib/extraction/lib to examples/lib to get Tika to work even though lib parameter was set in the solrconfig.xml to point to contrib/extraction/lib to examples/lib.

Conclusion

A few lines of code is all that it takes tika to index your rich content documents with Java.

References

http://wiki.apache.org/solr/ExtractingRequestHandler

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika

http://www.abcseo.com/tech/search/integrating-solr-and-tika