Indexing Using Tika with Solr & Java
Nagendra Nagarajayya
solr.tgels.com
This document introduces you to submitting content like text/html/xml including rich content like PDFs, word/ODF documents, to be indexed in Solr using Java and SolrJ in a very simple way.
The below code was taken from Lucene IndexFile code and modified to submit documents and files to be indexed in Solr.
import java.io.File; import java.io.FileNotFoundException; import java.io.IOException; import org.apache.lucene.index.IndexWriter; import org.apache.solr.client.solrj.SolrServer; import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer; import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest; import org.apache.solr.client.solrj.request.AbstractUpdateRequest.ACTION; import org.apache.solr.client.solrj.response.UpdateResponse; import org.apache.solr.common.SolrInputDocument; public class IndexFiles { static public void main(String args[]) throws Throwable { String urlString = "http://localhost:8989/solr"; if (args != null & args.length > 1) { urlString = args[1]; } SolrServer solr = new CommonsHttpSolrServer(urlString); indexDocs(solr, new File(args[0])); } static void indexDocs(SolrServer solr, File file) throws Exception { // do not try to index files that cannot be read if (file.canRead()) { if (file.isDirectory()) { String[] files = file.list(); // an IO error could occur if (files != null) { for (int i = 0; i < files.length; i++) { indexDocs(solr, new File(file, files[i])); } } } else { System.out.println("adding " + file); try { ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract"); String parts[] = file.getName().split("\\."); String type = "text"; if (parts.length>1) { type = parts[1]; } req.addFile(file); req.setParam("literal.id", file.getAbsolutePath()); req.setParam("literal.name", file.getName()); req.setParam("literal.content_type", type); req.setAction(ACTION.COMMIT, true, true); solr.request(req); // submits one req at a time. } catch (FileNotFoundException fnfe) { fnfe.printStackTrace(); } } } } }
The above code allows you to get started using Solr, Tika and Solr Cell very easily.
Note:
I had to copy all the jar files under contrib/extraction/lib to examples/lib to get Tika to work even though lib parameter was set in the solrconfig.xml to point to contrib/extraction/lib to examples/lib.
A few lines of code is all that it takes tika to index your rich content documents with Java.