I was invited to present again this year at Lucene/Solr Revolution 2014 in Washington, D.C. My presentation took place this afternoon and covered the topic of “Semantic & Mulilingual Strategies in Lucene/Solr. The material was taken partially from the extensive Multilingual Search chapter (ch. 14) in Solr in Action and from some of the exciting semantic search work we’ve been doing recently at CareerBuilder.
Video:
Slides:
http://www.slideshare.net/treygrainger/semantic-multilingual-strategies-in-lucenesolr
Talk Summary: When searching on text, choosing the right CharFilters, Tokenizer, stemmers, and other TokenFilters for each supported language is critical. Additional tools of the trade include language detection through UpdateRequestProcessors, parts of speech analysis, entity extraction, stopword and synonym lists, relevancy differentiation for exact vs. stemmed vs. conceptual matches, and identification of statistically interesting phrases per language. For multilingual search, you also need to choose between several strategies such as: searching across multiple fields, using a separate collection per language combination, or combining multiple languages in a single field (custom code is required for this and will be open sourced). These all have their own strengths and weaknesses depending upon your use case.
This talk will provide a tutorial (with code examples) on how to pull off each of these strategies as well as compare and contrast the different kinds of stemmers, review the precision/recall impact of stemming vs. lemmatization, and describe some techniques for extracting meaningful relationships between terms to power a semantic search experience per-language. Come learn how to build an excellent semantic and multilingual search system using the best tools and techniques Lucene/Solr has to offer!