Extract entities from document with Solr Text Tagger30 Mar 2017
Algorithms for recognizing entities from text are ones of the most crucial aspects of text analysis. They lead to better understanding of the content, enable additional operations like filtering or grouping and - most importantly - allow to process data automatically. In the previous post I announced combination of text indexing & such extraction and in order to keep my promise I created a fork of Solr Text Tagger.
To demonstrate entity recognition I will track down cities names in the indexed documents and then add their idies into documents. Then operations like faceting or filtering could go in a standard Solr way.
I will also show that cities might be found in a query, with their offsets. This can be used to change initial text to enable proximity search or query another set of fields.
It worth to mention that Solr Text Tagger works with user-defined list of entities. What’s more, one entity can be recognized by many names, in my case - citites names from different countries.
How it works?
On a high level of abstraction it would work as follows: Solr Text Tagger scans text in search of previosly declared entities. Those entites are Solr documents with id & names, in my case city_names_tags.
Let’s discuss what is important here: during indexing tokens in tag field are concatenated into final one, as we need complete match in text. Another thing is that terms of this field are kept in memory in efective structure called Finite State Transducer (FST).
What is FST? It is a kind of finite state machine (FSM), that is a graph with nodes (states) & labels (edges). Finite state machine has an input which determines transitions between nodes. FST is an extension of state machine with an output node. In Lucene it is used to map term into integer and is implemented by Michael McCandless, whose example of a structure is quite clear:
FST maps the sorted words mop, moth, pop, star, stop and top to their ordinal number (0, 1, 2, …). As you traverse the arcs, you sum up the outputs, so stop hits 3 on the s and 1 on the o, so its output ordinal is 4.
These mapping reduces memory usage, but increases CPU cost of lookups. As the memory is main problem with Solr/ Lucene, it’s a cost we can afford.
The last question to answer in this section is how the recognition algorithm works with this FST representation. Let’s say we are quering a phrase new york times - there is a hit for new as there are multiple city names starting that way. Then the second token in taken into consideration, and still there is a hit in directory (new york), but york alone is also marked as a takeoff for a new entity. The next steps would be checking new york times and york times, which in both cases mean end of lookup. Times is also a dead end, so we finish with entities york & new york unless overlapping is enabled. The path for such result is as follow:
Before we will start tagging indexing part should be set up. For the needs of that I wrote my own update Solr client: ProcessorAwareConcurrentUpdateSolrClient. I wanted to be able to pass update.chain parameter which decides which update processor will be used. For cities I need a default one, for articles (my source documents) it has to use my entity recognition processor.
For more information about that process I would again point you to my previous post.
Processor I wrote retrieves one of the document fields and pass it to handler responsible for text tagging. Returned cities ids are added as a new field to the document.
It is worth to mention, that cities list I used is part of Solr Text Tagger so you can also use it for tests.
Despite the fact that Solr Text Tagger is a working piece of code I decided to fork it and change behaviour a bit - I’ve changed a way it processes text that has to be tagged. Solrconfig definition is not changed, but I’ve provided my version of the library to Solr instance:
Field says from which field tags should be retieved, overlaps - how overlapping entities should be proceed (for example York & New York). In this case only the longest one will be returned. More detailed information about changes I made to tagger could be found on Github.
I added article with title Mario Balotelli Brendan Rodgers will help me realise my potential at Liverpool FC - Liverpool Echo to Solr. As expected, indexed tag was the one with id 2644210 - Liverpool.
I also used curl to test how request handler would work with provided text - query to analysis could be served that way. What’s more, I’ve also used different options for overlapping: first query chooses longest city name, second one should return all of them:
It also worked as expected. I’d suggest paying attention to offsets returned by those queries: in both cases starts & ends of extracted entities are valid.
I run more queries and didn’t encounter any problems with Solr Text Tagger - the only difficulties are results of data. I preprocessed them a bit - filtered out cities with population less then 100k to avoid entries like Merkel - but more should be done. For example if in title appears word liga (league in Polish) it would be recognized as Latvian Riga (as it is one of foreign names of this city). Cities names should be adjusted to language of a document or search and only a subset of them used - but this & other improvements are out of scope of this post.
In this post I proved that - with minor modifications - Solr Text Tagger can be used to recognize entities from texts of the documents being indexed. They can be a subject of further processing, like addition to a document field or modification of user query. Another approach could be using their offsets to add payloads to texts, and then boost documents - but it may be a topic for another post.