In a previous blog post, Denny and Kyle described how to train a classifier to isolate mentions of specific kinds of people, places, and things in free-text documents, a task known as Named Entity Recognition (NER). In general, tools such as Stanford CoreNLP can do a very good job of this for formal, well-edited text such as newspaper articles. However, a lot of the data that we need to process at HumanGeo comes from social media, in particular Twitter. Tweets are full of informal language, misspellings, abbreviations, hashtags, @-mentions, URLs, and unreliable capitalization and punctuation. Also, users can talk about anything and everything on Twitter, and new entities that were never or scarcely mentioned ever before may become suddenly popular. All these factors present huge challenges for general-purpose NER systems that were not designed for this type of text.

Fortunately, there is a good deal of academic research on ways to make NER better for Twitter data. In fact, every year since 2015 there has been a shared task at the Workshop on Noisy User-generated Text (W-NUT) for Twitter NER. A shared task is a competition in which all the participants are asked to submit a program for a specific task and the entries are scored and ranked based on a common metric. So we already know which system is the best of those that participated, but we don’t how good systems that didn’t compete are, and even the best system is of no use to us if we can’t get our hands on it. Unfortunately, none of the popular off-the-shelf NER tools have participated in this shared task, and I have only been able to find one entry, the seventh place winner from 2016, that is currently available on the internet.

With this in mind, I decided to use the test data from the 2016 shared task to evaluate systems that you can actually download and start using today to see how well they perform on tweets. The general-purpose NER systems that I selected are Stanford CoreNLP, spaCy, NLTK, MITIE, and Polyglot. The two Twitter-specific systems that I selected are OSU Twitter NLP Tools and TwitterNER (the seventh place entry for 2016). Each of these systems uses a slightly different set of entity types, so I decided to map the types in the output of these systems to just PERSON, LOCATION, and ORGANIZATION, which were common to all of them. I just ignored any types that didn’t match these three.

The Stanford CoreNLP NER tool can be run with several options that could potentially improve accuracy for tweets. In particular, there is a part-of-speech (POS) tagger that is optimized for tweets. Since part of speech is one of the features used for NER, improving the POS tagger should also improve NER accuracy. Additionally, there are two options for dealing with text that has inconsistent capitalization. This is a big problem for NER systems because, at least in well-edited text, capitalization is one of the strongest clues that a word is part of a proper noun, and therefore is likely to be a named entity. Systems trained only on well-edited text therefore tend to rely on capitalization too strongly when applied to text with inconsistent capitalization. The first option is to preprocess the text with a truecaser, which attempts to automatically figure out what the correct capitalization of the text should be. The second option is to use models that simply ignore case altogether.

Here is the precision, recall, and F1 score for these systems, sorted by highest F1 score first:

System Name Precision Recall F1 Score
Stanford CoreNLP 0.526600541 0.453416149 0.487275761
Stanford CoreNLP (with Twitter POS tagger) 0.526600541 0.453416149 0.487275761
TwitterNER 0.661496966 0.380822981 0.483370288
OSU NLP 0.524096386 0.405279503 0.45709282
Stanford CoreNLP (with caseless models) 0.547077922 0.392468944 0.457052441
Stanford CoreNLP (with truecasing) 0.413084823 0.421583851 0.417291066
MITIE 0.322916667 0.457298137 0.378534704
spaCy 0.278140062 0.380822981 0.321481239
Polyglot 0.273080661 0.327251553 0.297722055
NLTK 0.149006623 0.331909938 0.205677171

Precision measures the fraction of the entities that the system came up with that were correct, whereas recall measures the fraction of the correct entities that the system was able to find. F1 score is the harmonic mean of these two numbers. Which of these numbers is most important to you will depend on how you plan to use NER. For example, if the output of the NER system is always reviewed by a human, you might prefer a high recall/low precision system over a low recall/high precision system. In this case, the human reviewers can always toss out any bad entities that the system outputs, but if the system simply doesn’t report entities at all, the reviewers will never see them. On the other hand, if something important happens automatically to all of the entities that the system outputs, you might prefer the low recall/high precision system, so that any entities that the system outputs are as likely to be as correct as possible. All other things being equal, if you just want one number to look at, you should use F1 score.

Out of the box, Stanford CoreNLP is the winner, as measured by F1 score, though TwitterNER has a much higher precision. It is interesting to note that none of the alternative configurations for Stanford CoreNLP resulted in any improvement. The improved POS tagger didn’t change the results at all for any of the entity types I examined (though it did change the results for some other entity types), indicating that POS tagging plays a relatively minor role. Truecasing and caseless models made things even worse. My guess is that the truecaser is probably creating more capitalization errors than it is fixing, and the drop from the caseless models probably means that the capitalization information, as unreliable as it is, is still useful overall.

Given that they were designed explicitly for Twitter, it is somewhat surprising that TwitterNER and OSU Twitter NLP Tools did not get the highest F1 scores, but they were trained on a fairly small amount of data compared to the general-purpose systems, even if the quality of the data was better for this task.

One improvement that can easily be made to all of the systems is to exclude all detected entities that are @-mentions. These do refer to accounts, which correspond to either a person or an organization, so it would be natural to categorize them as entities. However, they are not marked as entities in the test data, since they are easy to identify with nearly 100% accuracy with a regular expression, and account profile information is likely to be a better source for distinguishing between people and organizations than the text of the tweet. Here are the results for all systems with @-mention entities excluded:

System Name Precision Recall F1 Score
Stanford CoreNLP 0.526838069 0.453416149 0.487377425
Stanford CoreNLP (with Twitter POS tagger) 0.526838069 0.453416149 0.487377425
TwitterNER 0.661496966 0.380822981 0.483370288
OSU NLP 0.524096386 0.405279503 0.45709282
Stanford CoreNLP (with caseless models) 0.547077922 0.392468944 0.457052441
Stanford CoreNLP (with truecasing) 0.413084823 0.421583851 0.417291066
MITIE 0.340364057 0.457298137 0.390260063
spaCy 0.28426543 0.380822981 0.325535092
Polyglot 0.273080661 0.327251553 0.297722055
NLTK 0.149006623 0.331909938 0.205677171

The improvement is only significant for MITIE and spaCy, but, as expected, no scores went down, so it’s still worth doing.

Since TwitterNER can be easily retrained, let’s see if we can make it better. The W-NUT Twitter NER shared task includes a set of training data that all participants are required to use, and if they use any additional training data it’s considered cheating. From a research perspective this is a really good idea, because this way, you know that the winner won because it was the best algorithm, not just because it used the most training data. But if you want the best system, you want to throw as much training data as you can at it. Fortunately, there are at least three more sets of tweets annotated for named entities available on the internet:

One of the challenges with using data from other sources is that there can be some inconsistency in formatting that you have to be careful about. I cleaned up the following issues from this data:

  • The W-NUT 2017 data incorrectly splits hashtags and @-mentions into two tokens (e.g. "@" and "username" rather than "@username"). I re-joined them.
  • All three of these sources annotate @-mentions as Person entities. I removed the Person entity annotations for all @-mentions.
  • All URLs and numbers are replaced with "URL" and "NUMBER", respectively. The reason for this is that it reduces data sparsity without sacrificing too much information, since it usually doesn't matter what the URL or number is exactly for the purpose of doing NER, and there is literally an infinite number of them. But TwitterNER has specialized features for numbers and URLs that expect numbers to look like numbers and URLs to look like URLs. So I replaced all "NUMBER" tokens with "1" and all "URL" tokens with "http://url.com".

Here are the results if you train a TwitterNER with this data in addition to the shared task training data:

System Name Precision Recall F1 Score
TwitterNER (with Hege training data) 0.657213317 0.413819876 0.507860886
TwitterNER (with W-NUT 2017 training data) 0.675307842 0.404503106 0.505948046
TwitterNER (with Finin training data) 0.598086124 0.388198758 0.470809793

After adding either the Hege or the W-NUT 2017 data, TwitterNER now has the highest F1 score of all of the systems, though adding the Finin data actually decreases the F1 score. This is likely due to the fact that the quality of the Finin annotations is not the best since they were crowdsourced rather than being produced by a smaller number of well-trained annotators like the other datasets. If we combine just the W-NUT 2017 and Hege data, we get a small but measurable additional improvement:

System Name Precision Recall F1 Score
TwitterNER (with W-NUT 2017 and Hege training data) 0.652276759 0.42818323 0.51699086

So for most use cases, TwitterNER with this extra training data is the best NER system to use for Twitter, since its F1 score and precision are the highest. In particular, if you need a high precision system, it’s significantly better than any of the other options. However, its recall is still a bit lower than Stanford CoreNLP’s, so if recall is especially important to you, you might still want to stick with CoreNLP.

The source code for this evaluation is available here. Like a lot of academic software, TwitterNER takes quite a bit of time and expertise to get up and running, so I created an easier-to-use version bundled with the best model (with the added W-NUT 2017 and Hege training data) here.