Denny and Kyle recently interned at HumanGeo. This blog post represents the culmination of their efforts. If you’re interested in interning or joining our team, let us know!

Here at HumanGeo we make sense of large amounts of unstructured text data, but one of the constant challenges is being able to extract the key text details we’re interested in across disparate data sources. An initial, low-effort approach is to create a keyword matcher that searches the document for words in various dictionaries, but there reaches a point where the keyword list gets too long and that process becomes too costly. Once one reaches this point, the method of attack needs to shift to a more powerful, more hands-off solution - Named Entity Recognition. NER is a field of natural language processing that uses sentence structure to identify proper nouns and classify them into a given set of categories. For our project, we used Stanford’s CoreNLP, a Java library that provides the ability to create custom classifiers for NER. More information on CoreNLP can be found at both the FAQ and the NER Feature Factory.

Download the version of CoreNLP we are using here.

Using CoreNLP

Tokenizing

In order to train your custom classifier with CoreNLP, you must first annotate a corpus with the tags you want the tool to recognize. Since our end goal was to analyze events in news articles, the tags we used were for people, organizations, locations and time. Before the annotating can start, the corpus must be tokenized. CoreNLP provides an easy way to do this:

java -cp stanford-ner.jar:lib/joda-time.jar:lib/jollyday-0.4.7.jar:lib/slf4j-api.jar:lib/slf4j-simple.jar:lib/stanford-ner-resources.jar edu.stanford.nlp.process.PTBTokenizer <text_file> | perl -ne 'chomp; print "$_\tOTHER\n > training.tsv"

One thing to note is that tokenizing text will not only put each token on a new line, but it will also split quotes and apostrophes onto their own lines. Because of this, annotating words which with those characters can be tricky, but we found that a conservative approach of annotating only the root provided the best results.

For instance, with the sample text:

This is an "example" of an article's text

Tokenizing the text results in:

This
is
an
"
example
"
of
an
article
's
text

Annotations

The annotations themselves are stored in a Tab Separated Value (TSV) format, where the first column is a single token, and the second column is the correct tag for that token. Anything that is not one of the categories we are looking for is marked OTHER.

SpaceX	ORGANIZATION
initially	OTHER
tweeted	OTHER
that	OTHER
one	OTHER
of	OTHER
the	OTHER
legs	OTHER
broke	OTHER
after	OTHER
a	OTHER
hard	OTHER
landing	OTHER
,	OTHER
but	OTHER
CEO	OTHER
Elon	PERSON
Musk	PERSON
followed	OTHER
up	OTHER
with	OTHER
a	OTHER
better	OTHER
explanation	OTHER

(article source)

Building the Classifier

After annotating the text, we need to build the classifier that CoreNLP will use to characterize new information. To do this, we concatenated all of our annotations into a single tsv file, created a properties.prop file following the CoreNLP example, and then ran the command below.

java -cp stanford-ner.jar:lib/joda-time.jar:lib/jollyday-0.4.7.jar:lib/slf4j-api.jar:lib/slf4j-simple.jar:lib/stanford-ner-resources.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop properties.prop

In your properties.prop file, you only really need to modify the trainFile and serializeTo fields to point to your annotated text path and your classifier destination.

#Location of the training file

trainFile = training.tsv

#Location where you would like to save (serialize to) your classifier;
# (adding .gz suffix automatically gzips the file)

serializeTo = news-model.ser.gz

#Structure of your training file;
# this tells the classifier that the word is in column 0 and the correct answer is in column 1

map = word=0, answer=1

#These are the features we'd like to train with some are discussed below, the rest can be
#gleaned by looking at the NERFeatureFactory documentation

useClassFeature=true
useWord=true
useNGrams=true

#No ngrams will be included that do not contain either the beginning or end of the word

noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1

#Word shape features

useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC

Using the Classifier

Once the operation finishes, you’re ready to use CoreNLP to run the classifier against tokenized bodies of text:

java -cp stanford-ner.jar:lib/joda-time.jar:lib/jollyday-0.4.7.jar:lib/slf4j-api.jar:lib/slf4j-simple.jar:lib/stanford-ner-resources.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier <your_classifier> -testFile <your_text>

Testing Our Classifier

We began testing our classifier by hand-annotating a selection of nine articles, and comparing our annotations to those detected by CoreNLP. The tool annotates TSVs by appending its annotations to the end of each line, allowing us to get a format with Word |TAB| Human Annotation |TAB| NER Annotation which can be easily analyzed to see where they disagree.

SpaceX	ORGANIZATION	ORGANIZATION
initially	OTHER	OTHER
tweeted	OTHER	OTHER
that	OTHER	OTHER
one	OTHER	OTHER
of	OTHER	OTHER
the	OTHER	OTHER
...
landing	OTHER	OTHER
,	OTHER	OTHER
but	OTHER	OTHER
CEO	OTHER	OTHER
Elon	PERSON	PERSON
Musk	PERSON	PERSON
followed	OTHER	OTHER
up	OTHER	OTHER
with	OTHER	OTHER
a	OTHER	OTHER
better	OTHER	OTHER
explanation	OTHER	OTHER

To tackle this output, we wrote a quick script to characterize the classifier’s mistakes. Some sample output is shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
{
  "mistake_by_tag": {
    "PERSON": {
      "MISCLASSIFICATION": 2,
      "MISS": 0,
      "FALSE-POSITIVE": 0
    },
    // ...
  },
  "non_other_count": 31,
  "mistakes": [
    {
      "HUM": "PERSON",
      "WORD": "Kolo",
      "NER": "LOCATION",
      "TYPE": "MISCLASSIFICATION"
    },
    // ...
  ],
  "num_mistakes": 6,
  "overall_mistake_type": {
    "MISCLASSIFICATION": 3,
    "TEMPORAL": 1,
    "PERSON": 2,
    "LOCATION": 4,
    "ORGANIZATION": 2,
    "MISS": 3,
    "FALSE-POSITIVE": 0
  },
  "mistake_percentage": 12,
  "num_correct": 299,
  "num_words": 305,
  "ke_mistake_type": {
    "MISCLASSIFICATION": 2,
    "MISS": 2,
    "FALSE-POSITIVE": 0
  }
}

Using this, we were able to see where the classifier was failing and set about correcting it by modifying our inital annotations. For instance, the first version of our source corpus was very inclusive in how the tokens were annotated, which led to a lot of false positives. With our second pass we were more conservative with annotations, and while our misses did increase slightly, we eliminated nearly all of the false positives. After updating the annotations, we wanted to see how the NER performed as we trained the classifier with more words. The figure below shows how well CoreNLP identified the terms in our nine test articles as more words were annotated.

Mistakes vs. Number of Annotated Words

For those articles, we tracked the number of mistakes that were made in the main categories (Person, Organization) versus the number of words in the classifier. As we can see, the number of mistakes falls rapidly at the beginning, but begins to level out around 15,000 words. An interesting correlation surfaced during testing where certain portions of our annotations were accompanied by a rise in the number of errors. Future efforts to improve the NER pipeline may want to investigate those spikes.

Comparing NER to the Keyword Enricher

On our project we were using a keyword enricher to pick out pre-defined terms from news articles. While it worked reasonably well, it had issues when dealing with alternate spellings of keywords (frequently a problem with non-English names). An example of this would be matching “Munich” (the romanization of the capitol of Germany), but not “München” (the actual name). NER drastically outperformed the keyword enricher when identifying people, but was more evenly matched when it came to organizations. In the end, our NER implementation provides a viable replacement for the person keyword detection, as well as a selector for possible locations (the hits would then be run through a geocoder). However, in the case of organizations, the keyword enricher will probably work just as well since the organizations we track do not change frequently.

Mistakes by Technique