One of the persistent issues we deal with at HumanGeo is determining the language of a block of text. There are multiple Language Detection (LD) libraries available claiming high accuracy, so building our own wasn’t necessary considering that these are libraries built by experts in Computational Linguistics. Out of the many choices, we were interested in determining the accuracy and performance of the different libraries for detecting the language of tweets. In the past, we have used the following libraries in various projects:
Recently, we needed to perform LD on text in Java, so we focused our efforts on two Java libraries, LangID and LanguageDetection. We ran tests on these two libraries to determine the accuracy and perfomance of the libraries. Below are the highlights of the process and a discussion of the results.
To perform Testing on this classification problem, a prelabeled data set of text needs to be used for testing. For this, we use Twitter data that we gathered from the Twitter Streaming API.
Messages from Twitter contain a field for language (
lang) that has a two-letter ISO code representing the language of the tweet that Twitter has determined using their own process.
We understand this language classification by Twitter is not perfect, but
we will overlook this issue momentarily because the large quantities of categorized data available outweigh these issues.
We will not ignore this issue completely and even study the data Twitter provides at a later time.
Now that we have collected adequate data (millions), we take the text of every tweet and remove #hashtags, @mentions, and URLs.
Given that these elements have primarily English characters, even in tweets in other languages, we don’t want these elements to
throw off the language detection.
This filtered text is passed into the two Language Detectors to perform detection.
The two detectors each return a list of languages that are ranked from most likely to least likely.
The different libraries have a “threshold” value that can filter the returned languages that have a score/probability higher than the threshold value for convenience.
A quick note on probabilities
Many people request clarifications on what the “probablities” mean. Each language has a probability associated with it. For example, you could get the following as a list of languages for a detection: (en: .6, es: .3, fr: .1). These numbers mean English (en) is twice as likely as Spanish (es) to being the probable language English is also six times more likely than French (fr). Similarly, Spanish is three times as likely as French. The numbers are derived from true probabilities which end up being very small. They are scaled so that they maintain their relative proportions and they sum up to 1.
The following is the high-level portion of code used the perform the main detection and store the results. This is where the actual detection happens. These blocks were derived from code found in tutorials and examples of the respective libraries.
The first portion just handles retrieving and stripping text from a Twitter message.
As stated, hashtags, mentions and urls are removed.
lang variable is the “true” language according to Twitter.
detLang is a variable for the detected Language from each library.
The following block of code retrieves the top language from LanguageDetection’s detection.
If no language met its default threshold, the code assigns a value of
The code then updates the confusion matrix with the true and detected languages.
The following block of code uses LangID to perform Language Detection similar to the block of code above. One difference here is that a bit of wrangling/sorting is needed to get the top detected language.
Results - Precision, Recall, F1 Scores - Confusion Matrices
Below we present the F1, Recall, Precision scores and Total messages of each language. The most frequent language is English, with 1.4 million messages. Spanish, Portuguese, Japanese, Arabic, Indonesian, Turkish, Russian rounding out the top languages with at least 100k total messages.
There are a few differences in the results statistics between the two libraries.
LanguageDetection seems to return more
und values than LangID.
Having more undetermined identifications lowers the recall score of a category.
This is seen in several recall scores of LanguageDetection, in particular for English, with a recall score under .5.
In turn, by potentially removing ambiguities, a classifier may improve its precision.
Again, LanguageDetection has some precision scores above .9, as seen in English and Spanish.
LangID, on the other hand, was willing to make mistakes in the precision, but typically had reasonable recall.
Overall, which library someone would prefer “out of the box” depends on which metric is more important. Another way to see this: the classifier can either be: * very certain about it’s decision, while balking at any ambiguous text * ok with fielding a guess even though it may be wrong, with a focus on doing better at capturing certain languages.
|el||“Greek Modern (1453-)”||0.545||0.914||0.68283756||5124|
|ht||Haitian; Haitian Creole||0.16||0.022||0.038681319||6180|
|ro||Romanian; Moldavian; Moldovan||0.166||0.035||0.057810945||2640|
|ro||Romanian; Moldavian; Moldovan||0.124||0.048||0.069209302||2622|
Below are the visual Confusion Matrices that represent the categorization totals of each language. The rows are labeled with the “truth” language, while the columns are labeled with detected language. Each cell in the matrix is color and transparency coded to represent the relative weight to the other cells in the row. The values of the cells are not linear (.1 value -> .9 transparency) but log scaled to bring out low scores and visualize any potential clusters.
The diagonals are labeled as blue because in this view with logarithmic scales, it would not be appropriate to compare and contrast the values of correct detection to the incorrect detections. That type of analysis is meant to be done using the recall/precision as shown above. The matrix is meant to determine what other languages are being detected instead of the true language.
The frequency ranking in the dropdown sorts according to the number of “true” texts for each row.
As stated before, the rows get sorted to English, Spanish, Portuguese, Japanese, Arabic, Indonesian, Turkish, Russian.
Additionally there is the
und for the “true” language that represents messages that haven’t been identified by Twitter.
Some interesting false detections points:
- When Arabic messages get detected incorrectly, they are usually tagged: Pashto, Urdu, Farsi
- For Japanese messages, Chinese is the culprit
- For English: French, German, Spanish, Italian, and Chinese(??)
- For Russian: Bulgarian, Serbian, Ukranian
- For Korean: Japanese, Thai
LanguageDetection Confusion Matrix
LangID Confusion Matrix
This study was an overview of a few Java-based Language Detection libraries. Though there is no clear indication of a “better” library, our preference is to use LangID “out of the box”, because it has a reasonable recall score for many languages.
We will delve into other issues such as the true accuracy of Twitter’s language detection in the future. This will help us create a proper Gold-Standard test (and maybe training) set for future studies. This study is meant to reiterate the fact that no machine learning classifier is perfect. It is also a helpful push to anyone interested in this problem and looking to contribute, since the source code to these libraries are on github.com and available for modification.
Twitter is still tagging Messages as
in for Indonesian, and
iw for Hebrew.
This has been reported to Twitter.