-
Notifications
You must be signed in to change notification settings - Fork 4
NoraIdentification
Language classification
A quick experiment with [http://search.cpan.org/~ambs/Lingua-Identify-0.23/lib/Lingua/Identify.pm Lingua::Identify] yielded promising results with little effort. Most documents were classified as either English or Norwegian, which was as expected; text files containing gibberish tended to be assigned random languages such as Afrikaans or Somali. The most potential for error seem to be in the files classified as French and Swedish.
The full results are in [http://heim.ifi.uio.no/olasba/nora/noralang.log] (or ~olasba/noralang.log on ps). Note that the "confidence score" is relative to the other possible languages, as described at [http://search.cpan.org/~ambs/Lingua-Identify-0.23/lib/Lingua/Identify.pm#confidence] — a high score doesn't necessarily mean it's not gibberish.
This list has been used to categorize my working copy of the converted text in /logon/scratch/olasba/nora0908/logon/scratch/nora/pdf/
The biggest problem with Lingua::Identify was a large percentage of Norwegian texts getting misclassified as Turkish; this was rectified by eliminating 'tr' as a possible language:
Lingua::Identify::deactivate_language('tr');
The most accurate methods for separating Norwegian from English seemed to be 4-grams (strictly speaking, 4-graphs) and stopwords.
my @guess = langof({method=>{smallwords=>0.8, ngrams4=>1.2}}, $text);
Sourcecode is in [http://heim.ifi.uio.no/olasba/nora/identifylang03.pl.txt]
- —Ola
Home | Forum | Discussions | Events