You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It occurs to me that if you could determine why the fuzzy matching algorithm goes wrong in this one particular example, it might suggest ways to improve accuracy overall.
The single letter OCR error Euglcna gracilis (for Euglena gracilis) is instead consistently replaced by Egilina gracilis, which requires substitution of four letters. Both names are in GNI. Other OCR errors for this name are also resolved in the same way.
As I pointed out previously, a single letter difference in the Latin name can be the difference between a butterfly and a snail, which makes fuzzy matching inappropriate. There seem to be a variety of algorithms specifically for detecting and correcting OCR errors, however, which are qualitatively different from the sort of phonetic misspelling and qwerty keyboard typographical errors that fuzzy matching is intended to overcome. OCR specific algorithms "know" about the nature of OCR errors, which result from similarly shaped characters being confounded or not recognized. In the example case, "c" is more likely to be an OCR error for "e" than for "i". Some OCR errors are obvious (mid-string non-alpha characters and case changes) while others are statistically much more likely than others. A probability table could suggest likely and eliminate unlikely substations. There seems to be a considerable literature on the subject, for example http://www.cs.cmu.edu/~rcarlson/docs/RyanCarlson_nlp.pdf. Use of GNI as a dictionary should be done with caution. It is rather "dirty," both significantly incomplete and full of errors. It must be relied on for recognition of genus group names, but a species group name string not being in GNI, particularly in combination with a particular genus, does not mean that it is invalid and must be replaced.
Regards,
Pat
The text was updated successfully, but these errors were encountered:
Hi Dima,
It occurs to me that if you could determine why the fuzzy matching algorithm goes wrong in this one particular example, it might suggest ways to improve accuracy overall.
The single letter OCR error Euglcna gracilis (for Euglena gracilis) is instead consistently replaced by Egilina gracilis, which requires substitution of four letters. Both names are in GNI. Other OCR errors for this name are also resolved in the same way.
BHL pages with OCR errors in the name Euglena gracilis that are misinterpreted as Egilina gracilis:
http://www.biodiversitylibrary.org/page/929138
http://www.biodiversitylibrary.org/page/1364518
http://www.biodiversitylibrary.org/page/1364524
http://www.biodiversitylibrary.org/page/1364525
http://www.biodiversitylibrary.org/page/1364531
http://www.biodiversitylibrary.org/page/1364532
http://www.biodiversitylibrary.org/page/1488979
http://www.biodiversitylibrary.org/page/1503443
http://www.biodiversitylibrary.org/page/1503451
http://www.biodiversitylibrary.org/page/1547005
http://www.biodiversitylibrary.org/page/1547013
http://www.biodiversitylibrary.org/page/1547015
http://www.biodiversitylibrary.org/page/1539409
http://www.biodiversitylibrary.org/page/1488189
As I pointed out previously, a single letter difference in the Latin name can be the difference between a butterfly and a snail, which makes fuzzy matching inappropriate. There seem to be a variety of algorithms specifically for detecting and correcting OCR errors, however, which are qualitatively different from the sort of phonetic misspelling and qwerty keyboard typographical errors that fuzzy matching is intended to overcome. OCR specific algorithms "know" about the nature of OCR errors, which result from similarly shaped characters being confounded or not recognized. In the example case, "c" is more likely to be an OCR error for "e" than for "i". Some OCR errors are obvious (mid-string non-alpha characters and case changes) while others are statistically much more likely than others. A probability table could suggest likely and eliminate unlikely substations. There seems to be a considerable literature on the subject, for example http://www.cs.cmu.edu/~rcarlson/docs/RyanCarlson_nlp.pdf. Use of GNI as a dictionary should be done with caution. It is rather "dirty," both significantly incomplete and full of errors. It must be relied on for recognition of genus group names, but a species group name string not being in GNI, particularly in combination with a particular genus, does not mean that it is invalid and must be replaced.
Regards,
Pat
The text was updated successfully, but these errors were encountered: