-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathgnparser.tex
927 lines (655 loc) · 68.5 KB
/
gnparser.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
%% BioMed_Central_Tex_Template_v1.06
%% %
% bmc_article.tex ver: 1.06 %
% %
%%IMPORTANT: do not delete the first line of this template
%%It must be present to enable the BMC Submission system to
%%recognise this template!!
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% %%
%% LaTeX template for BioMed Central %%
%% journal article submissions %%
%% %%
%% <8 June 2012> %%
%% %%
%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% %%
%% For instructions on how to fill out this Tex template %%
%% document please refer to Readme.html and the instructions for %%
%% authors page on the biomed central website %%
%% http://www.biomedcentral.com/info/authors/ %%
%% %%
%% Please do not use \input{...} to include other tex files. %%
%% Submit your LaTeX manuscript as one .tex document. %%
%% %%
%% All additional figures and files should be attached %%
%% separately and not embedded in the \TeX\ document itself. %%
%% %%
%% BioMed Central currently use the MikTex distribution of %%
%% TeX for Windows) of TeX and LaTeX. This is available from %%
%% http://www.miktex.org %%
%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% additional documentclass options:
% [doublespacing]
% [linenumbers] - put the line numbers on margins
%%% loading packages, author definitions
%\documentclass[twocolumn]{bmcart}
% uncomment this for twocolumn layout and comment line below
\documentclass{bmcart}
%%% Load packages
%\usepackage{amsthm,amsmath}
%\RequirePackage{natbib}
\RequirePackage{hyperref}
% \usepackage{etoolbox}
\usepackage[utf8x]{inputenc} %unicode support
\usepackage{graphicx}
\usepackage{tikz}
\usepackage{amsmath}
\usepackage{listings}
\usepackage{bera}
\usepackage{multirow}
\usepackage{fancyvrb}
\usepackage{soul}
\usepackage{enumitem}
\usepackage{bera}
\usepackage{xcolor}
% \usepackage[colorinlistoftodos, prependcaption, textsize=large]{todonotes}
% \usepackage[top=1.5cm, bottom=1.5cm, outer=5cm, inner=3cm, heightrounded,
% marginparwidth=4.5cm, marginparsep=0.2cm]{geometry}
% \usepackage{marginnote}
% \setmarginnotefont{\color{blue}}
% \usepackage[nomarkers,figuresonly]{endfloat}
%\usepackage[applemac]{inputenc} %applemac support if unicode package fails
%\usepackage[latin1]{inputenc} %UNIX support if unicode package fails
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% %%
%% If you wish to display your graphics for %%
%% your own use using includegraphic or %%
%% includegraphics, then comment out the %%
%% following two lines of code. %%
%% NB: These line *must* be included when %%
%% submitting to BMC. %%
%% All figure files must be submitted as %%
%% separate graphics through the BMC %%
%% submission process, not included in the %%
%% submitted article. %%
%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\def\includegraphic{}
%\def\includegraphics{}
\errorcontextlines1000
% \makeatletter
% \patchcmd{\@addmarginpar}{\ifodd\c@page}{\ifodd\c@page\@tempcnta\m@ne}{}{}
% \makeatother
% \reversemarginpar
\makeatletter
\newcommand{\verbatimfont}[1]{\def\verbatim@font{#1}}%
\makeatother
\setlist[description]{leftmargin=\parindent,labelindent=\parindent}
%%% Put your definitions there:
\startlocaldefs
\newcommand{\comment}[2]{\hspace{0in}#2}
\colorlet{punct}{red!60!black}
\definecolor{background}{HTML}{EEEEEE}
\definecolor{delim}{RGB}{20,105,176}
\lstdefinelanguage{json}{basicstyle=\scriptsize\ttfamily,
numbers=left,
numberstyle=\scriptsize,
stepnumber=1,
numbersep=8pt,
showstringspaces=false,
breaklines=true,
frame=lines,
backgroundcolor=\color{background},
extendedchars=true,
literate=
*{:}{{{\color{punct}{:}}}}{1}
{,}{{{\color{punct}{,}}}}{1}
{\{}{{{\color{delim}{\{}}}}{1}
{\}}{{{\color{delim}{\}}}}}{1}
{[}{{{\color{delim}{[}}}}{1}
{]}{{{\color{delim}{]}}}}{1}
{ü}{{{\"u}}}{1}}
% "define" Scala
\lstdefinelanguage{scala}{%
morekeywords={abstract,case,catch,class,def,%
do,else,extends,false,final,finally,%
for,if,implicit,import,match,mixin,%
new,null,object,override,package,%
private,protected,requires,return,sealed,%
super,this,throw,trait,true,try,%
type,val,var,while,with,yield},
otherkeywords={=>,<-,<\%,<:,>:,\#,@},
sensitive=true,
morecomment=[l]{//},
morecomment=[n]{/*}{*/},
morestring=[b]",
morestring=[b]',
morestring=[b]"""
}
\endlocaldefs
%%% Begin ...
\begin{document}
%%% Start of article front matter
\begin{frontmatter}
\begin{fmbox}
\dochead{Software}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% %%
%% Enter the title of your article here %%
%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{``gnparser'': A powerful parser for scientific names based on
Parsing Expression Grammar}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% %%
%% Enter the authors here %%
%% %%
%% Specify information, if available, %%
%% in the form: %%
%% <key>={<id1>,<id2>} %%
%% <key>= %%
%% Comment or delete the keys which are %%
%% not used. Repeat \author command as much %%
%% as required. %%
%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\author[
addressref={aff1},
noteref={n1},
corref={aff1}, % id of corresponding address, if any
email={[email protected]}
]{\inits{DYM}\fnm{Dmitry Y.} \snm{Mozzherin}}
\author[ % id's of addresses, e.g. {aff1,aff2}
addressref={aff2},
noteref={n1},% id's of article notes, if any
email={[email protected]} % email address
]{\inits{AAM}\fnm{Alexander A.} \snm{Myltsev}}
\author[
% id of corresponding address, if any
addressref={aff3},
email={[email protected]}
]{\inits{DJP}\fnm{David J.} \snm{Patterson}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% %%
%% Enter the authors' addresses here %%
%% %%
%% Repeat \address commands as much as %%
%% required. %%
%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\address[id=aff1]{\orgname{University of Illinois,
Illinois Natural History Survey, Species File Group},
\street{1816 South Oak St.},
\city{Champaign},
\state{IL},
\postcode{61820},
\cny{US}}
\address[id=aff2]{\orgname{IP Myltsev},
\street{Kaslinskaya St.},
\city{Chelyabinsk},
\postcode{454084},
\cny{Russia}}
\address[id=aff3]{\orgname{University of Sydney},
\city{Sydney},
\cny{Australia}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% %%
%% Enter short notes here %%
%% %%
%% Short notes will be after addresses %%
%% on first page. %%
%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{artnotes}
%\note{Sample of title note} % note to the article
\note[id=n1]{Equal contributors} % note, connected to author
\end{artnotes}
\end{fmbox}% comment this for two column layout
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% %%
%% The Abstract begins here %%
%% %%
%% Please refer to the Instructions for %%
%% authors on http://www.biomedcentral.com %%
%% and include the section headings %%
%% accordingly for your article type. %%
%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{abstractbox}
\begin{abstract} % abstract
\parttitle{Background} Scientific names in biology act as universal links. They allow to cross-reference information about organisms globally. However variations in spelling of scientific names greatly diminish their ability to interconnect data. Such variations may include abbreviations, annotations, misspellings, etc. Authorship is a part of a scientific name and may also differ significantly. To match all possible variations of a name we need to divide them into their elements and classify each element according to its role. We refer to this as `parsing' the name. Parsing categorizes name's elements into those that are stable and those that are prone to change. Names are matched first by combining them according to their stable elements. Matches are then refined by examining their varying elements. This two stage process dramatically improves the number and quality of matches. It is especially useful for the automatic data exchange within the context of ``Big Data'' in biology.
\parttitle{Results} We introduce Global Names Parser (\textit{gnparser}). It is a Java tool written in Scala language (a language for Java Virtual Machine) to parse scientific names. It is based on a Parsing Expression Grammar. The parser can be applied to scientific names of any complexity. It assigns a semantic meaning (such as genus name, species epithet, rank, year of publication, authorship, annotations, etc.) to all elements of a name. It is able to work with nested structures as in the names of hybrids. \textit{gnparser} performs with $\approx99\%$ accuracy and processes 30 million name-strings/hour per CPU thread. The \textit{gnparser} library is compatible with Scala, Java, R, Jython, and JRuby. The parser can be used as a command line application, as a socket server, a web-app or as a RESTful HTTP-service. It is released under an Open source MIT license.
\parttitle{Conclusions} Global Names Parser (\textit{gnparser}) is a fast, high precision tool for biodiversity informaticians and biologists working with large numbers of scientific names. It can replace expensive and error-prone manual parsing and standardization of scientific names in many situations, and can quickly enhance the interoperability of distributed biological information.
\end{abstract}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% %%
%% The keywords begin here %%
%% %%
%% Put each keyword in separate \kwd{}. %%
%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{keyword}
\kwd{biodiversity}
\kwd{biodiversity informatics}
\kwd{scientific name}
\kwd{parser}
\kwd{semantic parser}
\kwd{names-based cyberinfrastructure}
\kwd{Scala}
\kwd{Parsing Expression Grammar}
\end{keyword}
% MSC classifications codes, if any
%\begin{keyword}[class=AMS]
%\kwd[Primary ]{}
%\kwd{}
%\kwd[; secondary ]{}
%\end{keyword}
\end{abstractbox}
%
%\end{fmbox}% uncomment this for twcolumn layout
\end{frontmatter}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% %%
%% The Main Body begins here %%
%% %%
%% Please refer to the instructions for %%
%% authors on: %%
%% http://www.biomedcentral.com/info/authors%%
%% and include the section headings %%
%% accordingly for your article type. %%
%% %%
%% See the Results and Discussion section %%
%% for details on how to create sub-sections%%
%% %%
%% use \cite{...} to cite references %%
%% \cite{koon} and %%
%% \cite{oreg,khar,zvai,xjon,schn,pond} %%
%% \nocite{smith,marg,hunn,advi,koha,mouse}%%
%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%% start of article main body
% <put your article body there>
\section*{Background}
\subsection*{Conventions}
Throughout the paper we use the terms ``name'', ``scientific name'', and ``name-string'' in particular ways. ``Name'' refers to one or several words that act as a label for a taxon. A ``scientific name'' is a name formed in compliance with a nomenclatural code (Code) or, if beyond the scope of the Codes, is consistent with the expectations of a Code. The term ``name-string'' is the sequence of characters (letters, numbers, punctuation, spaces, symbols) that forms the name. A name can be expressed in the form of many name-strings (for example, see Figure~\ref{figure:carex}). There are about two and a half million currently accepted names for extinct and extant species. There are approximately ten million of legitimately formed scientific names and hundreds of millions of possible name-strings for them. We use the term ``elements'' for the components of a name-string. Traditionally, in biological literature, scientific names for genera and taxa below genus are presented in \textit{italics}. In this paper, where we wish to emphasize examples of name-strings, we use \textbf{bold font}.
\subsection*{Introduction}
Biology is entering a ``Big Data'' age, where global and fast access to all knowledge is envisaged. Progress towards this vision is still limited in scope. One impediment, especially for the long tail of smaller sources (of which some are not yet digital), is the absence of devices to inter-connect distributed data. The names of organisms are invaluable in ``Big Data'' biology because they can be treated as metadata and as such can be used to discover, index, organize, and interconnect distributed information about species and other taxa~\cite{Patterson2010}. The use of names for informatics purposes is not straightforward because, for example, there may be many legitimate spellings for a name (Figure~\ref{figure:carex}). A cyberinfrastructure that uses names to manage information about organisms must determine which name-strings are variant forms of the same scientific name.
Figure~\ref{figure:carex} presents some of the different legitimate variants of a scientific name in order to make the point that there is not a single correct way to spell scientific names. Because of these variations, fewer than 15\% of the names in comparisons of large biological databases could be matched based on exact spellings of name-strings~\cite{Patterson2016}. In order to improve this simple metric for interoperability, we need to identify variants of the same name. We refer to the process of addressing variant spellings (there being other causes of different names for the same taxon) as ``lexical reconciliation''. Lexical reconciliation involves linking the alternative spelling variants for the same taxon into a ``lexical group''. Most biologists do this intuitively --- they recognize that the name-strings in Figure~\ref{figure:carex} refer to the same taxon. They do so by ``parsing'' the name-strings into elements (genus name, species name, authors, ranks etc.) and mentally discarding less significant elements such as annotations and authorship. It then becomes clear all of name-strings are formed around the Latin elements \textbf{Carex scirpoidea convoluta}. We refer to the form of the scientific name without authority or annotations as the ``canonical form''. Further analysis of the name-strings reveals two different lexical groups (separated in Figure~\ref{figure:carex} by a line break) for, probably, one taxonomic concept:
\begin{itemize}
\item \textbf{Carex scirpoidea var.\ convoluta} description by
\textbf{Kükenthal}
\item \textbf{Carex scirpoidea subsp.\ convoluta} rank determination by
\textbf{Dunlop}.
\end{itemize}
In the past, the need to parse scientific names to form normalized names has mostly been achieved manually. A person familiar with rules of botanical nomenclature would be able to analyse the 24 name-strings in this example with relative ease, but not thousands or millions of name-strings - especially if they include scientific names to which more than one nomenclatural code may be applied. The manual splitting of names into even only two parts --- the latinized elements of taxon names that make up the canonical form and the authorship --- is slow and therefore expensive. To scale this exercise up requires an algorithmic solution, a scientific name parser!
\begin{figure}
\begin{center}
\caption{Some legitimate versions of the scientific name for the `Northern
Bulrush' or `Singlespike Sedge'. The genus (\textit{Carex}), species
(\textit{scirpoidea}), and subspecies (\textit{convoluta}) may be annotated
(var., subsp., and ssp.) or include or omit the name of the original
authority for the infraspecies (Kükenthal), or for the species (Michaux), or for the
current infraspecific combination (Dunlop). The name of the authority is sometimes abbreviated, sometimes
differently spelled, and may be with or without initials and dates. This list is
not complete. Image courtesy of~\cite{FNA2002}.}\label{figure:carex}
\end{center}
\end{figure}
The strategy of the algorithmic approach is to identify which combinations of the most atomic parts of a name-string (i.e.\ the UTF-8 encoded characters) represent words (such as genus name, species name, authors, annotations) or dates. An early algorithmic approach to parsing scientific names was with ``regular language'' implemented as regular expression~\cite{Leary2007}. A regular expression is a sequence of characters that describes a search pattern~\cite{aho1992foundations}. For example, a regular expression ``[A-Z][a-z]\{2\}'' recognizes a word that starts from a capital letter followed by two small letters (e.g. ``Zoo''). Scientific names almost universally follow patterns that are influenced by the Codes of Nomenclature: such as the use of spaces to separate words, capitalization of generic names and authors, or the inclusion of four digit dates between the middle of the 18th century and the present. This makes most names amenable to parsing by regular expressions. Current examples of scientific name parsers based on regular expressions are GBIF's \textit{name-parser}~\cite{gbifNameParser}, and \textit{YASMEEN}~\cite{VandenBerghe2015}.
While regular expression is a powerful approach to string parsing, it has limitations. It cannot elegantly deal with name-strings where an authorship element is present in the middle of the name (for example \textbf{Carex scirpoidea Michx.\ subsp.\ convoluta (Kük.) D.A.Dunlop}). Indeed, regular expressions are not well suited to any targets with recursive (nested) elements~\cite{yu1997handbook}, such as hybrid formulae (e.g. \textbf{Brassica oleracea L.\ subsp.\ capitata (L.) DC.\ convar.\ fruticosa (Metzg.) Alef.\ $\times$ B. oleracea L.\ subsp.\ capitata (L.) var.\ costata DC.}). Name parsing built on regular expressions is impractical for complex name-strings.
Another limitation with most regular expression software tools is that they are ``black boxes'' that allow developers very limited interaction with the parsing process. They do not reveal much information about the parsing context and developers cannot call a procedure during a parsing event. As a result, complex regular expression-based parsers are difficult to implement and maintain, and functions such as error recovery, detailed warnings, descriptions of errors are missing.
We wanted to deal with scientific names across a very broad range of complexity and to give more flexibility than can be achieved with a regular expression approach. We believe that a scientific name parser should satisfy the following requirements.
\begin{enumerate}
\item \textbf{High Quality.} A parser should be able to break names into their semantic elements to the same standards that can be achieved by a trained nomenclaturalist or better. This will give users confidence in the automated process and allow them to set aside tedious and expensive manual parsing.
\item \textbf{Global Scope.} A parser should be able to parse all types of scientific names, inclusive of the most complex name-strings such as hybrid formulae, multi-infraspecific names, names with multilevel authorships and so on. No name-strings should be left unparsed, otherwise biological information attached to them may remain undiscoverable.
\item \textbf{Parsing Completeness.} All information included in a name-string is important, not just the canonical form of the scientific name. Authorship, year, rank information allow us to distinguish homonyms, similar names, synonyms, spelling mistakes, or chresonyms. Access to such information improves the performance of subsequent reconciliation (the mapping of all alternative name-strings for the same taxon against each other).
\item \textbf{Speed.} Users, especially large-scale aggregators of biodiversity data, are more satisfied with speedy processing of data as it allows them to move forward to more purposeful value-adding tasks. Speed reduces the purchasing/operating costs of the hardware used for production parsing.
\item \textbf{Accessibility.} To be available to the widest possible audience, a parser should be released as a stand-alone program, have good documentation, be able to work as a library, to function as a command line tool, as a tool within a graphical interface, to run as a socket or as RESTful services.
\end{enumerate}
These requirements became our design goals. Based on our experience with prototype systems, we chose to use Parsing Expression Grammar and Scala language.
\subsection*{Adoption of Parsing Expression Grammar}
Parsing Expression Grammar (PEG)~\cite{Ford2004} have been introduced for parsing strings. PEG allows developers to define the rules (``grammar'') that describe the general structure of target strings. Such rules can be used to deconstruct scientific names. The rules are built from the ground up, starting from the simplest --- such as a combination of ``characters'' separated by ``spaces''. That `rule' identifies most ``words''. Digits and other characters make dates identifiable. Further rules can be applied, such as a ``genus'' rule can describe a part of a polynomial name-string in which the first word begins with combination of a ``capital\_character'' followed by several ``lower\_case\_characters'' that fall within a relatively small spectrum of allowed characters; ``authorship'' would consists of one or more capitalized words and followed perhaps by a ``year''. Within some instances of authorship, authors may be grouped to form ``author-teams''. PEG rules are designed to be recursive. They can be expanded to deal with increasingly complex name-strings, or address errors such as absent or extra spaces, or OCR errors. Each rule can have programmatic logic attached, making the PEG approach very flexible. We believe that PEG suits our goals better than regular expressions for the following reasons:
\begin{itemize}
\item PEG is better suited than regular expressions for strings with a recursive structure;
\item the syntax of scientific names is formal enough to be closer to an algebraic structure rather than to a natural language. Inconsistencies and ambiguities in scientific name-strings are relatively rare because they usually comply with the requirements and conventions of nomenclatural codes;
\item scientific name-strings are short enough to avoid problems with computational complexity and memory consumption;
\item programming a parser with PEG can describe parsing rules in a domain-specific language;
\item domain-specific languages offer great flexibility for logic within the rules, for example to report errors in name-strings.
\end{itemize}
The Global Names project created a specialized parsing library \textit{biodiversity} in 2008~\cite{biodiversity}. It was written in Ruby and based on PEG\@. It uses the \textit{TreeTop} Ruby library~\cite{treetop} as the underlying PEG implementation.
The PEG approach allowed us to deal with complex scientific names gracefully. It gave us flexibility to incorporate edge cases and to detect common mistakes during the parsing process. The \textit{biodiversity} library has enjoyed considerable popularity. At the time of writing, it had been downloaded more than 150,000 times~\cite{bdiv-downloads}, it is used by many taxon name resolution projects (e.g. Encyclopedia of Life~\cite{eol}, Canadian Register of Marine Species (CARMS)~\cite{carms}, the iPlant TNRS~\cite{iplant}, and World Registry of Marine Species (WoRMS)~\cite{worms}. According to statistics compiled by BioRuby, \textit{biodiversity}, at the time of writing, has been the most popular bio-library in the Ruby language~\cite{biogems}.
We were pleased with PEG approach for parsing scientific names, but regard the \textit{biodiversity} parser library as a working prototype. It has allowed us to make further improvements and deliver a better, faster production-grade parser.
\subsection*{Other approaches}
There is a growing number of algorithms and tools in machine learning and natural language processing that aim to recognize parts of texts. They include statistical parsing~\cite{charniak1996statistical}, context-free grammars~\cite{aho1972theory}, fuzzy context-free grammars~\cite{asveld1995fuzzy}, and named entity recognition~\cite{nadeau2007survey}. Unsupervised deep learning~\cite{mikolov2013distributed, schmidhuber2015deep} increases the quality of entity recognition without extensive curation and programming efforts by people. We chose not to use these approaches for the following reasons.
\begin{itemize}
\item The limited scope of a parser. A parser of scientific names very rarely needs to work with name-strings of more than 15 words.
\item There is no need for recognition. A scientific name-string parser is usually applied to preexisting lists of scientific names. There is no requirement to recognize scientific names in larger bodies of text. Other scientific name recognition and discovery tools are available.
\item Formal grammar. Scientific names are formed in compliance with well-defined and formal codes of nomenclature. They have predictable structures making the requirements for a scientific name-string parser to be more similar to parsers of programming languages than to tools designed to work with natural languages.
\item Scale and throughput. We created the parser to serve the needs of biodiversity aggregators. A core design requirement was to develop a lightweight library for inputs of millions of scientific name-strings per second, and to be processed locally.
\item Stand-alone approach. We did not wish the parser to rely on local or remote previously known information of genera, species, author names, or other scientific names. \textit{gnparser} relies instead on morphological features of scientific name-strings.
\item Determinism. Biologists know that there is only a single correct parsed version of a scientific name. A scientific names parser must produce a single ``correct'' result for each input string. A parser should provide meta information on every part of the string.
\end{itemize}
\subsection*{Adoption of Scala}
The pre-existing \textit{biodiversity} package is not speedy and cannot scale because it uses Ruby as its programming language. Ruby is one of the best languages for rapid prototyping, but it is an interpreted dynamic language with, originally, a single-threaded runtime during execution. This makes it slow and inappropriate for ``Big Data'' tasks. We concluded that we needed a replacement language environment with the following properties:
\begin{itemize}
\item a mature technology;
\item multithreaded, with high performance and scalability;
\item an active support community with an Open source friendly culture;
\item a wide range of libraries: utilities, web frameworks, etc.;
\item a powerful development environment with IDEs, testing frameworks, debuggers, profilers and the like;
\item mature libraries for search and cluster computations;
\item interoperable with languages popular in scientific community (R, Python, Matlab);
\item natural support of domain specific languages embedded in the hosted language.
\end{itemize}
While many of the properties are true for Ruby, other properties, such as high performance, scalability and interoperability, are not. To meet all requirements, and exploiting what we had learned from \textit{biodiversity}, we rewrote the code using Scala (a Java virtual machine programming language~\cite{odersky2004overview}), and the Open source \textit{parboiled2} library~\cite{Myltsev:inpress-a} which we improved~\cite{parboiled2-gna}. The \textit{parboiled2} library implements PEG in Scala. An alternative to \textit{parboiled2} is the Scala combinators library~\cite{moors2008parser}. We did not use it because it is slow and has memory consumption problems.
The functional programming features of Scala allowed us to build a domain specific language that describes the rules of the grammars to parse scientific names. This produces a Parsing Expression Grammar with considerably more flexibility than external lexers such as Bison or Yacc. As this domain specific language is within \textit{parboiled2}, it can take advantage of the Macro capacity of Scala~\cite{Burmako:2013:SML:2489837.2489840} to optimize the compilation of the code and the subsequent running of the program. As a result, the software performs with high efficiency. The resulting \textit{gnparser} library is faster, more scalable and more flexible than its predecessor.
We limited this version to work with scientific names that comply with the botanical, zoological, and prokaryotic codes of nomenclature, but not with names of viruses because they are formed in different ways~\cite{ICTV, Patterson2016} and need a different PEG\@. We intend to add this later.
\section*{Implementation}
The \textit{gnparser} project is entirely written in Scala. It supports two major Scala versions: 2.10.6+ and 2.11.x. The code is organized into four modules:
\begin{enumerate}
\item ``\textit{parser}'' is the core module used by all other modules. It parses scientific names from the most atomic components of a name-string to semantically-defined terms. It includes the parsing grammar, an abstract syntax tree (AST) composed of the elements of scientific names, warning and error facilities. When the parsing is complete and semantic elements of name-strings have been assigned to AST nodes, the elements can be recombined and formatted to meet further needs. For example:
\begin{itemize}
\item \textit{normalizer} converts input name-strings into a consistent style;
\item \textit{canonizer} creates canonical forms of the latinized elements of names;
\item \textit{JSON renderer}, the parsing result is converted to JSON~\cite{bray2014javascript} to allow developers to work with the output using other languages. The output (Figure~\ref{figure:webgui}, also see \hyperref[sec:discussion]{Discussion}) has the following information: \textbf{'details'} contains the JSON-representation of a parsed scientific name; \textbf{'quality\_warnings'} describes potential problems if names are not well-formed; \textbf{'quality'} depicts a quality level of the parsed name; and \textbf{'positions'} maps the positions of every element in a parsed name to the semantic meaning of the element. Full and formal explanation of all parser fields is given as a JSON schema and can be found online~\cite{gnparser-json} [also see Additional file 1].
\end{itemize}
\item The ``\textit{spark-python}'' module contains facilities to use ``\textit{gnparser}'' with Apache Spark scripts written in Python. Apache Spark is a highly distributive and scalable development environment for processing massive sets of data. Spark is written in Scala, but can also be used with Python, R and Java languages. Spark programs written in Java and Scala are able to run ``\textit{parser}'' in a distributed fashion natively.
\item The ``\textit{examples}'' module contains examples to assist developers in adding ``\textit{parser}'' functionality into other popular programming languages such as Java, Scala, Jython, JRuby, and R.
\item The ``\textit{runner}'' module contains the code that allows users to run ``\textit{parser}'' from a command line as a standalone tool or to run it as a TCP/IP socket or HTTP web server. It depends on the ``\textit{parser}'' module. The core part is the launch script ``\textit{gnparse}'' (for Linux/Mac and Windows) that creates a JVM instance and runs ``\textit{parser}'' on multiple threads against the input provided via a socket or file. This module also contains a web application and a RESTful interface to offer simpler ways to access ``\textit{parser}''. ``\textit{web}'' achieves interactions with ``\textit{parser}'' via HTTP protocol. It works both with simple web (HTML) and REST API interfaces. Figure~\ref{figure:webgui} illustrates a parsing example using the web-interface. Socket and REST services use Akka framework which makes them highly concurrent and scalable.
\end{enumerate}
``\textit{parser}`` and ``\textit{examples}`` can run in JVM~1.6+. ``\textit{runner}'' requires JVM~1.8+. Documentation is available in a README file [see Additional file 3].
\begin{figure}
\begin{center}
\caption{Web Graphical User Interface~\cite{gnparser-web}. In this example a user entered a name-string of a hybrid name consisted of 21 elements. The ``Results'' section contains detailed parsed output using compact JSON format.}\label{figure:webgui}
\end{center}
\end{figure}
\subsection*{Parsing rules}
\textit{gnparser} v0.3.1 contains 76 PEG rules. In turn, these rules make use of more elementary rules provided by the \textit{parboiled2} library. The rules are domain-specific based on hours of conversations with leading taxonomists, study of nomenclatural codes, and feedback of the users.
As an example, the \textit{yearNumber} rule is given below. It detects the year in which a name was published. \textit{Rule[Year]} is a type of the returning value of the rule. Using domain-specific language and elementary rules of \textit{parboiled2} we capture the start and the end positions of a year substring (lines \#1 and \#2). This matches a substring that represents a year in scientific name-strings. A publication year is usually a number between 1753~\cite{Linne1753} and the present. A year substring might have one or two digits substituted with question marks if the exact year of a publication is unknown. The capture is then passed as a parameter to a parser action (line \#3). Parser action, a Scala function, might produce warnings or a class instance of defined type (\textit{Rule[Year]}).
\begin{lstlisting}[language=scala]
def yearNumber: Rule[Year] = rule { capturePos( // #1
CharPredicate("12") ~ CharPredicate("0789") ~
Digit ~ (Digit|'?') ~ '?'.? // #2
) ~> { (yPos: CapturePosition) => // #3
FactoryAST.year(yPos) // #4
}
}
\end{lstlisting}
We then assemble more complex inter-dependent rules (lines \#5 to \#10), and finally combine all of them into the rule \textit{year} on line \#11 that consists of prioritized alternatives of all previously defined rules.
\begin{lstlisting}[language=scala]
def yearWithChar = rule { yearNumber ~ capturePos(Alpha) } // #5
def yearWithParens = rule { '(' ~ (yearWithChar |
yearNumber) ~ ')' } // #6
def yearWithPage = rule { (yearWithChar | yearNumber) ~
':' ~ oneOrMore(Digit) } // #7
def yearApprox = rule { '[' ~ yearNumber ~ ']' } // #8
def yearWithDot = rule { yearNumber ~ '.' } // #9
def yearRange = rule { yearNumber ~ '-' ~
capturePos(Digit.+) ~ (Alpha ++ "?").* } // #10
def year = rule { yearRange | yearApprox |
yearWithParens | yearWithPage | yearWithDot |
yearWithChar | yearNumber // #11
}
\end{lstlisting}
This enables the incorporation of the \textit{year} rule into all cases where it might be needed. For example on line \#12 we indicate that \textit{year} must be present in the matcher for the \textit{authorsYear} rule.
\begin{lstlisting}{language=scala}
def authorsYear: RuleNodeMeta[AuthorsGroup] = rule {
authorsGroup ~ softSpace ~ (',' ~ softSpace).? ~ year ~> { // #12
(aM: NodeMeta[AuthorsGroup], yM: NodeMeta[Year]) =>
val a1 = for { a <- aM; y <- yM } yield a.copy(year = y.some)
a1.changeWarningsRef((aM.node, a1.node))
}
}
\end{lstlisting}
\subsection*{Installation}
``\textit{gnparser}'' is available for launch in three bundles.
\begin{itemize} \item A \textit{parser} artifact is provided via the Maven~central~repository of Java code~\cite{maven-globalnames}. Physically it is a relatively small jar file without embedded external dependencies. The artifact can be accessed in custom projects by a build system such as Maven, Gradle, or SBT\@. The build system identifies and provides access to all dependent jars.
\item A Zip-archived ``fat jar'' is located at the project's GitHub repository. The jar contains the compiled files of \textit{gnparser} along with all necessary dependencies to launch it within JVM\@. The archive is also bundled with a launch script (for Windows, OS~X and Linux) that can run a command line interface to \textit{gnparser}.
\item The project's Docker container image is located at Docker~Hub~\cite{gnparser-docker}. Docker provides an additional layer of abstraction and automation of operating-system-level virtualization on Linux. It can be thought of as a lightweight virtualization technology within a Linux OS host. When it is setup properly, everything --- starting from JVM and ending with Scala and SBT --- can be run with simple commands that will, for example, pull the \textit{gnparser}'s Docker image from the DockerHub, and run the socket or web server on an appropriate port.
\end{itemize}
\subsection*{Testing Methods}
Data for our tests were sets of 1,000 and 100,000 name-strings randomly chosen from 24 million unique name-strings of the Global Names Index (GNI)~\cite{gn:index}. The name-strings in GNI are collected from a large variety of biodiversity data sources and are pre-identified as scientific names. While GNI contains some incorrectly classified strings, it is the largest compilation of name-strings representing scientific names. It is not biased towards any particular taxon or particular variant of name, and so the extracted datasets are believed to represent naturally occurring data quite well. The datasets are randomly chosen and are therefore mixtures of well-formed names, lexical variants of names, names with formatting and spelling mistakes, and name-strings that were misrepresented as names. Name-strings in the sets are independent of each other. An evaluation dataset with 1,000 names is included as Additional file 4.
We compared the performance of \textit{gnparser} with two other projects: \textit{biodiversity} parser~\cite{Boyle2013, biodiversity} (also developed by Global Names team), and the GBIF \textit{name-parser}~\cite{gbifNameParser}. The following versions were used: \textit{gnparser} v. 0.2.0, GBIF \textit{name-parser} v. 0.1.0, \textit{biodiversity} v. 3.4.1. To make comparisons, we calculated $Precision$, $Recall$ and $Accuracy$ (as described below) using a dataset consisting of 1,000 name-strings. We also tested the YASMEEN parser from iMarine~\cite{VandenBerghe2015}. With our dataset, YASMEEN generated many more mistakes than other parsers ($Precision$ 0.534, $Recall$ 1.0, $F1$ 0.6962), and was unable to finish a full dataset without crashing. We excluded it from further tests.
To estimate the quality of the parsers, we relied on their performance in representing canonical forms and terminal authorships. A canonical form represents the latinized elements of taxon names, while the terminal authorship refers to the author of the lowest subtaxon found in the scientific name. For example, with \textbf{Oriastrum lycopodioides Wedd.\ var.\ glabriusculum Reiche}, the canonical form is \textbf{Oriastrum lycopodioides glabriusculum} and the terminal authorship is \textbf{Reiche}, not \textbf{Wedd.}.
When both the canonical form and the terminal authorship were determined correctly we marked the result as true positive ($N_{tp}$). If one or both of them were determined incorrectly, the result was marked as a false positive ($N_{fp}$). Name-strings correctly discarded from parsing were marked as true negatives ($N_{tn}$). False negatives ($N_{fn}$) were name-strings which should have been parsed, but were not. The results of the tests are summarized in Table~\ref{table:precision}:
$Accuracy$ --- the proportion of all results that were correct. It is calculated as:
\[Accuracy = \dfrac{N_{tp} + N_{tn}}{N_{tp} + N_{tn} + N_{fp} + N_{fn}}\]
$Precision$ --- the proportion of name-strings parsed correctly compared to all detected name-strings. It is calculated as:
\[Precision = \dfrac{N_{tp}}{N_{tp} + N_{fp}}\]
$Recall$ --- the proportion of correctly detected name-strings relative to all parseable name-strings and is calculated as:
\[Recall = \dfrac{N_{tp}}{N_{tp} + N_{fn}}\]
The $F1-measure$ is a balanced harmonic mean (where $Precision$ and $Recall$ have the same weight). When $Precision$ and $Recall$ differ, $F1-measure$ allows results to be compared. It is calculated as
\[F1 = \dfrac{2 \times Precision \times Recall}{Precision + Recall}\]
Some names in the dataset were not well-formed. If a human could extract the canonical form and the terminal authorship from them, we included them in our assessment. Examples of such name-strings are \textbf{``Hieracium nobile subsp.\ perclusum (Arv.\ -Touv.\ ) O.\ Bolòs \& Vigo''} (the problem for the parser here is an introduced space within an author's name), \textbf{``Campylium gollanii C. M?ller ex Vohra 1970 [1972]''} (with a miscoded UTF-8 symbol and an additional year in square brackets), \textbf{``Myosorex muricauda (Miller, 1900).''} (with a period after the authorship).
Parsers analyze the structure of name-strings, but they cannot determine if a string is a ``real'' name. For example, in the case of a name-string that has the same form as a subspecies such as \textbf{``Example name Word var.\ something Capitalized Words, 1900''}. In such a case, the identification of a canonical form as \textbf{``Example name something''} and terminal authorship as \textbf{``Capitalized Words, 1900''} would be considered a true positive. Clearly, it will be important for name-management services to distinguish between name-strings of scientific names, names of viruses, surrogate names, and non-names. To find out how well parsers distinguished strings which are not scientific names, we calculated $Accuracy$ for discarded/non-parsed strings. If the parser worked well, non-parsed strings would include only names of viruses and terms that do not comply with the codes of zoological, prokaryotic, and botanical nomenclature.
We processed 100,000 name-strings with each parser. Each parser discarded close to 1,000 name-strings as non-parseable. $Accuracy$, in this case, provided the percentage of correctly discarded names out of all discarded by the parser names. We do not know $Recall$, as it was not reasonable to manually determine this for 100,000 names. To get a sense of names which should be discarded but were parsed instead, we analysed intersections and differences of the results between the three parsers as shown in Table~\ref{table:unparsed}.
To establish the throughput of parsing we used a computer with an Intel i7-4930K CPU (6 cores, 12 threads, at 3.4 GHz), 64GB of memory, and 250GB Samsung 840 EVO SSD, running Ubuntu version 14.04. Throughput was determined by processing 1,000,000 random name-strings from Global Names database.
To study the effects of parallel execution on throughput we used the \textit{ParallelParser} class from \textit{biodiversity} parser. We used `\textit{gnparse file --simple}' (a command line-based script set to return simplified output) for \textit{gnparser}. For GBIF \textit{name-parser}, we created a thin wrapper with multithreaded capabilities~\cite{gbifparser}. The following versions had been used for throughput benchmarks: \textit{gnparser} v. 0.3.1, GBIF \textit{name-parser} v. 0.1.0, \textit{biodiversity} v. 3.4.1.
\section*{Results and Discussion}\label{sec:discussion}
We discuss and compare \textit{gnparser}, GBIF \textit{name-parser} and \textit{biodiversity} parser in the context of our requirements for quality, global scope, parsing completeness, speed, and accessibility.
\subsection*{High Quality Parsing}
Quality is the most important of the 5 requirements. GBIF \textit{name-parser} uses regular expressions approach, while \textit{gnparser} and \textit{biodiversity} parsers use the PEG approach. Results for quality measurements are shown in Table~\ref{table:precision} and Table~\ref{table:unparsed}. We include the 1,000 tested names as Additional File 4.
If test data contain a large proportion of true negatives ($N_{tn}$) $Accuracy$ will not be a good measure as it favors algorithms that distinguish negative results rather than finding positive ones. We manually checked our test datasets and established that $\approx1\%$ were not scientific names. Given that true negatives are rare, they will have very limited influence on $Accuracy$. $Recall$ for all parsers was high, hence false negatives are not important.
$Accuracy$ is probably the best measure for our tests. All 3 parsers performed very well, with $Accuracy$ values higher than $95\%$. Both \textit{gnparser} and \textit{biodiversity} parser approached the 99\% mark which we regard as the metric for production quality. Most of the false positives came from name-strings with mistakes. For example, out of 11 false positives (below) that \textit{gnparser} found in the 1,000 name-string test data set, only 2 (the first 2) were well-formed names.
\vspace{0.5cm}
\verbatimfont{\bfseries\rmfamily\small}
\begin{verbatim}
Eucalyptus subser. Regulares Brooker
Jacquemontia spiciflora (Choisy) Hall. fil.
Acanthocephala declivis variety guianensis Osborn, 1904
Atysa (?) frontalis
Bumetopia (bumetopia) quadripunctata Breuning, 1950
Cyclotella kã¼tzingiana Thwaites
Elaphidion (romaleum) tæniatum Leconte, 1873
Hieracium nobile subsp. perclusum (Arv. -Touv. ) O. Bolòs & Vigo
Leptomitus vitreus (Roth) Agardh{?}
Myosorex muricauda (Miller, 1900).
Papillaria amblyacis (M<81>ll.Hal.) A.Jaeger}
\end{verbatim}
\vspace{0.5cm}
We do expect a parser to deal with names that are not well-formed. That means overcoming problems such as aberrant characters which might arise from Unicode character miscodings, inappropriate annotations, or other mistakes. To alert users, \textit{gnparser} generates a warning when it identifies a problem in a name-string. The other parsers do not have this feature.
When parsers reach $\approx80\%$ $Accuracy$, they hit a ``long tail'' of problems where each particular type of a problem is rare. Every new manual check of additional test sets of 1,000--10,000 name-strings reveals new issues. Examples of these challenges are given elsewhere~\cite{Patterson2016}. For all three parsers, developers have to perform the meticulous task of adding new rules to address each rare case. That is, parsers need to be subject to continuous improvement. The problems found during preparation of this paper are being addressed in the next version of \textit{gnparser}. As the parsing rules improve, we believe that \textit{gnparser} can reach $>99.5\%$ $Accuracy$ without diminishing $Recall$.
As we incorporate new rules to increase $Recall$, we have to consider the risks of reducing $Precision$ by introducing new false positives. For example, the GBIF \textit{name-parser} allows the genus element of a name-string to start with a lowercase character. As a result the name-strings below were parsed as if they were scientific names, while the other parsers ignored them:
\vspace{0.5cm}
\begin{verbatim}
acid mine drainage metagenome
agricultural soil bacterium CRS5639T18-1
agricultural soil bacterium SC-I-8
algal symbiont of Cladonia variegata MN075
alpha proteobacterium AP-24
anaerobic bacterium ANA No.5
anoxygenic photosynthetic bacterium G16
archaeon enrichment culture clone AOM-SR-A23
bacterium endosymbiont of Plateumaris fulvipes
bacterium enrichment culture DGGE band 61_3_FG_L
barley rhizosphere bacterium JJ-220
bovine rumen bacterium niuO17
\end{verbatim}
\vspace{0.5cm}
Strategies like these may increase $Recall$ with certain low-quality datasets, but they decrease $Precision$. Many ``dirty'' datasets contain recurring problems. As an example, DRYAD contains many name-strings in which elements of scientific names are concatenated with an interpolated character such as `\_’ (e.g. ``Homo\_sapiens'' and ``Pinoyscincus\_jagori\_grandis'')~\cite{Patterson2016}. For them, our solution was to include a ``preparser'' script which ``normalizes'' known problems that are inherent within particular datasets and then apply a high quality parser to the result.
Our testing also revealed differences between regular expressions and PEG approaches. Both can achieve high quality results with canonical forms of scientific names, but the regular expressions are less suitable for more complex name-strings. The recursive or nested nature of some scientific names can cause problems which become insurmountable for regular expressions.
\subsection*{Global Scope}
If we want to connect biological data using scientific names, no name-strings should be missed or rejected, no matter how complex they are. During our testing we found that $Accuracy$ of GBIF's \textit{name-parser} was depressed because, in part,the parser did not recognize hybrid formulae and infrasubspecific names with more then one infraspecific epithet. This case underscores the limitations of the regular expression approach. As examples, the following were not parsed by the GBIF \textit{name-parser}:
\vspace{0.5cm}
\textbf{Erigeron peregrinus ssp.callianthemus var.\ eucallianthemus} (a name-string with two infraspecific epithets)
\textbf{Polyporus varius var.\ nummularius f.\ undulatus (Pilát) Domanski, Orlos \& Skirg.} (two infraspecific epithets)
\textbf{Salvelinus fontinalis x Salmo gairdneri} (hybrid formula)
\textbf{Echinocereus fasciculatus var.\ bonkerae × E. fasciculatus var.\ fasciculatus} (hybrid formula)
\vspace{0.5cm}
The PEG approach supports nested parsing rules to create progressively more complex rules that manage such cases. The capacity to address recursion allows \textit{gnparser} to handle the full spectrum of scientific names that we have presented to it.
\begin{table}[htb]
\begin{center}
\caption{Precision/Recall for parsers applied to 1,000
name-strings}\label{table:precision}
\end{center}
\end{table}
\begin{table}[htb]
\begin{center}
\caption{Accuracy of non-parseable names detection out of 100,000
name-strings}\label{table:unparsed}
\end{center}
\end{table}
\subsection*{Parsing Completeness}
The extraction of canonical forms from name-strings representing scientific names is the most beneficial and widely used parsing goal. Sometimes, however, this may not be sufficient because the canonical form does not always distinguish a name completely.
In the example in Figuire~\ref{figure:carex} \textbf{Carex scirpoidea convoluta} is a canonical form for \textbf{Carex scirpoidea var.\ convoluta Kükenthal} and \textbf{Carex scirpoidea ssp.\ convoluta (Kük.) Dunlop.} The first non-parsed name-string refers to the variety \textbf{convoluta} of \textbf{Carex scirpoidea} that had been described by \textbf{Kükenthal}. The second captures Dunlop's reclassification of \textbf{convoluta} as a subspecies. We are not able to distinguish between these two different names without knowing the rank and/or the corresponding authorship. Furthermore, it is useful to see in the second example that \textbf{(Kük.)} was the original author and \textbf{Dunlop} was the author of the new combination. Also, canonical forms do not distinguish between homonyms. The heather, \textit{Pieris japonica} (Thunb.) D. Don ex G. Don and the butterfly, \textit{Pieris japonica} Shirôzu, 1952 have the same canonical form \textbf{Pieris japonica}.
After matching by canonical form, rank, authors, and ``types'' of authorship allow us to distinguish name-strings with similar or identical canonical elements. The name-string \textbf{Carex scirpoidea Michx.\ var.\ convoluta Kükenth.} adds the information that the species \textbf{Carex scirpoidea} was described by \textbf{Michx} but is not evident in the examples in the paragraph above.
Another area in which parsers with limited abilities can give misleading results is with negated names~\cite{Patterson2016}. In these cases, the name-string includes some annotation or marks to indicate that the information associated with the name does NOT refer to the taxon with the scientific name that is included. Examples include \textbf{Gambierodiscus aff toxicus} or \textbf{Russula xerampelina-like sp}.
All components of a name may be important and need to be parsed and categorized. With \textit{gnparser}, we describe the meaning of every element in the parsed name-string and present the results in JSON format. Parsing of \textbf{Carex scirpoidea Michx.\ subsp.\ convoluta (Kük.) D.A. Dunlop} gives the following JSON output
\vspace{0.1cm}
\begin{lstlisting}[language=json]
{
"name_string_id" : "203213f3-99d1-5f5e-810a-4453c4d220cb",
"parsed" : true, "quality" : 1, "parser_version" : "0.3.1",
"verbatim" : "Carex scirpoidea Michx. subsp. convoluta (Kük.) D.A. Dunlop",
"normalized" : "Carex scirpoidea Michx. ssp. convoluta (Kük.) D. A. Dunlop",
"canonical_name" : {
"value" : "Carex scirpoidea convoluta", "extended" : "Carex scirpoidea ssp. convoluta"
},
"hybrid" : false, "surrogate" : false, "virus" : false,
"details" : [ {
"genus" : { "value" : "Carex" },
"specific_epithet" : {
"value" : "scirpoidea",
"authorship" : {
"value" : "Michx.",
"basionym_authorship" : { "authors" : [ "Michx." ] }
}
},
"infraspecific_epithets" : [ {
"value" : "convoluta", "rank" : "ssp.",
"authorship" : {
"value" : "(Kük.) D. A. Dunlop",
"basionym_authorship" : { "authors" : [ "Kük." ] },
"combination_authorship" : { "authors" : [ "D. A. Dunlop" ] }
}
} ]
} ],
"positions" : [ [ "genus", 0, 5 ], [ "specific_epithet", 6, 16 ], [ "author_word", 17, 23 ],
[ "rank", 24, 30 ], [ "infraspecific_epithet", 31, 40 ], [ "author_word", 42, 46 ],
[ "author_word", 48, 50 ], [ "author_word", 50, 52 ], [ "author_word", 53, 59 ] ]
}
\end{lstlisting}
\vspace{0.5cm}
The output includes the semantic meaning of all parsed elements in a name-string, indicates if the name-string was parsed successfully, if it is a virus name, a hybrid, or a surrogate. Surrogates are name-strings that are alternatives to names (such as acronyms) and they may or may not include part of a scientific or colloquial name (e.g. \textbf{Coleoptera sp. BOLD:AAV0432}). The output also includes a statement of the position of each element in the name-string. Last, but not least, the JSON output contains UUID version 5 calculated from the verbatim name-string. This UUID is guaranteed to be the same for the same name-string, promoting its use to globally connect information and annotations.
The output usually covers every semantic element in the name-string. The fields in the output illustrated above have the following meanings.
\begin{description}
\item[name\_string\_id:] UUID v5 identifier;
\item[parsed:] whether a name-string was successfully parsed (true/false);
\item[quality:] how well-formed a name-string is (range from 1 to 3, 1 is the best);
\item[parser\_version:] version of a parser used; \item[verbatim:] name-string as was submitted to \textit{gnparser};
\item[normalized:] name-string modified by the parser to give a normalized style;
\item[canonical\_name:] a special form of normalization that includes only the scientific elements of the name, this form is contained within most name-strings relating to scientific names;
\item[hybrid:] whether the name-string refers to a hybrid (true/false); \item[surrogate:] whether a name-string is a surrogate name (true/false); \item[details:] describes the semantic elements within the name-string inclusive of the following;
\item[genus:] reports the genus part of the name (in this case Carex);
\item[specific epithet:] reports the species epithet (scirpoidea);
\item[authorship:] reports the authorship of the combination (Michx.);
\item[basionym authorship:] reports the authorship of the basionym (Michx.)
\item[infraspecific epithets:] reports the infraspecies name if present (convoluta) with rank (ssp.)
\item[authorship:] reports the authors of the infraspecies name ((Kük.) D. A. Dunlop)
\item[basionym authorship:] reports the author of the basionym of infraspecies name element ([``Kük.'']);
\item[combination authorship:] reports the author of the infraspecies name combination (D. A. Dunlop); and \item[positions:] identifies each name element and where it starts and ends.
\end{description}
The complete list of fields for the \textit{gnparser}'s output exists as a JSON Schema file~\cite{gnparser-json} [see Additional file 1].
\subsection*{Parsing Speed}
In the areas of performance discussed above, there is little difference between \textit{biodiversity} parser and \textit{gnparser}. There is, however, a dramatic difference in their parsing speed and ability to scale. Parsing tasks that took 20 hours with earlier \textit{biodiversity} parsers can now be completed in a few minutes on a multithreaded computer. Parsing is a key to other services such as name-reconciliation and subsequent resolution. Improvements to the speed of the parser will increase user satisfaction elsewhere.
Results on the speed performance are given in Figure~\ref{figure:throughput}. The performance depends on the number of CPU threads used. On 1 thread \textit{gnparser} was 7 times faster than \textit{biodiversity}, 10 times faster on 4 threads, and 14 times faster on 12 threads.
\begin{figure}
\begin{center}
\caption{Names parsed per second by GN, GBIF and Biodiversity parsers
(running on 1--12 parallel threads).}\label{figure:throughput}
\end{center}
\end{figure}
\textit{gnparser} displays functionality not presented in the GBIF \textit{name-parser} as described in previous sections. In spite of this additional functionality \textit{gnparser} outperformed other tested parsers.
\subsection*{Accessibility}
By `accessibility' we refer to the ability of the software code to be used by a wide audience. For Open source projects, accessibility is very important. If more people use a software, the more cost-effective is its development.
Parsing scientific names is essential for organizing biodiversity data. Many biodiversity database environments and projects include a parsing algorithm. Examples are uBio~\cite{ubio:parser}, the Botanical Society of Britain and Ireland~\cite{botsociety:parser}, FAT~\cite{Sautter2006}, NetiNeti~\cite{Akella2012}, and Taxonome~\cite{Kluyver2013}. A modular approach offers an option of re-use and avoids replication of effort. \textit{biodiversity} was the first biodiversity parser to be released as a stand-alone package that could be used as a module --- as it was with the iPlant project~\cite{Boyle2013}. The same approach has now been adopted with the GBIF \textit{name-parser}~\cite{gbifNameParser}, \textit{YASMEEN}~\cite{VandenBerghe2015}, and \textit{gnparser}.
We designed \textit{gnparser} with accessibility in mind from the outset. Scala language allows the use of \textit{gnparser} as a library in Scala, Java, Jython, JRuby and a variety of other languages based on Java Virtual Machine it can also be used natively in R and Python via JVM-binding libraries. Apache Spark, a ``Big Data'' framework, is also supported. The following example illustrates how a client written in Jython can access the \textit{gnparser} functionality.
\begin{lstlisting}{language=python}
from org.globalnames.parser import ScientificNameParser
snp = ScientificNameParser.instance()
result = snp.fromString("Homo sapiens L.").renderCompactJson()
print result
\end{lstlisting}
If programmers want to use \textit{gnparser} in some JVM-incompatible language they can connect to the parser via a socket server interface. There is also a command line tool, a web interface, and a RESTful API\@. In 2016, Encyclopedia of Life started to parse name-strings using \textit{gnparser} socket server.
We pay close attention to documentation, trying to keep it detailed, clear, and up to date. We have an extensive test suite [see Additional file 2] that describes the parser's behavior and contains examples of \textit{gnparser} functionality and output format.
This commitment to accessibility creates a larger potential audience for the parser, and will help many researchers and programmers deal with the problems that arise from variant forms of scientific names.
\section*{Conclusions}
The performance of the scientific names parsers is summarised in Table~\ref{table:summary}.The two PEG-based parsers --- \textit{biodiversity} and \textit{gnparser} are similar. They are based on the same algorithmic approach and follow similar design goals. While we had the option of modifying the rules for \textit{biodiversity} to improve $Accuracy$, we preferred to create a new tool from scratch to overcome limitations in speed, scalability and accessibility. We needed to address speed at Global Names because existing software took too long to parse or reparse 24 million name-strings. \textit{gnparser} can be used natively by larger variety of programming languages than \textit{biodiversity}, because JVM-based languages and tools are so widely used. Our first goal for \textit{gnparser} was complete coverage of the \textit{biodiversity}'s test suite. We continue to improve \textit{gnparser} while \textit{biodiversity} entered maintenance mode. That explains a slight difference in $Accuracy$ by these two parsers.
\textit{gbif-parser} is a high quality product. However, its regular expressions-based algorithm limits its usability. The recursive nature of some scientific names creates significant obstacles for intrinsically non-recursive algorithms such as regular expressions. Coverage of multi-infraspecific names and hybrids, both with recursive patterns, is prohibitively expensive for such an approach.
\begin{table}[htb]
\begin{center}
\caption{Summary comparison of Scientific Name Parsers}
\label{table:summary}
\end{center}
\end{table}
In conclusion, this paper describes \textit{gnparser}, a powerful tool for working with biodiversity information. It transforms names of taxa into their semantic elements. This allows standardization of names by, for example, representing them as canonical forms. This step dramatically improves name matching within and among data sources, and this increases the amount of data on a single taxon that can be integrated. Parsing can be used to improve the discovery of names in sources, and creating a common taxonomic index to multiple sources. Parsing allows users to extract, compare and analyse metadata within the name-strings, and allowing comparisons of the efforts of individuals or to map trends over time. The \textit{gnparser} tool is released under MIT Open source license, contains command line executable, socket, web, and REST services, and is optimized for use as a library in languages like Scala, Java, R, Jython, JRuby.
\section*{Additional Files}
\subsection*{Additional file 1 --- gnparser.json} Includes a full and formal explanation of all parser fields as a JSON schema.
\subsection*{Additional file 2 --- test\_data.txt} Extensive test suite that describes the parser's behavior. It is also a source of examples of parser functionality and output format. Test suite consists of a pipe delimited input (scientific name) and parsed output in JSON format.
\subsection*{Additional file 3 --- README.rst.html} README.rst file that is converted to HTML format. It is also available at project home page~\cite{gnparser}.
\subsection*{Additional file 4 --- 1000-name-strings.txt} 1,000 name-strings randomly selected from GNI and used to determine $Accuracy$, $Precision$ and $Recall$ data (Table~\ref{table:precision}).
\section*{Abbreviations}
\begin{description}
\item[AAM] -- Alexander A. Myltsev
\item[API] -- Application Program Interface
\item[AST] -- Abstract Syntax Tree
\item[BHL] -- Biodiversity Heritage Library
\item[DJP] -- David J. Patterson
\item[DYM] -- Dmitry Y. Mozzherin
\item[GBIF] -- Global Biodiversity Information Facility
\item[GNA] -- Global Names Architecture
\item[GNI] -- Global Names Index
\item[JSON] -- JavaScript Object Notation
\item[JVM] -- Java Virtual Machine
\item[PEG] -- Parsing Expression Grammar
\item[REST] -- Representational State Transfer
\end{description}
\section*{Declarations}
All authors have gone through the manuscript and contents of this article have not been published elsewhere.
\subsection*{Acknowledgements}
The authors thank David Mark Welch (Josephine Bay Paul Center, Marine Biological Laboratory) for the leadership at the beginning of the \textit{gnparser} project. The authors also thank administrators of the Species File Group for the much needed support during the transfer of the GNA grant from the Marine Biological Laboratory to University of Illinois.
\subsection*{Funding}
This work is supported by the National Science Foundation (NSF DBI-1356347). The Species File Group of the University of Illinois provided an additional funding. The funding bodies had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
\subsection*{Availability of Data and Materials}
\begin{description}
\item[Project Name:] gnparser
\item[Project home page:] https://github.com/GlobalNamesArchitecture/gnparser
\item[Operating System:] Any platform able to run JVM~1.8
\item[Programming Language:] Scala
\item[License:] The MIT License
\item[Any restrictions to use by non-academic:] no restriction
\end{description}
The data supporting the conclusions of this article are available in the repository https://github.com/GlobalNamesArchitecture/gnparser-paper under
the \textit{data} directory.
\subsection*{Author's Contributions}
DYM and AAM designed \textit{gnparser}. DYM created requirements, test suite and the original version of \textit{gnparser}. AAM optimized \textit{gnparser} for speed, refactored it into three internal subprojects. DYM set Docker containers and Kubernetes scripts. DYM and AAM wrote online documentation and JSON schema to formalize output. DJP corrected parser's results, calibrated quality output and errors output. DYM and AAM drafted manuscript and DJP edited its final version. All authors read and approved the final manuscript.
\subsection*{Competing Interests}
The authors declare that they have no competing interests.
\subsection*{Consent for Publication}
Not applicable
\subsection*{Ethics approval and consent to participate}
Not applicable
\subsection*{Author Information}
Dmitry Y. Mozzherin ([email protected]), Alexander A. Myltsev ([email protected]), David J. Patterson ([email protected])
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% The Bibliography %%
%% %%
%% Bmc_mathpys.bst will be used to %%
%% create a .BBL file for submission. %%
%% After submission of the .TEX file, %%
%% you will be prompted to submit your .BBL file. %%
%% %%
%% %%
%% Note that the displayed Bibliography will not %%
%% necessarily be rendered by Latex exactly as specified %%
%% in the online Instructions for Authors. %%
%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% if your bibliography is in bibtex format, use those commands:
\bibliographystyle{bmc-mathphys} % Style BST file
\bibliography{gnparser} % Bibliography file (usually '*.bib' )
\end{document}