-
Notifications
You must be signed in to change notification settings - Fork 4
ErgTokenization_ComplexExample
Differences in tokenization at various levels of analysis often present a technical (and sometimes conceptual) challenge, for example when seeking to apply a sequence classification model (e.g. a PoS tagger, supertagger, or uebertagger) prior to full parsing.
In the following, we distinguish three levels of processing (see the ErgTokenization page for background): (a) initial tokenization, i.e. the result of string-level pre-processing (see the ReppTop page for details on pre-processing rules included with the ERG); (b) internal tokenization, the state of affairs immediately prior to lexical lookup, i.e. upon completion of the token mapping phase; and (c) lexical tokenization, by which we refer to the result of lexical instantiation, i.e. the segmentation between instantiated lexical entries.
Note that only level (a) has a 'flat' form, i.e. forms a single sequence of tokens, whereas levels (b) and (c) will typically take the form of a lattice, i.e. admitting token-level ambiguity. Compared to stage (a), stage (b) can both split up initial tokens, as well as combine multiple initial tokens into a single internal token. Coversely, moving from stage (b) to stage (c), there is only further combination of multiple internal tokens into a single lexical token, viz. by virtue of instantiating a multi-word lexical entry.
To get started, consider the (silly) example
'Sun-filled', well-kept Mountain View.
The ERG REPP rules (as of mid-2011) will tokenize according to PTB conventions, splitting off (most) punctuation marks, but not breaking at dashes (or slashes). Thus, at level (a) there will be eight tokens, which (in YY format, and assuming PoS tags from TnT) might be the following:
(42, 0, 1, <0:1>, 1, "‘", 0, "null", "``" 1.0000)
(43, 1, 2, <1:11>, 1, "Sun-filled", 0, "null", "JJ" 0.7540 "NNP" 0.2211 "VBN" 0.0249)
(44, 2, 3, <11:12>, 1, "’", 0, "null", "''" 0.7433 "POS" 0.2567)
(45, 3, 4, <12:13>, 1, ",", 0, "null", "," 1.0000)
(46, 4, 5, <14:23>, 1, "well-kept", 0, "null", "VBD" 0.4979 "JJ" 0.3014 "NN" 0.0780 "VBN" 0.0464 "NNP" 0.0374 "VB" 0.0156 "JJS" 0.0129 "VBP" 0.0103)
(47, 5, 6, <24:32>, 1, "Mountain", 0, "null", "NNP" 1.0000)
(48, 6, 7, <33:37>, 1, "View", 0, "null", "NNP" 0.9591 "NN" 0.0409)
(49, 7, 8, <37:38>, 1, ".", 0, "null", "." 1.0000)
Home | Forum | Discussions | Events