Skip to content

ErgTokenization_ComplexExample

StephanOepen edited this page Aug 12, 2011 · 17 revisions

Motivation

Differences in tokenization at various levels of analysis often present a technical (and sometimes conceptual) challenge, for example when seeking to apply a sequence classification model (e.g. a PoS tagger, supertagger, or uebertagger) prior to full parsing.

In the following, we distinguish three levels of processing (see the ErgTokenization page for background): (a) initial tokenization, i.e. the result of string-level pre-processing (see the ReppTop page for details on pre-processing rules included with the ERG); (b) internal tokenization, the state of affairs immediately prior to lexical lookup, i.e. upon completion of the token mapping phase; and (c) lexical tokenization, by which we refer to the result of lexical instantiation, i.e. the segmentation between instantiated lexical entries.

Note that only level (a) has a 'flat' form, i.e. forms a single sequence of tokens, whereas levels (b) and (c) will typically take the form of a lattice, i.e. admitting token-level ambiguity. Compared to stage (a), stage (b) can both split up initial tokens, as well as combine multiple initial tokens into a single internal token. Coversely, moving from stage (b) to stage (c), there is only further combination of multiple internal tokens into a single lexical token, viz. by virtue of instantiating a multi-word lexical entry.

Initial Tokenization

To get started, consider the (silly) example

  'Sun-filled', well-kept Mountain View.

The ERG REPP rules (as of mid-2011) will tokenize according to PTB conventions, splitting off (most) punctuation marks, but not breaking at dashes (or slashes). Thus, at level (a) there will be eight tokens, which (in YY format, and assuming PoS tags from TnT) might be the following:

  (42, 0, 1, <0:1>, 1, "‘", 0, "null", "``" 1.0000)
  (43, 1, 2, <1:11>, 1, "Sun-filled", 0, "null", "JJ" 0.7540 "NNP" 0.2211 "VBN" 0.0249)
  (44, 2, 3, <11:12>, 1, "’", 0, "null", "''" 0.7433 "POS" 0.2567)
  (45, 3, 4, <12:13>, 1, ",", 0, "null", "," 1.0000)
  (46, 4, 5, <14:23>, 1, "well-kept", 0, "null", "VBD" 0.4979 "JJ" 0.3014 "NN" 0.0780 "VBN" 0.0464 "NNP" 0.0374 "VB" 0.0156 "JJS" 0.0129 "VBP" 0.0103)
  (47, 5, 6, <24:32>, 1, "Mountain", 0, "null", "NNP" 1.0000)
  (48, 6, 7, <33:37>, 1, "View", 0, "null", "NNP" 0.9591 "NN" 0.0409)
  (49, 7, 8, <37:38>, 1, ".", 0, "null", "." 1.0000)
Clone this wiki locally