Skip to content

ErgTokenization_ComplexExample

StephanOepen edited this page Aug 13, 2011 · 17 revisions

Motivation

Differences in tokenization at various levels of analysis often present a technical (and sometimes conceptual) challenge, for example when seeking to apply a sequence classification model (e.g. a PoS tagger, supertagger, or uebertagger) prior to full parsing.

In the following, we distinguish three levels of processing (see the ErgTokenization page for background): (a) initial tokenization, i.e. the result of string-level pre-processing (see the ReppTop page for details on pre-processing rules included with the ERG); (b) internal tokenization, the state of affairs immediately prior to lexical lookup, i.e. upon completion of the token mapping phase; and (c) lexical tokenization, by which we refer to the result of lexical instantiation, i.e. the segmentation between instantiated lexical entries.

Note that only level (a) has a 'flat' form, i.e. forms a single sequence of tokens, whereas levels (b) and (c) will typically take the form of a lattice, i.e. admitting token-level ambiguity. Compared to stage (a), stage (b) can both split up initial tokens, as well as combine multiple initial tokens into a single internal token. Coversely, moving from stage (b) to stage (c), there is only further combination of multiple internal tokens into a single lexical token, viz. by virtue of instantiating a multi-word lexical entry.

Initial Tokenization

To get started, consider a (silly) example:

  'Sun-filled', well-kept Mountain View.

The ERG REPP rules (as of mid-2011) will tokenize according to PTB conventions, splitting off (most) punctuation marks, but not breaking at dashes (or slashes). Thus, at level (a) there will be eight tokens, which (in YY format, and assuming PoS tags from TnT) might be the following:

  (42, 0, 1, <0:1>, 1, "‘", 0, "null", "``" 1.0000)
  (43, 1, 2, <1:11>, 1, "Sun-filled", 0, "null", "JJ" 0.7540 "NNP" 0.2211)
  (44, 2, 3, <11:12>, 1, "’", 0, "null", "''" 0.7433 "POS" 0.2567)
  (45, 3, 4, <12:13>, 1, ",", 0, "null", "," 1.0000)
  (46, 4, 5, <14:23>, 1, "well-kept", 0, "null", "VBD" 0.4979 "JJ" 0.3014)
  (47, 5, 6, <24:32>, 1, "Mountain", 0, "null", "NNP" 1.0000)
  (48, 6, 7, <33:37>, 1, "View", 0, "null", "NNP" 0.9591)
  (49, 7, 8, <37:38>, 1, ".", 0, "null", "." 1.0000)

Internal Tokenization

The parser-internal token mapping phase seeks to rewrite the intial tokens into a form that meets the ERG-internal assumptions about tokenization. Specifically, token mapping will re-attach (most) punctuation marks, on the one hand, and introduce additional token boundaries, for example breaking at intra-word dashes (and slashes). For our running example, token mapping will take us back to what one would have obtained by just breaking at whitespace and after dashes: with a sequence of seven token spans at its core, viz.

  (133, 0, 2, <0:11>, 1, "‘Sun-", 0, "null", "NN" 1.0000)
  (135, 0, 2, <0:11>, 1, "‘sun-", 0, "null")
  (123, 2, 5, <1:13>, 1, "filled’,", 0, "null")
  (128, 2, 5, <1:13>, 1, "filled’,", 0, "null", "JJ" 0.7540)
  (130, 2, 5, <1:13>, 1, "filled’,", 0, "null", "NNP" 0.2211)
  (117, 5, 6, <14:23>, 1, "well-", 0, "null")
  (132, 5, 6, <14:23>, 1, "well-", 0, "null", "NN" 1.0000)
  (125, 6, 7, <14:23>, 1, "kept", 0, "null")
  (126, 6, 7, <14:23>, 1, "kept", 0, "null", "VBD" 0.4979)
  (129, 6, 7, <14:23>, 1, "kept", 0, "null", "JJ" 0.3014)
  (87, 7, 8, <24:32>, 1, "Mountain", 0, "null")
  (131, 7, 8, <24:32>, 1, "Mountain", 0, "null", "NNP" 1.0000)
  (134, 7, 8, <24:32>, 1, "mountain", 0, "null")
  (91, 8, 10, <33:38>, 1, "View.", 0, "null")
  (127, 8, 10, <33:38>, 1, "View.", 0, "null", "NNP" 0.9591)
  (136, 8, 10, <33:38>, 1, "view.", 0, "null")

Lexical Tokenization

Clone this wiki locally