Skip to content

ErgTokenization_ComplexExample

StephanOepen edited this page Aug 15, 2011 · 17 revisions

Motivation

Differences in tokenization at various levels of analysis often present a technical (and sometimes conceptual) challenge, for example when seeking to apply a sequence classification model (e.g. a PoS tagger, supertagger, or uebertagger) prior to full parsing.

In the following, we distinguish three levels of processing (see the ErgTokenization page for background): (a) initial tokenization, i.e. the result of string-level pre-processing (see the ReppTop page for details on pre-processing rules included with the ERG); (b) internal tokenization, the state of affairs immediately prior to lexical lookup, i.e. upon completion of the token mapping phase; and (c) lexical tokenization, by which we refer to the result of lexical instantiation, i.e. the segmentation between instantiated lexical entries.

Note that only level (a) has a 'flat' form, i.e. forms a single sequence of tokens, whereas levels (b) and (c) will typically take the form of a lattice, i.e. admitting token-level ambiguity. Compared to stage (a), stage (b) can both split up initial tokens, as well as combine multiple initial tokens into a single internal token. Coversely, moving from stage (b) to stage (c), there is only further combination of multiple internal tokens into a single lexical token, viz. by virtue of instantiating a multi-word lexical entry.

Initial Tokenization

To get started, consider a (silly) example:

  'Sun-filled', well-kept Mountain View.

The ERG REPP rules (as of mid-2011) will tokenize according to PTB conventions, splitting off (most) punctuation marks, but not breaking at dashes (or slashes). Thus, at level (a) there will be eight tokens, which (in YY format, and assuming PoS tags from TnT) might be the following:

  (1, 0, 1, <0:1>, 1, "‘", 0, "null", "``" 1.0000)
  (2, 1, 2, <1:11>, 1, "Sun-filled", 0, "null", "JJ" 0.7540 "NNP" 0.2211)
  (3, 2, 3, <11:12>, 1, "’", 0, "null", "''" 0.7433 "POS" 0.2567)
  (4, 3, 4, <12:13>, 1, ",", 0, "null", "," 1.0000)
  (5, 4, 5, <14:23>, 1, "well-kept", 0, "null", "VBD" 0.4979 "JJ" 0.3014)
  (6, 5, 6, <24:32>, 1, "Mountain", 0, "null", "NNP" 1.0000)
  (7, 6, 7, <33:37>, 1, "View", 0, "null", "NNP" 0.9591 "NN" 0.0409)
  (8, 7, 8, <37:38>, 1, ".", 0, "null", "." 1.0000)

Internal Tokenization

The parser-internal token mapping phase seeks to rewrite the intial tokens into a form that meets the ERG-internal assumptions about tokenization. Specifically, token mapping will re-attach (most) punctuation marks, on the one hand, and introduce additional token boundaries, for example breaking at intra-word dashes (and slashes). For our running example, token mapping will take us back to what one would have obtained by just breaking at whitespace and after dashes: with a sequence of seven token spans at its core, viz.

  (133, 0, 2, <0:11>, 1, "‘Sun-", 0, "null", "NN" 1.0000)
  (135, 0, 2, <0:11>, 1, "‘sun-", 0, "null")
  (123, 2, 5, <1:13>, 1, "filled’,", 0, "null")
  (128, 2, 5, <1:13>, 1, "filled’,", 0, "null", "JJ" 0.7540)
  (130, 2, 5, <1:13>, 1, "filled’,", 0, "null", "NNP" 0.2211)
  (117, 5, 6, <14:23>, 1, "well-", 0, "null")
  (132, 5, 6, <14:23>, 1, "well-", 0, "null", "NN" 1.0000)
  (125, 6, 7, <14:23>, 1, "kept", 0, "null")
  (126, 6, 7, <14:23>, 1, "kept", 0, "null", "VBD" 0.4979)
  (129, 6, 7, <14:23>, 1, "kept", 0, "null", "JJ" 0.3014)
  (87, 7, 8, <24:32>, 1, "Mountain", 0, "null")
  (131, 7, 8, <24:32>, 1, "Mountain", 0, "null", "NNP" 1.0000)
  (134, 7, 8, <24:32>, 1, "mountain", 0, "null")
  (91, 8, 10, <33:38>, 1, "View.", 0, "null")
  (127, 8, 10, <33:38>, 1, "View.", 0, "null", "NNP" 0.9591)
  (136, 8, 10, <33:38>, 1, "view.", 0, "null")

Lexical Tokenization

Finally, at level (c), what we call lexical tokenization really is the segmentation between successfully instantiated lexical items. In the following, we show a subset of the lexical items in the parser chart after the completion of lexical instantiation (i.e. lookup by the surface strings associated to internal tokens, including instantiation of multi-word lexical entries), lexical parsing (application of lexical rules, until a fixpoint is reached), and lexical filtering (the process of pruning duplicate or otherwise undesirable entries from the chart).

  (849 w_sqleft_plr 0 0 1
   (694 w_hyphen_plr 0 0 1
    (129 sun_n1 0 0 1
     ("‘sun-" 
      114 
      "[ +FROM \"0\" +TO \"11\" 
         +ID [ LIST [ FIRST \"1\" REST [ FIRST \"2\" REST #1 ] ]
               LAST #1 ] ... ]"))))
  (1016 w_comma-nf_plr 0 1 2
   (845 w_sqright_plr 0 1 2
    (521 v_pas_odlr 0 1 2
     (173 fill_v1 0 1 2
      ("filled’," 
       102
       "[ +FROM \"1\" +TO \"13\" 
          +ID [ LIST [ FIRST \"2\" REST [ FIRST \"3\" REST [ FIRST \"4\" REST #1 ] ] ]
                LAST #1 ] ... ]")))))
  (457 well_kept_a1 0 2 4
   ("well- kept"
    96
    "[ +FROM \"14\" +TO \"23\" 
       +ID [ LIST [ REST #1 FIRST \"5\" ] LAST #1 ] ... ]"
    104
    "[ +FROM \"14\" +TO \"23\"
       +ID [ LIST [ FIRST \"5\" REST #1 ] LAST #1 ] ... ]"))
  (699 w_period_plr 0 4 6
   (529 n_sg_ilr 0 4 6
    (458 mtn_view_n1 0 4 6
     ("mountain view." 
      113
      "[ +FROM \"24\" +TO \"32\" 
         +ID [ LIST [ REST #1 FIRST \"6\" ] LAST #1 ] ... ]"
      115
      "[ +FROM \"33\" +TO \"38\"
         +ID [ LIST [ FIRST \"7\" REST [ FIRST \"8\" REST #1 ] ]
               LAST #1 ] ... ]"))))
Clone this wiki locally