Skip to content

ErgTokenization

StephanOepen edited this page Jun 26, 2010 · 17 revisions

Overview

Aiming for a balance of linguistic precision and broad coverage, the [http://www.delph-in.net/erg English Resource Grammar] (ERG) includes detailed analyses of punctuation and a wide variety of 'text-level' phenomena (e.g. various formats for temporal and numeric expressions). The grammar makes specific assumptions about tokenization, and for the successful application of the grammar it is important to understand and respect these assumptions. In early 2009, the ERG approach to tokenization has undergone a major revision, and this page aims to spell out some of the basic assumptions, specific decisions made, and technology used in preparing input text for parsing with the ERG.

This page was predominantly authored by StephanOepen, who jointly with DanFlickinger developed the current ERG approach to tokenization. As of early 2009, Stephan is the maintainer of the ERG tokenizer and token mapping rules. Please do not make substantial changes to this page unless you (a) are reasonably sure of the technical correctness of your revisions and (b) believe strongly that your changes are compatible with the general design and recommended use patterns for the ERG, and of course with the goals of this page.

String-Level Pre-Processing and Initial Tokenization

This section documents tokenization and a handful of other surface-level decisions. Technically speaking, when parsing with the ERG and PET (which is the reference setup for production use), the parser takes as its input a lattice of tokens, each a structured object (aka typed feature structure). Please see the PetInput page for additional background. In this view, string-level pre-processing and initial tokenization is the process of mapping a 'flat' string into a token lattice.

In the standard setup for the ERG, this task is solved by means of so-called REPP (Regular Expression Pre-Processor) modules, which are included with the ERG sources (in the rpp/ subdirectory); for general background on the technology, please see the ReppTop page. The REPP modules provided by the ERG can be configured in various ways, to accommodate different input conventions, i.e. variation in punctuation and markup conventions used in texts from various sources. As of mid-2010, these REPP modules have stabilized to a certain degree but remain to be documented (beyond the generous use of comments in the REPP source files). In the following, we document the normalized result of string-level pre-processing, i.e. the expected result of the application of a set of REPP modules.

Token Mapping

Unknown Word Handling

Clone this wiki locally