This document details errata that shaping engines may encounter, such as ambiguities or omissions in the existing OpenType or Unicode specification documents.
Table of Contents
- Unicode
- OpenType
- See also
This section lists errata pertaining to the Unicode Standard.
Unicode provides the Zero Width Joiner (ZWJ) and Zero Width Non-Joiner (ZWNJ) control characters so that a text sequence can "request a rendering system to have more or less of a connection between characters than they would otherwise have."
The generic examples used in the standard show how ZWJ and ZWNJ characters can affect the cursive-joining behavior between two characters or the ligature-forming behavior between two characters. However, the standard does not explicitly say whether or not the presence of a ZWJ or ZWNJ should influence the shaping behavior of characters for characters not adjacent to the ZWJ or ZWNJ.
For example, in the sequence "a,b,ZWNJ,c,d" the ZWNJ should prevent the application of a ligature between "b" and "c" (if such a ligature lookup exists in the active font).
However, if the active font contains a contextual ligature lookup for "c,d" when preceded by "b", it is not clear whether or not the ZWNJ in the same "a,b,ZWNJ,c,d" sequence should inhibit the application of the ligature between "c" and "d".
An "Implementation Notes" section in chapter 23.2 of the Unicode Standard says that font vendors should add ZWJ sequences to ligature lookups. For example, if the sequence "f,i" triggers the "fi" ligature, then the font should also include a lookup that triggers the "fi" ligature for "f,ZWJ,i".
However, the text of chapter 23.2 prior to the "Implementation Notes" says that ZWJ and ZWNJ "are not to be used in all cases where ligatures or cursive connections are desired; instead, they are meant only for over-riding the normal behavior of the text." That logic makes the suggested "f,ZWJ,i" ligature lookup superfluous, because it duplicates the effects of the existing "f,i" ligature lookup.
Using ZWJ within lookup patterns in the manner suggested by the "Implementation Notes" is not common practice.
This section lists errata pertaining to the OpenType specification.
The headers of the GSUB and GPOS tables include fields that contain
the offsets at which other structures within the font binary are
found. For example, the value of the featureVariationsOffset
field
indicates the byte value at which the featureVariations structure is
located.
The OpenType specification notes that featureVariationsOffset
can be
NULL
, but the specification does not indicate whether or any other
offset values can also be NULL
(nor, conversely, does it indicate
whether NULL
should be considered invalid).
In practice, other fields -- such as scriptListOffset
,
featureListOffset
, and lookupListOffset
-- may have NULL
values.
In such situations, NULL
is usually intrepreted as meaning that the
structure nominally pointed to by the offset is empty.
Furthermore, font-validation functions may overwrite a NULL
into an
offset field if the original value encountered was invalid.
The OpenType specification requires that lookups in the GSUB table must be sorted into numeric order before they are applied.
Lookups in the GPOS table, however, are not expected to be sorted first, because GPOS lookups are applied in a specified order.
Some OpenType feature tags are defined only to apply to text runs in specific scripts. Other feature tags are defined to apply to text in any script.
However, the definitions of some feature tags list a limited number of example scripts to which the feature should apply, but do not specify every supported script.
For example, the pstf
(post-base forms) tag is
described
as required for "scripts of south and southeast Asia that have
post-base forms for consonants eg: Gurmukhi, Malayalam, Khmer."
The Microsoft script-development specification for all Indic2-model scripts states parenthetically that "post-base forms have to follow below-base forms".
If this statement is taken to be a rule, it would affect the base-consonant search algorithm.
For example, in the Bengali sequence "Ka,Halant,Ba,Halant,Ya"
(U+0995
,U+09CD
,U+09AC
,U+09CD
,U+09AF
), "Ka" would be
identified as the syllable base, with "Ba" designated a below-base
form and "Ya" designated a post-base form. However, in the similar
sequence "Ka,Halant,Ya,Halant,Ba"
(U+0995
,U+09CD
,U+09AF
,U+09CD
,U+09AC
), "Ya" would be
identified as the base consonant.
Real-world Bengali texts provide counterexamples that contradict the assumption that "post-base forms follow below-base forms" is a requirement.
In other scripts, such as Telugu, the "post-base forms have to follow below-base forms" statement is, perhaps, statistically likely, but is certainly not an orthographic rule.
Consequently, it is unclear if the statement should be enforced as a rule or if it should be regarded as a suggestion, and it is unclear to what degree that answer varies between the Indic2-model scripts.
The GSUB specification says that a MultipleSubst
substitution cannot
be used to delete a glyph: it always substitutes at least one
replacement glyph. However, some implementations allow the
replacement-glyph array to be zero-length.
The GSUB specification allows contextual substitutions to invoke other contextual substitutions. It is unclear how implementations ought to handle certain cases of these nested lookups.
For example:
context: 'a'
subst index 0:
context: 'ab'
subst index 1: 'b' → 'ab'
This nested set of substitutions could cause an infinite loop on certain input strings, if it is interpreted in a naive manner:
'[]ab' // begin at start of glyph sequence
'[a]b' // context matches
'[ab]' // nested context matches at index 0
'[aab]' // subst applies at index 1
'[a]ab' // return to parent context, uh oh!
'a[]ab' // move on to next glyph
'a[a]b' // context matches, infinite loop!
In short, if a nested contextual substitution can insert glyphs ahead of its parent contextual substitution's context, then it creates a "stack" that allows Turing-complete computation.
The Microsoft script-development specifications say that marks should be reordered "to canonical order" (step 3 in the linked Devanagari document) in the reordering phase. However, the same step also describes this step as "Adjacent nukta and halant or nukta and vedic sign are always repositioned if necessary, so that the nukta is first."
Together, it is somewhat ambiguous as to whether only "Halant,Nukta" and "vedicsign,Nukta" sequences should be reordered by moving the "Nukta" to the beginning, or all sequences of marks require reordering into Unicode canonical combining class order, with "Nukta" moving to the initial position as a special case.
When the application of a shaping operation merges two or more adjacent glyphs (for example, when two adjacent glyphs are substituted with a single ligature glyph), the OpenType specification does not dictate how shaping engines should combine (for example, merge, replace, or drop) the properties of the input glyphs to determine the properties of the output glyph.
This may result in ambiguities when a sequence of glyphs has several substitutions applied in series.
For example, when shaping Indic scripts, glyphs may be tagged for the
possible application of multiple features, such as half
and rkrf
,
which are applied serially.
HarfBuzz and Uniscribe both take the approach of retaining the properties of the first input glyph in a sequence, propagating those properties to the merged output glyph.
Shaping engines may also want to offer explicit compatibility with Microsoft Uniscribe, for the purpose of ensuring that users' existing documents do not break. Therefore, implementors may wish to consult the Uniscribe compatibility notes.
These compatibilty notes record test-driven observations about Uniscribe's behavior, and they include any behavior that is a known bug or a known deviation from specifications. Consequently, the issues raised by offering Uniscribe compatiblity cannot be considered errata in the sense that it is described above.