-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Khmer] Establish a syllable regex strategy #129
Comments
Here is what I've found as the existing in-the-wild Khmer syllable regular-expressions.... Unicode (as of v13)
Microsoft
Open Forum of Cambodia
SIL Mondulkiri
|
Quick attempt to rewrite the above expressions with the same symbol set / syntax:
Unicode
Microsoft
Open Forum
SIL Mondulkiri
|
See also Issues in Khmer syllable validation |
At the risk of overcomplicating matters, I'm taking a stab at updating the syllable-IDing expressions in the docs to correctly match what HarfBuzz does. That branch is available here: https://github.com/n8willis/opentype-shaping-documents/blob/khmer-syllables-2/opentype-shaping-khmer.md#1-identifying-syllables-and-other-sequences Note that HarfBuzz's current Ragel machine uses general terms ("Xgroup" and "Ygroup") for the marks/diacritics classes that require separate treatment. The two groups are almost, but not exactly, the same as the W3C-cg's "non-spacing diacritics" and "right side, spacing diacritics" classes, so I elected to name them that way in the WIP branch. The main differences are (1) that HarfBuzz puts U+17DD in with the "right side, spacing diacritics" instead of the "non-spacing diacritics" and (2) that HarfBuzz puts U+17D3 into the "right side, spacing diacritics" class whereas the W3C-cg documents don't include it anywhere. In any case, those are still just classes, not the expressions to match syllables & subsyllables. The actual syllable-ID expressions under development by the W3C-cg folks take an entirely different approach than HarfBuzz's expressions, but I thought that at least aligning the classes a bit closer might reduce confusion. We shall see.... |
Informational update only: It looks as though the ad-hoc interest group has settled on a set of regular expressions that works well for contemporary Khmer, and has at least developed an extension that also handles Middle Khmer — however, the Middle Khmer exception-set is rather large ... but some shuffling and additional work could simplify it. Eventually, the effort is likely to move to a UTC issue, as well as to several Khmer-language ministries and institutions. Although the concerns of some of those groups are essentially "input" (e.g., how to design software keyboards to let people type things in the correct order and avoid incorrect orders that used to result in virtually-identical renderings) and "data-cleaning" (e.g., how to identify confusable strings in old documents and fix the incorrect ones to their correct orderings). |
This issue asks the question "how should these docs define syllables for Khmer", which is a non-trivial question because there are several non-identical definitions from the upstream sources that are common reference points for some of the other scripts.
Basically, there is what's written in the Unicode Standard, there's Microsoft's Script-shaping docs, there's the Open Forum of Cambodia, and there are also the practical examples of SIL's Mondulkiri font family and HarfBuzz. There may also be others that I don't have (or haven't found) access to (e.g., Apple, various Adobe implementations, type foundries, etc).
When the original version of the Khmer doc in this repo was added, it just took the "follow what HarfBuzz does" approach. Although it only made a snapshot of that, since the HarfBuzz developers have continued to improve it.
It seems to be the consensus view that the existing/prior sources' regular expressions don't contain a single, perfect model that the others should all adopt. The W3C font-and-text Community Group has been having regular working calls to tackle issues for Khmer (of which syllable formulation is just one).
That group discussion seems like a good place to pin down as "what we want this repo to document", although it is not in a completed state, so anybody who might be tackling Khmer implementation today wouldn't have an easy time of it.
In the meantime, the question remains whether or not these docs ought to stick to the "follow HarfBuzz" approach or something else. Considering that Allsorts is using these docs as a reference, it would seem pretty valuable to be able to say that the various FOSS shaping engines are in step, but this is an open forum, so if anyone has a better birds-eye-view approach to suggest, please do so.
The text was updated successfully, but these errors were encountered: