-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access to collation and equivalence classes #26
Conversation
I hope it will be the right place. I have to get around to releasing a useful version first though.
That is, unfortunately, irrelevant, because only GNU/Linux has that, while Rust aims to work cross-platform. The binding is actually scheduled to go away in the next release. Do you know how to get the data from CLDR? |
CLDR: That sounds much better than binding Is it supposed that we should use https://crates.io/crates/cldr Is there any WIP branch where you use that or similar crate?
|
Actually, as far as I can tell, the cldr crate is 100% useless. There is a WIP branch right here, called |
cldr crate is useless. It was just a start at attempting to deliver the data with the means of using it. Consider it abandoned until further notice, I guess. I haven't had any time to work on my personal projects for a long time. But I haven't completely forgotten about them... |
@mrhota Thanks Seems unicode don't have such concept of equivalence class. https://www.revolvy.com/main/index.php?s=Unicode%20equivalence Anyway reading a bit, is clear that unicode is far superior than that concept: https://unicode.org/cldr/utility/character.jsp?a=%C3%B2&B1=Show it easily detects character confusing
because unicode can be normalized into
So getting normalize form of any character you can get all characters that are equivalence for it: Here in Spain:
are the useful cases, but it will work for any language Do we have unicode normalizations avaliable on rust? |
@albfan, yes, we do have unicode normalizations in Rust, in unicode-normalization. However, that is just the language-agnostic normalizations, while the classes are language-specific. The language-specific stuff is defined by TR#35. I would expect the data to be derived from the Part 5, Collation, but I am not sure. Can you try to find the definition? The corresponding data is available in the CLDR. I am currently using the JSON exports, because they have all the references resolved, so it is much easier to work with them. |
Great to know about unicode-normalization on rust. I think there will be no success using CLDR about that. The only concept nearly similar is the "confuse" thingy (but I don't know how two or groups or characters are detected as "confusing" in unicode) For what I need I guess forking ripgrep and enhance bracket parsing for my cases will be enough. I could implement a preprocess of normalization for files to detect that use cases without predefined data and see how much it slows the searches. Thanks for all the links and related info |
@albfan, well, it depends on what those “equivalence classes” are supposed to mean. I can imagine two things:
The former can be obtained by simply converting to (compatibility, probably) decomposed normal form and dropping the combining characters. The later can be obtained from the collation data, which is available in CLDR. I have done the normal form thing in past job where we had on-screen keyboard without accents and matching search that ignored them. It was just the raw normalization data with a couple of tweaks, because a couple of accented characters exists on keyboards, e.g. ‘Ё’. I haven't looked at the collation algorithm at all yet, so I can't tell how hard that one would be (some accented characters are considered equal to their bases, but others are not and it is language-dependent). |
Trying to add equivalence support for rust regexp
rust-lang/regex#404
I found your crate. I'm little lost into rust and locale support (i18n, l10n,l20n) so hope yours is the right place
This is only a WIP to trying to retrieve equivalence characters from locale
get_equivalence('0') should return ['o','ò', 'ó', 'ö'].
All that info is in libc localedata
link to code on glibc
link to doc