Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to collation and equivalence classes #26

Closed
wants to merge 1 commit into from

Conversation

albfan
Copy link

@albfan albfan commented Nov 5, 2017

Trying to add equivalence support for rust regexp

rust-lang/regex#404

I found your crate. I'm little lost into rust and locale support (i18n, l10n,l20n) so hope yours is the right place

This is only a WIP to trying to retrieve equivalence characters from locale

get_equivalence('0') should return ['o','ò', 'ó', 'ö'].

All that info is in libc localedata

link to code on glibc

link to doc

@jan-hudec
Copy link
Collaborator

I'm little lost into rust and locale support (i18n, l10n,l20n) so hope yours is the right place

I hope it will be the right place. I have to get around to releasing a useful version first though.

All that info is in libc localedata

That is, unfortunately, irrelevant, because only GNU/Linux has that, while Rust aims to work cross-platform. The binding is actually scheduled to go away in the next release.

Do you know how to get the data from CLDR?

@albfan
Copy link
Author

albfan commented Nov 5, 2017

CLDR:

That sounds much better than binding

Is it supposed that we should use

https://crates.io/crates/cldr
https://github.com/mrhota/cldr-rs ?

Is there any WIP branch where you use that or similar crate?

I'm looking for retrieve equivalence classes from cldr

@jan-hudec
Copy link
Collaborator

Actually, as far as I can tell, the cldr crate is 100% useless.

There is a WIP branch right here, called next

@mrhota
Copy link

mrhota commented Nov 5, 2017

cldr crate is useless. It was just a start at attempting to deliver the data with the means of using it. Consider it abandoned until further notice, I guess. I haven't had any time to work on my personal projects for a long time. But I haven't completely forgotten about them...

@albfan
Copy link
Author

albfan commented Nov 5, 2017

@mrhota Thanks

Seems unicode don't have such concept of equivalence class.

https://www.revolvy.com/main/index.php?s=Unicode%20equivalence

Anyway reading a bit, is clear that unicode is far superior than that concept:

https://unicode.org/cldr/utility/character.jsp?a=%C3%B2&B1=Show

it easily detects character confusing

ò: 00F2 "LATIN SMALL LETTER O WITH GRAVE"

because unicode can be normalized into

o 006F "LATIN SMALL LETTER O"
+
 ̀ 0300 "COMBINING GRAVE ACCENT"

So getting normalize form of any character you can get all characters that are equivalence for it:

Here in Spain:

[=a=] = aá
[=e=] = eé
[=i=] = ií
[=o=] = oó
[=u=] = uüü
[=n=] = nñ

are the useful cases, but it will work for any language

unicode normalizations

Do we have unicode normalizations avaliable on rust?

@jan-hudec
Copy link
Collaborator

@albfan, yes, we do have unicode normalizations in Rust, in unicode-normalization. However, that is just the language-agnostic normalizations, while the classes are language-specific.

The language-specific stuff is defined by TR#35. I would expect the data to be derived from the Part 5, Collation, but I am not sure. Can you try to find the definition?

The corresponding data is available in the CLDR. I am currently using the JSON exports, because they have all the references resolved, so it is much easier to work with them.

@albfan
Copy link
Author

albfan commented Nov 5, 2017

Great to know about unicode-normalization on rust.

I think there will be no success using CLDR about that. The only concept nearly similar is the "confuse" thingy (but I don't know how two or groups or characters are detected as "confusing" in unicode)

For what I need I guess forking ripgrep and enhance bracket parsing for my cases will be enough. I could implement a preprocess of normalization for files to detect that use cases without predefined data and see how much it slows the searches.

Thanks for all the links and related info

@albfan albfan closed this Nov 5, 2017
@jan-hudec
Copy link
Collaborator

@albfan, well, it depends on what those “equivalence classes” are supposed to mean. I can imagine two things:

  • Characters that only differ in accent, or
  • Characters that collate the same.

The former can be obtained by simply converting to (compatibility, probably) decomposed normal form and dropping the combining characters. The later can be obtained from the collation data, which is available in CLDR.

I have done the normal form thing in past job where we had on-screen keyboard without accents and matching search that ignored them. It was just the raw normalization data with a couple of tweaks, because a couple of accented characters exists on keyboards, e.g. ‘Ё’.

I haven't looked at the collation algorithm at all yet, so I can't tell how hard that one would be (some accented characters are considered equal to their bases, but others are not and it is language-dependent).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants