Access to collation and equivalence classes #26

albfan · 2017-11-05T11:15:47Z

Trying to add equivalence support for rust regexp

rust-lang/regex#404

I found your crate. I'm little lost into rust and locale support (i18n, l10n,l20n) so hope yours is the right place

This is only a WIP to trying to retrieve equivalence characters from locale

get_equivalence('0') should return ['o','ò', 'ó', 'ö'].

All that info is in libc localedata

link to code on glibc

link to doc

jan-hudec · 2017-11-05T15:04:24Z

I'm little lost into rust and locale support (i18n, l10n,l20n) so hope yours is the right place

I hope it will be the right place. I have to get around to releasing a useful version first though.

All that info is in libc localedata

That is, unfortunately, irrelevant, because only GNU/Linux has that, while Rust aims to work cross-platform. The binding is actually scheduled to go away in the next release.

Do you know how to get the data from CLDR?

albfan · 2017-11-05T15:54:28Z

CLDR:

That sounds much better than binding

Is it supposed that we should use

https://crates.io/crates/cldr
https://github.com/mrhota/cldr-rs ?

Is there any WIP branch where you use that or similar crate?

I'm looking for retrieve equivalence classes from cldr

jan-hudec · 2017-11-05T16:15:52Z

Actually, as far as I can tell, the cldr crate is 100% useless.

There is a WIP branch right here, called next

mrhota · 2017-11-05T16:29:34Z

cldr crate is useless. It was just a start at attempting to deliver the data with the means of using it. Consider it abandoned until further notice, I guess. I haven't had any time to work on my personal projects for a long time. But I haven't completely forgotten about them...

albfan · 2017-11-05T17:15:21Z

@mrhota Thanks

Seems unicode don't have such concept of equivalence class.

https://www.revolvy.com/main/index.php?s=Unicode%20equivalence

Anyway reading a bit, is clear that unicode is far superior than that concept:

https://unicode.org/cldr/utility/character.jsp?a=%C3%B2&B1=Show

it easily detects character confusing

ò: 00F2 "LATIN SMALL LETTER O WITH GRAVE"

because unicode can be normalized into

o 006F "LATIN SMALL LETTER O"
+
 ̀ 0300 "COMBINING GRAVE ACCENT"

So getting normalize form of any character you can get all characters that are equivalence for it:

Here in Spain:

[=a=] = aá
[=e=] = eé
[=i=] = ií
[=o=] = oó
[=u=] = uüü
[=n=] = nñ

are the useful cases, but it will work for any language

unicode normalizations

Do we have unicode normalizations avaliable on rust?

jan-hudec · 2017-11-05T20:19:43Z

@albfan, yes, we do have unicode normalizations in Rust, in unicode-normalization. However, that is just the language-agnostic normalizations, while the classes are language-specific.

The language-specific stuff is defined by TR#35. I would expect the data to be derived from the Part 5, Collation, but I am not sure. Can you try to find the definition?

The corresponding data is available in the CLDR. I am currently using the JSON exports, because they have all the references resolved, so it is much easier to work with them.

albfan · 2017-11-05T23:21:53Z

Great to know about unicode-normalization on rust.

I think there will be no success using CLDR about that. The only concept nearly similar is the "confuse" thingy (but I don't know how two or groups or characters are detected as "confusing" in unicode)

For what I need I guess forking ripgrep and enhance bracket parsing for my cases will be enough. I could implement a preprocess of normalization for files to detect that use cases without predefined data and see how much it slows the searches.

Thanks for all the links and related info

jan-hudec · 2017-11-06T19:32:50Z

@albfan, well, it depends on what those “equivalence classes” are supposed to mean. I can imagine two things:

Characters that only differ in accent, or
Characters that collate the same.

The former can be obtained by simply converting to (compatibility, probably) decomposed normal form and dropping the combining characters. The later can be obtained from the collation data, which is available in CLDR.

I have done the normal form thing in past job where we had on-screen keyboard without accents and matching search that ignored them. It was just the raw normalization data with a couple of tweaks, because a couple of accented characters exists on keyboards, e.g. ‘Ё’.

I haven't looked at the collation algorithm at all yet, so I can't tell how hard that one would be (some accented characters are considered equal to their bases, but others are not and it is language-dependent).

Look for collation and equivalence classes

59c4186

albfan force-pushed the master branch from 34e8071 to 59c4186 Compare November 5, 2017 15:44

albfan mentioned this pull request Nov 5, 2017

support for equivalence classes rust-lang/regex#404

Closed

albfan closed this Nov 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access to collation and equivalence classes #26

Access to collation and equivalence classes #26

albfan commented Nov 5, 2017 •

edited

Loading

jan-hudec commented Nov 5, 2017

albfan commented Nov 5, 2017 •

edited

Loading

jan-hudec commented Nov 5, 2017

mrhota commented Nov 5, 2017 •

edited

Loading

albfan commented Nov 5, 2017

jan-hudec commented Nov 5, 2017

albfan commented Nov 5, 2017

jan-hudec commented Nov 6, 2017

Access to collation and equivalence classes #26

Access to collation and equivalence classes #26

Conversation

albfan commented Nov 5, 2017 • edited Loading

jan-hudec commented Nov 5, 2017

albfan commented Nov 5, 2017 • edited Loading

jan-hudec commented Nov 5, 2017

mrhota commented Nov 5, 2017 • edited Loading

albfan commented Nov 5, 2017

jan-hudec commented Nov 5, 2017

albfan commented Nov 5, 2017

jan-hudec commented Nov 6, 2017

albfan commented Nov 5, 2017 •

edited

Loading

albfan commented Nov 5, 2017 •

edited

Loading

mrhota commented Nov 5, 2017 •

edited

Loading