Bengali / Khmer / Gujarati / Odia / Hindi regression? #126

masaccio · 2024-05-15T14:14:31Z

I recently updated from 0.2.6 to 0.2.13 and I have some tests breaking in a package that uses wcswidth. The following test fails every check in 0.2.13 but passes in 0.2.6:

from wcwidth import wcswidth

def check(length, value):
    if wcswidth(value) == length:
        print(value, "OK")
    else:
        print(value, "FAIL")

check(10, '"আবখাজিয়া"')
check(12, '"আফগানিস্তান"')
check(10, '"আলবেনিয়া"')
check(10, '"अबख़ाज़िया"')
check(11, '"ඇෆ්ගනිස්ථානය"')
check(10, '"ඇල්බේනියාව"')
check(12, '"अफ़ग़ानिस्तान"')
check(12, '"ஆக்கானித்தான்"')
check(14, '"អាហ្វហ្គានីស្ថាន"')
check(12, '"અફગાનિસ્તાન"')
check(9, '"અલ્બેનિયા"')
check(12, '"ଆଫାଗାନିସ୍ତାନ୍"')
check(13, '"गन्धार, अश्वक"')
check(10, '"अल्बानिया"')

Aligning some ASCII text in my terminal, I believe that the check lengths are correct:

The text was updated successfully, but these errors were encountered:

masaccio · 2024-05-15T14:17:19Z

Looks like 0.2.9 is where the change happened.

jquast · 2024-05-15T16:45:44Z

@masaccio which terminal?

jquast · 2024-05-15T16:48:49Z

#91 (comment) related issue and comment about the change

masaccio · 2024-05-15T17:17:42Z

This was iTerm2 on a Mac with TERM=xterm-256color

masaccio · 2024-05-15T17:25:25Z

This is the text I expect to see aligned - https://raw.githubusercontent.com/masaccio/compact-json/main/tests/data/test-issue-4.ref-1.json

Though in my browser it's not aligned so I don't know what the right answer is.

jquast · 2024-05-15T23:23:23Z

I will say that I also use iTerm2, and that it is not a great indicator of multilanguage support. I have since authored a testing and reporting tool, ucs-detect, and have published results for ~27 terminals.

The following terminals match this library's measurements for Hindi:

WezTerm
QTerminal
mlterm
Kovid Goyal's kitty

The other ~23 terminals, including iTerm2, do not. iTerm2 gets an overall score of "B" rating for LANG score while the ones listed above get A's.

Some of them are systematic errors and I may create bug reports for their respective projects. However, languages like Hindi of script Devanagari are very excessive with combining characters (Category codes Mc and Mn), and, strictly following the Unicode Specifications, as these 4 terminals and this library do, may result in so much "squeezing" to be totally illegible!

On your findings of the browser, I have found that they do not make the effort to align by column as a terminal is expected to (see screenshots in #123 (comment))

I have authored a dummy "check" function to display a sequence where '|' should align,

def check(n, phrase):
     print('|'+(' '*wcwidth.wcswidth(phrase))+'|'+'\n'+'|'+phrase+'|\n')

And these are the results for iTerm (left) and WezTerm (right)

I don't know Devanagari enough to say for sure, I would say that iTerm2 appears to fail to correctly combine characters of category Mc and Mn, while wezterm does combine them but also sometimes reduces the font size to accommodate their expected width and maybe some combining characters are also poorly aligned

masaccio · 2024-05-16T07:29:03Z

Thanks for the comprehensive debug. I can see I'm staring a large rabbit hole of encodings I don't understand so I'll step away! Wezterm does indeed agree with your library (though not editing in vim) and that is enough for me.

masaccio closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bengali / Khmer / Gujarati / Odia / Hindi regression? #126

Bengali / Khmer / Gujarati / Odia / Hindi regression? #126

masaccio commented May 15, 2024

masaccio commented May 15, 2024

jquast commented May 15, 2024

jquast commented May 15, 2024

masaccio commented May 15, 2024

masaccio commented May 15, 2024

jquast commented May 15, 2024 •

edited

Loading

masaccio commented May 16, 2024

Bengali / Khmer / Gujarati / Odia / Hindi regression? #126

Bengali / Khmer / Gujarati / Odia / Hindi regression? #126

Comments

masaccio commented May 15, 2024

masaccio commented May 15, 2024

jquast commented May 15, 2024

jquast commented May 15, 2024

masaccio commented May 15, 2024

masaccio commented May 15, 2024

jquast commented May 15, 2024 • edited Loading

masaccio commented May 16, 2024

jquast commented May 15, 2024 •

edited

Loading