Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bengali / Khmer / Gujarati / Odia / Hindi regression? #126

Closed
masaccio opened this issue May 15, 2024 · 7 comments
Closed

Bengali / Khmer / Gujarati / Odia / Hindi regression? #126

masaccio opened this issue May 15, 2024 · 7 comments

Comments

@masaccio
Copy link

I recently updated from 0.2.6 to 0.2.13 and I have some tests breaking in a package that uses wcswidth. The following test fails every check in 0.2.13 but passes in 0.2.6:

from wcwidth import wcswidth

def check(length, value):
    if wcswidth(value) == length:
        print(value, "OK")
    else:
        print(value, "FAIL")

check(10, '"আবখাজিয়া"')
check(12, '"আফগানিস্তান"')
check(10, '"আলবেনিয়া"')
check(10, '"अबख़ाज़िया"')
check(11, '"ඇෆ්ගනිස්ථානය"')
check(10, '"ඇල්බේනියාව"')
check(12, '"अफ़ग़ानिस्तान"')
check(12, '"ஆக்கானித்தான்"')
check(14, '"អាហ្វហ្គានីស្ថាន"')
check(12, '"અફગાનિસ્તાન"')
check(9, '"અલ્બેનિયા"')
check(12, '"ଆଫାଗାନିସ୍ତାନ୍"')
check(13, '"गन्धार, अश्वक"')
check(10, '"अल्बानिया"')

Aligning some ASCII text in my terminal, I believe that the check lengths are correct:

Screenshot 2024-05-15 at 15 13 55
@masaccio
Copy link
Author

Looks like 0.2.9 is where the change happened.

@jquast
Copy link
Owner

jquast commented May 15, 2024

@masaccio which terminal?

@jquast
Copy link
Owner

jquast commented May 15, 2024

#91 (comment) related issue and comment about the change

@masaccio
Copy link
Author

This was iTerm2 on a Mac with TERM=xterm-256color

@masaccio
Copy link
Author

This is the text I expect to see aligned - https://raw.githubusercontent.com/masaccio/compact-json/main/tests/data/test-issue-4.ref-1.json

Though in my browser it's not aligned so I don't know what the right answer is.

@jquast
Copy link
Owner

jquast commented May 15, 2024

I will say that I also use iTerm2, and that it is not a great indicator of multilanguage support. I have since authored a testing and reporting tool, ucs-detect, and have published results for ~27 terminals.

The following terminals match this library's measurements for Hindi:

  • WezTerm
  • QTerminal
  • mlterm
  • Kovid Goyal's kitty

The other ~23 terminals, including iTerm2, do not. iTerm2 gets an overall score of "B" rating for LANG score while the ones listed above get A's.

Some of them are systematic errors and I may create bug reports for their respective projects. However, languages like Hindi of script Devanagari are very excessive with combining characters (Category codes Mc and Mn), and, strictly following the Unicode Specifications, as these 4 terminals and this library do, may result in so much "squeezing" to be totally illegible!

On your findings of the browser, I have found that they do not make the effort to align by column as a terminal is expected to (see screenshots in #123 (comment))

I have authored a dummy "check" function to display a sequence where '|' should align,

def check(n, phrase):
     print('|'+(' '*wcwidth.wcswidth(phrase))+'|'+'\n'+'|'+phrase+'|\n')

And these are the results for iTerm (left) and WezTerm (right)

image

I don't know Devanagari enough to say for sure, I would say that iTerm2 appears to fail to correctly combine characters of category Mc and Mn, while wezterm does combine them but also sometimes reduces the font size to accommodate their expected width and maybe some combining characters are also poorly aligned

@masaccio
Copy link
Author

Thanks for the comprehensive debug. I can see I'm staring a large rabbit hole of encodings I don't understand so I'll step away! Wezterm does indeed agree with your library (though not editing in vim) and that is enough for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants