Code marked as comment due to Umlauts #134

NielsNet · 2017-09-04T06:03:54Z

There seems to bee an issue with Textmate and umlauts.
As shown in the issue angelozerr/typescript.java#203 as soon as two or more letters with an umlaut are used the following lines of code are marked as comment even though they should not.
The following screenshot shows different cases where the behaviour appears.

angelozerr · 2017-09-05T00:40:54Z

I'm not sure, but perhaps problem comes from that tm4e doesn't support onigurama with UTF-16 (see ignored tests at https://github.com/eclipse/tm4e/blob/master/org.eclipse.tm4e.core.tests/src/main/java/org/eclipse/tm4e/core/grammar/GrammarSuiteTest.java#L44)

I think we should try to implement ConvertUtf16OffsetToUtf8 https://github.com/atom/node-oniguruma/blob/master/src/onig-searcher.cc#L4 in Java at https://github.com/eclipse/tm4e/blob/master/org.eclipse.tm4e.core/src/main/java/org/eclipse/tm4e/core/internal/oniguruma/OnigSearcher.java

Any PR are welcome!

fabioz · 2017-09-16T13:30:29Z

I already have some utilities for converting to/from bytes/multibyte I used in LiClipseText... (https://github.com/fabioz/LiClipseText/blob/master/plugins/org.brainwy.liclipsetext.editor/src/org/brainwy/liclipsetext/editor/partitioning/Utf8WithCharLen.java).

I'll take a look on porting them to OnigString (but will leave it up to you to where to place the calls, because it's not just convert one way, the result has to be converted back too...).

Maybe better would be just using always utf-8 internally and convert just the final result? Anyway, will provide a pull request for the conversions shortly, but will leave it up to you where to do the convertUtf16OffsetToUtf8 and convertUtf8OffsetToUtf16 calls.

#134

angelozerr · 2017-09-19T18:02:59Z

@NielsNet could you reinstall http://download.eclipse.org/tm4e/snapshots/ and tell me if it fixes your problem. Thanks!

fabioz · 2017-09-19T18:38:16Z

@angelozerr With this change I have the same behavior I had with the approach I was using which just used everything with utf-8 offsets internally and just converting the final result from utf-16 to utf-8.

(i.e.: fabioz/LiClipseText@1d8f88c -- the interesting bit is Token.java and Grammar.java -- mainly, no conversion is needed internally and just the final result is converted).

So not sure... I must say I'm more inclined toward the patch which keeps everything utf-8 internally and just changes offsets when returning to grammar callers (it should be a bit faster as it doesn't have to convert all the time internally which may happen more often and makes internal handling simpler as everything is utf-8 internally as oniguruma has to work on bytes anyways).

If you want I can provide a patch for that...

angelozerr · 2017-09-19T18:59:20Z

@fabioz I have tried to translate node onigurama code and it seems it fixes this issue and #136 (I have only tested in Windows OS).

Im' waiting for users feedback to know if it fixes encoding problem with other OS.

If you want I can provide a patch for that...

I'm aware with any patch if it improves performance and fixes any problem.

Please note that I would like to consume tokenizeLine2 #38 like VSCode does but for you it will change nothing if you wish to use again tokenizeLine

NielsNet · 2017-10-15T15:09:25Z

@angelozerr Sorry I did not notice your comment. I tried to install the snapshot but it doesn't work. I get an Cannot perform operation. Computing alternate solutions... as the installation fails.

angelozerr · 2017-10-15T15:29:19Z

@NielsNet please try to reinstall typescript.java http://oss.opensagres.fr/typescript.ide/snapshots/ which install last version of tm4e

NielsNet · 2017-10-24T19:34:39Z

The problem is partally gone:

However:

Omcsesz · 2022-04-07T22:21:49Z

@angelozerr After installing the latest version of typescript.java, the problem fully is gone:

mickaelistria · 2022-04-08T07:32:29Z

Thanks, let's close then.

angelozerr mentioned this issue Sep 5, 2017

Problem with HTML coloring #136

Closed

angelozerr mentioned this issue Sep 16, 2017

Fix recursion error on OnigRegExp with unicode chars #144

Closed

fabioz mentioned this issue Sep 16, 2017

Utilities to convert to/from utf-8, utf-16. #146

Merged

angelozerr added a commit that referenced this issue Sep 19, 2017

Support UTF-16 with onigurama. See

f711b76

#134

mickaelistria added the bug label May 3, 2018

mickaelistria closed this as completed Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code marked as comment due to Umlauts #134

Code marked as comment due to Umlauts #134

NielsNet commented Sep 4, 2017

angelozerr commented Sep 5, 2017

fabioz commented Sep 16, 2017

angelozerr commented Sep 19, 2017

fabioz commented Sep 19, 2017

angelozerr commented Sep 19, 2017

NielsNet commented Oct 15, 2017

angelozerr commented Oct 15, 2017

NielsNet commented Oct 24, 2017

Omcsesz commented Apr 7, 2022 •

edited

Loading

mickaelistria commented Apr 8, 2022

Code marked as comment due to Umlauts #134

Code marked as comment due to Umlauts #134

Comments

NielsNet commented Sep 4, 2017

angelozerr commented Sep 5, 2017

fabioz commented Sep 16, 2017

angelozerr commented Sep 19, 2017

fabioz commented Sep 19, 2017

angelozerr commented Sep 19, 2017

NielsNet commented Oct 15, 2017

angelozerr commented Oct 15, 2017

NielsNet commented Oct 24, 2017

Omcsesz commented Apr 7, 2022 • edited Loading

mickaelistria commented Apr 8, 2022

Omcsesz commented Apr 7, 2022 •

edited

Loading