Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Zinc] Infinite recursion on UTF encoding error #17542

Open
Gabriel-Darbord opened this issue Dec 18, 2024 · 10 comments
Open

[Zinc] Infinite recursion on UTF encoding error #17542

Gabriel-Darbord opened this issue Dec 18, 2024 · 10 comments

Comments

@Gabriel-Darbord
Copy link
Contributor

Bug description

  1. Attempting to encode a codepoint that is not valid in UTF throws an exception.
  2. If it is not handled, Pharo wants to log it in the PharoDebug.log file.
  3. The logger uses UTF-8 encoding and attempts to print the problematic character, looping back to step 1 (although the original exception may be different from the ZnInvalidUTF8 exception that causes the loops).

To reproduce
Minimal example that triggers the infinite recursion:

55296 asCharacter asString utf8Encoded

Expected behavior
The error logging should not be susceptible to the encoding error it is trying to log.

Additional context

Click here to show/hide the cropped top of the stack.
Context>>printDetails:
Context>>errorReportOn:
[:logger |
		logger
			nextPutAll: 'THERE_BE_DRAGONS_HERE'; cr;
		  	nextPutAll: errMsg; cr.
			"just print the error message if no context is given"
			aContext ifNotNil: [
				aContext errorReportOn: logger ].
		" write some type of separator"
		logger nextPutAll: (String new: 79 withAll: $- ); cr; cr] in SmalltalkImage>>logError:inContext: in Block: [:logger |...
[
		logStream := self openLog.
		aMonadicBlock value: logStream.
	] in SmalltalkImage>>logDuring: in Block: [...
FullBlockClosure(BlockClosure)>>ensure:
SmalltalkImage>>logDuring:
SmalltalkImage>>logError:inContext:
MorphicUIManager(UIManager)>>logError:
MorphicUIManager>>handleError:log:
MorphicUIManager(UIManager)>>handleError:
Process>>handleError:
ZnInvalidUTF8(Exception)>>unhandledErrorAction
UnhandledError>>defaultAction
UndefinedObject>>handleSignal:
Context>>handleSignal:
Context>>handleSignal:
UnhandledError(Exception)>>signal
UnhandledError class>>signalForException:
ZnInvalidUTF8(Exception)>>raiseUnhandledError
ZnInvalidUTF8(Exception)>>defaultAction
UndefinedObject>>handleSignal:
ZnInvalidUTF8(Exception)>>pass
[ :ex |
			ActiveWorld := priorWorld.
			ex pass ] in WorldMorph>>becomeActiveDuring: in Block: [ :ex |...
FullBlockClosure(BlockClosure)>>cull:
Context>>evaluateSignal:
Context>>handleSignal:
ZnInvalidUTF8(Exception)>>pass
[ :e | e pass ] in SpCodePrintItCommand(SpCodeSelectionCommand)>>evaluate:andDo: in Block: [ :e | e pass ]
[ :e |
		self announcer announce: (SpCodeEvaluationFailedAnnouncement 
			newContent: aString
			error: e).
		errorBlock value: e ] in SpCodePresenter>>evaluate:onCompileError:onError: in Block: [ :e |...
FullBlockClosure(BlockClosure)>>cull:
Context>>evaluateSignal:
Context>>handleSignal:
ZnInvalidUTF8(Exception)>>signal
ZnInvalidUTF8(Exception)>>signal:
ZnInvalidUTF8 class(Exception class)>>signal:
ZnUTF8Encoder>>error:
ZnUTF8Encoder(ZnCharacterEncoder)>>errorOutsideRange
ZnUTF8Encoder>>nextPutCodePoint:toStream:
ZnUTF8Encoder(ZnCharacterEncoder)>>nextPut:toStream:
ZnCharacterWriteStream>>nextPut:
ZnNewLineWriterStream>>nextPut:
[:v | self nextPut: v] in ZnNewLineWriterStream(Stream)>>nextPutAll: in Block: [:v | self nextPut: v]
WideString(SequenceableCollection)>>do:
ZnNewLineWriterStream(Stream)>>nextPutAll:
ZnNewLineWriterStream(WriteStream)>>nextPutAll:
Context>>printDetails:
Context>>errorReportOn:
[:logger |
		logger
			nextPutAll: 'THERE_BE_DRAGONS_HERE'; cr;
		  	nextPutAll: errMsg; cr.
			"just print the error message if no context is given"
			aContext ifNotNil: [
				aContext errorReportOn: logger ].
		" write some type of separator"
		logger nextPutAll: (String new: 79 withAll: $- ); cr; cr] in SmalltalkImage>>logError:inContext: in Block: [:logger |...
[
		logStream := self openLog.
		aMonadicBlock value: logStream.
	] in SmalltalkImage>>logDuring: in Block: [...
FullBlockClosure(BlockClosure)>>ensure:
SmalltalkImage>>logDuring:
SmalltalkImage>>logError:inContext:
MorphicUIManager(UIManager)>>logError:
MorphicUIManager>>handleError:log:
MorphicUIManager(UIManager)>>handleError:
Process>>handleError:
ZnInvalidUTF8(Exception)>>unhandledErrorAction
UnhandledError>>defaultAction
UndefinedObject>>handleSignal:
Context>>handleSignal:
Context>>handleSignal:
UnhandledError(Exception)>>signal
UnhandledError class>>signalForException:
ZnInvalidUTF8(Exception)>>raiseUnhandledError
ZnInvalidUTF8(Exception)>>defaultAction
UndefinedObject>>handleSignal:
ZnInvalidUTF8(Exception)>>pass
[ :ex |
			ActiveWorld := priorWorld.
			ex pass ] in WorldMorph>>becomeActiveDuring: in Block: [ :ex |...
FullBlockClosure(BlockClosure)>>cull:
Context>>evaluateSignal:
Context>>handleSignal:
ZnInvalidUTF8(Exception)>>pass
[ :e | e pass ] in SpCodePrintItCommand(SpCodeSelectionCommand)>>evaluate:andDo: in Block: [ :e | e pass ]
[ :e |
		self announcer announce: (SpCodeEvaluationFailedAnnouncement 
			newContent: aString
			error: e).
		errorBlock value: e ] in SpCodePresenter>>evaluate:onCompileError:onError: in Block: [ :e |...
FullBlockClosure(BlockClosure)>>cull:
Context>>evaluateSignal:
Context>>handleSignal:
ZnInvalidUTF8(Exception)>>signal
ZnInvalidUTF8(Exception)>>signal:
ZnInvalidUTF8 class(Exception class)>>signal:
ZnUTF8Encoder>>error:
ZnUTF8Encoder(ZnCharacterEncoder)>>errorOutsideRange
ZnUTF8Encoder>>nextPutCodePoint:toStream:
ZnUTF8Encoder(ZnCharacterEncoder)>>nextPut:toStream:
ZnUTF8Encoder(ZnCharacterEncoder)>>next:putAll:startingAt:toStream:
ZnUTF8Encoder>>next:putAll:startingAt:toStream:
[ :stream |
		self next: string size putAll: string startingAt: 1 toStream: stream ] in ZnUTF8Encoder(ZnCharacterEncoder)>>encodeString: in Block: [ :stream |...
ByteArray class(SequenceableCollection class)>>new:streamContents:
ByteArray class(SequenceableCollection class)>>streamContents:
ZnUTF8Encoder(ZnCharacterEncoder)>>encodeString:
WideString(String)>>encodeWith:
WideString(String)>>utf8Encoded
@svenvc
Copy link
Contributor

svenvc commented Dec 18, 2024

That is the classic 'error while printing an error' situation.

Quite annoying indeed.

One thing that might help is the recently added ZnLossyUTF8Encoder

55296 asCharacter asString encodeWith: #utf8lossy

it was originally added to read any byte sequence without erroring,
but it can also handle the writing of erroneous characters,
by outputting nothing.

Anyway that is the current behaviour. Not sure it is helpful.
Maybe it could be changed to output the Unicode replacement character in case of errors.

@noha
Copy link
Member

noha commented Dec 19, 2024

It might be also a clear sign that debuggers should not use printString but something error related that could return a safer variant

@Gabriel-Darbord
Copy link
Contributor Author

I'm looking into fixing this issue at the root.
While ZnLossyUTF8Encoder sounds interesting, writing nothing or the Unicode replacement character does not seem helpful for debugging the issue when reading the logs: we want to know the codepoint that caused the issue.

I think the solution is for the logger, created in SmalltalkImage>>#openLog, to have a safe encoder that still writes the codepoint it failed to encode, using the Unicode standard style: U+XXXX.
For reference, here's the error Python prints (it uses a different style):

'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

I'll submit a PR here first.
I'm not sure if it should go through zinc?

@svenvc
Copy link
Contributor

svenvc commented Dec 20, 2024

If it modifies Zn* classes I need to know about it, at least.

I am curious to what you want to do.

Instead of ZnLossyUTF8Encoder we could have a ZnDebuggingUTF8Encoder that splices errors in the output.

@Gabriel-Darbord
Copy link
Contributor Author

So I looked into what the actual implementation of a ZnDebuggingUTF8Encoder would look like.

  • I was hoping to just override the errorXXX methods, but the stream is always an argument, not accessible from the error methods unless using some context wizardry.
  • For encoding, only overriding nextPutCodePoint:toStream: is necessary, replacing the two calls to errorOutsideRange with a new message to print the codepoint in full.
  • For decoding, which is not part of this issue but would be nice for completeness, the nextCodePointFromStream: method is expected to return a single Character, so this would not work with the idea of representing them as U+XXXX strings (at least not easily).

So now I'm hesitating whether it's worth making a new class just for encoding during error logging, with the single method nextPutCodePoint:toStream: which copies a lot of code from the superclass.

The problem occurs during Context>>#printDetails:, where there are already a lot of guards that default to printing a simple <<error during printing>>.
However, there's no guard for the problematic statement:

string := [ (self tempsAndValuesLimitedTo: 80 indent: 2) padRightTo: 1 with: $x ]
on: Exception
do: [ errorMessage ].
stream nextPutAll: string.

Catching ZnUTF8InvalidEncoding (or Exception to be safe) and using the default 'error during printing' would be the simplest thing to do.
Otherwise, I guess we could replace invalid UTF characters with their U+XXXX strings and try again, with a final guard to finally give up for good measure.

@svenvc
Copy link
Contributor

svenvc commented Dec 24, 2024

In the feenk fork of zinc [ https://github.com/feenkcom/zinc ] in the last two commits I did some refactoring and added an experimental ZnDebuggingUTF8Encoder.

Here is how it works:

testDebuggingUTF8
	self 
		assert: (({ $A. 55296 asCharacter . $B } as: String) encodeWith: #utf8debug) utf8Decoded 
		equals: 'A<Cannot utf-8 encode code point 55296>B'

Not sure how to plug it in though.

@noha
Copy link
Member

noha commented Dec 24, 2024

Can you elaborate about why there is a fork for feenk? Especially for zinc I don't see a reason why this is necessary. Does it mean pharo will become a second class child of support?

@svenvc
Copy link
Contributor

svenvc commented Jan 7, 2025

Can you elaborate about why there is a fork for feenk? Especially for zinc I don't see a reason why this is necessary. Does it mean pharo will become a second class child of support?

Forks are a standard way to collaborate in the git (hub) world. They offer a temporary divergence that can be relatively easily consolidated later on.

This has nothing to do with forking as in going separate ways.

The feenk Zinc fork is a synced fork: it is updated regularly in both directions.

I would prefer this to be the same with Pharo as it is very hard to keep things in sync right now.

@svenvc
Copy link
Contributor

svenvc commented Jan 7, 2025

I did an experiment in Pharo 12. I loaded the new code from Zinc upstream to get the ZnDebuggingUTF8Encoder.

Then I changed two methods by switching from the 'utf8' to the 'utf8debug' encoder:

  • SmalltalkImage>>#openLog
  • SmalltalkImage>>#logStdErrorDuring:

Screenshot 2025-01-07 at 14 27 18

When I execute the offending code, 55296 asCharacter asString utf8Encoded, I get a proper error while PharoDebug.log looks like:

THERE_BE_DRAGONS_HERE
Character Unicode code point outside encoder range
7 January 2025 2:15:33.239425 pm

VM: Mac OS - arm64 - 1000 - CoInterpreter VMMaker-tonel.1 uuid: 60d66b41-80dd-0d00-bdda-516c033ccd3c Sep  2 2024
StackToRegisterMappingCogit VMMaker-tonel.1 uuid: 60d66b41-80dd-0d00-bdda-516c033ccd3c Sep  2 2024
v10.3.1 - Commit: 6cdb1e5d - Date: 2024-09-02 09:56:37 +0200

Image: Pharo12.0.0SNAPSHOT [Build information: Pharo-12.0.0+SNAPSHOT.build.1549.sha.8713046c28ea35a275cfa482fac8223084f5ea36 (64 Bit)]

ZnUTF8Encoder>>error:
	Receiver: a ZnUTF8Encoder
	Arguments and temporary variables: 
		message: 	'Character Unicode code point outside encoder range'
	Receiver's instance variables: 
		ignoreByteOrderMark: 	nil


ZnUTF8Encoder(ZnCharacterEncoder)>>errorOutsideRange
	Receiver: a ZnUTF8Encoder
	Arguments and temporary variables: 
x
	Receiver's instance variables: 
		ignoreByteOrderMark: 	nil


ZnUTF8Encoder(ZnCharacterEncoder)>>errorOutsideRange:to:
	Receiver: a ZnUTF8Encoder
	Arguments and temporary variables: 
		codePoint: 	55296
		stream: 	a WriteStream
	Receiver's instance variables: 
		ignoreByteOrderMark: 	nil


ZnUTF8Encoder>>nextPutCodePoint:toStream:
	Receiver: a ZnUTF8Encoder
	Arguments and temporary variables: 
		codePoint: 	55296
		stream: 	a WriteStream
	Receiver's instance variables: 
		ignoreByteOrderMark: 	nil


ZnUTF8Encoder(ZnCharacterEncoder)>>nextPut:toStream:
	Receiver: a ZnUTF8Encoder
	Arguments and temporary variables: 
		character: 	$<Cannot utf-8 encode code point 55296>
		stream: 	a WriteStream
	Receiver's instance variables: 
		ignoreByteOrderMark: 	nil


ZnUTF8Encoder(ZnCharacterEncoder)>>next:putAll:startingAt:toStream:
	Receiver: a ZnUTF8Encoder
	Arguments and temporary variables: 
		count: 	1
		string: 	'<Cannot utf-8 encode code point 55296>'
		offset: 	1
		stream: 	a WriteStream
		index: 	1
	Receiver's instance variables: 
		ignoreByteOrderMark: 	nil


ZnUTF8Encoder>>next:putAll:startingAt:toStream:
	Receiver: a ZnUTF8Encoder
	Arguments and temporary variables: 
		count: 	1
		string: 	'<Cannot utf-8 encode code point 55296>'
		offset: 	1
		stream: 	a WriteStream
	Receiver's instance variables: 
		ignoreByteOrderMark: 	nil

...

As you can see both the character and string are printed as <Cannot utf-8 encode code point 55296>

@Gabriel-Darbord
Copy link
Contributor Author

Nice, this gets rid of the infinite recursion.
In my opinion, <Cannot utf-8 encode code point N> is a bit too verbose, considering that it is output for every problematic character, but at least it indicates the error explicitly.
My original idea was to print the Unicode Standard code point notation instead, e.g. U+D800 for 55296 asCharacter, but someone unfamiliar with this particular encoding behavior might be confused by that.
I think you can make a PR with your changes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants