Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input is not valid Modified UTF-8: illegal continuation byte 0 #1894

Closed
iprovalo opened this issue Feb 18, 2025 · 10 comments
Closed

input is not valid Modified UTF-8: illegal continuation byte 0 #1894

iprovalo opened this issue Feb 18, 2025 · 10 comments

Comments

@iprovalo
Copy link
Contributor

Latest master, I am seeing this new exception:

 java_vm_ext.cc:598] JNI DETECTED ERROR IN APPLICATION: input is not valid Modified UTF-8: illegal continuation byte 0
              java_vm_ext.cc:598]     string: ' ?'
              java_vm_ext.cc:598]     input: '0x20 0xc4'
              java_vm_ext.cc:598]     in call to NewStringUTF
              java_vm_ext.cc:598]     from java.lang.Object[] com.k2fsa.sherpa.onnx.OfflineRecognizer.getResult(long)

Steps to reproduce:

ASR with whisper small model, pass the language as cs (Czech), utterance - "check check". Reproduces almost 100%. This is a regression introduced in the last 2-4 weeks.

@iprovalo
Copy link
Contributor Author

iprovalo commented Feb 20, 2025

@csukuangfj I tested some fixes locally, like this:

std::string text_str = isValidUtf8(result.text) ? result.text : sanitizeUtf8(result.text);
jstring text = env->NewStringUTF(text_str.c_str());

and added my own implementation of the new methods (isValidUtf8 and sanitizeUtf8). By I think all the normalization is supposed to happen in the text-utils.cc

@csukuangfj
Copy link
Collaborator

We have already done

std::string OfflineRecognizerImpl::ApplyInverseTextNormalization(
std::string text) const {
text = RemoveInvalidUtf8Sequences(text);

r.text = ApplyInverseTextNormalization(std::move(r.text));

Can you output the byte sequence of the invalid utf8 string?

@iprovalo
Copy link
Contributor Author

We have already done

sherpa-onnx/sherpa-onnx/csrc/offline-recognizer-impl.cc

Lines 498 to 500 in 4801094

std::string OfflineRecognizerImpl::ApplyInverseTextNormalization(
std::string text) const {
text = RemoveInvalidUtf8Sequences(text);
sherpa-onnx/sherpa-onnx/csrc/offline-recognizer-whisper-impl.h

Line 159 in 4801094

r.text = ApplyInverseTextNormalization(std::move(r.text));
Can you output the byte sequence of the invalid utf8 string?

This is what I get in the log: input: '0x20 0xc4'

@csukuangfj
Copy link
Collaborator

So the string has only two bytes 0x20 0xc4?

We need to update our RemoveInvalidUtf8Sequences() to handle that. Would you mind creating a PR to fix RemoveInvalidUtf8Sequences() for your case?

@iprovalo
Copy link
Contributor Author

So the string has only two bytes 0x20 0xc4?

We need to update our RemoveInvalidUtf8Sequences() to handle that. Would you mind creating a PR to fix RemoveInvalidUtf8Sequences() for your case?

No problem, I can do that. How do I run the unit-tests for this project?

@csukuangfj
Copy link
Collaborator

No problem, I can do that.

Thanks!


How do I run the unit-tests for this project?

Please use

git clone https://github.com/k2-fsa/sherpa-onnx
cd sherpa-onnx
mkdir build
cd build
cmake -DSHERPA_ONNX_ENABLE_TESTS=ON ..
make test-utils-test

./bin/test-utils-test

Please also update text-utils-test.cc to include your case.

@iprovalo
Copy link
Contributor Author

No problem, I can do that.

Thanks!

How do I run the unit-tests for this project?

Please use

git clone https://github.com/k2-fsa/sherpa-onnx
cd sherpa-onnx
mkdir build
cd build
cmake -DSHERPA_ONNX_ENABLE_TESTS=ON ..
make test-utils-test

./bin/test-utils-test
Please also update text-utils-test.cc to include your case.

After I successfully ran cmake -DSHERPA_ONNX_ENABLE_TESTS=ON ..

Then I try to run this ./bin/test-utils-test

I get this error:
zsh: no such file or directory: ./bin/test-utils-test

@iprovalo
Copy link
Contributor Author

I got it:

make -j4 
./bin/test-utils-test

@iprovalo
Copy link
Contributor Author

@csukuangfj here is the PR with a fix: #1904

@csukuangfj
Copy link
Collaborator

Closing by #1904

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants