Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing Whisper Model Token Normalization #1904

Merged

Conversation

iprovalo
Copy link
Contributor

@iprovalo iprovalo commented Feb 21, 2025

Fixing Whisper Model Token Normalization UTF8 characters issues.
Adding more unit tests for the RemoveInvalidUtf8Sequences.

Issue reported here

The summary of the issue: all normalization works for the text in the result, however, it was not applied to tokens, causing JNI exception.

Fixing Whisper Model Token Normalization UTF8 characters issues.  Adding more unit tests for the RemoveInvalidUtf8Sequences.
Formatting Fixes
if (!sym_table.Contains(i)) {
continue;
class OfflineRecognizerWhisperImpl : public OfflineRecognizerImpl {
private:
Copy link
Contributor Author

@iprovalo iprovalo Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csukuangfj this is a private method, I will move it below after you take a look at the diffs. I just converted the existing static method into this instance function. The reason is that it needs access to ApplyInverseTextNormalization().

@@ -156,7 +159,6 @@ class OfflineRecognizerWhisperImpl : public OfflineRecognizerImpl {
std::move(cross_kv.second));

auto r = Convert(results[0], symbol_table_);
r.text = ApplyInverseTextNormalization(std::move(r.text));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to Convert()

@@ -55,4 +55,77 @@ TEST(RemoveInvalidUtf8Sequences, Case1) {
EXPECT_EQ(s.size() + 4, v.size());
}


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few tests, verified the characters are being removed correctly.

Copy link
Collaborator

@csukuangfj csukuangfj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Just left a minor comment.

if (!sym_table.Contains(i)) {
continue;
class OfflineRecognizerWhisperImpl : public OfflineRecognizerImpl {
private:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move it to the end of the private section for methods in this class?

We usually put the public constructor at the beginning of a class in sherpa-onnx.

(It is a code style issue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I initially left it at the top so that the diffs are easier to read.

Copy link
Collaborator

@csukuangfj csukuangfj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

@csukuangfj csukuangfj merged commit 94728bf into k2-fsa:master Feb 21, 2025
164 of 211 checks passed
@iprovalo iprovalo deleted the feature/whisper_token_array_normalization branch February 21, 2025 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants