-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing Whisper Model Token Normalization #1904
Fixing Whisper Model Token Normalization #1904
Conversation
Fixing Whisper Model Token Normalization UTF8 characters issues. Adding more unit tests for the RemoveInvalidUtf8Sequences.
Formatting Fixes
if (!sym_table.Contains(i)) { | ||
continue; | ||
class OfflineRecognizerWhisperImpl : public OfflineRecognizerImpl { | ||
private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@csukuangfj this is a private method, I will move it below after you take a look at the diffs. I just converted the existing static method into this instance function. The reason is that it needs access to ApplyInverseTextNormalization()
.
@@ -156,7 +159,6 @@ class OfflineRecognizerWhisperImpl : public OfflineRecognizerImpl { | |||
std::move(cross_kv.second)); | |||
|
|||
auto r = Convert(results[0], symbol_table_); | |||
r.text = ApplyInverseTextNormalization(std::move(r.text)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to Convert()
@@ -55,4 +55,77 @@ TEST(RemoveInvalidUtf8Sequences, Case1) { | |||
EXPECT_EQ(s.size() + 4, v.size()); | |||
} | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few tests, verified the characters are being removed correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Just left a minor comment.
if (!sym_table.Contains(i)) { | ||
continue; | ||
class OfflineRecognizerWhisperImpl : public OfflineRecognizerImpl { | ||
private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move it to the end of the private section for methods in this class?
We usually put the public constructor at the beginning of a class in sherpa-onnx.
(It is a code style issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I initially left it at the top so that the diffs are easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution!
Fixing Whisper Model Token Normalization UTF8 characters issues.
Adding more unit tests for the RemoveInvalidUtf8Sequences.
Issue reported here
The summary of the issue: all normalization works for the text in the result, however, it was not applied to tokens, causing JNI exception.