Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用 sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12 模型,热词异常 #842

Open
jianking123 opened this issue May 8, 2024 · 11 comments

Comments

@jianking123
Copy link

使用命令行生成中文字热词文件后,在tokens.txt能找到对应的byte,但是使用时有异常
微信图片_20240508112931
微信图片_20240508113007
微信图片_20240508113014

@jianking123
Copy link
Author

any update?

@csukuangfj
Copy link
Collaborator

但是使用时有异常

请贴 error log

@jianking123
Copy link
Author

但是使用时有异常

请贴 error log
Uploading 日志.txt…

@jianking123
Copy link
Author

15:13:55.836 SuggestManager E openApp name = com.k2fsa.sherpa.onnx
15:13:55.984 Perf I Connecting to perf service.
15:13:55.997 FeatureParser I can't find dipper.xml in assets/device_features/,it may be in /system/etc/device_features
15:13:56.008 libc E Access denied finding property "ro.vendor.df.effect.conflict"
15:13:56.013 Perf E Fail to get file list com.k2fsa.sherpa.onnx
15:13:56.013 Perf E getFolderSize() : Exception_1 = java.lang.NullPointerException: Attempt to get length of null array
15:13:56.055 ForceDarkHelper D updateByCheckExcludeList: pkg: com.k2fsa.sherpa.onnx activity: com.k2fsa.sherpa.onnx.MainActivity@a4ffbed
15:13:56.057 ForceDarkHelper D updateByCheckExcludeList: pkg: com.k2fsa.sherpa.onnx activity: com.k2fsa.sherpa.onnx.MainActivity@a4ffbed
15:13:56.062 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/sdk/Token;->(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;I)V (greylist, linking, allowed)
15:13:56.062 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/InterceptorProxy;->getWorkThread()Landroid/os/HandlerThread; (greylist, linking, allowed)
15:13:56.062 ViewCo...actory D initViewContentFetcherClass
15:13:56.062 ViewCo...actory D getInterceptorPackageInfo
15:13:56.062 fsa.sherpa.onn W Accessing hidden method Landroid/app/AppGlobals;->getInitialApplication()Landroid/app/Application; (greylist, linking, allowed)
15:13:56.063 ViewCo...actory D getInitialApplication took 0ms
15:13:56.063 ViewCo...actory D packageInfo.packageName: com.miui.catcherpatch
15:13:56.070 ViewCo...actory D initViewContentFetcherClass took 7ms
15:13:56.070 ContentCatcher I ViewContentFetcher : ViewContentFetcher
15:13:56.070 ViewCo...actory D createInterceptor took 7ms
15:13:56.070 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/sdk/ContentCatcherManager;->getInstance()Lmiui/contentcatcher/sdk/ContentCatcherManager; (greylist, linking, allowed)
15:13:56.070 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/sdk/ContentCatcherManager;->registerContentInjector(Lmiui/contentcatcher/sdk/Token;Lmiui/contentcatcher/sdk/injector/IContentDecorateCallback;)V (greylist, linking, allowed)
15:13:56.072 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/sdk/ContentCatcherManager;->getPageConfig(Lmiui/contentcatcher/sdk/Token;)Lmiui/contentcatcher/sdk/data/PageConfig; (greylist, linking, allowed)
15:13:56.072 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/sdk/data/PageConfig;->getFeatures()Ljava/util/ArrayList; (greylist, linking, allowed)
15:13:56.072 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/sdk/data/PageConfig;->getCatchers()Ljava/util/ArrayList; (greylist, linking, allowed)
15:13:56.079 fsa.sherpa.onn W Accessing hidden method Landroid/view/View;->computeFitSystemWindows(Landroid/graphics/Rect;Landroid/graphics/Rect;)Z (greylist, reflection, allowed)
15:13:56.080 fsa.sherpa.onn W Accessing hidden method Landroid/view/ViewGroup;->makeOptionalFitsSystemWindows()V (greylist, reflection, allowed)
15:13:56.089 ForceDarkHelper D updateByCheckExcludeList: pkg: com.k2fsa.sherpa.onnx activity: com.k2fsa.sherpa.onnx.MainActivity@a4ffbed
15:13:56.092 chatty I uid=10200(com.k2fsa.sherpa.onnx) identical 1 line
15:13:56.093 ForceDarkHelper D updateByCheckExcludeList: pkg: com.k2fsa.sherpa.onnx activity: com.k2fsa.sherpa.onnx.MainActivity@a4ffbed
15:13:56.101 ForceDarkHelper D updateByCheckExcludeList: pkg: com.k2fsa.sherpa.onnx activity: com.k2fsa.sherpa.onnx.MainActivity@a4ffbed
15:13:56.122 sherpa-onnx I Start to initialize model
15:13:56.122 sherpa-onnx I Select model type 11
15:13:56.139 Online...Config I OnlineRecognizerConfig: OnlineRecognizerConfig(featConfig=FeatureConfig(sampleRate=16000, featureDim=80), modelConfig=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/encoder-epoch-20-avg-1-chunk-16-left-128.int8.onnx, decoder=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/decoder-epoch-20-avg-1-chunk-16-left-128.int8.onnx, joiner=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/joiner-epoch-20-avg-1-chunk-16-left-128.int8.onnx), paraformer=OnlineParaformerModelConfig(encoder=, decoder=), zipformer2Ctc=OnlineZipformer2CtcModelConfig(model=), tokens=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/tokens.txt, numThreads=1, debug=false, provider=cpu, modelType=zipformer2), lmConfig=OnlineLMConfig(model=, scale=0.5), ctcFstDecoderConfig=OnlineCtcFstDecoderConfig(graph=, maxActive=3000), endpointConfig=EndpointConfig(rule1=EndpointRule(mustContainNonSilence=false, minTrailingSilence=2.4, minUtteranceLength=0.0), rule2=EndpointRule(mustContainNonSilence=true, minTrailingSilence=1.4, minUtteranceLength=0.0), rule3=EndpointRule(mustContainNonSilence=false, minTrailingSilence=0.0, minUtteranceLength=20.0)), enableEndpoint=true, decodingMethod=modified_beam_search, maxActivePaths=4, hotwordsFile=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/hotwords_mix_b.txt, hotwordsScore=4.5)
15:13:56.139 sherpa-onnx W config:
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder="sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/encoder-epoch-20-avg-1-chunk-16-left-128.int8.onnx", decoder="sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/decoder-epoch-20-avg-1-chunk-16-left-128.int8.onnx", joiner="sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/joiner-epoch-20-avg-1-chunk-16-left-128.int8.onnx"), paraformer=OnlineParaformerModelConfig(encoder="", decoder=""), wenet_ctc=OnlineWenetCtcModelConfig(model="", chunk_size=16, num_left_chunks=4), zipformer2_ctc=OnlineZipformer2CtcModelConfig(model=""), tokens="sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/tokens.txt", num_threads=1, warm_up=0, debug=False, provider="cpu", model_type="zipformer2"), lm_config=OnlineLMConfig(model="", scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointR
15:13:56.671 SuggestManager E openApp name = com.k2fsa.sherpa.onnx
15:13:57.049 libc E Access denied finding property "ro.hardware.chipname"
15:14:02.692 sherpa-onnx W Cannot find ID for token <0xE7> at line: ▁ 频 <0xE7> <0xB9> <0x81>. (Hint: words on the same line are separated by spaces)
15:14:02.692 sherpa-onnx W Encode hotwords failed.

@pkufool
Copy link
Contributor

pkufool commented May 15, 2024

This fix is in #828 , will merge it as soon as possible.

@lxp3
Copy link

lxp3 commented Oct 28, 2024

It appears that simple-sentencepiece is unable to tokenize UTF-8 strings (BBPE CJK) correctly.

Python Example 1: Google's sentencepiece works fine. This code successfully produces the expected BPE tokens: ['▁ƋţŅ', '▁ƌŋţ', '▁ƌĭĺ', '▁ƋŠŒ'] with token IDs [6, 24, 433, 693].

from byte_utils import byte_encode
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
rc = sp.load("onnx/bbpe.model")
s = "你 好 北 京"
s_utf8 = byte_encode(s) #  ƋţŅ ƌŋţ ƌĭĺ ƋŠŒ
pieces = sp.encode(s_utf8 , out_type=str) # ['▁ƋţŅ', '▁ƌŋţ', '▁ƌĭĺ', '▁ƋŠŒ']

Python Example 2: simple-sentencepiece causes a segmentation fault (core dumped).

  from byte_utils import byte_encode
  from ssentencepiece import Ssentencepiece # pip install simple-sentencepiece
  ssp = Ssentencepiece("onnx/tokens.txt")
  s = "你 好 北 京"
  s_utf8 = byte_encode(s) #  ƋţŅ ƌŋţ ƌĭĺ ƋŠŒ
  pieces = ssp.encode(s_utf8, out_type=str) # raise  segmentation fault (core dumped).

On the C++ side, in sherpa-onnx, a core dump error also occurs when executing bpe_encode().
Our simple solution is use the google library "sentencepiece::SentencePieceProcessor *bpe_encoder" as encoder, instead of "ssentencepiece::Ssentencepiece *bpe_encoder".

@csukuangfj
Copy link
Collaborator

@pkufool please have a look.

@pkufool
Copy link
Contributor

pkufool commented Nov 26, 2024

@lxp3 Sorry, I missed this issue. I tried your example, it runs normally. I think you might pass a wrong file to Ssentencepiece, it is not tokens.txt, it is bbpe.vocab. See scripts/export_bpe_vocab.py for how to export a bpe.vocab from a bpe.model.

@pkufool
Copy link
Contributor

pkufool commented Nov 26, 2024

from byte_utils import byte_encode
from ssentencepiece import Ssentencepiece # pip install simple-sentencepiece
ssp = Ssentencepiece("bbpe.vocab")  # python scripts/export_bpe_vocab.py --bpe-model bpe.model
s = "你 好 北 京"
s_utf8 = byte_encode(s) #  ƋţŅ ƌŋţ ƌĭĺ ƋŠŒ
pieces = ssp.encode(s_utf8, out_type=str) # ['▁Ƌţ', 'Ņ',  '▁ƌŋţ', '▁ƌĭĺ', '▁ƋŠŒ']  I use a 500-class bbpe trained on aishell texts

@pkufool
Copy link
Contributor

pkufool commented Nov 26, 2024

@lxp3 I also tried tokens.txt, it runs normally too, the output is incorrect though. Can you share your bpe.model ?

@lxp3
Copy link

lxp3 commented Nov 26, 2024

@pkufool Glad to see your reply. There are the tokens.txt and bpe.model I used. Both are in the zip file.
test_bpe.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants