-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用 sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12 模型,热词异常 #842
Comments
any update? |
请贴 error log |
|
15:13:55.836 SuggestManager E openApp name = com.k2fsa.sherpa.onnx |
This fix is in #828 , will merge it as soon as possible. |
It appears that simple-sentencepiece is unable to tokenize UTF-8 strings (BBPE CJK) correctly. Python Example 1: Google's sentencepiece works fine. This code successfully produces the expected BPE tokens: ['▁ƋţŅ', '▁ƌŋţ', '▁ƌĭĺ', '▁ƋŠŒ'] with token IDs [6, 24, 433, 693].
Python Example 2: simple-sentencepiece causes a segmentation fault (core dumped).
On the C++ side, in sherpa-onnx, a core dump error also occurs when executing bpe_encode(). |
@pkufool please have a look. |
@lxp3 Sorry, I missed this issue. I tried your example, it runs normally. I think you might pass a wrong file to Ssentencepiece, it is not |
from byte_utils import byte_encode
from ssentencepiece import Ssentencepiece # pip install simple-sentencepiece
ssp = Ssentencepiece("bbpe.vocab") # python scripts/export_bpe_vocab.py --bpe-model bpe.model
s = "你 好 北 京"
s_utf8 = byte_encode(s) # ƋţŅ ƌŋţ ƌĭĺ ƋŠŒ
pieces = ssp.encode(s_utf8, out_type=str) # ['▁Ƌţ', 'Ņ', '▁ƌŋţ', '▁ƌĭĺ', '▁ƋŠŒ'] I use a 500-class bbpe trained on aishell texts |
@lxp3 I also tried |
@pkufool Glad to see your reply. There are the tokens.txt and bpe.model I used. Both are in the zip file. |
使用命令行生成中文字热词文件后,在tokens.txt能找到对应的byte,但是使用时有异常
![微信图片_20240508112931](https://private-user-images.githubusercontent.com/166801697/328745500-a391aab4-3442-48ee-819e-8e80c56e3e39.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNzM4ODMsIm5iZiI6MTczOTM3MzU4MywicGF0aCI6Ii8xNjY4MDE2OTcvMzI4NzQ1NTAwLWEzOTFhYWI0LTM0NDItNDhlZS04MTllLThlODBjNTZlM2UzOS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMlQxNTE5NDNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1iNjFkOTM1ZjVlOTcxYjlkMWZiZWNjOWQ5ZGM5MDVkODA0MDFjZGY5NGQzYWY1ODcyNTg2OWNiZDdkZjAwY2ZkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.wb-44w1MiBGsOjpH91kiOpr8Kz-VfkkpD24C2Q_FbsA)
![微信图片_20240508113007](https://private-user-images.githubusercontent.com/166801697/328745511-cafcb26b-1f6d-4666-a070-28d037f1e2a1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNzM4ODMsIm5iZiI6MTczOTM3MzU4MywicGF0aCI6Ii8xNjY4MDE2OTcvMzI4NzQ1NTExLWNhZmNiMjZiLTFmNmQtNDY2Ni1hMDcwLTI4ZDAzN2YxZTJhMS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMlQxNTE5NDNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lZTlhZTI2NjQ0ZDgyNGI5YjRjNjRjNzRlYzM1YjNlNDljMjE0MWFiNGIwMmFiOTEwM2YwY2I0NDc5ZDYxMmU2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.2g8LAjGJ4nxMZGB79ktwD7Tbo6AIqYL0Qz0b6qRX5Fc)
![微信图片_20240508113014](https://private-user-images.githubusercontent.com/166801697/328745516-c134ac7d-2183-4d6f-8682-a2b753ffab0c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNzM4ODMsIm5iZiI6MTczOTM3MzU4MywicGF0aCI6Ii8xNjY4MDE2OTcvMzI4NzQ1NTE2LWMxMzRhYzdkLTIxODMtNGQ2Zi04NjgyLWEyYjc1M2ZmYWIwYy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMlQxNTE5NDNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1jMjA0N2E4ZjNiMjlkM2FlY2ZlZTAzYjQxZmM0MzZmNTQxYzI2ZDQ0NWVkMTg1YTU5M2JhY2I1NTk4ODdiMGNmJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.hnjCQVirPFUsW2w46oqDf7nl1qUo2VuQ9-c_88kC4Fg)
The text was updated successfully, but these errors were encountered: