llm_llm suddenly causes error in handling multi-byte utf8 string #7

nyasu3w · 2025-01-27T13:34:38Z

The output string of llm_llm is sent separetedly in json format, but the separation point can be at wrong point inside of multi-byte character. When this wrong separation happens, maybe the json output is corrupted to make some error.

If llm_llm gets "ガンダムについて語ってください" (in ja language) as input for inference, it will stop by the below error.
[W][inference][ 199]: lLaMa_->Run have error!

The result for the input is always "(snip) 作品は、1960年に発売された(snip)", and separated at "発"character
"作品は、", "196", "0年にXX", "Y売された"
(発 is 3 bytes char 0xe799ba: XX=e799 Y=ba )

If json output is stopped, no error seems to happen.
Extended log is the following. Ignore 6066d1, it is my logging mistake.

[I][task_output][ 249]: send:作品は、
[I][task_output][ 251]: datalen:12
[I][task_output][ 253]: data:e4,bd,9c,e5,93,81,e3,81
[I][task_output][ 255]: data:af,6066d1
[I][task_output][ 273]: send stream
[I][task_output][ 249]: send:196
[I][task_output][ 251]: datalen:3
[I][task_output][ 273]: send stream
[I][task_output][ 249]: send:0年に��
[I][task_output][ 251]: datalen:9
[I][task_output][ 253]: data:30,e5,b9,b4,e3,81,ab,e7
[I][task_output][ 255]: data:99,6066d1
// if json is output, the error is here.
[I][task_output][ 249]: send:�売された
[I][task_output][ 251]: datalen:13
[I][task_output][ 253]: data:ba,e5,a3,b2,e3,81,95,e3
[I][task_output][ 255]: data:82,6066d1
[I][task_output][ 273]: send stream

The logging code is like this in llm_llm::task_output()

        SLOGI("send:%s", data.c_str());   // this is the original logging 
        const char* cstr = data.c_str();
        SLOGI("datalen:%d",data.length());
        if(data.length() > 8)
            SLOGI("data:%x,%x,%x,%x,%x,%x,%x,%x",cstr[0],cstr[1],cstr[2],cstr[3],cstr[4],cstr[5],cstr[6],cstr[7]);
        if(data.length() > 8)  SLOGI("data:%x, _%x_ ",cstr[8]);  // mistake

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm_llm suddenly causes error in handling multi-byte utf8 string #7

llm_llm suddenly causes error in handling multi-byte utf8 string #7

nyasu3w commented Jan 27, 2025 •

edited

Loading

llm_llm suddenly causes error in handling multi-byte utf8 string #7

llm_llm suddenly causes error in handling multi-byte utf8 string #7

Comments

nyasu3w commented Jan 27, 2025 • edited Loading

nyasu3w commented Jan 27, 2025 •

edited

Loading