You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think I’ve found a small issue in fsst_compress. I added some details here and would be grateful if someone could have a look :)
Issue: The fsst_compress function (link) returns the number of compressed rows. I call it with enough space to fit the first row (7+2*inputlength), but not enough for all rows. It then returns that x rows were compressed, but the lenOut and strOut are not set for all of them.
Details: This happens when the job-creation loop in the compressSIMD (link) function has already created some jobs and thus increased curLine, but then runs out of space (i.e., budget) in line 253. In that case, these queded jobs are never processed, yet fsst_compress just returns the incremented curLine.
Impact: This makes it a bit tricky, as the return value of fsst_compress can not be used to tell if all given rows were actually compressed. Instead you would always have to have a large enough buffer or check the strOut pointers for nullptrs.
Potential Fix: I think something like the following at the end of compressSIMD should solve the issue. I am pretty sure the jobLines are soreted and we can just use the first.
Apologies for the late response. I indeed have my doubts about this code.
Breaking from the loop seems not the right response, because indeed we would not flush and compress the already buffered jobs for some lines before curLine (returning curLine is then not true, we compressed fewer). Rather, we should stop accumulating jobs and flush (goto line 292), be it with SIMD or not (line 322). Only after that flushing we should break.
Apologies for not coming back with a full fix immediately, but would be happy to review a PR.
Hello,
I think I’ve found a small issue in
fsst_compress
. I added some details here and would be grateful if someone could have a look :)Issue: The
fsst_compress
function (link) returns the number of compressed rows. I call it with enough space to fit the first row (7+2*inputlength), but not enough for all rows. It then returns thatx
rows were compressed, but thelenOut
andstrOut
are not set for all of them.Details: This happens when the job-creation loop in the
compressSIMD
(link) function has already created some jobs and thus increasedcurLine
, but then runs out of space (i.e.,budget
) in line 253. In that case, these queded jobs are never processed, yetfsst_compress
just returns the incrementedcurLine
.Impact: This makes it a bit tricky, as the return value of
fsst_compress
can not be used to tell if all given rows were actually compressed. Instead you would always have to have a large enough buffer or check thestrOut
pointers for nullptrs.Potential Fix: I think something like the following at the end of
compressSIMD
should solve the issue. I am pretty sure thejobLines
are soreted and we can just use the first.Reproduce: link
Could you confirm if this is unintended behavior? If you like the fix, I’d be happy to open a PR :)
Thanks,
Alex
The text was updated successfully, but these errors were encountered: