fsst_compress may return an incorrect number of compressed rows on AVX512 #30

alexandervanrenen · 2025-02-13T20:41:03Z

Hello,

I think I’ve found a small issue in fsst_compress. I added some details here and would be grateful if someone could have a look :)

Issue: The fsst_compress function (link) returns the number of compressed rows. I call it with enough space to fit the first row (7+2*inputlength), but not enough for all rows. It then returns that x rows were compressed, but the lenOut and strOut are not set for all of them.

Details: This happens when the job-creation loop in the compressSIMD (link) function has already created some jobs and thus increased curLine, but then runs out of space (i.e., budget) in line 253. In that case, these queded jobs are never processed, yet fsst_compress just returns the incremented curLine.

Impact: This makes it a bit tricky, as the return value of fsst_compress can not be used to tell if all given rows were actually compressed. Instead you would always have to have a large enough buffer or check the strOut pointers for nullptrs.

Potential Fix: I think something like the following at the end of compressSIMD should solve the issue. I am pretty sure the jobLines are soreted and we can just use the first.

 if (batchPos != 0) {
    return jobLine[0];
 }

Reproduce: link

Could you confirm if this is unintended behavior? If you like the fix, I’d be happy to open a PR :)

Thanks,
Alex

The text was updated successfully, but these errors were encountered:

peterboncz · 2025-02-21T15:58:14Z

Hi Alexander:

Apologies for the late response. I indeed have my doubts about this code.

Breaking from the loop seems not the right response, because indeed we would not flush and compress the already buffered jobs for some lines before curLine (returning curLine is then not true, we compressed fewer). Rather, we should stop accumulating jobs and flush (goto line 292), be it with SIMD or not (line 322). Only after that flushing we should break.

Apologies for not coming back with a full fix immediately, but would be happy to review a PR.

best,

Peter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fsst_compress may return an incorrect number of compressed rows on AVX512 #30

fsst_compress may return an incorrect number of compressed rows on AVX512 #30

alexandervanrenen commented Feb 13, 2025

peterboncz commented Feb 21, 2025 •

edited

Loading

fsst_compress may return an incorrect number of compressed rows on AVX512 #30

fsst_compress may return an incorrect number of compressed rows on AVX512 #30

Comments

alexandervanrenen commented Feb 13, 2025

peterboncz commented Feb 21, 2025 • edited Loading

peterboncz commented Feb 21, 2025 •

edited

Loading