Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsst_compress may return an incorrect number of compressed rows on AVX512 #30

Open
alexandervanrenen opened this issue Feb 13, 2025 · 1 comment

Comments

@alexandervanrenen
Copy link

Hello,

I think I’ve found a small issue in fsst_compress. I added some details here and would be grateful if someone could have a look :)

Issue: The fsst_compress function (link) returns the number of compressed rows. I call it with enough space to fit the first row (7+2*inputlength), but not enough for all rows. It then returns that x rows were compressed, but the lenOut and strOut are not set for all of them.

Details: This happens when the job-creation loop in the compressSIMD (link) function has already created some jobs and thus increased curLine, but then runs out of space (i.e., budget) in line 253. In that case, these queded jobs are never processed, yet fsst_compress just returns the incremented curLine.

Impact: This makes it a bit tricky, as the return value of fsst_compress can not be used to tell if all given rows were actually compressed. Instead you would always have to have a large enough buffer or check the strOut pointers for nullptrs.

Potential Fix: I think something like the following at the end of compressSIMD should solve the issue. I am pretty sure the jobLines are soreted and we can just use the first.

 if (batchPos != 0) {
    return jobLine[0];
 }

Reproduce: link

Could you confirm if this is unintended behavior? If you like the fix, I’d be happy to open a PR :)

Thanks,
Alex

@peterboncz
Copy link
Collaborator

peterboncz commented Feb 21, 2025

Hi Alexander:

Apologies for the late response. I indeed have my doubts about this code.

Breaking from the loop seems not the right response, because indeed we would not flush and compress the already buffered jobs for some lines before curLine (returning curLine is then not true, we compressed fewer). Rather, we should stop accumulating jobs and flush (goto line 292), be it with SIMD or not (line 322). Only after that flushing we should break.

Apologies for not coming back with a full fix immediately, but would be happy to review a PR.

best,

Peter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants