Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fwrite length for gzip output #6393

Open
wants to merge 47 commits into
base: master
Choose a base branch
from

Conversation

philippechataignon
Copy link
Contributor

@philippechataignon philippechataignon commented Aug 23, 2024

Closes #6356. Closes #5506.

This PR is an attempt to create a better gzip file with fwrite. Its an important rewrite because it includes some refactoring of actual code.

zlib

  • use Z_SYNC_FLUSH instead of Z_FINISH
  • create manual heading and write crc and len in tail
  • calc len and crc in thread and summarize in main thread
  • len in gzip specification is 32 bits and then is modulo 2 ^ 32 for uncompressed size > 4GiB

C code

  • simplify the implementation with only a #pragma omp parallel for for chunk loop and #pragma omp ordered for the writing and summarizing part.
  • Matt Dowle introduces the use of pool of buffers : the idea is generalized. The pools are created at the beginning and then uses for writing headers and rows. All the malloc occur early and no need for an header buffer.
  • Deobfuscate some part, especially if ( ) is followed by a new line, no =- or =*. Lot of work remains. Use of indent command ?
  • Remove some old debug code (msg)

Copy link

github-actions bot commented Aug 23, 2024

Comparison Plot

Generated via commit 826ab8c

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 4 minutes and 32 seconds
Installing different package versions 8 minutes and 17 seconds
Running and plotting the test cases 2 minutes and 19 seconds

@oliverfoster
Copy link

oliverfoster commented Aug 23, 2024

len in gzip specification is 32 bits and then is false for uncompressed size > 4GiB

Is it not meant to be length modulo 4GiB rather than false? The remainder of dividing by 4GiB multiplied by 4GiB?

Screenshot_20240823-194218

@philippechataignon
Copy link
Contributor Author

philippechataignon commented Aug 24, 2024

You're right and this PR version stores the modulo 2**32 as requested but its not the right size.
Note that in new versions of gzip (after 1.12), the length stored in file is no more used. See note on gzip -l on https://savannah.gnu.org/news/?id=10121

7z l # (use header)
    Date      Time    Attr         Size   Compressed  Name
                     .....   1298954309   2212092276  mtcars.csv

% gzip --version                                                                                                                                                    
gzip 1.12
% gzip -lv mtcars.csv.gz  # (decompress and takes much more time)
method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 5cd79282 Aug 24 10:19          2212092276          9888888901  77.6% mtcars.csv

@philippechataignon
Copy link
Contributor Author

Put PR #5513 in this PR with new param compressLevel.

@tdhock
Copy link
Member

tdhock commented Sep 3, 2024

yes that would be great to Point to the Wiki from the first line of .ci/atime/tests.R

I would suggest keeping docs on the wiki, which is easier to update, and include screenshots/graphics.

@philippechataignon
Copy link
Contributor Author

@philippechataignon do you want to have a go at adding a atime performance regression test? Totally fine if not -- what would help at least would be to write a simple benchmark of gzipped fwrite that you think would capture the important pieces of what's changed here, does that make sense?

OK for testing regression but notice that the core of fwrite hasn't change : same buffer sizes, same number of jobs, same number of rows per job. Personally I observe similar timings that previous version.

One point of discussion : I notice that #2020 introduces a change that I never realized before this PR. By default scipen fwrite parameter uses the value of R option scipen. Like a lot of persons, I have a options(scipen = 999) in my Rprofile because I don't like sci output on screen but I never realize that fwrite was penalized. Why ? Because of the maxLineLen which can be very high and gives a lot of batches with few lines and little chunk. In fwrite, scipen has an maximum of 350 but it's a very high limit. And maxLineLen is used for determining number and sizes of fwrite chunks.

For testing impact, I have this little program :

n = 10000
ncol = 1000
dt <- data.table(i=1:n)
dt[, paste0("V", 1:ncol) := lapply(1:ncol, function(x) as.numeric(sample(1:n, replace=T)))]
print(sessionInfo())
system.time(fwrite(dt, "/dev/null", compress="gzip", verbose=T, scipen=0))
system.time(fwrite(dt, "/dev/null", compress="gzip", verbose=T, scipen=999))

With scipen = 0

maxLineLen=61026. Found in 0.000s
Writing bom (false), yaml (0 characters) and column names (true)
Writing 10000 rows in 73 batches of 137 rows, each buffer size 8388608 bytes (8 MiB), showProgress=1, nth=4
zlib: uncompressed length=48947760 (46 MiB), compressed length=21719798 (20 MiB), ratio=44.4%, crc=26a23b0a
Written 10000 rows in 1.569 secs using 4 threads. MaxBuffUsed=7%

With scipen = 999

maxLineLen=761026. Found in 0.000s
Writing bom (false), yaml (0 characters) and column names (true)
Writing 10000 rows in 910 batches of 11 rows, each buffer size 8388608 bytes (8 MiB), showProgress=1, nth=4
zlib: uncompressed length=48947760 (46 MiB), compressed length=22745125 (21 MiB), ratio=46.5%, crc=26a23b0a
Written 10000 rows in 1.202 secs using 4 threads. MaxBuffUsed=0%

In last case real mean line length is ~ 5000 but estimated to 761026. Compression ratio is higher because the buffers are very little used. Surprisingly timing is better despite openmp number of threads overhead.

In my opinion, scipen parameter should be 0 in fwrite and not scipen option witch is related to digits option not present in fwrite and digits is higher in fwrite output (20 ?).

I use this little bench for scipen impact and I think it can be used for atime. I've tried to add this :

  # Issue with fwrite length for gzip output, fixed in: https://github.com/Rdatatable/data.table/pull/6393
  # No regression timing test
  "No regression fwrite" = atime::atime_test(
    N = 10^seq(2, 8),
    setup = {
      set.seed(1)
      ncol = 1000
      L <- data.table(i=1:N)
      L[, paste0("V", 1:ncol) := lapply(1:ncol, function(x) rnorm(N))]
    },
    expr = {
      fwrite(dt, "/dev/null", compress="gzip")
    },
    Fast = "117ab45674f1e56304abca83f9f0df50ab0274be", # Close-to-last merge commit in the PR
    Slow = "e73c2c849f921cf4ef51e3809842e0fee9b9f52c"),

but I'm not sure that /dev/null is portable and if we write a real file, that's made the timing.

OK for another one to continue and test that there is not time regression.

@Anirban166
Copy link
Member

I documented the current process here,

Should we maybe (1) migrate that documentation into .ci/atime directly (2) add .ci/atime/README.md pointing to the Wiki (3) Point to the Wiki from the first line of .ci/atime/tests.R?

2 and 3 sounds good to me

yes that would be great to Point to the Wiki from the first line of .ci/atime/tests.R

Should I go ahead and make a PR for this quick addition?

I would suggest keeping docs on the wiki, which is easier to update, and include screenshots/graphics.

I agree, both for being able to include images and in case we miss out on something that other people notice, they should be able to fill in points quickly

@MichaelChirico
Copy link
Member

I notice that #2020 introduces a change that I never realized before this PR. By default scipen fwrite parameter uses the value of R option scipen.

Moved this to #6457 for further discussion, I think it's out of scope here. Thanks!

src/fwrite.c Outdated Show resolved Hide resolved
src/fwrite.c Outdated Show resolved Hide resolved
@tdhock
Copy link
Member

tdhock commented Sep 3, 2024

but I'm not sure that /dev/null is portable

this only has to run on github actions ubuntu vm, so /dev/null should be ok in principle, but I changed it to tempfile() which should be fine too.

Thanks for sharing your code for scipen benchmarking. I adapted it to get the following atime result, which indicates little to no impact on computation time, but a small constant factor increase in memory usage.

image

edit.data.table = function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}
out.csv <- tempfile()
issue6393 <- atime::atime_versions(
  "~/R/data.table",
  N = 2^seq(1, 20),
  pkg.edit.fun=edit.data.table,
  setup = {
    set.seed(1)
    NC = 10
    L <- data.table(i=1:N)
    L[, paste0("V", 1:NC) := replicate(NC, rnorm(N), simplify=FALSE)]
  },
  expr = {
    data.table::fwrite(L, out.csv, compress="gzip")
  },
  Fast="f339aa64c426a9cd7cf2fcb13d91fc4ed353cd31", # Parent of the first commit https://github.com/Rdatatable/data.table/commit/fcc10d73a20837d0f1ad3278ee9168473afa5ff1 in the PR https://github.com/Rdatatable/data.table/pull/6393/commits with major change to fwrite with gzip.
  PR = "117ab45674f1e56304abca83f9f0df50ab0274be") # Close-to-last merge commit in the PR.
plot(issue6393)

Copy link

codecov bot commented Sep 27, 2024

Codecov Report

Attention: Patch coverage is 91.85185% with 11 lines in your changes missing coverage. Please review.

Project coverage is 98.55%. Comparing base (b48649a) to head (826ab8c).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
src/fwrite.c 91.12% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6393      +/-   ##
==========================================
- Coverage   98.62%   98.55%   -0.08%     
==========================================
  Files          79       79              
  Lines       14569    14624      +55     
==========================================
+ Hits        14368    14412      +44     
- Misses        201      212      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MichaelChirico
Copy link
Member

Thanks again @philippechataignon! I still can't say I've understood the C changes thoroughly, but they pass the existing suite and we have user reports it is working. I am happy to submit now and see if revdep checks tell us anything new. 🚀

DTPRINT(_("Allocate %zu bytes for thread_streams\n"), nth * sizeof(z_stream));
}
if (!thread_streams)
STOP(_("Failed to allocated %d bytes for threads_streams."), (int)(nth * sizeof(z_stream)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
STOP(_("Failed to allocated %d bytes for threads_streams."), (int)(nth * sizeof(z_stream)));
STOP(_("Failed to allocated %d bytes for threads_streams."), (int)(nth * sizeof(z_stream))); // # nocov

// compute zbuffSize which is the same for each thread
z_stream *stream = thread_streams;
if (init_stream(stream) != Z_OK)
STOP(_("Can't init stream structure for deflateBound"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
STOP(_("Can't init stream structure for deflateBound"));
STOP(_("Can't init stream structure for deflateBound")); // # nocov

}
z_stream *stream = thread_streams;
if (init_stream(stream) != Z_OK)
STOP(_("Can't init stream structure for writing header"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
STOP(_("Can't init stream structure for writing header"));
STOP(_("Can't init stream structure for writing header")); // # nocov

@MichaelChirico
Copy link
Member

Ah, apologies @philippechataignon I may draw this out just a wee bit longer 🙃

I think our code coverage suite was not up & running when you first posted this PR -- it's back now.

Would you mind taking a look through the report and suggesting which lines could reasonably be covered by new tests, and which should just get # nocov treatment? I've suggested a few likely # nocov candidates already.

@MichaelChirico
Copy link
Member

Happy new year @philippechataignon! Checking if you'll find time in the next few weeks to add some coverage tests to this PR. Thanks again!

@philippechataignon
Copy link
Contributor Author

Happy new year @philippechataignon! Checking if you'll find time in the next few weeks to add some coverage tests to this PR. Thanks again!

Happy new year too. Not a coverage expert code but will have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants