-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes for very large dataframes #21
Comments
A stacktrace would have been helpful as those snippets aren't the most general. |
Yes, I was just getting the info down before it was lost to slack's blackhole |
NOTE: Fannie Mae 2004Q3 data (2.7 Gb) isn't what's in the gist and I don't see a 2.7 Gb file in the dataset documented. Running the code on Julia 1.0.3 with JLSO is one of the slowest to read and write, but it might be worth updating the benchmarks to consider file size though because by default we're compressing our data.
The last 2 entries are both jlso files. |
I also couldn't get it to throw an error with a 9Gb CSV to push things a little more. https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory Fun fact, the default JLSO file format compresses the 9.1G file down to < 800M while JLD2 only compresses it to 2.6G on disk. I guess this is why JLSO is still pretty handy for our use case (e.g., upload lots of files to S3). |
Oooh, a chance to use DataDepsGenerators
|
Yep, can't reproduce on 1.2 either.
Looks like the performance is more consistent at least :) My best guess is that this is a windows specific issue... possibly with the compression library. |
This is the link to file that contains 2004Q3: http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2007.tgz In general, more data can be sourced from https://docs.rapids.ai/datasets/mortgage-data |
This is the error I get in 1.3-rc2 ERROR: InexactError: trunc(Int32, 2738917277)
Stacktrace:
[1] throw_inexacterror(::Symbol, ::Type{Int32}, ::Int64) at .\boot.jl:560
[2] checked_trunc_sint at .\boot.jl:582 [inlined]
[3] toInt32 at .\boot.jl:619 [inlined]
[4] Int32 at .\boot.jl:709 [inlined]
[5] bson_primitive at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:14 [inlined]
[6] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Array{UInt8,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
[7] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Array{UInt8,1}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
[8] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
[9] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
[10] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Array{Any,1}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
[11] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
[12] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Symbol, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
[13] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
[14] bson_primitive at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:36 [inlined]
[15] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
[16] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Dict{Symbol,Any}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
[17] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
[18] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
[19] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Array{Any,1}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
[20] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
[21] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Symbol, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
[22] bson_doc(::IOStream, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
[23] bson_primitive(::IOStream, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:36
[24] bson(::IOStream, ::Dict{String,Dict{String,V} where V}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:81
[25] write(::IOStream, ::JLSOFile) at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:7
[26] #save#4 at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:59 [inlined]
[27] save at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:59 [inlined]
[28] #7 at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:61 [inlined]
[29] #open#271(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(open), ::JLSO.var"##7#8"{Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Tuple{DataFrames.DataFrame}}, ::String, ::Vararg{String,N} where N) at .\io.jl:298
[30] open at .\io.jl:296 [inlined]
[31] #save#6 at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:61 [inlined]
[32] save(::String, ::DataFrames.DataFrame) at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:61
[33] top-level scope at util.jl:155 |
Interesting, looks like this is a known issue with the BSON spec. Apparently, array types are indexed with an Int32. I'm not entirely sure why I wasn't able to hit the condition with the other large files maybe difference in efficiency of the serialization/compression on windows? One option could be to split these large array primitives into multiple parts as suggested in the BSON.jl issues. |
Just ran into this issue in #74. It would be nice just to get a more informative error message here, possibly linking to this issue. Right now it's really inscrutable. |
Alright, I think I came up with an actual solution to this problem in #75 (vs just improving the error message). The gist is that we can just drop the serialized object bytes from the BSON doc and manually write them after, allowing us to save much larger files. |
I've not looked into this, but @xiaodaigh
reports that on the Fannie Mae 2004Q3 data (2.7 Gb)
on julia 1.2.0-rc1.0
that JLSO crashes.
Gist for reproducting is https://gist.github.com/xiaodaigh/2b9c1b6eb068fb8b3dcd1b1f2a55facd
The text was updated successfully, but these errors were encountered: