Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes for very large dataframes #21

Open
oxinabox opened this issue Sep 16, 2019 · 11 comments
Open

Crashes for very large dataframes #21

oxinabox opened this issue Sep 16, 2019 · 11 comments
Labels
blocked bug Something isn't working unreproducible Unable to reproduce the error with given info

Comments

@oxinabox
Copy link
Member

I've not looked into this, but @xiaodaigh
reports that on the Fannie Mae 2004Q3 data (2.7 Gb)
on julia 1.2.0-rc1.0
that JLSO crashes.

Gist for reproducting is https://gist.github.com/xiaodaigh/2b9c1b6eb068fb8b3dcd1b1f2a55facd

@rofinn
Copy link
Member

rofinn commented Sep 17, 2019

A stacktrace would have been helpful as those snippets aren't the most general.

@oxinabox
Copy link
Member Author

Yes, I was just getting the info down before it was lost to slack's blackhole

@rofinn
Copy link
Member

rofinn commented Sep 18, 2019

NOTE: Fannie Mae 2004Q3 data (2.7 Gb) isn't what's in the gist and I don't see a 2.7 Gb file in the dataset documented.

Running the code on Julia 1.0.3 with Performance_2000Q4.txt (1.0 Gb) seems to run fine. I'll try testing on 1.2 as it's possible that there's a bug in the more recent julia releases that need to be fixed.

JLSO is one of the slowest to read and write, but it might be worth updating the benchmarks to consider file size though because by default we're compressing our data.

write_perf = [0.0, 0.0, 74.5112, 61.946, 147.795, 91.5526, 23.0511, 10.9649, 485.467, 795.006]
read_perf = [0.0, 0.0, 19.1525, 60.1645, 0.00758378, 0.00147431, 16.6633, 12.3439, 96.9803, 49.7399]

The last 2 entries are both jlso files.

@rofinn
Copy link
Member

rofinn commented Sep 18, 2019

I also couldn't get it to throw an error with a 9Gb CSV to push things a little more. https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory

Fun fact, the default JLSO file format compresses the 9.1G file down to < 800M while JLD2 only compresses it to 2.6G on disk. I guess this is why JLSO is still pretty handy for our use case (e.g., upload lots of files to S3).

@oxinabox
Copy link
Member Author

Oooh, a chance to use DataDepsGenerators

julia> using DataDepsGenerators

julia> println(generate("https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory"))
register(DataDep(
    "Seattle Library Collection Inventory",
    """
	Dataset: Seattle Library Collection Inventory
	Website: https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory
	Author: City of Seattle
	Date of Publication: August 1, 2019
	License: https://creativecommons.org/publicdomain/zero/1.0/

	### Content

	The Seattle Public Library's collection inventory.

	### Context

	This is a dataset hosted by the City of Seattle. The city has an open data platform found [here](https://data.seattle.gov/) and they update their information according the amount of data that is brought in. Explore the City of Seattle using Kaggle and all of the data sources available through the City of Seattle [organization page](https://www.kaggle.com/city-of-seattle)!

	* Update Frequency: This dataset is updated monthly.

	### Acknowledgements

	This dataset is maintained using Socrata's API and Kaggle's API. [Socrata](https://socrata.com/) has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.

	[Cover photo](https://unsplash.com/photos/VphuLHwuyks) by [Alexandra Kirr](https://unsplash.com/@alexkirrthegirl) on [Unsplash](https://unsplash.com/)
	_Unsplash Images are distributed under a unique [Unsplash License](https://unsplash.com/license)._
	""",
	["https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/seattle-library-collection-inventory.zip/3", "https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/CollectionInventory_Codes_EXCLUDED_INCLUDED.xlsx/3", "https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/Library Collection Inventory FAQs.pdf/3", "https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/library-collection-inventory.csv/3", "https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/socrata_metadata.json/3"],
))

@rofinn rofinn added bug Something isn't working unreproducible Unable to reproduce the error with given info blocked labels Sep 18, 2019
@rofinn
Copy link
Member

rofinn commented Sep 18, 2019

Yep, can't reproduce on 1.2 either.

write_perf = [8.97624532, 6.488284407, 37.267838932, 24.014278148, 67.626921068, 64.134563347, 23.585749921, 12.960553185, 311.955533758, 318.826512276]
read_perf = [11.179993041, 8.72340904, 21.159888766, 10.411692382, 0.887963974, 0.001172795, 14.963994836, 7.80005972, 43.608864438, 43.488846143]

Looks like the performance is more consistent at least :)

My best guess is that this is a windows specific issue... possibly with the compression library.

@xiaodaigh
Copy link

NOTE: Fannie Mae 2004Q3 data (2.7 Gb) isn't what's in the gist and I don't see a 2.7 Gb file in the dataset documented.

This is the link to file that contains 2004Q3: http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2007.tgz

In general, more data can be sourced from https://docs.rapids.ai/datasets/mortgage-data

@xiaodaigh
Copy link

This is the error I get in 1.3-rc2

ERROR: InexactError: trunc(Int32, 2738917277)
Stacktrace:
 [1] throw_inexacterror(::Symbol, ::Type{Int32}, ::Int64) at .\boot.jl:560
 [2] checked_trunc_sint at .\boot.jl:582 [inlined]
 [3] toInt32 at .\boot.jl:619 [inlined]
 [4] Int32 at .\boot.jl:709 [inlined]
 [5] bson_primitive at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:14 [inlined]
 [6] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Array{UInt8,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [7] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Array{UInt8,1}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [8] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
 [9] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [10] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Array{Any,1}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [11] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
 [12] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Symbol, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [13] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [14] bson_primitive at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:36 [inlined]
 [15] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [16] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Dict{Symbol,Any}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [17] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
 [18] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [19] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Array{Any,1}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [20] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
 [21] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Symbol, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [22] bson_doc(::IOStream, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [23] bson_primitive(::IOStream, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:36
 [24] bson(::IOStream, ::Dict{String,Dict{String,V} where V}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:81
 [25] write(::IOStream, ::JLSOFile) at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:7
 [26] #save#4 at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:59 [inlined]
 [27] save at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:59 [inlined]
 [28] #7 at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:61 [inlined]
 [29] #open#271(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(open), ::JLSO.var"##7#8"{Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Tuple{DataFrames.DataFrame}}, ::String, ::Vararg{String,N} where N) at .\io.jl:298
 [30] open at .\io.jl:296 [inlined]
 [31] #save#6 at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:61 [inlined]
 [32] save(::String, ::DataFrames.DataFrame) at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:61
 [33] top-level scope at util.jl:155

@rofinn
Copy link
Member

rofinn commented Sep 19, 2019

Interesting, looks like this is a known issue with the BSON spec. Apparently, array types are indexed with an Int32. I'm not entirely sure why I wasn't able to hit the condition with the other large files maybe difference in efficiency of the serialization/compression on windows? One option could be to split these large array primitives into multiple parts as suggested in the BSON.jl issues.

@samuela
Copy link

samuela commented Oct 22, 2020

Just ran into this issue in #74. It would be nice just to get a more informative error message here, possibly linking to this issue. Right now it's really inscrutable.

@rofinn rofinn mentioned this issue Oct 24, 2020
@rofinn
Copy link
Member

rofinn commented Oct 24, 2020

Alright, I think I came up with an actual solution to this problem in #75 (vs just improving the error message). The gist is that we can just drop the serialized object bytes from the BSON doc and manually write them after, allowing us to save much larger files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked bug Something isn't working unreproducible Unable to reproduce the error with given info
Projects
None yet
Development

No branches or pull requests

4 participants