Crashes for very large dataframes #21

oxinabox · 2019-09-16T10:03:14Z

I've not looked into this, but @xiaodaigh
reports that on the Fannie Mae 2004Q3 data (2.7 Gb)
on julia 1.2.0-rc1.0
that JLSO crashes.

Gist for reproducting is https://gist.github.com/xiaodaigh/2b9c1b6eb068fb8b3dcd1b1f2a55facd

rofinn · 2019-09-17T21:48:54Z

A stacktrace would have been helpful as those snippets aren't the most general.

oxinabox · 2019-09-17T22:08:27Z

Yes, I was just getting the info down before it was lost to slack's blackhole

rofinn · 2019-09-18T17:34:36Z

NOTE: Fannie Mae 2004Q3 data (2.7 Gb) isn't what's in the gist and I don't see a 2.7 Gb file in the dataset documented.

Running the code on Julia 1.0.3 with Performance_2000Q4.txt (1.0 Gb) seems to run fine. I'll try testing on 1.2 as it's possible that there's a bug in the more recent julia releases that need to be fixed.

JLSO is one of the slowest to read and write, but it might be worth updating the benchmarks to consider file size though because by default we're compressing our data.

write_perf = [0.0, 0.0, 74.5112, 61.946, 147.795, 91.5526, 23.0511, 10.9649, 485.467, 795.006]
read_perf = [0.0, 0.0, 19.1525, 60.1645, 0.00758378, 0.00147431, 16.6633, 12.3439, 96.9803, 49.7399]

The last 2 entries are both jlso files.

rofinn · 2019-09-18T17:46:04Z

I also couldn't get it to throw an error with a 9Gb CSV to push things a little more. https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory

Fun fact, the default JLSO file format compresses the 9.1G file down to < 800M while JLD2 only compresses it to 2.6G on disk. I guess this is why JLSO is still pretty handy for our use case (e.g., upload lots of files to S3).

oxinabox · 2019-09-18T17:50:10Z

Oooh, a chance to use DataDepsGenerators

julia> using DataDepsGenerators

julia> println(generate("https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory"))
register(DataDep(
    "Seattle Library Collection Inventory",
    """
	Dataset: Seattle Library Collection Inventory
	Website: https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory
	Author: City of Seattle
	Date of Publication: August 1, 2019
	License: https://creativecommons.org/publicdomain/zero/1.0/

	### Content

	The Seattle Public Library's collection inventory.

	### Context

	This is a dataset hosted by the City of Seattle. The city has an open data platform found [here](https://data.seattle.gov/) and they update their information according the amount of data that is brought in. Explore the City of Seattle using Kaggle and all of the data sources available through the City of Seattle [organization page](https://www.kaggle.com/city-of-seattle)!

	* Update Frequency: This dataset is updated monthly.

	### Acknowledgements

	This dataset is maintained using Socrata's API and Kaggle's API. [Socrata](https://socrata.com/) has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.

	[Cover photo](https://unsplash.com/photos/VphuLHwuyks) by [Alexandra Kirr](https://unsplash.com/@alexkirrthegirl) on [Unsplash](https://unsplash.com/)
	_Unsplash Images are distributed under a unique [Unsplash License](https://unsplash.com/license)._
	""",
	["https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/seattle-library-collection-inventory.zip/3", "https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/CollectionInventory_Codes_EXCLUDED_INCLUDED.xlsx/3", "https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/Library Collection Inventory FAQs.pdf/3", "https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/library-collection-inventory.csv/3", "https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/socrata_metadata.json/3"],
))

rofinn · 2019-09-18T19:11:13Z

Yep, can't reproduce on 1.2 either.

write_perf = [8.97624532, 6.488284407, 37.267838932, 24.014278148, 67.626921068, 64.134563347, 23.585749921, 12.960553185, 311.955533758, 318.826512276]
read_perf = [11.179993041, 8.72340904, 21.159888766, 10.411692382, 0.887963974, 0.001172795, 14.963994836, 7.80005972, 43.608864438, 43.488846143]

Looks like the performance is more consistent at least :)

My best guess is that this is a windows specific issue... possibly with the compression library.

xiaodaigh · 2019-09-19T00:03:08Z

NOTE: Fannie Mae 2004Q3 data (2.7 Gb) isn't what's in the gist and I don't see a 2.7 Gb file in the dataset documented.

This is the link to file that contains 2004Q3: http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2007.tgz

In general, more data can be sourced from https://docs.rapids.ai/datasets/mortgage-data

xiaodaigh · 2019-09-19T03:47:35Z

This is the error I get in 1.3-rc2

ERROR: InexactError: trunc(Int32, 2738917277)
Stacktrace:
 [1] throw_inexacterror(::Symbol, ::Type{Int32}, ::Int64) at .\boot.jl:560
 [2] checked_trunc_sint at .\boot.jl:582 [inlined]
 [3] toInt32 at .\boot.jl:619 [inlined]
 [4] Int32 at .\boot.jl:709 [inlined]
 [5] bson_primitive at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:14 [inlined]
 [6] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Array{UInt8,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [7] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Array{UInt8,1}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [8] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
 [9] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [10] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Array{Any,1}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [11] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
 [12] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Symbol, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [13] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [14] bson_primitive at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:36 [inlined]
 [15] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [16] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Dict{Symbol,Any}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [17] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
 [18] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [19] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Array{Any,1}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [20] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
 [21] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Symbol, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [22] bson_doc(::IOStream, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [23] bson_primitive(::IOStream, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:36
 [24] bson(::IOStream, ::Dict{String,Dict{String,V} where V}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:81
 [25] write(::IOStream, ::JLSOFile) at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:7
 [26] #save#4 at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:59 [inlined]
 [27] save at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:59 [inlined]
 [28] #7 at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:61 [inlined]
 [29] #open#271(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(open), ::JLSO.var"##7#8"{Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Tuple{DataFrames.DataFrame}}, ::String, ::Vararg{String,N} where N) at .\io.jl:298
 [30] open at .\io.jl:296 [inlined]
 [31] #save#6 at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:61 [inlined]
 [32] save(::String, ::DataFrames.DataFrame) at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:61
 [33] top-level scope at util.jl:155

rofinn · 2019-09-19T05:15:00Z

Interesting, looks like this is a known issue with the BSON spec. Apparently, array types are indexed with an Int32. I'm not entirely sure why I wasn't able to hit the condition with the other large files maybe difference in efficiency of the serialization/compression on windows? One option could be to split these large array primitives into multiple parts as suggested in the BSON.jl issues.

samuela · 2020-10-22T20:45:16Z

Just ran into this issue in #74. It would be nice just to get a more informative error message here, possibly linking to this issue. Right now it's really inscrutable.

rofinn · 2020-10-24T19:13:36Z

Alright, I think I came up with an actual solution to this problem in #75 (vs just improving the error message). The gist is that we can just drop the serialized object bytes from the BSON doc and manually write them after, allowing us to save much larger files.

rofinn added bug Something isn't working unreproducible Unable to reproduce the error with given info blocked labels Sep 18, 2019

rofinn mentioned this issue Oct 22, 2020

ERROR: LoadError: InexactError: trunc(Int32, 3381946202) #74

Closed

rofinn mentioned this issue Oct 24, 2020

Large JLSO files #75

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashes for very large dataframes #21

Crashes for very large dataframes #21

oxinabox commented Sep 16, 2019

rofinn commented Sep 17, 2019

oxinabox commented Sep 17, 2019

rofinn commented Sep 18, 2019 •

edited

Loading

rofinn commented Sep 18, 2019 •

edited

Loading

oxinabox commented Sep 18, 2019

rofinn commented Sep 18, 2019

xiaodaigh commented Sep 19, 2019

xiaodaigh commented Sep 19, 2019

rofinn commented Sep 19, 2019

samuela commented Oct 22, 2020

rofinn commented Oct 24, 2020

Crashes for very large dataframes #21

Crashes for very large dataframes #21

Comments

oxinabox commented Sep 16, 2019

rofinn commented Sep 17, 2019

oxinabox commented Sep 17, 2019

rofinn commented Sep 18, 2019 • edited Loading

rofinn commented Sep 18, 2019 • edited Loading

oxinabox commented Sep 18, 2019

rofinn commented Sep 18, 2019

xiaodaigh commented Sep 19, 2019

xiaodaigh commented Sep 19, 2019

rofinn commented Sep 19, 2019

samuela commented Oct 22, 2020

rofinn commented Oct 24, 2020

rofinn commented Sep 18, 2019 •

edited

Loading

rofinn commented Sep 18, 2019 •

edited

Loading