Manage dtypes and memory usage in SQLiteIOManager #2431
Labels
dagster
Issues related to our use of the Dagster orchestrator
data-types
Dtype conversions, standardization and implications of data types
output
Exporting data from PUDL into other platforms or interchange formats.
performance
Make PUDL run faster!
sqlite
Issues related to interacting with sqlite databases
Milestone
SQLite only supports simple data types internally, so when we round-trip a dataframe to the DB we lose precious typing information. At least within Dagster, we may be able to work around this by managing types in the
SQLiteIOManager
.Some issues this can address:
Different string representations of the same
datetime64
objects being stored in the DB due to different precision on the seconds. We can fix this by formatting any datetime columns as strings equivalent todatetime64[s]
before writing out to the database.When reading a table out of the database, we can use
pudl.helpers.apply_pudl_dtypes()
to ensure that we get the correct data types for use in pandas. E.g. 0, 1, andNULL
in aboolean
column read from SQLite would becomeFalse
,True
, andpd.NA
in pandas. We'd get nullable integers rather than floatified integers, nullable strings instead of objects.We could also go one step further and use
pudl.metadata.classes.Resource.enforce_schema()
, which would convert memory intensive string columns that are actually categoricals intopd.CategoricalDtype()
.To reduce peak memory use, and not just the size of the finished dataframe when converting strings to categoricals, we can read tables with
chunksize=100_000
or something similar, and convert the individual chunks one by one before concatenating the whole table into a single dataframe.For example in the case of
demand_hourly_pa_ferc714
, the following reduced peak memory usage by ~10x, and the final size of the dataframe in memory by ~3x without adding a noticeable amount of time to the operation.Tasks
Scope
The text was updated successfully, but these errors were encountered: