Manage dtypes and memory usage in SQLiteIOManager #2431

zaneselvans · 2023-03-22T04:58:06Z

SQLite only supports simple data types internally, so when we round-trip a dataframe to the DB we lose precious typing information. At least within Dagster, we may be able to work around this by managing types in the SQLiteIOManager.

Some issues this can address:

Different string representations of the same datetime64 objects being stored in the DB due to different precision on the seconds. We can fix this by formatting any datetime columns as strings equivalent to datetime64[s] before writing out to the database.

When reading a table out of the database, we can use pudl.helpers.apply_pudl_dtypes() to ensure that we get the correct data types for use in pandas. E.g. 0, 1, and NULL in a boolean column read from SQLite would become False, True, and pd.NA in pandas. We'd get nullable integers rather than floatified integers, nullable strings instead of objects.

We could also go one step further and use pudl.metadata.classes.Resource.enforce_schema(), which would convert memory intensive string columns that are actually categoricals into pd.CategoricalDtype().

To reduce peak memory use, and not just the size of the finished dataframe when converting strings to categoricals, we can read tables with chunksize=100_000 or something similar, and convert the individual chunks one by one before concatenating the whole table into a single dataframe.

For example in the case of demand_hourly_pa_ferc714, the following reduced peak memory usage by ~10x, and the final size of the dataframe in memory by ~3x without adding a noticeable amount of time to the operation.

    def demand_hourly_pa_ferc714(self) -> pd.DataFrame:
        """An interim FERC 714 output function."""
        dhpa_res = Package.from_resource_ids().get_resource("demand_hourly_pa_ferc714")
        # Concatenating a bunch of smaller chunks reduces peak memory usage drastically
        # and doesn't seem to take any longer.
        return pd.concat(
            [
                # enforce_schema() cuts memory use by ~70% b/c of categorical tzones
                dhpa_res.enforce_schema(df)
                for df in pd.read_sql(
                    "demand_hourly_pa_ferc714",
                    self.pudl_engine,
                    chunksize=100_000,
                )
            ]
        )

Tasks

Give feedback

Move the enforce_schema() and concat/chunksize code from PudlTabl into SQliteIOManager
Ensure that the SQLite string representation of Datetime columns do not include fractional seconds.
Options

Scope

All dataframes are read out from SQLite with correct data types (including categorical values where applicable)
Peak memory usage is not much larger than the ultimate size of the dataframe being read

The text was updated successfully, but these errors were encountered:

zaneselvans · 2023-03-23T04:38:35Z

It looks like we can specify a formatting string in the SQLite-specific DATETIME class with SQLAlchemy

from sqlalchemy.dialects.sqlite import DATETIME

dt = DATETIME(storage_format="%(year)04d/%(month)02d/%(day)02d %(hour)02d:%(minute)02d:%(second)02d")

zaneselvans added this to the 2023Q2 milestone Mar 22, 2023

zaneselvans added this to Catalyst Megaproject Mar 22, 2023

github-project-automation bot moved this to 🆕 New in Catalyst Megaproject Mar 22, 2023

This was referenced Mar 22, 2023

Improve ergonomics & performance of pudl.etl DAG #2386

Open

Convert FERC-714 ETL to use Dagster #2266

Closed

jdangerx moved this from 🆕 New to 🔖 Backlog in Catalyst Megaproject Mar 23, 2023

jdangerx assigned zaneselvans Mar 27, 2023

zaneselvans mentioned this issue Mar 27, 2023

Should we output FERC-714 hourly demand to Parquet? #2458

Closed

zaneselvans linked a pull request Mar 28, 2023 that will close this issue

Use enforce_schema() and read-chunking in PudlSQLiteIOManager. #2459

Merged

8 tasks

zaneselvans modified the milestones: 2023Q2, 2023Q1 Mar 28, 2023

zaneselvans moved this from 🔖 Backlog to 🚧 In progress in Catalyst Megaproject Mar 28, 2023

zaneselvans added the performance Make PUDL run faster! label Mar 28, 2023

zaneselvans moved this from 🚧 In progress to 👀 In review in Catalyst Megaproject Mar 28, 2023

zaneselvans closed this as completed Mar 31, 2023

github-project-automation bot moved this from 👀 In review to ✅ Done in Catalyst Megaproject Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manage dtypes and memory usage in SQLiteIOManager #2431

Manage dtypes and memory usage in SQLiteIOManager #2431

zaneselvans commented Mar 22, 2023 •

edited

Loading

Tasks

zaneselvans commented Mar 23, 2023

Manage dtypes and memory usage in SQLiteIOManager #2431

Manage dtypes and memory usage in SQLiteIOManager #2431

Comments

zaneselvans commented Mar 22, 2023 • edited Loading

Tasks

Scope

zaneselvans commented Mar 23, 2023

zaneselvans commented Mar 22, 2023 •

edited

Loading