Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage dtypes and memory usage in SQLiteIOManager #2431

Closed
2 tasks done
Tracked by #2386
zaneselvans opened this issue Mar 22, 2023 · 1 comment · Fixed by #2459
Closed
2 tasks done
Tracked by #2386

Manage dtypes and memory usage in SQLiteIOManager #2431

zaneselvans opened this issue Mar 22, 2023 · 1 comment · Fixed by #2459
Assignees
Labels
dagster Issues related to our use of the Dagster orchestrator data-types Dtype conversions, standardization and implications of data types output Exporting data from PUDL into other platforms or interchange formats. performance Make PUDL run faster! sqlite Issues related to interacting with sqlite databases
Milestone

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Mar 22, 2023

SQLite only supports simple data types internally, so when we round-trip a dataframe to the DB we lose precious typing information. At least within Dagster, we may be able to work around this by managing types in the SQLiteIOManager.

Some issues this can address:

Different string representations of the same datetime64 objects being stored in the DB due to different precision on the seconds. We can fix this by formatting any datetime columns as strings equivalent to datetime64[s] before writing out to the database.

When reading a table out of the database, we can use pudl.helpers.apply_pudl_dtypes() to ensure that we get the correct data types for use in pandas. E.g. 0, 1, and NULL in a boolean column read from SQLite would become False, True, and pd.NA in pandas. We'd get nullable integers rather than floatified integers, nullable strings instead of objects.

We could also go one step further and use pudl.metadata.classes.Resource.enforce_schema(), which would convert memory intensive string columns that are actually categoricals into pd.CategoricalDtype().

To reduce peak memory use, and not just the size of the finished dataframe when converting strings to categoricals, we can read tables with chunksize=100_000 or something similar, and convert the individual chunks one by one before concatenating the whole table into a single dataframe.

For example in the case of demand_hourly_pa_ferc714, the following reduced peak memory usage by ~10x, and the final size of the dataframe in memory by ~3x without adding a noticeable amount of time to the operation.

    def demand_hourly_pa_ferc714(self) -> pd.DataFrame:
        """An interim FERC 714 output function."""
        dhpa_res = Package.from_resource_ids().get_resource("demand_hourly_pa_ferc714")
        # Concatenating a bunch of smaller chunks reduces peak memory usage drastically
        # and doesn't seem to take any longer.
        return pd.concat(
            [
                # enforce_schema() cuts memory use by ~70% b/c of categorical tzones
                dhpa_res.enforce_schema(df)
                for df in pd.read_sql(
                    "demand_hourly_pa_ferc714",
                    self.pudl_engine,
                    chunksize=100_000,
                )
            ]
        )

Tasks

Preview Give feedback

Scope

  • All dataframes are read out from SQLite with correct data types (including categorical values where applicable)
  • Peak memory usage is not much larger than the ultimate size of the dataframe being read
@zaneselvans zaneselvans added output Exporting data from PUDL into other platforms or interchange formats. sqlite Issues related to interacting with sqlite databases data-types Dtype conversions, standardization and implications of data types dagster Issues related to our use of the Dagster orchestrator labels Mar 22, 2023
@zaneselvans zaneselvans added this to the 2023Q2 milestone Mar 22, 2023
@zaneselvans
Copy link
Member Author

It looks like we can specify a formatting string in the SQLite-specific DATETIME class with SQLAlchemy

from sqlalchemy.dialects.sqlite import DATETIME

dt = DATETIME(storage_format="%(year)04d/%(month)02d/%(day)02d %(hour)02d:%(minute)02d:%(second)02d")

@jdangerx jdangerx moved this from 🆕 New to 🔖 Backlog in Catalyst Megaproject Mar 23, 2023
@zaneselvans zaneselvans linked a pull request Mar 28, 2023 that will close this issue
8 tasks
@zaneselvans zaneselvans modified the milestones: 2023Q2, 2023Q1 Mar 28, 2023
@zaneselvans zaneselvans moved this from 🔖 Backlog to 🚧 In progress in Catalyst Megaproject Mar 28, 2023
@zaneselvans zaneselvans added the performance Make PUDL run faster! label Mar 28, 2023
@zaneselvans zaneselvans moved this from 🚧 In progress to 👀 In review in Catalyst Megaproject Mar 28, 2023
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Catalyst Megaproject Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator data-types Dtype conversions, standardization and implications of data types output Exporting data from PUDL into other platforms or interchange formats. performance Make PUDL run faster! sqlite Issues related to interacting with sqlite databases
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant