Use enforce_schema() and read-chunking in PudlSQLiteIOManager. #2459

zaneselvans · 2023-03-28T02:08:56Z

PR Overview

This PR changes the SQLiteIOManager to run Resource.enforce_schema() before writing to SQLite, and after reading from SQLite IF a resource matching the asset table_name is defined in the pudl Package.

It applies enforce_schema to chunks of the dataframe as it's read out of the DB so that in the case of large tables that contain categorical values, peak and final memory usage is reduced.

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

…econds.

zaneselvans

I didn't make any changes to the PudlTabl class, but there's definitely overlap in the functionality between that and the SQLiteIOManager Is that just the way it is for now, until we deprecate PudlTabl? Would it make sense to change PudlTabl to grab the assets using Dagster AssetKeys? That seems like it would be a significant change in how things work.

zaneselvans · 2023-03-28T03:02:55Z

src/pudl/metadata/constants.py

 FIELD_DTYPES_SQL: dict[str, sa.sql.visitors.VisitableType] = {
    "boolean": sa.Boolean,
    "date": sa.Date,
-    "datetime": sa.DateTime,
+    # Ensure SQLite's string representation of datetime uses only whole seconds:
+    "datetime": SQLITE_DATETIME(
+        storage_format="%(year)04d-%(month)02d-%(day)02d %(hour)02d:%(minute)02d:%(second)02d"
+    ),
    "integer": sa.Integer,
    "number": sa.Float,
    "string": sa.Text,


This lets the datetime pseudo-type check work, and removes artificial precision in the datetime fields.

I think that we have enough SQLite specific stuff elsewhere in the system that our to_sql() is functionally to_sqlite() and so having an SQLite specific dtype in here (rather than the SQLAlchemy generics) isn't any more constrained than we already are.

If we want to be able to write into other DBs I think we'll probably want dialect-specific schema generation methods, like to_sqlite(), to_bigquery() and to_postgres()

zaneselvans · 2023-03-28T03:04:25Z

src/pudl/metadata/classes.py

+            elif self.type == "datetime":
+                checks.append(f"{name} IS DATETIME({name})")


This works now that I'm providing a formatting string in the SQLite specific datetime type.

zaneselvans · 2023-03-28T03:07:50Z

src/pudl/io_managers.py

+        pkg = Package.from_resource_ids()
+        all_resources = [resource.name for resource in pkg.resources]
+        res = None
+        if table_name in all_resources:
+            res = pkg.get_resource(table_name)
+


@bendnorman I'm still a little unclear on what our expectations are about tables that would get written to SQLite with this IOManager.

Without the check to see whether the table had a resource, some unit tests failed, since they just had a temporary table with no metadata associated with it.

Do we only expect to write tables to SQLite that have Resource classes defined? Or do we want to be able to write interim tables that are so stringently described? If so, should we differentiate between those cases somehow in the IO Manager?

I think we should expect all tables we are loading from the database to have a Resource class defined so we can be confident we are working with the correct dtype information.

self._get_sqlalchemy_table(table_name) throws an error if the table doesn't exist in the metadata.

codecov · 2023-03-28T03:32:57Z

Codecov Report

Patch coverage: 100.0% and no project coverage change.

Comparison is base (01bd9af) 86.7% compared to head (775be55) 86.7%.

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #2459   +/-   ##
=====================================
  Coverage   86.7%   86.7%           
=====================================
  Files         81      81           
  Lines       9467    9490   +23     
=====================================
+ Hits        8208    8233   +25     
+ Misses      1259    1257    -2

Impacted Files	Coverage Δ
src/pudl/io_managers.py	`90.4% <100.0%> (+1.9%)`	⬆️
src/pudl/metadata/classes.py	`81.8% <100.0%> (+<0.1%)`	⬆️
src/pudl/metadata/constants.py	`100.0% <100.0%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

bendnorman · 2023-03-28T16:17:48Z

src/pudl/io_managers.py

+        pkg = Package.from_resource_ids()
+        all_resources = [resource.name for resource in pkg.resources]
+        res = None
+        if table_name in all_resources:
+            res = pkg.get_resource(table_name)
+


I think we should expect all tables we are loading from the database to have a Resource class defined so we can be confident we are working with the correct dtype information.

self._get_sqlalchemy_table(table_name) throws an error if the table doesn't exist in the metadata.

bendnorman · 2023-03-28T16:19:26Z

src/pudl/io_managers.py

+                    else:
+                        chunk_df = pudl.metadata.fields.apply_pudl_dtypes(chunk_df)
+                    dfs.append(chunk_df)


If we decide load_input should fail if the table_name doesn't show up in our metadata, this call to apply_pudl_dtypes would not be required.

bendnorman · 2023-03-28T16:31:44Z

src/pudl/io_managers.py

+        pkg = Package.from_resource_ids()
+        all_resources = [resource.name for resource in pkg.resources]
+        if table_name in all_resources:
+            res = pkg.get_resource(table_name)
+            df = res.enforce_schema(df)
+        else:
+            df = pudl.metadata.fields.apply_pudl_dtypes(df)


These changes intersect the work/discussion happening in #2445. It feels duplicative pass in a sqlalchemy metadata object created from the Package class to then create a package class inside of the IO Manager again.

I think it makes the most sense to:

Create a package object with the desired etl groups and pass it into a SQLiteIOManager object. self.pkg.enforce_schema() can be called in handle_output and load_input.

Create a new metadata field or etl_group that allows us to filter out SQL view tables so we can track the datatypes but avoid creating schemas for them in the db.

src/pudl/io_managers.py

test/unit/io_managers_test.py

bendnorman

Ah yay, the dtypes enforcement and self.package checks feel much better. I left a couple of comments about the tests and which checks are necessary.

src/pudl/io_managers.py

test/unit/io_managers_test.py

Use enforce_schema() and read-chunking in SQLiteIOManager.

9bef568

zaneselvans added output Exporting data from PUDL into other platforms or interchange formats. sqlite Issues related to interacting with sqlite databases performance Make PUDL run faster! data-types Dtype conversions, standardization and implications of data types labels Mar 28, 2023

zaneselvans linked an issue Mar 28, 2023 that may be closed by this pull request

Manage dtypes and memory usage in SQLiteIOManager #2431

Closed

zaneselvans added this to the 2023Q1 milestone Mar 28, 2023

zaneselvans added the dagster Issues related to our use of the Dagster orchestrator label Mar 28, 2023

zaneselvans added 2 commits March 27, 2023 20:43

Ensure SQLite's string representation of datetime has no fractional s…

77e7c53

…econds.

Use apply_pudl_dtypes() when the table has no Resource.

ba1522e

zaneselvans commented Mar 28, 2023

View reviewed changes

zaneselvans self-assigned this Mar 28, 2023

zaneselvans requested a review from bendnorman March 28, 2023 03:10

zaneselvans marked this pull request as ready for review March 28, 2023 03:11

bendnorman reviewed Mar 28, 2023

View reviewed changes

This was referenced Mar 28, 2023

Create simple SQL view assets #2445

Merged

Create PudlSQLiteIOManager to accept a Package object #2466

Merged

zaneselvans added 7 commits March 29, 2023 00:40

Merge branch 'dev' into sqlite-iomanager-dyptes-memory

d4bf214

Merge branch 'dev' into sqlite-iomanager-dyptes-memory

07a4cd8

Merge branch 'dev' into sqlite-iomanager-dyptes-memory

748c9bc

Merge branch 'dev' into sqlite-iomanager-dyptes-memory

071cabb

Merge branch 'dev' into sqlite-iomanager-dyptes-memory

73b4595

Implement _handle_pandas_output() in PudlSQLiteIOManager

07d8e8d

Add unit test for PudlSQLiteIOManager

68ed2e7

zaneselvans changed the title ~~Use enforce_schema() and read-chunking in SQLiteIOManager.~~ Use enforce_schema() and read-chunking in PudlSQLiteIOManager. Mar 30, 2023

zaneselvans requested a review from bendnorman March 30, 2023 20:22

zaneselvans commented Mar 30, 2023

View reviewed changes

src/pudl/io_managers.py Show resolved Hide resolved

zaneselvans commented Mar 30, 2023

View reviewed changes

test/unit/io_managers_test.py Outdated Show resolved Hide resolved

Merge branch 'dev' into sqlite-iomanager-dyptes-memory

7c3d16d

bendnorman reviewed Mar 30, 2023

View reviewed changes

src/pudl/io_managers.py Outdated Show resolved Hide resolved

src/pudl/io_managers.py Outdated Show resolved Hide resolved

test/unit/io_managers_test.py Outdated Show resolved Hide resolved

test/unit/io_managers_test.py Outdated Show resolved Hide resolved

Throw ValueError when table doesn't exist in PudlSQLiteIOManager.package

775be55

bendnorman approved these changes Mar 31, 2023

View reviewed changes

bendnorman merged commit 7ca27c2 into dev Mar 31, 2023

bendnorman deleted the sqlite-iomanager-dyptes-memory branch March 31, 2023 05:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use enforce_schema() and read-chunking in PudlSQLiteIOManager. #2459

Use enforce_schema() and read-chunking in PudlSQLiteIOManager. #2459

zaneselvans commented Mar 28, 2023 •

edited

Loading

zaneselvans left a comment

zaneselvans Mar 28, 2023

zaneselvans Mar 28, 2023

zaneselvans Mar 28, 2023

bendnorman Mar 28, 2023

codecov bot commented Mar 28, 2023 •

edited

Loading

bendnorman Mar 28, 2023

bendnorman Mar 28, 2023

bendnorman Mar 28, 2023

bendnorman left a comment

		elif self.type == "datetime":
		checks.append(f"{name} IS DATETIME({name})")

Use enforce_schema() and read-chunking in PudlSQLiteIOManager. #2459

Use enforce_schema() and read-chunking in PudlSQLiteIOManager. #2459

Conversation

zaneselvans commented Mar 28, 2023 • edited Loading

PR Overview

PR Checklist

zaneselvans left a comment

Choose a reason for hiding this comment

zaneselvans Mar 28, 2023

Choose a reason for hiding this comment

zaneselvans Mar 28, 2023

Choose a reason for hiding this comment

zaneselvans Mar 28, 2023

Choose a reason for hiding this comment

bendnorman Mar 28, 2023

Choose a reason for hiding this comment

codecov bot commented Mar 28, 2023 • edited Loading

Codecov Report

bendnorman Mar 28, 2023

Choose a reason for hiding this comment

bendnorman Mar 28, 2023

Choose a reason for hiding this comment

bendnorman Mar 28, 2023

Choose a reason for hiding this comment

bendnorman left a comment

Choose a reason for hiding this comment

zaneselvans commented Mar 28, 2023 •

edited

Loading

codecov bot commented Mar 28, 2023 •

edited

Loading