Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use enforce_schema() and read-chunking in PudlSQLiteIOManager. #2459

Merged
merged 12 commits into from
Mar 31, 2023

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Mar 28, 2023

PR Overview

This PR changes the SQLiteIOManager to run Resource.enforce_schema() before writing to SQLite, and after reading from SQLite IF a resource matching the asset table_name is defined in the pudl Package.

It applies enforce_schema to chunks of the dataframe as it's read out of the DB so that in the case of large tables that contain categorical values, peak and final memory usage is reduced.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@zaneselvans zaneselvans added output Exporting data from PUDL into other platforms or interchange formats. sqlite Issues related to interacting with sqlite databases performance Make PUDL run faster! data-types Dtype conversions, standardization and implications of data types labels Mar 28, 2023
@zaneselvans zaneselvans linked an issue Mar 28, 2023 that may be closed by this pull request
@zaneselvans zaneselvans added this to the 2023Q1 milestone Mar 28, 2023
@zaneselvans zaneselvans added the dagster Issues related to our use of the Dagster orchestrator label Mar 28, 2023
Copy link
Member Author

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't make any changes to the PudlTabl class, but there's definitely overlap in the functionality between that and the SQLiteIOManager Is that just the way it is for now, until we deprecate PudlTabl? Would it make sense to change PudlTabl to grab the assets using Dagster AssetKeys? That seems like it would be a significant change in how things work.

Comment on lines 31 to 40
FIELD_DTYPES_SQL: dict[str, sa.sql.visitors.VisitableType] = {
"boolean": sa.Boolean,
"date": sa.Date,
"datetime": sa.DateTime,
# Ensure SQLite's string representation of datetime uses only whole seconds:
"datetime": SQLITE_DATETIME(
storage_format="%(year)04d-%(month)02d-%(day)02d %(hour)02d:%(minute)02d:%(second)02d"
),
"integer": sa.Integer,
"number": sa.Float,
"string": sa.Text,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lets the datetime pseudo-type check work, and removes artificial precision in the datetime fields.

I think that we have enough SQLite specific stuff elsewhere in the system that our to_sql() is functionally to_sqlite() and so having an SQLite specific dtype in here (rather than the SQLAlchemy generics) isn't any more constrained than we already are.

If we want to be able to write into other DBs I think we'll probably want dialect-specific schema generation methods, like to_sqlite(), to_bigquery() and to_postgres()

Comment on lines +714 to +715
elif self.type == "datetime":
checks.append(f"{name} IS DATETIME({name})")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works now that I'm providing a formatting string in the SQLite specific datetime type.

Comment on lines 359 to 364
pkg = Package.from_resource_ids()
all_resources = [resource.name for resource in pkg.resources]
res = None
if table_name in all_resources:
res = pkg.get_resource(table_name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bendnorman I'm still a little unclear on what our expectations are about tables that would get written to SQLite with this IOManager.

Without the check to see whether the table had a resource, some unit tests failed, since they just had a temporary table with no metadata associated with it.

Do we only expect to write tables to SQLite that have Resource classes defined? Or do we want to be able to write interim tables that are so stringently described? If so, should we differentiate between those cases somehow in the IO Manager?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should expect all tables we are loading from the database to have a Resource class defined so we can be confident we are working with the correct dtype information.

self._get_sqlalchemy_table(table_name) throws an error if the table doesn't exist in the metadata.

@zaneselvans zaneselvans self-assigned this Mar 28, 2023
@zaneselvans zaneselvans requested a review from bendnorman March 28, 2023 03:10
@zaneselvans zaneselvans marked this pull request as ready for review March 28, 2023 03:11
@codecov
Copy link

codecov bot commented Mar 28, 2023

Codecov Report

Patch coverage: 100.0% and no project coverage change.

Comparison is base (01bd9af) 86.7% compared to head (775be55) 86.7%.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2459   +/-   ##
=====================================
  Coverage   86.7%   86.7%           
=====================================
  Files         81      81           
  Lines       9467    9490   +23     
=====================================
+ Hits        8208    8233   +25     
+ Misses      1259    1257    -2     
Impacted Files Coverage Δ
src/pudl/io_managers.py 90.4% <100.0%> (+1.9%) ⬆️
src/pudl/metadata/classes.py 81.8% <100.0%> (+<0.1%) ⬆️
src/pudl/metadata/constants.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Comment on lines 359 to 364
pkg = Package.from_resource_ids()
all_resources = [resource.name for resource in pkg.resources]
res = None
if table_name in all_resources:
res = pkg.get_resource(table_name)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should expect all tables we are loading from the database to have a Resource class defined so we can be confident we are working with the correct dtype information.

self._get_sqlalchemy_table(table_name) throws an error if the table doesn't exist in the metadata.

Comment on lines 375 to 377
else:
chunk_df = pudl.metadata.fields.apply_pudl_dtypes(chunk_df)
dfs.append(chunk_df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide load_input should fail if the table_name doesn't show up in our metadata, this call to apply_pudl_dtypes would not be required.

Comment on lines 279 to 285
pkg = Package.from_resource_ids()
all_resources = [resource.name for resource in pkg.resources]
if table_name in all_resources:
res = pkg.get_resource(table_name)
df = res.enforce_schema(df)
else:
df = pudl.metadata.fields.apply_pudl_dtypes(df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes intersect the work/discussion happening in #2445. It feels duplicative pass in a sqlalchemy metadata object created from the Package class to then create a package class inside of the IO Manager again.

I think it makes the most sense to:

  1. Create a package object with the desired etl groups and pass it into a SQLiteIOManager object. self.pkg.enforce_schema() can be called in handle_output and load_input.
  2. Create a new metadata field or etl_group that allows us to filter out SQL view tables so we can track the datatypes but avoid creating schemas for them in the db.

src/pudl/io_managers.py Show resolved Hide resolved
@zaneselvans zaneselvans changed the title Use enforce_schema() and read-chunking in SQLiteIOManager. Use enforce_schema() and read-chunking in PudlSQLiteIOManager. Mar 30, 2023
@zaneselvans zaneselvans requested a review from bendnorman March 30, 2023 20:22
Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yay, the dtypes enforcement and self.package checks feel much better. I left a couple of comments about the tests and which checks are necessary.

src/pudl/io_managers.py Outdated Show resolved Hide resolved
src/pudl/io_managers.py Outdated Show resolved Hide resolved
test/unit/io_managers_test.py Outdated Show resolved Hide resolved
test/unit/io_managers_test.py Outdated Show resolved Hide resolved
@bendnorman bendnorman merged commit 7ca27c2 into dev Mar 31, 2023
@bendnorman bendnorman deleted the sqlite-iomanager-dyptes-memory branch March 31, 2023 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator data-types Dtype conversions, standardization and implications of data types output Exporting data from PUDL into other platforms or interchange formats. performance Make PUDL run faster! sqlite Issues related to interacting with sqlite databases
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Manage dtypes and memory usage in SQLiteIOManager
2 participants