-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use enforce_schema() and read-chunking in PudlSQLiteIOManager. #2459
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't make any changes to the PudlTabl
class, but there's definitely overlap in the functionality between that and the SQLiteIOManager
Is that just the way it is for now, until we deprecate PudlTabl
? Would it make sense to change PudlTabl
to grab the assets using Dagster AssetKey
s? That seems like it would be a significant change in how things work.
FIELD_DTYPES_SQL: dict[str, sa.sql.visitors.VisitableType] = { | ||
"boolean": sa.Boolean, | ||
"date": sa.Date, | ||
"datetime": sa.DateTime, | ||
# Ensure SQLite's string representation of datetime uses only whole seconds: | ||
"datetime": SQLITE_DATETIME( | ||
storage_format="%(year)04d-%(month)02d-%(day)02d %(hour)02d:%(minute)02d:%(second)02d" | ||
), | ||
"integer": sa.Integer, | ||
"number": sa.Float, | ||
"string": sa.Text, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lets the datetime pseudo-type check work, and removes artificial precision in the datetime fields.
I think that we have enough SQLite specific stuff elsewhere in the system that our to_sql()
is functionally to_sqlite()
and so having an SQLite specific dtype in here (rather than the SQLAlchemy generics) isn't any more constrained than we already are.
If we want to be able to write into other DBs I think we'll probably want dialect-specific schema generation methods, like to_sqlite()
, to_bigquery()
and to_postgres()
elif self.type == "datetime": | ||
checks.append(f"{name} IS DATETIME({name})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works now that I'm providing a formatting string in the SQLite specific datetime type.
src/pudl/io_managers.py
Outdated
pkg = Package.from_resource_ids() | ||
all_resources = [resource.name for resource in pkg.resources] | ||
res = None | ||
if table_name in all_resources: | ||
res = pkg.get_resource(table_name) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bendnorman I'm still a little unclear on what our expectations are about tables that would get written to SQLite with this IOManager.
Without the check to see whether the table had a resource, some unit tests failed, since they just had a temporary table with no metadata associated with it.
Do we only expect to write tables to SQLite that have Resource
classes defined? Or do we want to be able to write interim tables that are so stringently described? If so, should we differentiate between those cases somehow in the IO Manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should expect all tables we are loading from the database to have a Resource class defined so we can be confident we are working with the correct dtype information.
self._get_sqlalchemy_table(table_name)
throws an error if the table doesn't exist in the metadata.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2459 +/- ##
=====================================
Coverage 86.7% 86.7%
=====================================
Files 81 81
Lines 9467 9490 +23
=====================================
+ Hits 8208 8233 +25
+ Misses 1259 1257 -2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
src/pudl/io_managers.py
Outdated
pkg = Package.from_resource_ids() | ||
all_resources = [resource.name for resource in pkg.resources] | ||
res = None | ||
if table_name in all_resources: | ||
res = pkg.get_resource(table_name) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should expect all tables we are loading from the database to have a Resource class defined so we can be confident we are working with the correct dtype information.
self._get_sqlalchemy_table(table_name)
throws an error if the table doesn't exist in the metadata.
src/pudl/io_managers.py
Outdated
else: | ||
chunk_df = pudl.metadata.fields.apply_pudl_dtypes(chunk_df) | ||
dfs.append(chunk_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide load_input should fail if the table_name doesn't show up in our metadata, this call to apply_pudl_dtypes
would not be required.
src/pudl/io_managers.py
Outdated
pkg = Package.from_resource_ids() | ||
all_resources = [resource.name for resource in pkg.resources] | ||
if table_name in all_resources: | ||
res = pkg.get_resource(table_name) | ||
df = res.enforce_schema(df) | ||
else: | ||
df = pudl.metadata.fields.apply_pudl_dtypes(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes intersect the work/discussion happening in #2445. It feels duplicative pass in a sqlalchemy metadata object created from the Package class to then create a package class inside of the IO Manager again.
I think it makes the most sense to:
- Create a package object with the desired etl groups and pass it into a SQLiteIOManager object.
self.pkg.enforce_schema()
can be called inhandle_output
andload_input
. - Create a new metadata field or etl_group that allows us to filter out SQL view tables so we can track the datatypes but avoid creating schemas for them in the db.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yay, the dtypes enforcement and self.package
checks feel much better. I left a couple of comments about the tests and which checks are necessary.
PR Overview
This PR changes the
SQLiteIOManager
to runResource.enforce_schema()
before writing to SQLite, and after reading from SQLite IF a resource matching the assettable_name
is defined in the pudlPackage
.It applies
enforce_schema
to chunks of the dataframe as it's read out of the DB so that in the case of large tables that contain categorical values, peak and final memory usage is reduced.PR Checklist
dev
).