-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize Dagster processing of EPA CEMS #2472
Conversation
For more information, see https://pre-commit.ci
Bummer the partitions aren't working out as planned :/
|
That seems like really weird behavior from Dagster. I wonder why their idea about how it should work is so different from ours. |
I think that dagster expects partitioned assets to depend only on assets with the same partitions. It sounds like they're working on adding functionality more like what we expect, but for the time being we have to go with another option. |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2472 +/- ##
=====================================
Coverage 86.7% 86.7%
=====================================
Files 81 81
Lines 9490 9504 +14
=====================================
+ Hits 8233 8247 +14
Misses 1257 1257
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
…onization write problems
…puts in and writing to single file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the performance on this is totally good enough -- It went from 70 minutes to 14 minutes on my machine, and once it's interleaved with all of the other jobs in the DAG I don't think the additional granularity would speed things up much. It kept my 10 CPUs pegged to 100% for 10+ minutes.
I needed to merge in dev
to run it locally and I tried to fix another lingering partitioning config that was in the integration tests, so I went ahead and pushed that merge back up to the PR. I hope that's okay!
I took at look at the Dagster docs for the other abstractions you're using here, which are different from how we're managing the other assets, and I wasn't totally clear on how it's working. Would you be willing to talk me through it? Maybe a little more of an explanation in the epacems_assets.py
module-level docstring would be helpful, since it's so different from the straight-up assets we're using everywhere else?
But functionally it seems to work great!
src/pudl/resources.py
Outdated
@@ -30,6 +31,23 @@ def ferc_to_sqlite_settings(init_context) -> FercToSqliteSettings: | |||
return FercToSqliteSettings(**init_context.resource_config) | |||
|
|||
|
|||
class ParquetWriter(pq.ParquetWriter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth preserving the ability to write out a single monolithic file directly, rather than compiling it after the fact from partitioned outputs? Once the data is in Parquet, it's extremely fast to read and write and it seems simpler to avoid the issue of concurrency altogether. We're hacking around it with SQLite because we don't have a choice unless we want to switch to another format for the DB entirely, but Parquet is designed to be sliced and diced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now I don't think there's any way to directly write to the monolithic output. I was using this class for testing, but it's removed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack! This was a "pending" comment that never got submitted from 4 days ago when I first saw you push to the branch.
plants_entity_eia, | ||
) | ||
) | ||
return consolidate_partitions(partitions.collect()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you say more about what the meaning of the return value is here? consolidate_partitions()
doesn't return anything itself and I'm not immediately understanding the Dagster docs on what's going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically a graph_asset
like this has to return the output from an op
for dagster to be able to figure out how to create the asset. In most cases the op
would actually be returning something that would then get passed to an io_manager
. This case is kind of confusing though because we are directly writing to the parquet outputs inside the op
, so there's nothing to return.
…pudl into epacems_partitions
For more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's a docstring formatting issue that's breaking the docs build, but other than that this looks good to me.
Background
epacems
is currently a major bottleneck in the ETL and could be easily parallelized. Here I've tried using Dagster's partitioned asets to attempt to achieve concurrency, but they don't seem to be behaving quite as expected/hoped. It seems that for every partition combination, it is a launching a run of theetl_fast
/etl_full
job, which running the entire ETL instead of just materializing all assets just once, then running each CEMS partition individually. This is unsurprisingly blowing up the resource usage on my computer and causing it to crash even when limiting the max concurrency to 5 runs.Possible solutions
Option 1: Try to get partitions to work as desired
It would be ideal if we could just get partitions working as we want them to, but I'm not entirely sure if it's possible. I'm going to write up a question in Dagster's slack and see if we can get any help, so we'll se what comes of it.
edit: Just got a response from the Dagster slack, and they recommended creating a new asset job for the partitioned asset and running it separately. This will allow all of the
epacems
partitions to be generated concurrently, but they won't be running with the main ETL, so we do lose a little concurrency there. For that reason, option 2 might still be preferable, but I'd like to hear others thoughts?Option 2: Create an asset factory to generate assets for each partition
This seems to me like a decent solution if we can't get
partitions
working. We could have the factory generate an asset for each year, and allow dagster to handle generating each of these assets concurrently. This behavior seems like exactly what we'd want, but it would add a bunch of assets that kind of create some clutter, but I don't think it would be too bad.Option 3: Handle concurrency ourselves
We could just use normal python concurrency tools to handle this ourselves and only have one asset. I think doing our own concurrency wouldn't really be ideal, as I think performance will be best optimized if we let dagster handle it. The one advantage I see over the asset factory option is just that we don't have to create a bunch of assets