Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make a multi-year EIA MECS archive #542

Merged
merged 9 commits into from
Jan 29, 2025
Merged

Make a multi-year EIA MECS archive #542

merged 9 commits into from
Jan 29, 2025

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Jan 24, 2025

Overview

Closes #516.

converted the 2018 mecs archive into a mutli-year archive. the link patterns for each of the years before 2006 are pretty different. The info about the major and minor table numbers are not in the original file names for the 1998 and 1994 archives so I didn't attempted to rename those.

Testing

How did you make sure this worked? How can a reviewer verify this?

Unpublished archives:
https://zenodo.org/uploads/14749820
https://sandbox.zenodo.org/uploads/158873

To-do list

Tasks

Preview Give feedback

@cmgosnell cmgosnell requested a review from e-belfer January 24, 2025 22:23
@cmgosnell cmgosnell self-assigned this Jan 24, 2025
@cmgosnell cmgosnell changed the title wip draft of multi-year mecs Make a multi-year EIA MECS archive Jan 24, 2025
Comment on lines 37 to 56
if int(year) >= 2006:
table_link_pattern = re.compile(
r"(RSE|)[Tt]able(\d{1,2}|\d{1.1})_(\d{1,2})(.xlsx|.xls)"
)
elif int(year) == 2002:
table_link_pattern = re.compile(
r"(RSE|)[Tt]able(\d{1,2}).(\d{1,2})_\d{1,2}(.xlsx|.xls)"
)
elif int(year) == 1998:
table_link_pattern = re.compile(
r"(d|e)\d{2}([a-z]\d{1,2})_(\d{1,2})(.xlsx|.xls)"
)
elif int(year) == 1994:
# These earlier years the pattern is functional but not actually that informative.
# so we will just use the original name by making the whole pattern a match
table_link_pattern = re.compile(
r"((rse|)m\d{2}_(\d{2})([a-d]|)(.xlsx|.xls))"
)
elif int(year) == 1991:
table_link_pattern = re.compile(r"((rse|)mecs(\d{2})([a-z]|)(.xlsx|.xls))")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this whole situation but i didn't know what else to do. i thought about making a dict w/ year as key and pattern as value but then it wouldn't naturally grab the next year if the newer pattern holds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make the latest year be the default pattern and update with a dict-key if it's one of the other years? I agree it's a bit verbose.

Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run this, 1998 is full of files called 'd' and 'e'. All other years look just fine!

@cmgosnell cmgosnell requested a review from e-belfer January 27, 2025 15:58
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to go! Publish that thing!

src/pudl_archiver/archivers/eia/eiamecs.py Outdated Show resolved Hide resolved
@cmgosnell
Copy link
Member Author

@cmgosnell cmgosnell enabled auto-merge January 29, 2025 14:08
@cmgosnell cmgosnell merged commit 94cc3ab into main Jan 29, 2025
3 checks passed
@cmgosnell cmgosnell deleted the eiamecs-udpate branch January 29, 2025 14:12
@e-belfer e-belfer added the eiamecs EIA Manufacturing Energy Consumption Survey data label Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eiamecs EIA Manufacturing Energy Consumption Survey data
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Update archiver for EIA MECS to archive all available years
2 participants