-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new authors.csv, adjust published columns and tests #99
Changes from 3 commits
d992d2c
f388581
5898235
ccf6816
9ae9adc
fd68077
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,11 +57,7 @@ def dimensions_pubs_df(dimensions_pubs): | |
"document_type", | ||
"funders", | ||
"funding_section", | ||
"linkout", | ||
"open_access", | ||
"publisher", | ||
"research_orgs", | ||
"researchers", | ||
"title", | ||
"type", | ||
"year", | ||
|
@@ -86,7 +82,7 @@ def openalex_pubs_df(openalex_pubs): | |
"publication_year", | ||
"title", | ||
"type", | ||
"best_oa_location", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We found it useful to return this in the Easy Deposit workcycle, because we wanted to get the URLs for OpenAccess PDFs. But I guess it can be removed. Did you find it was not needed or redundant? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My thinking was that this DAG would drive the open access dashboard and its not needed for that dashboard. I think we will want another DAG for a publishers data dashboard and others for other purposes. The other option would be to get all of the fields for all dashboards in one DAG and then ignore the fields we don't need but I recall us having concerns about the size of the file with that apporach. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm open to either approach. Let me know if you think we should discuss. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like we should minimize the time we have to go out and get metadata from OpenAlex and Dimensions. As long as we have the data in the interim files there will be an opportunity to use it other datasets like you said. |
||
"open_access", | ||
), | ||
) | ||
df = df.rename(lambda column_name: "openalex_" + column_name) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we limiting to publications between 2018-2024. This seems like something we will forget about, and we will quietly stop getting new data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thought was that since Rochelle only cared about pubs after 2017, we could speed up the run time by filtering in the query but I don't think it actually did speed things up so I pulled it out.