-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new authors.csv, adjust published columns and tests #99
Conversation
I noticed |
The dois_from_orcid() now returns a list of unique DOIs, instead of an iterator of possiblly unique DOIs. This allows a failing test to pass. The test for openalex_publications_from_dois() was relaxed a bit to look for 231 or more publications, since a lookup for an individual DOI can sometimes pull back multiple works. The number of columns for OpenAlex is now 53 because we added `institution_assertions`.
rialto_airflow/harvest/dimensions.py
Outdated
@@ -16,10 +16,10 @@ | |||
def dois_from_orcid(orcid): | |||
logging.info(f"looking up dois for orcid {orcid}") | |||
q = """ | |||
search publications where researchers.orcid_id = "{}" | |||
search publications where researchers.orcid_id = "{}" and year in [2018:{}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we limiting to publications between 2018-2024. This seems like something we will forget about, and we will quietly stop getting new data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thought was that since Rochelle only cared about pubs after 2017, we could speed up the run time by filtering in the query but I don't think it actually did speed things up so I pulled it out.
@@ -86,7 +82,7 @@ def openalex_pubs_df(openalex_pubs): | |||
"publication_year", | |||
"title", | |||
"type", | |||
"best_oa_location", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We found it useful to return this in the Easy Deposit workcycle, because we wanted to get the URLs for OpenAccess PDFs. But I guess it can be removed. Did you find it was not needed or redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that this DAG would drive the open access dashboard and its not needed for that dashboard. I think we will want another DAG for a publishers data dashboard and others for other purposes. The other option would be to get all of the fields for all dashboards in one DAG and then ignore the fields we don't need but I recall us having concerns about the size of the file with that apporach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to either approach. Let me know if you think we should discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we should minimize the time we have to go out and get metadata from OpenAlex and Dimensions. As long as we have the data in the interim files there will be an opportunity to use it other datasets like you said.
@peetucket made some changes to the RIALTO orgs app that adds
primary_school
andprimary_department
to theauthors.csv
file and changes how orgs are assigned to researchers. This PR adds the new authors.csv, publishes those columns, adds theopenalex_open_access
column, and removes some unneeded columns.