Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new authors.csv, adjust published columns and tests #99

Merged
merged 6 commits into from
Sep 26, 2024
Merged

Conversation

jacobthill
Copy link
Contributor

@jacobthill jacobthill commented Sep 26, 2024

@peetucket made some changes to the RIALTO orgs app that adds primary_school and primary_department to the authors.csv file and changes how orgs are assigned to researchers. This PR adds the new authors.csv, publishes those columns, adds the openalex_open_access column, and removes some unneeded columns.

@jacobthill
Copy link
Contributor Author

I noticed test_openalex.py::test_dois_from_orcid_paging and test_openalex.py::test_publications_from_dois are failing as well but I don't think that's related to any changes in this PR.

jacobthill and others added 4 commits September 26, 2024 12:56
The dois_from_orcid() now returns a list of unique DOIs, instead of an
iterator of possiblly unique DOIs. This allows a failing test to pass.

The test for openalex_publications_from_dois() was relaxed a bit to look
for 231 or more publications, since a lookup for an individual DOI can
sometimes pull back multiple works.

The number of columns for OpenAlex is now 53 because we added
`institution_assertions`.
@@ -16,10 +16,10 @@
def dois_from_orcid(orcid):
logging.info(f"looking up dois for orcid {orcid}")
q = """
search publications where researchers.orcid_id = "{}"
search publications where researchers.orcid_id = "{}" and year in [2018:{}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we limiting to publications between 2018-2024. This seems like something we will forget about, and we will quietly stop getting new data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thought was that since Rochelle only cared about pubs after 2017, we could speed up the run time by filtering in the query but I don't think it actually did speed things up so I pulled it out.

@@ -86,7 +82,7 @@ def openalex_pubs_df(openalex_pubs):
"publication_year",
"title",
"type",
"best_oa_location",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We found it useful to return this in the Easy Deposit workcycle, because we wanted to get the URLs for OpenAccess PDFs. But I guess it can be removed. Did you find it was not needed or redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that this DAG would drive the open access dashboard and its not needed for that dashboard. I think we will want another DAG for a publishers data dashboard and others for other purposes. The other option would be to get all of the fields for all dashboards in one DAG and then ignore the fields we don't need but I recall us having concerns about the size of the file with that apporach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to either approach. Let me know if you think we should discuss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should minimize the time we have to go out and get metadata from OpenAlex and Dimensions. As long as we have the data in the interim files there will be an opportunity to use it other datasets like you said.

@edsu edsu merged commit a0cfdb1 into main Sep 26, 2024
1 check passed
@edsu edsu deleted the data-fields branch September 26, 2024 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants