Skip to content
This repository has been archived by the owner on Nov 18, 2023. It is now read-only.

Update UTA and automate future updates #6

Open
korikuzma opened this issue Jul 4, 2023 · 9 comments
Open

Update UTA and automate future updates #6

korikuzma opened this issue Jul 4, 2023 · 9 comments
Labels
advanced Project is good for those with advanced experience uta Project is for UTA

Comments

@korikuzma
Copy link
Contributor

korikuzma commented Jul 4, 2023

Submitter Name

@reece

Submitter Affiliation

MyOme

Requested By

Everyone using UTA

Lead(s)

@reece

biocommons Repo

uta

Project Details

Hackathon Project Slide

UTA data has not been updated since Jan 29, 2021. Instructions for updating UTA is here. This project aims to automate UTA updates and releases to dl.biocommons.org and Docker images.

Hackathon project plan:

  • Incorporate Remove materialized views from UTA release #11 into this project
  • Review release process and identify challenges
  • Update UTA manually once
  • Develop vision for ideal loading process and automation strategy/tools
  • Identify subprojects to tackle during hackathon and get to it!

Skill Level

Advanced

Required Skills

Python, Docker

@korikuzma korikuzma added advanced Project is good for those with advanced experience uta Project is for UTA labels Jul 4, 2023
@korikuzma
Copy link
Contributor Author

@reece @andreasprlic Would you be able to provide any additional information on this?

@andreasprlic
Copy link
Member

  • The main problem is that the update procedure is tied to a local setup on a specific system

  • Specific steps in a virtual env on that system need to get run.

  • Sometimes these steps break, since there are file changes at NCBI. steps need to get updated. (we prob can't easily fix this aspect)

  • The best way going forward is to change the update procedure to wrap these steps as tasks in a workflow (pick your favorite workflow engine)

  • After an update we would be doing manual QC on the new UTA build. Can we add automated QC as a new set of steps at the end of the workflow?

@andreasprlic
Copy link
Member

Suggestions for workflow engines: Conductor (slight preference), Airflow, Nextflow

@andreasprlic
Copy link
Member

QC steps:

  • make sure all entrez gene make it into the database.
  • make sure the latest version of each transcript is in the database.
  • no null gene symbols
  • no null cigar strings
  • make sure alignments are complete for transcripts (compare the length with the cigar strings)
  • make sure cds start/stop and NP accessions are associated with all coding transcripts
  • check all transcripts and write warns for deprecated transcripts to a file
  • check all transcripts and write warns for transcripts with ref-disagrees to a file

@andreasprlic
Copy link
Member

Feature request for improved update procedure:

  • get the cigar string for exon alignment from NCBI, rather than computing ourselves (big speed up)

@reece reece changed the title Automate UTA updates and releases Update UTA and automate future updates Aug 19, 2023
@andreasprlic
Copy link
Member

@reece can we grant all people who will work on this project access to the minion host (or whatever it is called nowadays)?

  • does this host have a recent copy of ncbi-mirrors?

@reece
Copy link
Member

reece commented Sep 5, 2023

@andreasprlic Yes.

The machine is stuart (yes, a minion). Yes, current ncbi-mirrors.

Since this account is paid for by MyOme, I'd like to not share root/sudo. That means we'll need a bit of planning.

It would be helpful to have login,real name, and ssh pub key beforehand so that I can do this in a batch.

I'll also make a separate pg RDS database for hackathon dev.

@ktennessenInvitae
Copy link

@andreasprlic Yes.

The machine is stuart (yes, a minion). Yes, current ncbi-mirrors.

Since this account is paid for by MyOme, I'd like to not share root/sudo. That means we'll need a bit of planning.

It would be helpful to have login,real name, and ssh pub key beforehand so that I can do this in a batch.

I'll also make a separate pg RDS database for hackathon dev.

@reece What is the best way to send you ssh pub key?

@reece
Copy link
Member

reece commented Sep 9, 2023

Hi @ktennessenInvitae: I just DM'd you on slack.
@andreasprlic : Ditto

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
advanced Project is good for those with advanced experience uta Project is for UTA
Projects
None yet
Development

No branches or pull requests

4 participants