This repository contains code used for the p50 infertility project. There are two separate tasks:
- Running CELLECT on ovary datasets with infertility and hormone GWAS sumstats to prioritise etiologic cell types.
- Finding marker genes for clusters from the ovary datasets.
Using CELLEX and CELLECT on single cell RNA-seq ovary datasets with infertility GWAS summary statistics
To use CELLEX and CELLECT, follow the instructions on their github repositories. Once the CELLECT directory is cloned from their github, create a subdirectory p50 for this project.
The basic directory structure for the p50 infertility project work:
|-- cluster_markers
| |-- GSE118127
| |-- GSE202601
| `-- GSE213216
|-- data
| |-- counts
| |-- esmu
| `-- sumstats
|-- dbSNP
|-- logs
`-- plots
- CELLECT_OUT_p50 - Created when CELLECT is run. Contains CELLECT output files.
- cluster_markers - Store cluster_marker_genes output files here.
- data - Store input data for CELLEX and CELLECT here.
- dbSNP - Store the MarkerName to RSID map file here.
- logs - For logs.
- plots - Store plots generated from CELLECT results here.
Once the directory structure is set up, follow the pipeline below.
- Download data
We need scRNA-seq count data and the corresponding cell type annotations metadata. We also need GWAS summary statistics (in-house). See datasets for more information. - Set up environments
Download required packages/create the recommended conda environments. More information is given in set_up. - Prepare ESMU files (run CELLEX)
Using the counts and cell type annotations metadata as input, we use CELLEX to produce expression specificity files (ESMU). See prepare_esmu for R and python code used to prepare data and run CELLEX. - Prepare sumstats file
Use the pipeline provided in prepare_sumstats to prepare the GWAS summary statistics for input to CELLECT. - Run CELLECT
Using munged summary stats and ESMU files as input, we use CELLECT to prioritise etilogical cell types. Use the config_p50.yml file provided. See run_cellect. - Visualisation
Use R to visualise the results. See visualisation.
Find cluster gene markers for three single cell RNA-seq ovary datasets. See cluster_marker_genes.