Skip to content

`rna-features` is a package used to generate machine-learning features from RNAseq data.

License

Notifications You must be signed in to change notification settings

SpikyClip/rna-features

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rna-features

rna-features is a package used to generate machine-learning features from RNAseq data. Given a list of dataset directories containing DESeq2 contrast files (.csv) and a 'tpm.tsv' matrix of gene Transcripts per Million (TPM) across samples (generated by the llrnaseq pipeline), it generates a feature matrix containing the following features per dataset:

  • Gene breadth (p <= p-value)
    • down (log2FC <= -1)
    • neither (-1 < log2FC < 1)
    • up (log2FC >= 1)
  • log2FC (p <= p-value)
    • Median Absolute Deviation (MAD)
    • Maximum
    • Median
  • TPM
    • MAD
    • Maximum
    • Median

These features are output as a feature_matrix file in both .csv and .pkl format (the .pkl file can be loaded as a pandas dataframe with pandas.read_pickle(path)). Below is an output preview:

                         regulation              log2foldchange                              tpm                        
                               down neither   up            mad        max     median        mad         max      median
dataset gene                                                                                                            
set_1   Solyc00g500063.1        0.0     1.0  0.0       0.000000   0.953245   0.953245   8.412766   54.887642   27.721765
        Solyc00g500185.1        0.0     0.0  1.0       0.000000   1.333732   1.333732   0.135050    0.943789    0.254913
        Solyc01g005000.3        0.0     1.0  2.0       0.118566   1.097196   1.093001  44.024541  254.986816  108.668376
        Solyc01g005010.4        4.0     0.0  0.0       0.439194  -1.201843  -1.577684  13.191743   85.372719   12.014153
        Solyc01g005020.3        0.0     1.0  0.0       0.000000   0.649139   0.649139   6.994529   42.430080   18.944556
...                             ...     ...  ...            ...        ...        ...        ...         ...         ...
set_2   Solyc12g150103.1        0.0     2.0  3.0       0.245354   1.598051   1.049794   1.223475    7.559584    3.616534
        Solyc12g150108.1        1.0     0.0  0.0       0.000000 -23.707473 -23.707473   1.287612   13.105947    0.000000
        Solyc12g150113.1        0.0     1.0  4.0       0.251563   1.845714   1.397828  40.746832  193.108032   59.591179
        Solyc12g150124.1        0.0     0.0  2.0       0.076378   1.622478   1.546100   0.468325    4.811159    0.703217
        Solyc12g150132.1        0.0     0.0  1.0       0.000000   4.130969   4.130969   0.074633    0.551118    0.091994

Installation

To install rna-features, download the latest .whl binary from the releases page and install using pip(note: the package is not currently installable with python 3.10, as dependencies such as numpy have not yet released compatible wheels):

wget https://github.com/SpikyClip/rna-features/releases/download/0.1.1-dev/rna_features-0.1.1-py3-none-any.whl

pip install rna_features-0.1.1-py3-none-any.whl

This will install rna-features as a python package, and rna-features will be available on $PATH. To test if installation is successful:

rna-features -h

The following help message should appear:

usage: rna-features [-h] [-p p-value] dir [dir ...]

Generates machine-learning features from RNAseq data. Takes a list of 
directories containing DESeq2 contrast files (.csv) and a 'tpm.tsv' file 
(containing a matrix of tpm values of genes against sample) returning a 
'feature_matrix.csv' containing gene expression breadth and log2fc/tpm 
mad, max and median for each gene.

positional arguments:
  dir         Dataset directories containing DESeq2 contrast files (.csv) and a 'tpm.tsv' matrix file.

optional arguments:
  -h, --help  show this help message and exit
  -p p-value  p-value cutoff for filtering log2fc values [default: 0.05]

Usage

To use rna-features, specify a list of directories each containing DESeq2 .csv contrast files and one tpm.tsv file:

rna-features dataset_1 dataset_2 dataset_3

An optional p-value cutoff can be specified:

rna-features -p 0.005 dataset_1 dataset_2 dataset_3

Additional Notes

  • The contrast files (*.csv) should be in the following format:
                    "",      "baseMean", "log2FoldChange",          "lfcSE",           "stat",            "pvalue",              "padj"
    "Solyc01g005000.3",4496.05232181299, 1.09719580776875,0.313072912511878, 3.50460152865228,0.000457291165260712, 0.0115280270712814
    "Solyc01g005340.3",540.376944106274, 0.52013987940027,0.170624565359894, 3.04844661906186,  0.0023002777636722, 0.0362570019128406
    "Solyc01g005390.3",16.4785747787331,-1.85885261292963,0.471053842373692,-3.94615741496274,7.94154133931579e-05,0.00287540425470711
    "Solyc01g005410.4",1181.71130130374, 1.37296624988023,0.394738835793252, 3.47816359928501,0.000504861691439399, 0.0125485785916511
    
  • The tpm matrix (tpm.tsv) should be in the tab-delimited following format:
    gene_id01-0-hr-C1	02-0-hr-C2	03-0-hr-C3	04-0-hr-JA1
    Solyc00g500003.1	0.030844	0.011062	0.006824
    Solyc00g500041.1	1.515571	1.78357	1.503047
    Solyc00g500042.1	0.258916	0.273953	0.248473
    
  • NaN values may occur in the regulation and log2foldchange columns if the tpm.tsv matrix contains a broader set of genes than those found in the contrast files. Such NaN files have to be processed by the user.

About

`rna-features` is a package used to generate machine-learning features from RNAseq data.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages