This repository contains an R-based analysis pipeline for investigating differential gene expression in HER2/ERBB2+ breast cancer using TCGA RNA-seq data.
# Install BiocManager if not already installed
install.packages("BiocManager")
# Install required packages
BiocManager::install(c(
"DESeq2", # For differential expression analysis
"clusterProfiler", # For pathway analysis
"pheatmap", # For heatmap visualization
"ComplexHeatmap", # For advanced heatmap visualization
"glmnet", # For LASSO regression
"org.Hs.eg.db" # For gene annotation
))
# Additional CRAN packages
install.packages(c(
"survival", # For survival analysis
"survminer", # For survival visualization
"ggplot2", # For plotting
"reshape2" # For data manipulation
))
The pipeline expects three input files from the TCGA breast cancer dataset:
- RNA-seq data (
data_mrna_seq_v2_rsem.txt
) - Clinical data (
data_clinical_patient.txt
) - Copy Number Alteration (CNA) data (
data_cna.txt
)
# Load data files
rna_seq <- read.delim("data_mrna_seq_v2_rsem.txt", sep="\t", header=TRUE)
clinical <- read.delim("data_clinical_patient.txt", sep="\t", header=TRUE)
cna <- read.delim("data_cna.txt", sep="\t", header=TRUE)
Key preprocessing steps:
- ID standardization across datasets
- Removal of metadata rows from clinical data
- Handling of missing values
- Sample matching across datasets
The pipeline classifies samples based on ERBB2 amplification status:
- Amplified: CNA > 0
- Not Amplified: CNA ≤ 0
Using DESeq2 for:
- Data normalization
- Differential expression testing
- Variance stabilizing transformation (VST)
# Generate PCA plot
pca_data <- plotPCA(vst, intgroup="ERBB2_Status", returnData=TRUE)
ggplot(pca_data, aes(PC1, PC2, color=ERBB2_Status)) +
geom_point(size=3) +
geom_density2d()
# Generate heatmap of top DE genes
pheatmap(mat,
annotation_col=metadata_factors,
scale="row",
show_rownames=TRUE)
Using clusterProfiler for:
- GO enrichment analysis
- GSEA analysis
- Pathway visualization
Implements LASSO-regularized Cox regression:
- Patient stratification
- Survival curve generation
- Risk score calculation
standardize_ids_strict <- function(ids) {
ids <- toupper(ids)
ids <- gsub("[^A-Z0-9.]", ".", ids)
ids <- gsub("\\.\\d+$", "", ids)
ids <- gsub("\\.+", ".", ids)
ids <- sub("\\.$", "", ids)
return(ids)
}
- Removes low count genes (< 10 counts)
- Handles missing values in survival data
- Matches samples across datasets
The pipeline generates:
- Differential expression results
- PCA plots
- Heatmaps
- Pathway enrichment results
- Survival analysis plots
- Risk stratification results
The pipeline includes validation steps:
- Known ERBB2+ signature genes verification
- Data quality checks
- Sample matching verification
- Survival data completeness checks
# Load required libraries
source("required_libraries.R")
# Run analysis pipeline
source("main_analysis.R")
# Generate visualizations
source("visualization.R")
# Perform survival analysis
source("survival_analysis.R")
- Ensure all input files are in the correct format
- Monitor memory usage with large datasets
- Consider using parallel processing for large-scale analyses
- Verify sample IDs match across all input files
Common issues and solutions:
- Sample ID mismatches: Use the standardize_ids_strict function
- Memory issues: Filter low-count genes early
- Missing survival data: Check completeness of clinical data
- Zero-variance genes: Remove before LASSO regression
Feel free to submit issues and enhancement requests!