Project Topics

The following summarizes challenge projects that students in GEN242 selected and worked on in current and past course offerings. The projects are designed to strengthen understanding of data analysis problems, method selection, and interpretation of computational results. Each challenge task is aligned with a matching research paper that students present at an early stage of the project. Students then develop the topic into a small research-style project, present their results at the end of the course, and submit a research report written in a format similar to a short scientific paper, including reproducible research foundations.

Project Topics from this Year (2026)

Expression Profiling

Splice-aware RNA-seq alignment with HISAT2

Area: RNA-seq quantification · Methods: HISAT2, featureCounts · Data type: bulk RNA-seq

Paper covered: Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nature Methods 12:357–360. PMID: 25751142.

Summary: This project benchmarks HISAT2 plus featureCounts against Kallisto-derived quantification by comparing alignment or quantification rates, detected genes, count-table agreement, runtime/resource use, and downstream DEG overlap.

Pseudoalignment and transcript quantification with Kallisto

Area: RNA-seq quantification · Methods: Kallisto, tximport/tximeta · Data type: bulk RNA-seq

Paper covered: Bray NL, Pimentel H, Melsted P, Pachter L (2016) Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34:525–527. PMID: 27043002.

Summary: This project benchmarks Kallisto pseudoalignment against HISAT2-based gene counting by comparing transcript-to-gene count estimates, runtime/resource use, count correlations, and the resulting DEG overlap and discrepancies.

Differential expression with DESeq2

Area: differential expression · Methods: DESeq2 · Data type: bulk RNA-seq counts

Paper covered: Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15:550. PMID: 25516281.

Summary: This project applies DESeq2 to the shared count matrix and contributes one DEG result set for group-level comparison by evaluating DEG ranks, shrunken fold changes, enrichment results, and overlap with limma/voom and edgeR outputs.

Differential expression with limma/voom

Area: differential expression · Methods: limma, voom, edgeR preprocessing · Data type: bulk RNA-seq counts

Paper covered: Guo Y, Li C-I, Ye F, Shyr Y (2013) Evaluation of read count based RNAseq analysis methods. BMC Genomics 14 Suppl 8:S2. PMID: 24564449.

Summary: This project applies limma/voom to the same count matrix used by the DEG group and compares its DEG calls, volcano/MA-plot behavior, enrichment results, and gene-list overlap against DESeq2 and edgeR.

Differential expression with edgeR

Area: differential expression · Methods: edgeR · Data type: bulk RNA-seq counts

Paper covered: Zhou X, Lindsay H, Robinson MD (2014) Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Research 42:e91. PMID: 24753412.

Summary: This project applies edgeR to the shared count matrix and focuses on method performance relative to DESeq2 and limma/voom, including DEG overlap, rank concordance, visualization, and downstream functional enrichment differences.

Functional Interpretation

GSEA and ORA with fgsea

Area: functional enrichment · Methods: fgsea, ORA, GSEA · Data type: ranked gene lists or DEG sets

Paper covered: Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102:15545–15550. PMID: 16199517.

Summary: This project compares ORA and preranked GSEA/fgsea on the same DEG-derived gene sets or ranked lists, emphasizing concordance of enriched pathways, rank-based differences, and method-dependent biological interpretation.

Sample-level pathway scoring with GSVA

Area: pathway activity analysis · Methods: GSVA · Data type: gene expression matrix

Paper covered: Hänzelmann S, Castelo R, Guinney J (2013) GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 14:7.

Summary: This project applies GSVA to compute sample-level pathway scores and compares pathway-driven clustering, group separation, and top pathway signals against ORA/GSEA and PROGENy-based interpretations.

Pathway activity inference with PROGENy

Area: pathway/network inference · Methods: PROGENy, decoupleR · Data type: gene expression profiles

Paper covered: Schubert M, Klinger B, Klünemann M, Sieber A, Uhlitz F, Sauer S, Garnett MJ, Blüthgen N, Saez-Rodriguez J (2018) Perturbation-response genes reveal signaling footprints in cancer gene expression. Nature Communications 9:20.

Summary: This project applies PROGENy/decoupleR to infer pathway activity from gene expression data and compares the inferred pathway signals with GSVA scores and ORA/GSEA results from the same biological comparison.

AI / Machine Learning Classification

Random Forest classification

Area: supervised learning · Methods: randomForest, ranger · Data type: high-dimensional omics features

Paper covered: Díaz-Uriarte R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7:3.

Summary: This project trains a random forest classifier on shared omics features and evaluates predictive performance, feature importance, and agreement with XGBoost and SHAP-based interpretations within the group.

Random Forest variable importance bias

Area: model interpretation · Methods: conditional inference forests, partykit/cforest · Data type: high-dimensional omics features

Paper covered: Strobl C, Boulesteix A-L, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 8:25.

Summary: This project tests how random forest feature-importance rankings change under standard versus bias-reduced approaches and compares the selected predictors with those prioritized by XGBoost and SHAP analyses.

XGBoost classification

Area: supervised learning · Methods: XGBoost · Data type: high-dimensional omics features

Paper covered: Chen T, Guestrin C (2016) XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794.

Summary: This project trains an XGBoost classifier on the same classification task as the random forest projects and compares cross-validated performance, selected features, and interpretability outputs across tree-based methods.

SHAP-based feature interpretation

Area: explainable AI · Methods: TreeSHAP, treeshap, shapviz · Data type: trained tree-based models

Paper covered: Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Himmelfarb J, Bansal N, Lee S-I (2020) From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence 2:56–67.

Summary: This project applies SHAP to trained tree-based models to compare local and global feature contributions with random forest importance and XGBoost feature rankings, emphasizing method agreement and biological interpretability.

Single-Cell Genomics

Cluster significance testing

Area: single-cell clustering · Methods: scran bootstrap clustering functions · Data type: scRNA-seq

Paper covered: Grabski IN, Street K, Irizarry RA (2023) Significance analysis for clustering with single-cell RNA-sequencing data. Nature Methods 20:1324–1330. PMID: 37679539.

Summary: This project tests the statistical support of single-cell cluster assignments and compares cluster stability, resolution effects, and biological interpretability against the group’s other single-cell analyses.

Dimensionality reduction benchmarking

Area: visualization and embedding · Methods: PCA, UMAP, t-SNE, diffusion maps · Data type: scRNA-seq

Paper covered: Huang H, Wang Y, Rudin C, Browne EP (2022) Towards a comprehensive evaluation of dimension reduction methods for transcriptomic data visualization. Communications Biology 5:719. PMID: 35851282.

Summary: This project benchmarks dimensionality reduction methods on the same scRNA-seq data by comparing cluster separation, preservation of known labels, visualization stability, and agreement with clustering or annotation results.

Automated cell-type annotation

Area: cell annotation · Methods: SingleR, celldex · Data type: scRNA-seq

Paper covered: Aran D, Looney AP, Liu L, Wu E, Fong V, Hsu A, Chak S, Naikawadi RP, Bhattacharya M, Bhattacharya A, et al. (2019) Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nature Immunology 20:163–172. PMID: 30643263.

Summary: This project applies reference-based cell-type annotation and compares annotation confidence, reference dependence, and consistency with unsupervised clusters, embeddings, and trajectory or subpopulation results.

Trajectory inference and pseudotime analysis

Area: developmental trajectory analysis · Methods: slingshot, tradeSeq · Data type: scRNA-seq

Paper covered: Street K, Risso D, Fletcher RB, Das D, Ngai J, Yosef N, Purdom E, Dudoit S (2018) Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19:477. PMID: 29914354.

Summary: This project applies trajectory and pseudotime inference and compares lineage structure, pseudotime-associated genes, and biological interpretation with cluster assignments and cell-type annotations from the shared analysis context.

Subpopulation prediction

Area: single-cell state analysis · Methods: muscat · Data type: multi-sample scRNA-seq

Paper covered: Crowell HL, Soneson C, Germain P-L, Calini D, Collin L, Wessa C, Bhatt DL, Robinson MD (2020) muscat detects subpopulation-specific state transitions from multi-sample multi-group single-cell RNA-seq data. Nature Communications 11:6077. PMID: 33257685.

Summary: This project tests for subpopulation-specific state changes across sample groups and compares detected cell-state shifts with clustering, annotation, and trajectory-based interpretations.

Multi-Omics Analysis

Unsupervised multi-omics integration with MOFA+

Area: latent factor modeling · Methods: MOFA2 · Data type: matched RNA-seq and proteomics

Paper covered: Argelaguet R, Arnol D, Bredikhin D, Deloro Y, Velten B, Marioni JC, Stegle O (2020) MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biology 21:111. PMID: 32393329.

Additional benchmarking paper: Cantini L, Zakeri P, Hernandez C, Naldi A, Thieffry D, Remy E, Baudot A (2021) Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer. Nature Communications 12:124. PMID: 33420054.

Summary: This project applies MOFA+ to the shared multi-omics data set and compares latent factors with pathwayPCA scores, clinical annotations, and supervised classification features identified by DIABLO and netDx.

Pathway-based multi-omics integration with pathwayPCA

Area: pathway-level integration · Methods: pathwayPCA · Data type: matched RNA-seq and proteomics

Paper covered: Odom GJ, Ban Y, Colaprico A, Liu L, Silva TC, Sun X, Pico AR, Zhang B, Wang L, Chen X (2020) PathwayPCA: an R/Bioconductor package for pathway based integrative analysis of multi-omics data. Proteomics 20:e1900409. PMID: 31610092.

Summary: This project applies pathwayPCA to RNA-seq and proteomics data and compares pathway-level sample scores, assay concordance, and clinical associations with MOFA+ factors and supervised multi-omics results.

Supervised multi-omics classification with DIABLO

Area: supervised multi-omics integration · Methods: mixOmics DIABLO / block.splsda · Data type: matched multi-omics profiles

Paper covered: Singh A, Shannon CP, Gautier B, Rohart F, Vacher M, Tebbutt SJ, Lê Cao K-A (2019) DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics 35:3055–3062. PMID: 30657866.

Summary: This project trains DIABLO multi-omics classifiers and compares cross-validated classification performance, selected molecular drivers, and single-assay versus integrated models against netDx and pathway-level analyses.

Patient similarity network classification with netDx

Area: network-based classification · Methods: netDx · Data type: matched multi-omics profiles

Paper covered: Pai S, Hui S, Isserlin R, Shah MA, Kaka H, Bader GD (2019) netDx: interpretable patient classification using integrated patient similarity networks. Molecular Systems Biology 15:e8497. PMID: 30981949.

Summary: This project builds netDx patient similarity network classifiers and compares pathway-level predictive networks, multi-omics versus single-omics performance, and feature agreement with DIABLO, pathwayPCA, and MOFA+.

Subselection of Project Topics from Previous Years

RNA-seq and Expression Analysis

Comparison of RNA-seq aligners

Area: RNA-seq quantification · Methods: HISAT2, Rsubread, STAR, Kallisto · Data type: bulk RNA-seq

Representative paper(s): Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nature Methods 12:357–360; Bray NL, Pimentel H, Melsted P, Pachter L (2016) Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34:525–527.

Summary: This project compares RNA-seq alignment or quantification tools on the same data set and evaluates agreement in counts, detected features, DEG results, runtime, and downstream biological interpretation.

Comparison of DEG analysis methods

Area: differential expression · Methods: DESeq2, edgeR, limma/voom, baySeq · Data type: bulk RNA-seq counts

Representative paper(s): Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15:550; Hardcastle TJ, Kelly KA (2010) baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11:422.

Summary: This project compares multiple DEG methods on the same count matrix and evaluates overlap, rank concordance, visual summaries, enrichment results, and method-specific behavior under the same biological contrast.

Differential exon and transcript usage analysis

Area: transcript-level analysis · Methods: DEXSeq, Kallisto/Sleuth, DTUrtle · Data type: bulk RNA-seq

Representative paper(s): Anders S, Reyes A, Huber W (2012) Detecting differential usage of exons from RNA-seq data. Genome Research 22:2008–2017; Pimentel H, Bray NL, Puente S, Melsted P, Pachter L (2017) Differential analysis of RNA-seq incorporating quantification uncertainty. Nature Methods 14:687–690.

Summary: This project compares exon- or transcript-level differential usage methods and evaluates how isoform-aware results agree with each other, with gene-level DEG results, and with published splice-variant findings.

Cluster and network analysis methods

Area: unsupervised learning · Methods: hierarchical clustering, k-means, fuzzy c-means, WGCNA, Clust · Data type: gene expression profiles

Representative paper(s): Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9:559; Abu-Jamous B, Kelly S (2018) Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data. Genome Biology 19:172.

Summary: This project benchmarks clustering and network-analysis methods by comparing group assignments, overlap between clusters/modules, functional enrichment outcomes, and optional performance summaries using annotation-based pseudo ground truth.

Clustering and embedding methods for scRNA-seq

Area: single-cell analysis · Methods: SC3, Seurat, PCA, t-SNE, UMAP, related methods · Data type: scRNA-seq

Representative paper(s): Kiselev VY et al. (2017) SC3: consensus clustering of single-cell RNA-seq data. Nature Methods 14:483–486; Sun S, Zhu J, Ma Y, Zhou X (2019) Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biology 20:269.

Summary: This project benchmarks scRNA-seq clustering and embedding methods using data with known or interpretable cell labels, comparing partition quality, visualization performance, scalability, and agreement with published benchmarks.

ChIP-seq and Regulatory Genomics

Comparison of ChIP-seq peak callers

Area: peak detection · Methods: MACS2, PeakSeq, F-Seq, HOMER, CSAR, related tools · Data type: ChIP-seq

Representative paper(s): Zhang Y et al. (2008) Model-based analysis of ChIP-Seq. Genome Biology 9:R137; Feng J, Liu T, Qin B, Zhang Y, Liu XS (2012) Identifying ChIP-seq enrichment using MACS. Nature Protocols 7:1728–1740.

Summary: This project compares ChIP-seq peak callers on the same data set and evaluates shared and method-specific peaks, agreement with published or consensus peaks, and optional ROC/AUC-style performance summaries.

Functional enrichment analysis of ChIP-seq targets

Area: functional interpretation · Methods: ORA, GSEA, GO/KEGG/Reactome enrichment · Data type: ChIP-seq peak-associated genes

Representative paper(s): Subramanian A et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102:15545–15550; Welch RP et al. (2014) ChIP-Enrich: gene set enrichment testing for ChIP-seq data. Nucleic Acids Research 42:e105.

Summary: This project compares functional enrichment methods for ChIP-seq peak-associated genes, emphasizing rank concordance, method-specific pathway calls, and whether inferred functions match expectations for the experiment.

Motif enrichment and motif discovery

Area: regulatory sequence analysis · Methods: PWMEnrich, MEME-ChIP, BCRANK, motif databases · Data type: ChIP-seq peak sequences

Representative paper(s): McLeay RC, Bailey TL (2010) Motif enrichment analysis: a unified framework and an evaluation on ChIP data. BMC Bioinformatics 11:165; Machanick P, Bailey TL (2011) MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27:1696–1697.

Summary: This project compares motif enrichment and motif discovery tools or peak-ranking strategies, evaluating overlap among detected motifs, agreement with known motifs, and effects of ranking criteria on motif results.

Programmable genome summary graphics

Area: genomic visualization · Methods: ggplot2, ggbio, Gviz, RCircos, Shiny · Data type: genomic ranges and annotations

Representative paper(s): Yin T, Cook D, Lawrence M (2012) ggbio: an R package for extending the grammar of graphics for genomic data. Genome Biology 13:R77; Hahne F, Ivanek R (2016) Visualizing genomic data using Gviz and Bioconductor. In Statistical Genomics: Methods and Protocols.

Summary: This project develops and compares programmable genome-visualization approaches for summarizing ChIP-seq or genomic-feature results, emphasizing reusable plotting functions, interpretability, and optional interactive extensions.

Drug-target analysis of peak-associated genes

Area: translational bioinformatics · Methods: orthology mapping, drug-target annotation, structural similarity search · Data type: genes near ChIP-seq peaks

Representative paper(s): Chen X, Reynolds CH (2002) Performance of similarity measures in 2D fragment-based similarity searching: comparison of structural descriptors and similarity coefficients. Journal of Chemical Information and Computer Sciences 42:1407–1414.

Summary: This project compares drug-target and structural-similarity strategies for peak-associated genes by evaluating ortholog mapping, annotated targets, overlap among candidate compounds, and visualization of method performance.

lncRNAs, ORFs, miRNAs, and repeats in regulatory regions

Area: genome feature discovery · Methods: sequence feature detection, coding-potential tools, repeat or RNA feature annotation · Data type: ChIP-seq peak sequences

Representative paper(s): Han S et al. (2019) LncFinder: an integrated platform for long non-coding RNA identification. Briefings in Bioinformatics 20:2009–2027; Hu L, Xu Z, Hu B, Lu ZJ (2017) COME: a robust coding potential calculation tool for lncRNA identification. Nucleic Acids Research 45:e2.

Summary: This project compares feature-detection approaches for regulatory peak sequences, focusing on how different tools or feature classes identify lncRNAs, ORFs, miRNAs, repeats, or other sequence-derived annotations.

Variant Analysis

Performance comparison of variant callers

Area: variant discovery · Methods: GATK, BCFtools, Octopus, DeepVariant · Data type: resequencing / VAR-seq

Representative paper(s): DePristo MA et al. (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43:491–498; Poplin R et al. (2018) A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology 36:983–987.

Summary: This project compares variant callers on the same resequencing data set by evaluating shared and method-specific variants, agreement with published or consensus calls, and optional ROC/AUC-style performance summaries.

Optional Tag Index

RNA-seq: aligners, DEG analysis, transcript usage, clustering
ChIP-seq: peak calling, motif analysis, regulatory annotation, visualization
Single-cell genomics: clustering, embedding, annotation, trajectory inference, subpopulation analysis
Machine learning: random forest, XGBoost, SHAP, classification, feature importance
Multi-omics: MOFA+, pathwayPCA, DIABLO, netDx
Functional interpretation: ORA, GSEA, GSVA, PROGENy
Variant analysis: variant callers, benchmarking, annotation