Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem
Tóm tắt
Transcriptomic structural variants (TSVs)—large-scale transcriptome sequence change due to structural variation - are common in cancer. TSV detection from high-throughput sequencing data is a computationally challenging problem. Among all the confounding factors, sample heterogeneity, where each sample contains multiple distinct alleles, poses a critical obstacle to accurate TSV prediction. To improve TSV detection in heterogeneous RNA-seq samples, we introduce the Multiple Compatible Arrangements Problem (MCAP), which seeks k genome arrangements that maximize the number of reads that are concordant with at least one arrangement. This models a heterogeneous or diploid sample. We prove that MCAP is NP-complete and provide a $$\frac{1}{4}$$-approximation algorithm for $$k=1$$ and a $$\frac{3}{4}$$-approximation algorithm for the diploid case ($$k=2$$) assuming an oracle for $$k=1$$. Combining these, we obtain a $$\frac{3}{16}$$-approximation algorithm for MCAP when $$k=2$$ (without an oracle). We also present an integer linear programming formulation for general k. We characterize the conflict structures in the graph that require $$k>1$$ alleles to satisfy read concordancy and show that such structures are prevalent. We show that the solution to MCAP accurately addresses sample heterogeneity during TSV detection. Our algorithms have improved performance on TCGA cancer samples and cancer cell line samples compared to a TSV calling tool, SQUID. The software is available at https://github.com/Kingsford-Group/diploidsquid.
Tài liệu tham khảo
Deininger MW, Goldman JM, Melo JV. The molecular biology of chronic myeloid leukemia. Blood. 2000;96(10):3343–56.
Wang X, Zamolyi RQ, Zhang H, Pannain VL, Medeiros F, Erickson-Johnson M, Jenkins RB, Oliveira AM. Fusion of HMGA1 to the LPP/TPRG1 intergenic region in a lipoma identified by mapping paraffin-embedded tissues. Cancer Genet Cytogenet. 2010;196(1):64–7.
Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009;6(9):677.
Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 2014;15(6):84.
Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28(18):333–9.
Hormozdiari F, Hajirasouliha I, Dao P, Hach F, Yorukoglu D, Alkan C, Eichler EE, Sahinalp SC. Next-generation variationhunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics. 2010;26(12):350–7.
Dixon JR, Xu J, Dileep V, Zhan Y, Song F. Integrative detection and analysis of structural variation in cancer genomes. Nat Genet. 2018;50(10):1388.
Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, Schatz MC. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018;15(6):461–8.
Ma C, Shao M, Kingsford C. SQUID: transcriptomic structural variation detection from RNA-seq. Genome Biol. 2018;19(1):52.
Huang Z, Jones DT, Wu Y, Lichter P, Zapatka M. confFuse: high-confidence fusion gene detection across tumor entities. Front Genet. 2017;8:137.
McPherson A, Hormozdiari F, Zayed A, Giuliany R, Ha G. deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data. PLoS Comput Biol. 2011;7(5):1001138.
Davidson NM, Majewski IJ, Oshlack A. Jaffa: High sensitivity transcriptome-focused fusion gene detection. Genome Med. 2015;7(1):43.
Nicorici D, Satalan M, Edgren H, Kangaspeska S, Murumagi A, Kallioniemi O, Virtanen S, Kilkku O. FusionCatcher–a tool for finding somatic fusion genes in paired-end RNA-sequencing data. BioRxiv. 2014;011650.
Torres-García W, Zheng S, Sivachenko A, Vegesna R, Wang Q, Yao R, Berger MF, Weinstein JN, Getz G, Verhaak RG. PRADA: pipeline for RNA sequencing data analysis. Bioinformatics. 2014;30(15):2224–6.
Jia W, Qiu K, He M, Song P, Zhou Q. SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data. Genome Biol. 2013;14(2):12.
Liu S, Tsai W-H, Ding Y, Chen R, Fang Z. Comprehensive evaluation of fusion transcript detection algorithms and a meta-caller to combine top performing methods in paired-end RNA-seq data. Nucleic Acids Res. 2015;44(5):47.
Heber S, Alekseyev M, Sze S-H, Tang H, Pevzner PA. Splicing graphs and EST assembly problem. Bioinformatics. 2002;18(suppl-1):181–8.
Kececioglu JD, Myers EW. Combinatorial algorithms for DNA sequence assembly. Algorithmica. 1995;13(1–2):7.
Hagberg A, Swart P, Chult SD. Exploring network structure, dynamics, and function using NetworkX. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States) 2008.
Sedgewick R. Algorithms in C, part 5: graph algorithms. 3rd ed. Boston: Addison-Wesley Professional; 2001.
Aran D, Sirota M, Butte AJ. Systematic pan-cancer analysis of tumour purity. Nat Commun. 2015;6:8971.
Gazdar AF, Kurvari V, Virmani A, Gollahon L, Sakaguchi M. Characterization of paired tumor and non-tumor cell lines established from patients with breast cancer. Int J Cancer. 1998;78(6):766–74.
Xiu Y, Liu W, Wang T, Liu Y, Ha M. Overexpression of ect2 is a strong poor prognostic factor in er (+) breast cancer. Mol Clin Oncol. 2019;10(5):497–505.
Nystrom NA, Levine MJ, Roskies RZ, Scott J Bridges: a uniquely flexible HPC resource for new communities and data analytics. In: Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, p. 30 2015.
Marcotte R, Sayad A, Brown KR, Sanchez-Garcia F, Reimand J, Haider M, Virtanen C, Bradner JE, Bader GD, Mills GB et al. Functional genomic landscape of human breast cancer drivers, vulnerabilities, and resistance. Elsevier 2016. https://www.ncbi.nlm.nih.gov/sra/?term=SRR2532336
Marcotte R, Sayad A, Brown KR, Sanchez-Garcia F, Reimand J, Haider M, Virtanen C, Bradner JE, Bader GD, Mills GB et al. Functional genomic landscape of human breast cancer drivers, vulnerabilities, and resistance. Elsevier 2016. https://www.ncbi.nlm.nih.gov/sra/?term=SRR2532344
Daemen A, Griffith OL, Heiser LM, Wang NJ, Enache OM, Sanborn Z, Pepin F, Durinck S, Korkola JE, Griffith M et al. Modeling precision treatment of breast cancer. BioMed Central 2013. https://www.ncbi.nlm.nih.gov/sra/?term=SRR925710
Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L. The ncbi dbgap database of genotypes and phenotypes. Nat Genet. 2007;39(10):1181.