thumbnail

Springer Science and Business Media LLC

 

  1748-7188

 

Cơ quản chủ quản:  BMC , BioMed Central Ltd.

Lĩnh vực:
Computational Theory and MathematicsApplied MathematicsStructural BiologyMolecular Biology

Phân tích ảnh hưởng

Thông tin về tạp chí

 

Các bài báo tiêu biểu

Median quartet tree search algorithms using optimal subtree prune and regraft
- 2024
Shayesteh Arasti, Siavash Mirarab
Gene trees can be different from the species tree due to biological processes and inference errors. One way to obtain a species tree is to find one that maximizes some measure of similarity to a set of gene trees. The number of shared quartets between a potential species tree and gene trees provides a statistically justifiable score; if maximized properly, it could result in a statistically consistent estimator of the species tree under several statistical models of discordance. However, finding the median quartet score tree, one that maximizes this score, is NP-Hard, motivating several existing heuristic algorithms. These heuristics do not follow the hill-climbing paradigm used extensively in phylogenetics. In this paper, we make theoretical contributions that enable an efficient hill-climbing approach. Specifically, we show that a subtree of size m can be placed optimally on a tree of size n in quasi-linear time with respect to n and (almost) independently of m. This result enables us to perform subtree prune and regraft (SPR) rearrangements as part of a hill-climbing search. We show that this approach can slightly improve upon the results of widely-used methods such as ASTRAL in terms of the optimization score but not necessarily accuracy.
Finding driver pathways in cancer: models and algorithms
Tập 7 Số 1 - 2012
Fabio Vandin, Eli Upfal, Benjamin J. Raphael
Resolving spatial inconsistencies in chromosome conformation measurements
Tập 8 - Trang 1-10 - 2013
Geet Duggal, Rob Patro, Emre Sefer, Hao Wang, Darya Filippova, Samir Khuller, Carl Kingsford
Chromosome structure is closely related to its function and Chromosome Conformation Capture (3C) is a widely used technique for exploring spatial properties of chromosomes. 3C interaction frequencies are usually associated with spatial distances. However, the raw data from 3C experiments is an aggregation of interactions from many cells, and the spatial distances of any given interaction are uncertain. We introduce a new method for filtering 3C interactions that selects subsets of interactions that obey metric constraints of various strictness. We demonstrate that, although the problem is computationally hard, near-optimal results are often attainable in practice using well-designed heuristics and approximation algorithms. Further, we show that, compared with a standard technique, this metric filtering approach leads to (a) subgraphs with higher statistical significance, (b) lower embedding error, (c) lower sensitivity to initial conditions of the embedding algorithm, and (d) structures with better agreement with light microscopy measurements. Our filtering scheme is applicable for a strict frequency-to-distance mapping and a more relaxed mapping from frequency to a range of distances. Our filtering method for 3C data considers both metric consistency and statistical confidence simultaneously resulting in lower-error embeddings that are biologically more plausible.
ViennaRNA Package 2.0
Tập 6 Số 1 - 2011
Ronny Lorenz, Stephan Wolf, Christian Höner zu Siederdissen, Hakim Tafer, Christoph Flamm, Peter F. Stadler, Ivo L. Hofacker
Prefix-free parsing for building big BWTs
Tập 14 - Trang 1-15 - 2019
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, Taher Mun
High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive—a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21  GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.
Analysis of gene copy number changes in tumor phylogenetics
Tập 11 Số 1 - 2016
Jun Zhou, Yu Lin, Vaibhav Rajan, William Hoskins, Bing Feng, Jijun Tang
HuMiTar: A sequence-based method for prediction of human microRNA targets
- 2008
Jishou Ruan, Hanzhe Chen, Lukasz Kurgan, Ke Chen, Chunsheng Kang, Peiyu Pu
Abstract Background MicroRNAs (miRs) are small noncoding RNAs that bind to complementary/partially complementary sites in the 3' untranslated regions of target genes to regulate protein production of the target transcript and to induce mRNA degradation or mRNA cleavage. The ability to perform accurate, high-throughput identification of physiologically active miR targets would enable functional characterization of individual miRs. Current target prediction methods include traditional approaches that are based on specific base-pairing rules in the miR's seed region and implementation of cross-species conservation of the target site, and machine learning (ML) methods that explore patterns that contrast true and false miR-mRNA duplexes. However, in the case of the traditional methods research shows that some seed region matches that are conserved are false positives and that some of the experimentally validated target sites are not conserved. Results We present HuMiTar, a computational method for identifying common targets of miRs, which is based on a scoring function that considers base-pairing for both seed and non-seed positions for human miR-mRNA duplexes. Our design shows that certain non-seed miR nucleotides, such as 14, 18, 13, 11, and 17, are characterized by a strong bias towards formation of Watson-Crick pairing. We contrasted HuMiTar with several representative competing methods on two sets of human miR targets and a set of ten glioblastoma oncogenes. Comparison with the two best performing traditional methods, PicTar and TargetScanS, and a representative ML method that considers the non-seed positions, NBmiRTar, shows that HuMiTar predictions include majority of the predictions of the other three methods. At the same time, the proposed method is also capable of finding more true positive targets as a trade-off for an increased number of predictions. Genome-wide predictions show that the proposed method is characterized by 1.99 signal-to-noise ratio and linear, with respect to the length of the mRNA sequence, computational complexity. The ROC analysis shows that HuMiTar obtains results comparable with PicTar, which are characterized by high true positive rates that are coupled with moderate values of false positive rates. Conclusion The proposed HuMiTar method constitutes a step towards providing an efficient model for studying translational gene regulation by miRs.
ASP-based method for the enumeration of attractors in non-deterministic synchronous and asynchronous multi-valued networks
Tập 12 - Trang 1-23 - 2017
Emna Ben Abdallah, Maxime Folschette, Olivier Roux, Morgan Magnin
This paper addresses the problem of finding attractors in biological regulatory networks. We focus here on non-deterministic synchronous and asynchronous multi-valued networks, modeled using automata networks (AN). AN is a general and well-suited formalism to study complex interactions between different components (genes, proteins,...). An attractor is a minimal trap domain, that is, a part of the state-transition graph that cannot be escaped. Such structures are terminal components of the dynamics and take the form of steady states (singleton) or complex compositions of cycles (non-singleton). Studying the effect of a disease or a mutation on an organism requires finding the attractors in the model to understand the long-term behaviors. We present a computational logical method based on answer set programming (ASP) to identify all attractors. Performed without any network reduction, the method can be applied on any dynamical semantics. In this paper, we present the two most widespread non-deterministic semantics: the asynchronous and the synchronous updating modes. The logical approach goes through a complete enumeration of the states of the network in order to find the attractors without the necessity to construct the whole state-transition graph. We realize extensive computational experiments which show good performance and fit the expected theoretical results in the literature. The originality of our approach lies on the exhaustive enumeration of all possible (sets of) states verifying the properties of an attractor thanks to the use of ASP. Our method is applied to non-deterministic semantics in two different schemes (asynchronous and synchronous). The merits of our methods are illustrated by applying them to biological examples of various sizes and comparing the results with some existing approaches. It turns out that our approach succeeds to exhaustively enumerate on a desktop computer, in a large model (100 components), all existing attractors up to a given size (20 states). This size is only limited by memory and computation time.
Module detection in complex networks using integer optimisation
Tập 5 - Trang 1-11 - 2010
Gang Xu, Laura Bennett, Lazaros G Papageorgiou, Sophia Tsoka
The detection of modules or community structure is widely used to reveal the underlying properties of complex networks in biology, as well as physical and social sciences. Since the adoption of modularity as a measure of network topological properties, several methodologies for the discovery of community structure based on modularity maximisation have been developed. However, satisfactory partitions of large graphs with modest computational resources are particularly challenging due to the NP-hard nature of the related optimisation problem. Furthermore, it has been suggested that optimising the modularity metric can reach a resolution limit whereby the algorithm fails to detect smaller communities than a specific size in large networks. We present a novel solution approach to identify community structure in large complex networks and address resolution limitations in module detection. The proposed algorithm employs modularity to express network community structure and it is based on mixed integer optimisation models. The solution procedure is extended through an iterative procedure to diminish effects that tend to agglomerate smaller modules (resolution limitations). A comprehensive comparative analysis of methodologies for module detection based on modularity maximisation shows that our approach outperforms previously reported methods. Furthermore, in contrast to previous reports, we propose a strategy to handle resolution limitations in modularity maximisation. Overall, we illustrate ways to improve existing methodologies for community structure identification so as to increase its efficiency and applicability.
Space-efficient representation of genomic k-mer count tables
Tập 17 - Trang 1-15 - 2022
Yoshihiro Shibuya, Djamal Belazzougui, Gregory Kucherov
k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s.