Springer Science and Business Media LLC
Công bố khoa học tiêu biểu
* Dữ liệu chỉ mang tính chất tham khảo
Sắp xếp:
On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences
Springer Science and Business Media LLC - Tập 1 - Trang 1-17 - 2006
Given a set S of n locally aligned sequences, it is a needed prerequisite to partition it into groups of very similar sequences to facilitate subsequent computations, such as the generation of a phylogenetic tree. This article introduces a new method of clustering which partitions S into subsets such that the overlap of each pair of sequences within a subset is at least a given percentage c of the lengths of the two sequences. We show that this problem can be reduced to finding all maximal cliques in a special kind of max-tolerance graph which we call a c-max-tolerance graph. Previously we have shown that finding all maximal cliques in general max-tolerance graphs can be done efficiently in O(n3 + out). Here, using a new kind of sweep-line algorithm, we show that the restriction to c-max-tolerance graphs yields a better runtime of O(n2 log n + out). Furthermore, we present another algorithm which is much easier to implement, and though theoretically slower than the first one, is still running in polynomial time. We then experimentally analyze the number and structure of all maximal cliques in a c-max-tolerance graph, depending on the chosen c-value. We apply our simple algorithm to artificial and biological data and we show that this implementation is much faster than the well-known application Cliquer. By introducing a new heuristic that uses the set of all maximal cliques to partition S, we finally show that the computed partition gives a reasonable clustering for biological data sets.
Distributional fold change test – a statistical approach for detecting differential expression in microarray experiments
Springer Science and Business Media LLC - Tập 7 - Trang 1-16 - 2012
Because of the large volume of data and the intrinsic variation of data intensity observed in microarray experiments, different statistical methods have been used to systematically extract biological information and to quantify the associated uncertainty. The simplest method to identify differentially expressed genes is to evaluate the ratio of average intensities in two different conditions and consider all genes that differ by more than an arbitrary cut-off value to be differentially expressed. This filtering approach is not a statistical test and there is no associated value that can indicate the level of confidence in the designation of genes as differentially expressed or not differentially expressed. At the same time the fold change by itself provide valuable information and it is important to find unambiguous ways of using this information in expression data treatment. A new method of finding differentially expressed genes, called distributional fold change (DFC) test is introduced. The method is based on an analysis of the intensity distribution of all microarray probe sets mapped to a three dimensional feature space composed of average expression level, average difference of gene expression and total variance. The proposed method allows one to rank each feature based on the signal-to-noise ratio and to ascertain for each feature the confidence level and power for being differentially expressed. The performance of the new method was evaluated using the total and partial area under receiver operating curves and tested on 11 data sets from Gene Omnibus Database with independently verified differentially expressed genes and compared with the t-test and shrinkage t-test. Overall the DFC test performed the best – on average it had higher sensitivity and partial AUC and its elevation was most prominent in the low range of differentially expressed features, typical for formalin-fixed paraffin-embedded sample sets. The distributional fold change test is an effective method for finding and ranking differentially expressed probesets on microarrays. The application of this test is advantageous to data sets using formalin-fixed paraffin-embedded samples or other systems where degradation effects diminish the applicability of correlation adjusted methods to the whole feature set.
Noisy: Identification of problematic columns in multiple sequence alignments
Springer Science and Business Media LLC - Tập 3 - Trang 1-10 - 2008
Sequence-based methods for phylogenetic reconstruction from (nucleic acid) sequence data are notoriously plagued by two effects: homoplasies and alignment errors. Large evolutionary distances imply a large number of homoplastic sites. As most protein-coding genes show dramatic variations in substitution rates that are not uncorrelated across the sequence, this often leads to a patchwork pattern of (i) phylogenetically informative and (ii) effectively randomized regions. In highly variable regions, furthermore, alignment errors accumulate resulting in sometimes misleading signals in phylogenetic reconstruction. We present here a method that, based on assessing the distribution of character states along a cyclic ordering of the taxa, allows the identification of phylogenetically uninformative homoplastic sites in a multiple sequence alignment. Removal of these sites appears to improve the performance of phylogenetic reconstruction algorithms as measured by various indices of "tree quality". In particular, we obtain more stable trees due to the exclusion of phylogenetically incompatible sites that most likely represent strongly randomized characters. The computer program noisy implements this approach. It can be employed to improving phylogenetic reconstruction capability with quite a considerable success rate whenever (1) the average bootstrap support obtained from the original alignment is low, and (2) there are sufficiently many taxa in the data set – at least, say, 12 to 15 taxa. The software can be obtained under the GNU Public License from
http://www.bioinf.uni-leipzig.de/Software/noisy/
.
Using a constraint-based regression method for relative quantification of somatic mutations in pyrosequencing signals: a case for NRAS analysis
Springer Science and Business Media LLC - Tập 11 - Trang 1-10 - 2016
Pyrosequencing Allele Quantification (AQ) is a cost-effective DNA sequencing method that can be used for detecting somatic mutations in formalin-fixed paraffin-embedded (FFPE) samples. The method displays a low turnaround time and a high sensitivity. Pyrosequencing suffers however from two main drawbacks including (i) low specificity and (ii) difficult signal interpretation when multiple mutations are reported in a hotspot genomic region. Using a constraint-based regression method, the new AdvISER-PYRO-SMQ algorithm was developed in the current study and implemented into an R package. As a proof-of-concept, AdvISER-PYRO-SMQ was used to identify a set of 9 distinct point mutations affecting codon 61 of the NRAS oncogene. In parallel, a pyrosequencing assay using the Qiagen software and its AQ module was used to assess selectively the presence of a single point mutation (NRAS
$$c.182A>G$$
- Q61R-1) among the set of codon 61 mutations, and to analyze related pyrosequencing signals. AdvISER-PYRO-SMQ produced a lower limit of blank (0 %) than the AQ module of Qiagen software (5.1 %) and similar limit of detection were obtained for both software (5.6 vs 4.8 %). AdvISER-PYRO-SMQ was able to screen for the presence of 9 distinct mutations with a single pyrosequencing reaction whereas the AQ module was limited to screen a single mutation per reaction. Using a constraint-based regression method enables to analyze pyrosequencing signal and to detect multiple mutations within a hotspot genomic region with an optimal compromise between sensitivity and specificity. The AdvISER-PYRO-SMQ R package provides a generic tool which can be applied on a wide range of somatic mutations. Its implementation in a Shiny web interactive application (available at
https://ucl-irec-ctma.shinyapps.io/Pyrosequencing-NRAS-61/
) enables its use in research or clinical routine applications.
Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail
Springer Science and Business Media LLC - Tập 2 - Trang 1-17 - 2007
The optimal score for ungapped local alignments of infinitely long random sequences is known to follow a Gumbel extreme value distribution. Less is known about the important case, where gaps are allowed. For this case, the distribution is only known empirically in the high-probability region, which is biologically less relevant. We provide a method to obtain numerically the biologically relevant rare-event tail of the distribution. The method, which has been outlined in an earlier work, is based on generating the sequences with a parametrized probability distribution, which is biased with respect to the original biological one, in the framework of Metropolis Coupled Markov Chain Monte Carlo. Here, we first present the approach in detail and evaluate the convergence of the algorithm by considering a simple test case. In the earlier work, the method was just applied to one single example case. Therefore, we consider here a large set of parameters: We study the distributions for protein alignment with different substitution matrices (BLOSUM62 and PAM250) and affine gap costs with different parameter values. In the logarithmic phase (large gap costs) it was previously assumed that the Gumbel form still holds, hence the Gumbel distribution is usually used when evaluating p-values in databases. Here we show that for all cases, provided that the sequences are not too long (L > 400), a "modified" Gumbel distribution, i.e. a Gumbel distribution with an additional Gaussian factor is suitable to describe the data. We also provide a "scaling analysis" of the parameters used in the modified Gumbel distribution. Furthermore, via a comparison with BLAST parameters, we show that significance estimations change considerably when using the true distributions as presented here. Finally, we study also the distribution of the sum statistics of the k best alignments. Our results show that the statistics of gapped and ungapped local alignments deviates significantly from Gumbel in the rare-event tail. We provide a Gaussian correction to the distribution and an analysis of its scaling behavior for several different scoring parameter sets, which are commonly used to search protein data bases. The case of sum statistics of k best alignments is included.
A simple data-adaptive probabilistic variant calling model
Springer Science and Business Media LLC - - 2015
Lossless filter for multiple repeats with bounded edit distance
Springer Science and Business Media LLC - Tập 4 Số 1 - 2009
Graph-distance distribution of the Boltzmann ensemble of RNA secondary structures
Springer Science and Business Media LLC - Tập 9 - Trang 1-14 - 2014
Large RNA molecules are often composed of multiple functional domains whose spatial arrangement strongly influences their function. Pre-mRNA splicing, for instance, relies on the spatial proximity of the splice junctions that can be separated by very long introns. Similar effects appear in the processing of RNA virus genomes. Albeit a crude measure, the distribution of spatial distances in thermodynamic equilibrium harbors useful information on the shape of the molecule that in turn can give insights into the interplay of its functional domains. Spatial distance can be approximated by the graph-distance in RNA secondary structure. We show here that the equilibrium distribution of graph-distances between a fixed pair of nucleotides can be computed in polynomial time by means of dynamic programming. While a naïve implementation would yield recursions with a very high time complexity of O(n6D5) for sequence length n and D distinct distance values, it is possible to reduce this to O(n4) for practical applications in which predominantly small distances are of of interest. Further reductions, however, seem to be difficult. Therefore, we introduced sampling approaches that are much easier to implement. They are also theoretically favorable for several real-life applications, in particular since these primarily concern long-range interactions in very large RNA molecules. The graph-distance distribution can be computed using a dynamic programming approach. Although a crude approximation of reality, our initial results indicate that the graph-distance can be related to the smFRET data. The additional file and the software of our paper are available from
http://www.rna.uni-jena.de/RNAgraphdist.html
.
Finding the region of pseudo-periodic tandem repeats in biological sequences
Springer Science and Business Media LLC - Tập 1 - Trang 1-8 - 2006
The genomes of many species are dominated by short sequences repeated consecutively. It is estimated that over 10% of the human genome consists of tandemly repeated sequences. Finding repeated regions in long sequences is important in sequence analysis. We develop a software, LocRepeat, that finds regions of pseudo-periodic repeats in a long sequence. We use the definition of Li et al. [1] for the pseudo-periodic partition of a region and extend the algorithm that can select the repeated region from a given long sequence and give the pseudo-periodic partition of the region. LocRepeat is available at
http://www.cs.cityu.edu.hk/~lwang/software/LocRepeat
Investigating the complexity of the double distance problems
Springer Science and Business Media LLC - Tập 19 - Trang 1-20 - 2024
Two genomes
$$\mathbb {A}$$
and
$$\mathbb {B}$$
over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. Denote by
$$n_*$$
the number of common families of
$$\mathbb {A}$$
and
$$\mathbb {B}$$
. Different distances of canonical genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length and paths. Let
$$c_i$$
and
$$p_j$$
be respectively the numbers of cycles of length i and of paths of length j in the breakpoint graph of genomes
$$\mathbb {A}$$
and
$$\mathbb {B}$$
. Then, the breakpoint distance of
$$\mathbb {A}$$
and
$$\mathbb {B}$$
is equal to
$$n_*-\left( c_2+\frac{p_0}{2}\right)$$
. Similarly, when the considered rearrangements are those modeled by the double-cut-and-join (DCJ) operation, the rearrangement distance of
$$\mathbb {A}$$
and
$$\mathbb {B}$$
is
$$n_*-\left( c+\frac{p_e }{2}\right)$$
, where c is the total number of cycles and
$$p_e$$
is the total number of paths of even length. The distance formulation is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NP-hard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider a
$$\sigma _k$$
distance, defined to be
$$n_*-\left( c_2+c_4+\ldots +c_k+\frac{p_0+p_2+\ldots +p_{k-2}}{2}\right)$$
, and increasingly investigate the complexities of median and double distance for the
$$\sigma _4$$
distance, then the
$$\sigma _6$$
distance, and so on. While for the median much effort was done in our and in other research groups but no progress was obtained even for the
$$\sigma _4$$
distance, for solving the double distance under
$$\sigma _4$$
and
$$\sigma _6$$
distances we could devise linear time algorithms, which we present here.
Tổng số: 299
- 1
- 2
- 3
- 4
- 5
- 6
- 10