Bioinformatics (Oxford, England)
Công bố khoa học tiêu biểu
* Dữ liệu chỉ mang tính chất tham khảo
Motivation: Outer membrane beta-barrels (OMBBs) are the proteins found in the outer membrane of bacteria, mitochondria and chloroplasts. There are thousands of beta-barrels reported in genomic databases with ∼2–3% of the genes in gram-negative bacteria encoding these proteins. These proteins have a wide variety of biological functions including active and passive transport, cell adhesion, catalysis and structural anchoring. Of the non-redundant OMBB structures in the Protein Data Bank, half have been solved during the past 5 years. This influx of information provides new opportunities for understanding the chemistry of these proteins. The distribution of charges in proteins in the outer membrane has implications for how the mechanism of outer membrane protein insertion is understood. Understanding the distribution of charges might also assist in organism selection for the heterologous expression of mitochondrial OMBBs.
Results: We find a strong asymmetry in the charge distribution of these proteins. For the outward-facing residues of the beta-barrel within regions of similar amino acid density for both membrane leaflets, the external side of the outer membrane contains almost three times the number of charged residues as the internal side of the outer membrane. Moreover, the lipid bilayer of the outer membrane is asymmetric, and the overall preference for amino acid types to be in the external leaflet of the membrane correlates roughly with the hydrophobicity of the membrane lipids. This preference is demonstrably related to the difference in lipid composition of the external and internal leaflets of the membrane.
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: A motif is a short DNA or protein sequence that contributes to the biological function of the sequence in which it resides. Over the past several decades, many computational methods have been described for identifying, characterizing and searching with sequence motifs. Critical to nearly any motif-based sequence analysis pipeline is the ability to scan a sequence database for occurrences of a given motif described by a position-specific frequency matrix.
Results: We describe Find Individual Motif Occurrences (FIMO), a software tool for scanning DNA or protein sequences with motifs described as position-specific scoring matrices. The program computes a log-likelihood ratio score for each position in a given sequence database, uses established dynamic programming methods to convert this score to a P-value and then applies false discovery rate analysis to estimate a q-value for each position in the given sequence. FIMO provides output in a variety of formats, including HTML, XML and several Santa Cruz Genome Browser formats. The program is efficient, allowing for the scanning of DNA sequences at a rate of 3.5 Mb/s on a single CPU.
Availability and Implementation: FIMO is part of the MEME Suite software toolkit. A web server and source code are available at http://meme.sdsc.edu.
Contact: [email protected]; [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Multi-series time-course microarray experiments are useful approaches for exploring biological processes. In this type of experiments, the researcher is frequently interested in studying gene expression changes along time and in evaluating trend differences between the various experimental groups. The large amount of data, multiplicity of experimental conditions and the dynamic nature of the experiments poses great challenges to data analysis.
Results: In this work, we propose a statistical procedure to identify genes that show different gene expression profiles across analytical groups in time-course experiments. The method is a two-regression step approach where the experimental groups are identified by dummy variables. The procedure first adjusts a global regression model with all the defined variables to identify differentially expressed genes, and in second a variable selection strategy is applied to study differences between groups and to find statistically significant different profiles. The methodology is illustrated on both a real and a simulated microarray dataset.
Availability: The method has been implemented in the statistical language R and is freely available from the Bioconductor contributed packages repository and from
Contact: [email protected]; [email protected]
We have established a method for systematic integration of multiple microarray datasets. The method was applied to two different sets of cancer profiling studies. The change of gene expression in cancer was expressed as ’ effect size’, a standardized index measuring the magnitude of a treatment or covariate effect. The effect sizes were combined to obtain the estimate of the overall mean. The statistical significance was determined by a permutation test extended to multiple datasets. It was shown that the data integration promotes the discovery of small but consistent expression changes with increased sensitivity and reliability. The effect size methods provided the efficient modeling framework for addressing interstudy variation as well. Based on the result of homogeneity tests, a fixed effects model was adopted for one set of datasets that had been created in controlled experimental conditions. By contrast, a random effects model was shown to be appropriate for the other set of datasets that had been published by independent groups. We also developed an alternative modeling procedure based on a Bayesian approach, which would offer flexibility and robustness compared to the classical procedure.
Contact: [email protected]
Keywords: microarray, meta-analysis, effect size, Bayesian meta-analysis
*To whom correspondence should be addressed.
Motivation: With the proliferation of microarray experiments and their availability in the public domain, the use of meta-analysis methods to combine results from different studies increases. In microarray experiments, where the sample size is often limited, meta-analysis offers the possibility to considerably increase the statistical power and give more accurate results.
Results: A moderated effect size combination method was proposed and compared with other meta-analysis approaches. All methods were applied to real publicly available datasets on prostate cancer, and were compared in an extensive simulation study for various amounts of inter-study variability. Although the proposed moderated effect size combination improved already existing effect size approaches, the P-value combination was found to provide a better sensitivity and a better gene ranking than the other meta-analysis methods, while effect size methods were more conservative.
Availability: An R package metaMA is available on the CRAN.
Contact: [email protected]
Motivation: Experimental evidence has accumulated showing that microRNA (miRNA) binding sites within protein coding sequences (CDSs) are functional in controlling gene expression.
Results: Here we report a computational analysis of such miRNA target sites, based on features extracted from existing mammalian high-throughput immunoprecipitation and sequencing data. The analysis is performed independently for the CDS and the 3′-untranslated regions (3′-UTRs) and reveals different sets of features and models for the two regions. The two models are combined into a novel computational model for miRNA target genes, DIANA-microT-CDS, which achieves higher sensitivity compared with other popular programs and the model that uses only the 3′-UTR target sites. Further analysis indicates that genes with shorter 3′-UTRs are preferentially targeted in the CDS, suggesting that evolutionary selection might favor additional sites on the CDS in cases where there is restricted space on the 3′-UTR.
Availability: The results of DIANA-microT-CDS are available at www.microrna.gr/microT-CDS
Contact: [email protected]; [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: To understand the behaviour of complex biological regulatory networks, a proper integration of molecular data into a full-fledge formal dynamical model is ultimately required. As most available data on regulatory interactions are qualitative, logical modelling offers an interesting framework to delineate the main dynamical properties of the underlying networks.
Results: Transposing a generic model of the core network controlling the mammalian cell cycle into the logical framework, we compare different strategies to explore its dynamical properties. In particular, we assess the respective advantages and limits of synchronous versus asynchronous updating assumptions to delineate the asymptotical behaviour of regulatory networks. Furthermore, we propose several intermediate strategies to optimize the computation of asymptotical properties depending on available knowledge.
Availability: The mammalian cell cycle model is available in a dedicated XML format (GINML) on our website, along with our logical simulation software GINsim (). Higher resolution state transitions graphs are also found on this web site (Model Repository page).
Contact: [email protected]
Motivation: Next-generation sequencing captures sequence differences in reads relative to a reference genome or transcriptome, including splicing events and complex variants involving multiple mismatches and long indels. We present computational methods for fast detection of complex variants and splicing in short reads, based on a successively constrained search process of merging and filtering position lists from a genomic index. Our methods are implemented in GSNAP (Genomic Short-read Nucleotide Alignment Program), which can align both single- and paired-end reads as short as 14 nt and of arbitrarily long length. It can detect short- and long-distance splicing, including interchromosomal splicing, in individual reads, using probabilistic models or a database of known splice sites. Our program also permits SNP-tolerant alignment to a reference space of all possible combinations of major and minor alleles, and can align reads from bisulfite-treated DNA for the study of methylation state.
Results: In comparison testing, GSNAP has speeds comparable to existing programs, especially in reads of ≥70 nt and is fastest in detecting complex variants with four or more mismatches or insertions of 1–9 nt and deletions of 1–30 nt. Although SNP tolerance does not increase alignment yield substantially, it affects alignment results in 7–8% of transcriptional reads, typically by revealing alternate genomic mappings for a read. Simulations of bisulfite-converted DNA show a decrease in identifying genomic positions uniquely in 6% of 36 nt reads and 3% of 70 nt reads.
Availability: Source code in C and utility programs in Perl are freely available for download as part of the GMAP package at http://share.gene.com/gmap.
Contact: [email protected]
MOTIVATION: JOY is a program to annotate protein sequence alignments with three-dimensional (3D) structural features. It was developed to display 3D structural information in a sequence alignment and to help understand the conservation of amino acids in their specific local environments. RESULTS:: The JOY representation now constitutes an essential part of the two databases of protein structure alignments: HOMSTRAD (http://www-cryst.bioc.cam.ac.uk/homstrad ) and CAMPASS (http://www-cryst.bioc.cam.ac. uk/campass). It has also been successfully used for identifying distant evolutionary relationships. AVAILABILITY: The program can be obtained via anonymous ftp from torsa.bioc.cam.ac.uk from the directory /pub/joy/. The address for the JOY server is http://www-cryst.bioc.cam.ac.uk/cgi-bin/joy.cgi. CONTACT: [email protected]
We developed a prokaryotic genome annotation pipeline, DFAST, that also supports genome submission to public sequence databases. DFAST was originally started as an on-line annotation server, and to date, over 7000 jobs have been processed since its first launch in 2016. Here, we present a newly implemented background annotation engine for DFAST, which is also available as a standalone command-line program. The new engine can annotate a typical-sized bacterial genome within 10 min, with rich information such as pseudogenes, translation exceptions and orthologous gene assignment between given reference genomes. In addition, the modular framework of DFAST allows users to customize the annotation workflow easily and will also facilitate extensions for new functions and incorporation of new tools in the future.
The software is implemented in Python 3 and runs in both Python 2.7 and 3.4—on Macintosh and Linux systems. It is freely available at https://github.com/nigyta/dfast_core/under the GPLv3 license with external binaries bundled in the software distribution. An on-line version is also available at https://dfast.nig.ac.jp/.
Supplementary data are available at Bioinformatics online.
- 1
- 2
- 3
- 4
- 5
- 6
- 10