BMC Bioinformatics

  1471-2105

 

 

Cơ quản chủ quản:  BioMed Central Ltd. , BMC

Lĩnh vực:
Computer Science ApplicationsBiochemistryMolecular BiologyApplied MathematicsStructural Biology

Phân tích ảnh hưởng

Thông tin về tạp chí

 

Các bài báo tiêu biểu

AlzPharm: integration of neurodegeneration data using RDF
Tập 8 - Trang 1-12 - 2007
Hugo YK Lam, Luis Marenco, Tim Clark, Yong Gao, June Kinoshita, Gordon Shepherd, Perry Miller, Elizabeth Wu, Gwendolyn T Wong, Nian Liu, Chiquito Crasto, Thomas Morse, Susie Stephens, Kei-Hoi Cheung
Neuroscientists often need to access a wide range of data sets distributed over the Internet. These data sets, however, are typically neither integrated nor interoperable, resulting in a barrier to answering complex neuroscience research questions. Domain ontologies can enable the querying heterogeneous data sets, but they are not sufficient for neuroscience since the data of interest commonly span multiple research domains. To this end, e-Neuroscience seeks to provide an integrated platform for neuroscientists to discover new knowledge through seamless integration of the very diverse types of neuroscience data. Here we present a Semantic Web approach to building this e-Neuroscience framework by using the Resource Description Framework (RDF) and its vocabulary description language, RDF Schema (RDFS), as a standard data model to facilitate both representation and integration of the data. We have constructed a pilot ontology for BrainPharm (a subset of SenseLab) using RDFS and then converted a subset of the BrainPharm data into RDF according to the ontological structure. We have also integrated the converted BrainPharm data with existing RDF hypothesis and publication data from a pilot version of SWAN (Semantic Web Applications in Neuromedicine). Our implementation uses the RDF Data Model in Oracle Database 10g release 2 for data integration, query, and inference, while our Web interface allows users to query the data and retrieve the results in a convenient fashion. Accessing and integrating biomedical data which cuts across multiple disciplines will be increasingly indispensable and beneficial to neuroscience researchers. The Semantic Web approach we undertook has demonstrated a promising way to semantically integrate data sets created independently. It also shows how advanced queries and inferences can be performed over the integrated data, which are hard to achieve using traditional data integration approaches. Our pilot results suggest that our Semantic Web approach is suitable for realizing e-Neuroscience and generic enough to be applied in other biomedical fields.
Reporting FDR analogous confidence intervals for the log fold change of differentially expressed genes
Tập 12 - Trang 1-9 - 2011
Klaus Jung, Tim Friede, Tim Beißbarth
Gene expression experiments are common in molecular biology, for example in order to identify genes which play a certain role in a specified biological framework. For that purpose expression levels of several thousand genes are measured simultaneously using DNA microarrays. Comparing two distinct groups of tissue samples to detect those genes which are differentially expressed one statistical test per gene is performed, and resulting p-values are adjusted to control the false discovery rate. In addition, the expression change of each gene is quantified by some effect measure, typically the log fold change. In certain cases, however, a gene with a significant p-value can have a rather small fold change while in other cases a non-significant gene can have a rather large fold change. The biological relevance of the change of gene expression can be more intuitively judged by a fold change then merely by a p-value. Therefore, confidence intervals for the log fold change which accompany the adjusted p-values are desirable. In a new approach, we employ an existing algorithm for adjusting confidence intervals in the case of high-dimensional data and apply it to a widely used linear model for microarray data. Furthermore, we adopt a concept of different relevance categories for effects in clinical trials to assess biological relevance of genes in microarray experiments. In a brief simulation study the properties of the adjusting algorithm are maintained when being combined with the linear model for microarray data. In two cancer data sets the adjusted confidence intervals can indicate significance of large fold changes and distinguish them from other large but non-significant fold changes. Adjusting of confidence intervals also corrects the assessment of biological relevance. Our new combination approach and the categorization of fold changes facilitates the selection of genes in microarray experiments and helps to interpret their biological relevance.
Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports
Tập 24 Số 1
Haitham Elmarakeby, Pavel Trukhanov, Vidal M. Arroyo, Irbaz Bin Riaz, Deborah Schrag, Eliezer M. Van Allen, Kenneth L. Kehl
Abstract Background Longitudinal data on key cancer outcomes for clinical research, such as response to treatment and disease progression, are not captured in standard cancer registry reporting. Manual extraction of such outcomes from unstructured electronic health records is a slow, resource-intensive process. Natural language processing (NLP) methods can accelerate outcome annotation, but they require substantial labeled data. Transfer learning based on language modeling, particularly using the Transformer architecture, has achieved improvements in NLP performance. However, there has been no systematic evaluation of NLP model training strategies on the extraction of cancer outcomes from unstructured text. Results We evaluated the performance of nine NLP models at the two tasks of identifying cancer response and cancer progression within imaging reports at a single academic center among patients with non-small cell lung cancer. We trained the classification models under different conditions, including training sample size, classification architecture, and language model pre-training. The training involved a labeled dataset of 14,218 imaging reports for 1112 patients with lung cancer. A subset of models was based on a pre-trained language model, DFCI-ImagingBERT, created by further pre-training a BERT-based model using an unlabeled dataset of 662,579 reports from 27,483 patients with cancer from our center. A classifier based on our DFCI-ImagingBERT, trained on more than 200 patients, achieved the best results in most experiments; however, these results were marginally better than simpler “bag of words” or convolutional neural network models. Conclusion When developing AI models to extract outcomes from imaging reports for clinical cancer research, if computational resources are plentiful but labeled training data are limited, large language models can be used for zero- or few-shot learning to achieve reasonable performance. When computational resources are more limited but labeled training data are readily available, even simple machine learning architectures can achieve good performance for such tasks.
A successful hybrid deep learning model aiming at promoter identification
Tập 23 - Trang 1-20 - 2022
Ying Wang, Qinke Peng, Xu Mou, Xinyuan Wang, Haozhou Li, Tian Han, Zhao Sun, Xiao Wang
The zone adjacent to a transcription start site (TSS), namely, the promoter, is primarily involved in the process of DNA transcription initiation and regulation. As a result, proper promoter identification is critical for further understanding the mechanism of the networks controlling genomic regulation. A number of methodologies for the identification of promoters have been proposed. Nonetheless, due to the great heterogeneity existing in promoters, the results of these procedures are still unsatisfactory. In order to establish additional discriminative characteristics and properly recognize promoters, we developed the hybrid model for promoter identification (HMPI), a hybrid deep learning model that can characterize both the native sequences of promoters and the morphological outline of promoters at the same time. We developed the HMPI to combine a method called the PSFN (promoter sequence features network), which characterizes native promoter sequences and deduces sequence features, with a technique referred to as the DSPN (deep structural profiles network), which is specially structured to model the promoters in terms of their structural profile and to deduce their structural attributes. The HMPI was applied to human, plant and Escherichia coli K-12 strain datasets, and the findings showed that the HMPI was successful at extracting the features of the promoter while greatly enhancing the promoter identification performance. In addition, after the improvements of synthetic sampling, transfer learning and label smoothing regularization, the improved HMPI models achieved good results in identifying subtypes of promoters on prokaryotic promoter datasets. The results showed that the HMPI was successful at extracting the features of promoters while greatly enhancing the performance of identifying promoters on both eukaryotic and prokaryotic datasets, and the improved HMPI models are good at identifying subtypes of promoters on prokaryotic promoter datasets. The HMPI is additionally adaptable to different biological functional sequences, allowing for the addition of new features or models.
Rank Difference Analysis of Microarrays (RDAM), a novel approach to statistical analysis of microarray expression profiling data
Tập 5 Số 1
Dietmar E. Martin, Philippe Demougin, Michael N. Hall, Michel Bellis
A vascular image registration method based on network structure and circuit simulation
Tập 18 Số 1 - 2017
Li Chen, Yuxi Lian, Yi Guo, Yuan‐Yuan Wang, Thomas S. Hatsukami, Kristi Pimentel, Niranjan Balu, Chun Yuan
Hypergraph models of biological networks to identify genes critical to pathogenic viral response
Tập 22 - Trang 1-21 - 2021
Song Feng, Emily Heath, Brett Jefferson, Cliff Joslyn, Henry Kvinge, Hugh D. Mitchell, Brenda Praggastis, Amie J. Eisfeld, Amy C. Sims, Larissa B. Thackray, Shufang Fan, Kevin B. Walters, Peter J. Halfmann, Danielle Westhoff-Smith, Qing Tan, Vineet D. Menachery, Timothy P. Sheahan, Adam S. Cockrell, Jacob F. Kocher, Kelly G. Stratton, Natalie C. Heller, Lisa M. Bramer, Michael S. Diamond, Ralph S. Baric, Katrina M. Waters, Yoshihiro Kawaoka, Jason E. McDermott, Emilie Purvine
Representing biological networks as graphs is a powerful approach to reveal underlying patterns, signatures, and critical components from high-throughput biomolecular data. However, graphs do not natively capture the multi-way relationships present among genes and proteins in biological systems. Hypergraphs are generalizations of graphs that naturally model multi-way relationships and have shown promise in modeling systems such as protein complexes and metabolic reactions. In this paper we seek to understand how hypergraphs can more faithfully identify, and potentially predict, important genes based on complex relationships inferred from genomic expression data sets. We compiled a novel data set of transcriptional host response to pathogenic viral infections and formulated relationships between genes as a hypergraph where hyperedges represent significantly perturbed genes, and vertices represent individual biological samples with specific experimental conditions. We find that hypergraph betweenness centrality is a superior method for identification of genes important to viral response when compared with graph centrality. Our results demonstrate the utility of using hypergraphs to represent complex biological systems and highlight central important responses in common to a variety of highly pathogenic viruses.
The PHA Depolymerase Engineering Database: A systematic analysis tool for the diverse family of polyhydroxyalkanoate (PHA) depolymerases
Tập 10 Số 1 - 2009
Michael Knoll, Thomas Hamm, Florian Wagner, Virginia Martínez, Jürgen Pleiss
Abstract Background Polyhydroxyalkanoates (PHAs) can be degraded by many microorganisms using intra- or extracellular PHA depolymerases. PHA depolymerases are very diverse in sequence and substrate specificity, but share a common α/β-hydrolase fold and a catalytic triad, which is also found in other α/β-hydrolases. Results The PHA Depolymerase Engineering Database (DED, http://www.ded.uni-stuttgart.de) has been established as a tool for systematic analysis of this enzyme family. The DED contains sequence entries of 587 PHA depolymerases, which were assigned to 8 superfamilies and 38 homologous families based on their sequence similarity. For each family, multiple sequence alignments and profile hidden Markov models are provided, and functionally relevant residues are annotated. Conclusion The DED is a valuable tool which can be applied to identify new PHA depolymerase sequences from complete genomes in silico, to classify PHA depolymerases, to predict their biochemical properties, and to design enzyme variants with improved properties.
NeuronBridge: an intuitive web application for neuronal morphology search across large data sets
- 2024
Jody Clements, Cristian Goina, Philip M. Hubbard, Takashi Kawase, Donald J. Olbris, Hideo Otsuna, Robert Svirskas, Konrad Rokicki
Neuroscience research in Drosophila is benefiting from large-scale connectomics efforts using electron microscopy (EM) to reveal all the neurons in a brain and their connections. To exploit this knowledge base, researchers relate a connectome’s structure to neuronal function, often by studying individual neuron cell types. Vast libraries of fly driver lines expressing fluorescent reporter genes in sets of neurons have been created and imaged using confocal light microscopy (LM), enabling the targeting of neurons for experimentation. However, creating a fly line for driving gene expression within a single neuron found in an EM connectome remains a challenge, as it typically requires identifying a pair of driver lines where only the neuron of interest is expressed in both. This task and other emerging scientific workflows require finding similar neurons across large data sets imaged using different modalities. Here, we present NeuronBridge, a web application for easily and rapidly finding putative morphological matches between large data sets of neurons imaged using different modalities. We describe the functionality and construction of the NeuronBridge service, including its user-friendly graphical user interface (GUI), extensible data model, serverless cloud architecture, and massively parallel image search engine. NeuronBridge fills a critical gap in the Drosophila research workflow and is used by hundreds of neuroscience researchers around the world. We offer our software code, open APIs, and processed data sets for integration and reuse, and provide the application as a service at http://neuronbridge.janelia.org .
cnvCurator: an interactive visualization and editing tool for somatic copy number variations
- 2015
Lingnan Ma, Maochun Qin, Biao Liu, Qiang Hu, Lei Wei, Jianmin Wang, Song Liu