PSSMHCpan: a novel PSSM-based software for predicting class I peptide-HLA binding affinity Oxford University Press (OUP) - Tập 6 Số 5 - 2017
Geng Liu, Dongli Li, Zhang Li, Si Qiu, Wenhui Li, Cheng‐Chi Chao, Naibo Yang, Handong Li, Zhen Cheng, Xin Song, Le Cheng, Xiuqing Zhang, Jian Wang, Huanming Yang, Kun Ma, Yong Hou, Bo Li
Abstract
Predicting peptide binding affinity with human leukocyte antigen (HLA) is a crucial step in developing powerful antitumor vaccine for cancer immunotherapy. Currently available methods work quite well in predicting peptide binding affinity with HLA alleles such as HLA-A*0201, HLA-A*0101, and HLA-B*0702 in terms of sensitivity and specificity. However, quite a few types of HLA alleles that are present in the majority of human populations including HLA-A*0202, HLA-A*0203, HLA-A*6802, HLA-B*5101, HLA-B*5301, HLA-B*5401, and HLA-B*5701 still cannot be predicted with satisfactory accuracy using currently available methods. Furthermore, currently the most popularly used methods for predicting peptide binding affinity are inefficient in identifying neoantigens from a large quantity of whole genome and transcriptome sequencing data. Here we present a Position Specific Scoring Matrix (PSSM)-based software called PSSMHCpan to accurately and efficiently predict peptide binding affinity with a broad coverage of HLA class I alleles. We evaluated the performance of PSSMHCpan by analyzing 10-fold cross-validation on a training database containing 87 HLA alleles and obtained an average area under receiver operating characteristic curve (AUC) of 0.94 and accuracy (ACC) of 0.85. In an independent dataset (Peptide Database of Cancer Immunity) evaluation, PSSMHCpan is substantially better than the popularly used NetMHC-4.0, NetMHCpan-3.0, PickPocket, Nebula, and SMM with a sensitivity of 0.90, as compared to 0.74, 0.81, 0.77, 0.24, and 0.79. In addition, PSSMHCpan is more than 197 times faster than NetMHC-4.0, NetMHCpan-3.0, PickPocket, sNebula, and SMM when predicting neoantigens from 661 263 peptides from a breast tumor sample. Finally, we built a neoantigen prediction pipeline and identified 117 017 neoantigens from 467 cancer samples of various cancers from TCGA. PSSMHCpan is superior to the currently available methods in predicting peptide binding affinity with a broad coverage of HLA class I alleles.
Enhanced reproducibility of SADI web service workflows with Galaxy and Docker Oxford University Press (OUP) - Tập 4 - Trang 1-9 - 2015
Mikel Egaña Aranguren, Mark D. Wilkinson
Semantic Web technologies have been widely applied in the life sciences, for example by data providers such as OpenLifeData and through web services frameworks such as SADI. The recently reported OpenLifeData2SADI project offers access to the vast OpenLifeData data store through SADI services. This article describes how to merge data retrieved from OpenLifeData2SADI with other SADI services using the Galaxy bioinformatics analysis platform, thus making this semantic data more amenable to complex analyses. This is demonstrated using a working example, which is made distributable and reproducible through a Docker image that includes SADI tools, along with the data and workflows that constitute the demonstration. The combination of Galaxy and Docker offers a solution for faithfully reproducing and sharing complex data retrieval and analysis workflows based on the SADI Semantic web service design patterns.
OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis Oxford University Press (OUP) - Tập 5 - Trang 1-16 - 2016
Davide Verzotto, Audrey S. M. Teo, Axel M. Hillmer, Niranjan Nagarajan
Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and genome-mapping technologies (for example, optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kilo base pairs (kbp) to 2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging because of the lack of efficient and sensitive map-alignment algorithms for robustly aligning error-prone maps to sequences. We introduce a novel seed-and-extend glocal (short for global-local) alignment method, OPTIMA (and a sliding-window extension for overlap alignment, OPTIMA-Overlap), which is the first to create indexes for continuous-valued mapping data while accounting for mapping errors. We also present a novel statistical model, agnostic with respect to technology-dependent error rates, for conservatively evaluating the significance of alignments without relying on expensive permutation-based tests. We show that OPTIMA and OPTIMA-Overlap outperform other state-of-the-art approaches (1.6−2 times more sensitive) and are more efficient (170−200 %) and precise in their alignments (nearly 99 % precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust, provide improved sensitivity and guarantee high precision.
miRNA Temporal Analyzer (mirnaTA): a bioinformatics tool for identifying differentially expressed microRNAs in temporal studies using normal quantile transformation Oxford University Press (OUP) - Tập 3 - Trang 1-8 - 2014
Regina Z Cer, J Enrique Herrera-Galeano, Joseph J Anderson, Kimberly A Bishop-Lilly, Vishwesh P Mokashi
Understanding the biological roles of microRNAs (miRNAs) is a an active area of research that has produced a surge of publications in PubMed, particularly in cancer research. Along with this increasing interest, many open-source bioinformatics tools to identify existing and/or discover novel miRNAs in next-generation sequencing (NGS) reads become available. While miRNA identification and discovery tools are significantly improved, the development of miRNA differential expression analysis tools, especially in temporal studies, remains substantially challenging. Further, the installation of currently available software is non-trivial and steps of testing with example datasets, trying with one’s own dataset, and interpreting the results require notable expertise and time. Subsequently, there is a strong need for a tool that allows scientists to normalize raw data, perform statistical analyses, and provide intuitive results without having to invest significant efforts. We have developed miRNA Temporal Analyzer (mirnaTA), a bioinformatics package to identify differentially expressed miRNAs in temporal studies. mirnaTA is written in Perl and R (Version 2.13.0 or later) and can be run across multiple platforms, such as Linux, Mac and Windows. In the current version, mirnaTA requires users to provide a simple, tab-delimited, matrix file containing miRNA name and count data from a minimum of two to a maximum of 20 time points and three replicates. To recalibrate data and remove technical variability, raw data is normalized using Normal Quantile Transformation (NQT), and linear regression model is used to locate any miRNAs which are differentially expressed in a linear pattern. Subsequently, remaining miRNAs which do not fit a linear model are further analyzed in two different non-linear methods 1) cumulative distribution function (CDF) or 2) analysis of variances (ANOVA). After both linear and non-linear analyses are completed, statistically significant miRNAs (P < 0.05) are plotted as heat maps using hierarchical cluster analysis and Euclidean distance matrix computation methods. mirnaTA is an open-source, bioinformatics tool to aid scientists in identifying differentially expressed miRNAs which could be further mined for biological significance. It is expected to provide researchers with a means of interpreting raw data to statistical summaries in a fast and intuitive manner.
A data repository and analysis framework for spontaneous neural activity recordings in developing retina Oxford University Press (OUP) - Tập 3 - Trang 1-12 - 2014
Stephen John Eglen, Michael Weeks, Mark Jessop, Jennifer Simonotto, Tom Jackson, Evelyne Sernagor
During early development, neural circuits fire spontaneously, generating activity episodes with complex spatiotemporal patterns. Recordings of spontaneous activity have been made in many parts of the nervous system over the last 25 years, reporting developmental changes in activity patterns and the effects of various genetic perturbations. We present a curated repository of multielectrode array recordings of spontaneous activity in developing mouse and ferret retina. The data have been annotated with minimal metadata and converted into HDF5. This paper describes the structure of the data, along with examples of reproducible research using these data files. We also demonstrate how these data can be analysed in the CARMEN workflow system. This article is written as a literate programming document; all programs and data described here are freely available. 1. We hope this repository will lead to novel analysis of spontaneous activity recorded in different laboratories. 2. We encourage published data to be added to the repository. 3. This repository serves as an example of how multielectrode array recordings can be stored for long-term reuse.
Integrated metabolomics and phytochemical genomics approaches for studies on rice Oxford University Press (OUP) - Tập 5 - Trang 1-7 - 2016
Yozo Okazaki, Kazuki Saito
Metabolomics is widely employed to monitor the cellular metabolic state and assess the quality of plant-derived foodstuffs because it can be used to manage datasets that include a wide range of metabolites in their analytical samples. In this review, we discuss metabolomics research on rice in order to elucidate the overall regulation of the metabolism as it is related to the growth and mechanisms of adaptation to genetic modifications and environmental stresses such as fungal infections, submergence, and oxidative stress. We also focus on phytochemical genomics studies based on a combination of metabolomics and quantitative trait locus (QTL) mapping techniques. In addition to starch, rice produces many metabolites that also serve as nutrients for human consumers. The outcomes of recent phytochemical genomics studies of diverse natural rice resources suggest there is potential for using further effective breeding strategies to improve the quality of ingredients in rice grains.
A genome draft of the legless anguid lizard, Ophisaurus gracilis Oxford University Press (OUP) - Tập 4 - Trang 1-3 - 2015
Bo Song, Shifeng Cheng, Yanbo Sun, Xiao Zhong, Jieqiong Jin, Rui Guan, Robert W Murphy, Jing Che, Yaping Zhang, Xin Liu
Transition from a lizard-like to a snake-like body form is one of the most important transformations in reptilian evolution. The increasing number of sequenced reptilian genomes is enabling a deeper understanding of vertebrate evolution, although the genetic basis of the loss of limbs in reptiles remains enigmatic. Here we report genome sequencing, assembly, and annotation for the Asian glass lizard Ophisaurus gracilis, a limbless lizard species with an elongated snake-like body form. Addition of this species to the genome repository will provide an excellent resource for studying the genetic basis of limb loss and trunk elongation. O. gracilis genome sequencing using the Illumina HiSeq2000 platform resulted in 274.20 Gbp of raw data that was filtered and assembled to a final size of 1.78 Gbp, comprising 6,717 scaffolds with N50 = 1.27 Mbp. Based on the k-mer estimated genome size of 1.71 Gbp, the assembly appears to be nearly 100% complete. A total of 19,513 protein-coding genes were predicted, and 884.06 Mbp of repeat sequences (approximately half of the genome) were annotated. The draft genome of O. gracilis has similar characteristics to both lizard and snake genomes. We report the first genome of a lizard from the family Anguidae, O. gracilis. This supplements currently available genetic and genomic resources for amniote vertebrates, representing a major increase in comparative genome data available for squamate reptiles in particular.
Visualizing genome and systems biology: technologies, tools, implementation techniques and trends, past, present and future Oxford University Press (OUP) - Tập 4 - Trang 1-27 - 2015
Georgios A. Pavlopoulos, Dimitris Malliarakis, Nikolas Papanikolaou, Theodosis Theodosiou, Anton J. Enright, Ioannis Iliopoulos
“Α picture is worth a thousand words.” This widely used adage sums up in a few words the notion that a successful visual representation of a concept should enable easy and rapid absorption of large amounts of information. Although, in general, the notion of capturing complex ideas using images is very appealing, would 1000 words be enough to describe the unknown in a research field such as the life sciences? Life sciences is one of the biggest generators of enormous datasets, mainly as a result of recent and rapid technological advances; their complexity can make these datasets incomprehensible without effective visualization methods. Here we discuss the past, present and future of genomic and systems biology visualization. We briefly comment on many visualization and analysis tools and the purposes that they serve. We focus on the latest libraries and programming languages that enable more effective, efficient and faster approaches for visualizing biological concepts, and also comment on the future human-computer interaction trends that would enable for enhancing visualization further.