Prediction of synergistic drug combinations using PCA-initialized deep learning BioData Mining - Tập 14 - Trang 1-15 - 2021
Jun Ma, Alison Motsinger-Reif
Cancer is one of the main causes of death worldwide. Combination drug therapy has been a mainstay of cancer treatment for decades and has been shown to reduce host toxicity and prevent the development of acquired drug resistance. However, the immense number of possible drug combinations and large synergistic space makes it infeasible to screen all effective drug pairs experimentally. Therefore, it is crucial to develop computational approaches to predict drug synergy and guide experimental design for the discovery of rational combinations for therapy. We present a new deep learning approach to predict synergistic drug combinations by integrating gene expression profiles from cell lines and chemical structure data. Specifically, we use principal component analysis (PCA) to reduce the dimensionality of the chemical descriptor data and gene expression data. We then propagate the low-dimensional data through a neural network to predict drug synergy values. We apply our method to O’Neil’s high-throughput drug combination screening data as well as a dataset from the AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge. We compare the neural network approach with and without dimension reduction. Additionally, we demonstrate the effectiveness of our deep learning approach and compare its performance with three state-of-the-art machine learning methods: Random Forests, XGBoost, and elastic net, with and without PCA-based dimensionality reduction. Our developed approach outperforms other machine learning methods, and the use of dimension reduction dramatically decreases the computation time without sacrificing accuracy.
matK-QR classifier: a patterns based approach for plant species identification BioData Mining - Tập 9 Số 1 - Trang 1-15 - 2016
More, Ravi Prabhakar, Mane, Rupali Chandrashekhar, Purohit, Hemant J.
DNA barcoding is widely used and most efficient approach that facilitates rapid and accurate identification of plant species based on the short standardized segment of the genome. The nucleotide sequences of maturaseK (matK) and ribulose-1, 5-bisphosphate carboxylase (rbcL) marker loci are commonly used in plant species identification. Here, we present a new and highly efficient approach for identifying a unique set of discriminating nucleotide patterns to generate a signature (i.e. regular expression) for plant species identification. In order to generate molecular signatures, we used matK and rbcL loci datasets, which encompass 125 plant species in 52 genera reported by the CBOL plant working group. Initially, we performed Multiple Sequence Alignment (MSA) of all species followed by Position Specific Scoring Matrix (PSSM) for both loci to achieve a percentage of discrimination among species. Further, we detected Discriminating Patterns (DP) at genus and species level using PSSM for the matK dataset. Combining DP and consecutive pattern distances, we generated molecular signatures for each species. Finally, we performed a comparative assessment of these signatures with the existing methods including BLASTn, Support Vector Machines (SVM), Jrip-RIPPER, J48 (C4.5 algorithm), and the Naïve Bayes (NB) methods against NCBI-GenBank matK dataset. Due to the higher discrimination success obtained with the matK as compared to the rbcL, we selected matK gene for signature generation. We generated signatures for 60 species based on identified discriminating patterns at genus and species level. Our comparative assessment results suggest that a total of 46 out of 60 species could be correctly identified using generated signatures, followed by BLASTn (34 species), SVM (18 species), C4.5 (7 species), NB (4 species) and RIPPER (3 species) methods As a final outcome of this study, we converted signatures into QR codes and developed a software matK-QR Classifier ( http://www.neeri.res.in/matk_classifier/index.htm ), which search signatures in the query matK gene sequences and predict corresponding plant species. This novel approach of employing pattern-based signatures opens new avenues for the classification of species. In addition to existing methods, we believe that matK-QR Classifier would be a valuable tool for molecular taxonomists enabling precise identification of plant species.
LoFTK: a framework for fully automated calculation of predicted Loss-of-Function variants and genes BioData Mining -
Abdulrahman Alasiri, Konrad J. Karczewski, Brian Cole, Bao‐Li Loza, Jason H. Moore, Sander W. van der Laan, Folkert W. Asselbergs, Brendan J. Keating, Jessica van Setten
Abstract
Background
Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. The association of LoF variants with complex diseases and traits may lead to the discovery and validation of novel therapeutic targets. Current approaches predict high-confidence LoF variants without identifying the specific genes or the number of copies they affect. Moreover, there is a lack of methods for detecting knockout genes caused by compound heterozygous (CH) LoF variants.
Results
We have developed the Loss-of-Function ToolKit (LoFTK), which allows efficient and automated prediction of LoF variants from genotyped, imputed and sequenced genomes. LoFTK enables the identification of genes that are inactive in one or two copies and provides summary statistics for downstream analyses. LoFTK can identify CH LoF variants, which result in LoF genes with two copies lost. Using data from parents and offspring we show that 96% of CH LoF genes predicted by LoFTK in the offspring have the respective alleles donated by each parent.
Conclusions
LoFTK is a command-line based tool that provides a reliable computational workflow for predicting LoF variants from genotyped and sequenced genomes, identifying genes that are inactive in 1 or 2 copies. LoFTK is an open software and is freely available to non-commercial users at https://github.com/CirculatoryHealth/LoFTK.
Hypothesis exploration with visualization of variance BioData Mining - Tập 7 - Trang 1-18 - 2014
Douglass Stott Parker, Eliza Congdon, Robert M Bilder
The Consortium for Neuropsychiatric Phenomics (CNP) at UCLA was an investigation into the biological bases of traits such as memory and response inhibition phenotypes—to explore whether they are linked to syndromes including ADHD, Bipolar disorder, and Schizophrenia. An aim of the consortium was in moving from traditional categorical approaches for psychiatric syndromes towards more quantitative approaches based on large-scale analysis of the space of human variation. It represented an application of phenomics—wide-scale, systematic study of phenotypes—to neuropsychiatry research. This paper reports on a system for exploration of hypotheses in data obtained from the LA2K, LA3C, and LA5C studies in CNP. ViVA is a system for exploratory data analysis using novel mathematical models and methods for visualization of variance. An example of these methods is called VISOVA, a combination of visualization and analysis of variance, with the flavor of exploration associated with ANOVA in biomedical hypothesis generation. It permits visual identification of phenotype profiles—patterns of values across phenotypes—that characterize groups. Visualization enables screening and refinement of hypotheses about variance structure of sets of phenotypes. The ViVA system was designed for exploration of neuropsychiatric hypotheses by interdisciplinary teams. Automated visualization in ViVA supports ‘natural selection’ on a pool of hypotheses, and permits deeper understanding of the statistical architecture of the data. Large-scale perspective of this kind could lead to better neuropsychiatric diagnostics.
Analysis of risk factors progression of preterm delivery using electronic health records BioData Mining - Tập 15 - Trang 1-16 - 2022
Zeineb Safi, Neethu Venugopal, Haytham Ali, Michel Makhlouf, Faisal Farooq, Sabri Boughorbel
Preterm deliveries have many negative health implications on both mother and child. Identifying the population level factors that increase the risk of preterm deliveries is an important step in the direction of mitigating the impact and reducing the frequency of occurrence of preterm deliveries. The purpose of this work is to identify preterm delivery risk factors and their progression throughout the pregnancy from a large collection of Electronic Health Records (EHR). The study cohort includes about 60,000 deliveries in the USA with the complete medical history from EHR for diagnoses, medications and procedures. We propose a temporal analysis of risk factors by estimating and comparing risk ratios and variable importance at different time points prior to the delivery event. We selected the following time points before delivery: 0, 12 and 24 week(s) of gestation. We did so by conducting a retrospective cohort study of patient history for a selected set of mothers who delivered preterm and a control group of mothers that delivered full-term. We analyzed the extracted data using logistic regression and random forests models. The results of our analyses showed that the highest risk ratio and variable importance corresponds to history of previous preterm delivery. Other risk factors were identified, some of which are consistent with those that are reported in the literature, others need further investigation. The comparative analysis of the risk factors at different time points showed that risk factors in the early pregnancy related to patient history and chronic condition, while the risk factors in late pregnancy are specific to the current pregnancy. Our analysis unifies several previously reported studies on preterm risk factors. It also gives important insights on the changes of risk factors in the course of pregnancy. The code used for data analysis will be made available on github.
DASSI: tìm kiếm kiến trúc vi sai cho việc nhận diện splice từ chuỗi DNA BioData Mining - - 2021
Shabir Moosa, Abbes Amira, Sabri Boughorbel
Tóm tắtBối cảnhSự bùng nổ dữ liệu do tiến bộ chưa từng có trong lĩnh vực hệ gen đang liên tục thách thức các phương pháp truyền thống trong việc giải thích hệ gen người. Nhu cầu cho các thuật toán mạnh mẽ trong những năm gần đây đã mang lại thành công lớn trong lĩnh vực Học Sâu (Deep Learning - DL) trong việc giải quyết nhiều nhiệm vụ khó khăn trong xử lý hình ảnh, giọng nói và ngôn ngữ tự nhiên bằng cách tự động hóa quá trình thiết kế kiến trúc. Điều này được thúc đẩy thông qua sự phát triển của các kiến trúc DL mới. Tuy nhiên, hệ gen có những thách thức đặc thù đòi hỏi tùy chỉnh và phát triển mô hình DL mới.
Phương phápChúng tôi đề xuất một mô hình mới, DASSI, bằng cách thích nghi một phương pháp tìm kiếm kiến trúc vi sai và áp dụng nó cho nhiệm vụ nhận diện splice site (SS) trên chuỗi DNA để phát hiện các kiến trúc hội tụ hiệu năng cao mới theo cách tự động. Chúng tôi đã đánh giá mô hình khám phá này so với các công cụ tiên tiến để phân loại SS đúng và sai ở Homo sapiens (Người), Arabidopsis thaliana (Thực vật), Caenorhabditis elegans (Giun) và Drosophila melanogaster (Ruồi).
Kết quảĐánh giá thực nghiệm của chúng tôi chỉ ra rằng kiến trúc được phát hiện vượt trội so với các mô hình cơ bản và kiến trúc cố định, và hiển thị kết quả cạnh tranh khi so với các mô hình tiên tiến được sử dụng trong phân loại splice site. Mô hình đề xuất - DASSI có kiến trúc gọn và cho kết quả rất tốt trong một nhiệm vụ học chuyển giao. Các thí nghiệm chuẩn hóa về thời gian thực thi và độ chính xác trong quá trình tìm kiếm và đánh giá kiến trúc cho thấy hiệu suất tốt hơn trên GPU hiện có, khiến cho việc áp dụng các phương pháp tìm kiếm kiến trúc trên tập dữ liệu lớn khả thi.
Kết luậnChúng tôi đề xuất sử dụng phương pháp tìm kiếm kiến trúc vi sai (DASSI) để thực hiện phân loại SS trên chuỗi DNA thô và khám phá các mô hình mạng nơ-ron mới với số lượng tham số có thể điều chỉnh thấp và hiệu suất cạnh tranh so với các kiến trúc được thiết kế thủ công. Chúng tôi đã chuẩn hóa mô hình DASSI rộng rãi với các mô hình tiên tiến khác và đánh giá hiệu suất tính toán của nó. Kết quả cho thấy tiềm năng cao việc sử dụng cơ chế tìm kiếm kiến trúc tự động để giải quyết các vấn đề khác nhau trong lĩnh vực hệ gen.
#Genomics #Deep Learning #Splice Site Recognition #DNA Sequences #Architecture Search #Neural Networks
Classification of breast cancer recurrence based on imputed data: a simulation study BioData Mining - Tập 15 - Trang 1-13 - 2022
Rahibu A. Abassi, Amina S. Msengwa
Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers’ discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers’ accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.
The accelerated aging model reveals critical mechanisms of late-onset Parkinson’s disease BioData Mining - Tập 13 Số 1 - 2020
Shiyan Li, Hongxin Liu, Shiyu Bian, Xianyi Sha, Yixue Li, Yin Wang
Abstract
Background
Late-onset Parkinson’s disease (LOPD) is a common neurodegenerative disorder and lacks disease-modifying treatments, attracting major attentions as the aggravating trend of aging population. There were numerous evidences supported that accelerated aging was the primary risk factor for LOPD, thus pointed out that the mechanisms of PD should be revealed thoroughly based on aging acceleration. However, how PD was triggered by accelerated aging remained unclear and the systematic prediction model was needed to study the mechanisms of PD.
Results
In this paper, an improved PD predictor was presented by comparing with the normal aging process, and both aging and PD markers were identified herein using machine learning methods. Based on the aging scores, the aging acceleration network was constructed thereby, where the enrichment analysis shed light on key characteristics of LOPD. As a result, dysregulated energy metabolisms, the cell apoptosis, neuroinflammation and the ion imbalances were identified as crucial factors linking accelerated aging and PD coordinately, along with dysfunctions in the immune system.
Conclusions
In short, mechanisms between aging and LOPD were integrated by our computational pipeline.