Learning protein-ligand binding affinity with atomic environment vectorsSpringer Science and Business Media LLC - - 2021
Rocco Meli, Andrew Anighoro, Michael J. Bodkin, Garrett M. Morris, Philip C. Biggin
AbstractScoring functions for the prediction of protein-ligand binding affinity have seen renewed interest in recent years when novel machine learning and deep learning methods started to consistently outperform classical scoring functions. Here we explore the use of atomic environment vectors (AEVs) and feed-forward neural networks, the building blocks of several neural network potentials, for the prediction of protein-ligand binding affinity. The AEV-based scoring function, which we term AEScore, is shown to perform as well or better than other state-of-the-art scoring functions on binding affinity prediction, with an RMSE of 1.22 pKunits and a Pearson’s correlation coefficient of 0.83 for the CASF-2016 benchmark. However, AEScore does not perform as well in docking and virtual screening tasks, for which it has not been explicitly trained. Therefore, we show that the model can be combined with the classical scoring function AutoDock Vina in the context of$$\Delta$$Δ-learning, where corrections to the AutoDock Vina scoring function are learned instead of the protein-ligand binding affinity itself. Combined with AutoDock Vina,$$\Delta$$Δ-AEScore has an RMSE of 1.32 pKunits and a Pearson’s correlation coefficient of 0.80 on the CASF-2016 benchmark, while retaining the docking and screening power of the underlying classical scoring function.
CardioTox net: một công cụ dự đoán chắc chắn cho việc chặn kênh hERG dựa trên các tổ hợp meta-feature học sâu Dịch bởi AI Springer Science and Business Media LLC - Tập 13 - Trang 1-13 - 2021
Abdul Karim, Matthew Lee, Thomas Balle, Abdul Sattar
Việc chặn kênh gene liên quan đến ether-a-go-go (hERG) bằng các phân tử nhỏ là một mối quan tâm lớn trong quá trình phát triển thuốc trong ngành dược phẩm. Việc chặn các kênh hERG có thể gây ra khoảng QT kéo dài, điều này có thể dẫn đến độc tính tim mạch. Nhiều kỹ thuật in-silico, bao gồm các mô hình học sâu, được sử dụng rộng rãi để sàng lọc các phân tử nhỏ có độc tính liên quan đến hERG tiềm năng. Hầu hết các phương pháp học sâu đã đăng được sử dụng một loại đặc điểm duy nhất, điều này có thể hạn chế hiệu suất của chúng. Các phương pháp dựa trên nhiều loại đặc điểm hơn, chẳng hạn như DeepHIT, gặp khó khăn trong việc tổng hợp thông tin đã được trích xuất. DeepHIT cho thấy hiệu suất tốt hơn khi được đánh giá dựa trên một hoặc hai chỉ số độ chính xác như giá trị dự đoán âm (NPV) và độ nhạy (SEN), nhưng gặp khó khăn khi được đánh giá dựa trên các chỉ số khác như hệ số tương quan Matthew (MCC), độ chính xác (ACC), giá trị dự đoán dương (PPV) và tính đặc hiệu (SPE). Do đó, cần có một phương pháp có thể tổng hợp hiệu quả thông tin thu được từ các mô hình dựa trên các đại diện hóa học khác nhau và tăng cường dự đoán độc tính hERG trên một loạt các chỉ số hiệu suất. Trong bài báo này, chúng tôi đề xuất một khuôn khổ học sâu dựa trên đào tạo từng bước để dự đoán hoạt động chặn kênh hERG của các phân tử nhỏ. Phương pháp của chúng tôi sử dụng năm mô hình học sâu cơ sở riêng biệt với các đặc điểm cơ sở tương ứng và một mạng nơ-ron riêng biệt để kết hợp đầu ra của năm mô hình cơ sở. Bằng cách sử dụng ba bộ kiểm tra độc lập bên ngoài với hoạt động mạnh IC50 ở ngưỡng 10 µm, phương pháp của chúng tôi đạt hiệu suất tốt hơn cho sự kết hợp các chỉ số phân loại. Chúng tôi cũng điều tra việc tổng hợp hiệu quả thông tin hóa học đã được trích xuất để dự đoán hoạt động hERG chắc chắn. Tóm lại, CardioTox net có thể hoạt động như một công cụ mạnh mẽ để sàng lọc các phân tử nhỏ cho việc chặn kênh hERG trong các quy trình phát hiện thuốc và hoạt động tốt hơn so với các phương pháp đã được báo cáo trước đó trên một loạt các chỉ số phân loại.
#hERG #độc tính tim mạch #chặn kênh ion #mô hình học sâu #sàng lọc thuốc #thông tin hóa học
InChI in the wild: an assessment of InChIKey searching in GoogleSpringer Science and Business Media LLC - Tập 5 - Trang 1-10 - 2013
Christopher Southan
While chemical databases can be queried using the InChI string and InChIKey (IK) the latter was designed for open-web searching. It is becoming increasingly effective for this since more sources enhance crawling of their websites by the Googlebot and consequent IK indexing. Searchers who use Google as an adjunct to database access may be less familiar with the advantages of using the IK as explored in this review. As an example, the IK for atorvastatin retrieves ~200 low-redundancy links from a Google search in 0.3 of a second. These include most major databases and a very low false-positive rate. Results encompass less familiar but potentially useful sources and can be extended to isomer capture by using just the skeleton layer of the IK. Google Advanced Search can be used to filter large result sets. Image searching with the IK is also effective and complementary to open-web queries. Results can be particularly useful for less-common structures as exemplified by a major metabolite of atorvastatin giving only three hits. Testing also demonstrated document-to-document and document-to-database joins via structure matching. The necessary generation of an IK from chemical names can be accomplished using open tools and resources for patents, papers, abstracts or other text sources. Active global sharing of local IK-linked information can be accomplished via surfacing in open laboratory notebooks, blogs, Twitter, figshare and other routes. While information-rich chemistry (e.g. approved drugs) can exhibit swamping and redundancy effects, the much smaller IK result sets for link-poor structures become a transformative first-pass option. The IK indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records. The simplicity, specificity and speed of matching make it a useful option for biologists or others less familiar with chemical searching. However, compared to rigorously maintained major databases, users need to be circumspect about the consistency of Google results and provenance of retrieved links. In addition, community engagement may be necessary to ameliorate possible future degradation of utility.
Virtual screening of bioassay dataSpringer Science and Business Media LLC - Tập 1 - Trang 1-12 - 2009
Amanda C Schierz
There are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets. Pharmaceutical bioassay data is not readily available to the academic community. The data held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary and Confirmatory screening assays. With regard to the number of false positives that occur in the primary screening process, the analysis carried out has been shallow due to the lack of cross-referencing mentioned above. In six cases found, the average percentage of false positives from the High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's implementations of the Support Vector Machine and C4.5 decision tree learner have performed relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base classifier used and not solely on the ratio of class imbalance. Understandably, pharmaceutical data is hard to obtain. However, it would be beneficial to both the pharmaceutical industry and to academics for curated primary screening and corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual screening techniques to bioassay data. First, by reducing the search space of compounds to be screened and secondly, by analysing the false positives that occur in the primary screening process, the technology may be improved. The number of false positives arising from primary screening leads to the issue of whether this type of data should be used for virtual screening. Care when using Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class ratios should not be used when comparing differing classifiers for the same dataset.
Comparative analysis of chemical similarity methods for modular natural products with a hypothetical structure enumeration algorithmSpringer Science and Business Media LLC - - 2017
Michael A. Skinnider, Chris A. Dejong, Brian C. Franczak, Paul D. McNicholas, Nathan A. Magarvey
Natural products represent a prominent source of pharmaceutically and industrially important agents. Calculating the chemical similarity of two molecules is a central task in cheminformatics, with applications at multiple stages of the drug discovery pipeline. Quantifying the similarity of natural products is a particularly important problem, as the biological activities of these molecules have been extensively optimized by natural selection. The large and structurally complex scaffolds of natural products distinguish their physical and chemical properties from those of synthetic compounds. However, no analysis of the performance of existing methods for molecular similarity calculation specific to natural products has been reported to date. Here, we present LEMONS, an algorithm for the enumeration of hypothetical modular natural product structures. We leverage this algorithm to conduct a comparative analysis of molecular similarity methods within the unique chemical space occupied by modular natural products using controlled synthetic data, and comprehensively investigate the impact of diverse biosynthetic parameters on similarity search. We additionally investigate a recently described algorithm for natural product retrobiosynthesis and alignment, and find that when rule-based retrobiosynthesis can be applied, this approach outperforms conventional two-dimensional fingerprints, suggesting it may represent a valuable approach for the targeted exploration of natural product chemical space and microbial genome mining. Our open-source algorithm is an extensible method of enumerating hypothetical natural product structures with diverse potential applications in bioinformatics.
Comparison and improvement of the predictability and interpretability with ensemble learning models in QSPR applicationsSpringer Science and Business Media LLC - Tập 12 - Trang 1-16 - 2020
Chia-Hsiu Chen, Kenichi Tanaka, Masaaki Kotera, Kimito Funatsu
Ensemble learning helps improve machine learning results by combining several models and allows the production of better predictive performance compared to a single model. It also benefits and accelerates the researches in quantitative structure–activity relationship (QSAR) and quantitative structure–property relationship (QSPR). With the growing number of ensemble learning models such as random forest, the effectiveness of QSAR/QSPR will be limited by the machine’s inability to interpret the predictions to researchers. In fact, many implementations of ensemble learning models are able to quantify the overall magnitude of each feature. For example, feature importance allows us to assess the relative importance of features and to interpret the predictions. However, different ensemble learning methods or implementations may lead to different feature selections for interpretation. In this paper, we compared the predictability and interpretability of four typical well-established ensemble learning models (Random forest, extreme randomized trees, adaptive boosting and gradient boosting) for regression and binary classification modeling tasks. Then, the blending methods were built by summarizing four different ensemble learning methods. The blending method led to better performance and a unification interpretation by summarizing individual predictions from different learning models. The important features of two case studies which gave us some valuable information to compound properties were discussed in detail in this report. QSPR modeling with interpretable machine learning techniques can move the chemical design forward to work more efficiently, confirm hypothesis and establish knowledge for better results.
A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literatureSpringer Science and Business Media LLC - - 2015
Buzhou Tang, Ying Feng, Xiaolong Wang, Yonghui Wu, Yaoyun Zhang, Min Jiang, Jingqi Wang, Hua Xu
Abstract
Background
Chemical compounds and drugs (together called chemical entities) embedded in scientific articles are crucial for many information extraction tasks in the biomedical domain. However, only a very limited number of chemical entity recognition systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of chemical entity recognition systems, the Spanish National Cancer Research Center (CNIO) and The University of Navarra organized a challenge on Chemical and Drug Named Entity Recognition (CHEMDNER). The CHEMDNER challenge contains two individual subtasks: 1) Chemical Entity Mention recognition (CEM); and 2) Chemical Document Indexing (CDI). Our study proposes machine learning-based systems for the CEM task.
Methods
The 2013 CHEMDNER challenge organizers provided a manually annotated 10,000 UTF8-encoded PubMed abstracts according to a predefined annotation guideline: a training set of 3,500 abstracts, a development set of 3,500 abstracts and a test set of 3,000 abstracts. We developed machine learning-based systems, based on conditional random fields (CRF) and structured support vector machines (SSVM) respectively, for the CEM task for this data set. The effects of three types of word representation (WR) features, generated by Brown clustering, random indexing and skip-gram, on both two machine learning-based systems were also investigated. The performance of our system was evaluated on the test set using scripts provided by the CHEMDNER challenge organizers. Primary evaluation measures were micro Precision, Recall, and F-measure.
Results
Our best system was among the top ranked systems with an official micro F-measure of 85.05%. Fixing a bug caused by inconsistent features marginally improved the performance (micro F-measure of 85.20%) of the system.
Conclusions
The SSVM-based CEM systems outperformed the CRF-based CEM systems when using the same features. Each type of the WR feature was beneficial to the CEM task. Both the CRF-based and SSVM-based systems using the all three types of WR features showed better performance than the systems using only one type of the WR feature.
Next generation community assessment of biomedical entity recognition web servers: metrics, performance, interoperability aspects of BeCalmSpringer Science and Business Media LLC - Tập 11 - Trang 1-16 - 2019
Martin Pérez-Pérez, Gael Pérez-Rodríguez, Aitor Blanco-Míguez, Florentino Fdez-Riverola, Alfonso Valencia, Martin Krallinger, Anália Lourenço
Shared tasks and community challenges represent key instruments to promote research, collaboration and determine the state of the art of biomedical and chemical text mining technologies. Traditionally, such tasks relied on the comparison of automatically generated results against a so-called Gold Standard dataset of manually labelled textual data, regardless of efficiency and robustness of the underlying implementations. Due to the rapid growth of unstructured data collections, including patent databases and particularly the scientific literature, there is a pressing need to generate, assess and expose robust big data text mining solutions to semantically enrich documents in real time. To address this pressing need, a novel track called “Technical interoperability and performance of annotation servers” was launched under the umbrella of the BioCreative text mining evaluation effort. The aim of this track was to enable the continuous assessment of technical aspects of text annotation web servers, specifically of online biomedical named entity recognition systems of interest for medicinal chemistry applications. A total of 15 out of 26 registered teams successfully implemented online annotation servers. They returned predictions during a two-month period in predefined formats and were evaluated through the BeCalm evaluation platform, specifically developed for this track. The track encompassed three levels of evaluation, i.e. data format considerations, technical metrics and functional specifications. Participating annotation servers were implemented in seven different programming languages and covered 12 general entity types. The continuous evaluation of server responses accounted for testing periods of low activity and moderate to high activity, encompassing overall 4,092,502 requests from three different document provider settings. The median response time was below 3.74 s, with a median of 10 annotations/document. Most of the servers showed great reliability and stability, being able to process over 100,000 requests in a 5-day period. The presented track was a novel experimental task that systematically evaluated the technical performance aspects of online entity recognition systems. It raised the interest of a significant number of participants. Future editions of the competition will address the ability to process documents in bulk as well as to annotate full-text documents.
MINEs: open access databases of computationally predicted enzyme promiscuity products for untargeted metabolomicsSpringer Science and Business Media LLC - Tập 7 - Trang 1-8 - 2015
James G Jeffryes, Ricardo L Colastani, Mona Elbadawi-Sidhu, Tobias Kind, Thomas D Niehaus, Linda J Broadbelt, Andrew D Hanson, Oliver Fiehn, Keith E J Tyo, Christopher S Henry
In spite of its great promise, metabolomics has proven difficult to execute in an untargeted and generalizable manner. Liquid chromatography–mass spectrometry (LC–MS) has made it possible to gather data on thousands of cellular metabolites. However, matching metabolites to their spectral features continues to be a bottleneck, meaning that much of the collected information remains uninterpreted and that new metabolites are seldom discovered in untargeted studies. These challenges require new approaches that consider compounds beyond those available in curated biochemistry databases. Here we present Metabolic In silico Network Expansions (MINEs), an extension of known metabolite databases to include molecules that have not been observed, but are likely to occur based on known metabolites and common biochemical reactions. We utilize an algorithm called the Biochemical Network Integrated Computational Explorer (BNICE) and expert-curated reaction rules based on the Enzyme Commission classification system to propose the novel chemical structures and reactions that comprise MINE databases. Starting from the Kyoto Encyclopedia of Genes and Genomes (KEGG) COMPOUND database, the MINE contains over 571,000 compounds, of which 93% are not present in the PubChem database. However, these MINE compounds have on average higher structural similarity to natural products than compounds from KEGG or PubChem. MINE databases were able to propose annotations for 98.6% of a set of 667 MassBank spectra, 14% more than KEGG alone and equivalent to PubChem while returning far fewer candidates per spectra than PubChem (46 vs. 1715 median candidates). Application of MINEs to LC–MS accurate mass data enabled the identity of an unknown peak to be confidently predicted. MINE databases are freely accessible for non-commercial use via user-friendly web-tools at
http://minedatabase.mcs.anl.gov
and developer-friendly APIs. MINEs improve metabolomics peak identification as compared to general chemical databases whose results include irrelevant synthetic compounds. Furthermore, MINEs complement and expand on previous in silico generated compound databases that focus on human metabolism. We are actively developing the database; future versions of this resource will incorporate transformation rules for spontaneous chemical reactions and more advanced filtering and prioritization of candidate structures.
Efficient enumeration of monocyclic chemical graphs with given path frequenciesSpringer Science and Business Media LLC - - 2014
Masaki Suzuki, Hiroshi Nagamochi, Tatsuya Akutsu
The enumeration of chemical graphs (molecular graphs) satisfying given constraints is one of the fundamental problems in chemoinformatics and bioinformatics because it leads to a variety of useful applications including structure determination and development of novel chemical compounds. We consider the problem of enumerating chemical graphs with monocyclic structure (a graph structure that contains exactly one cycle) from a given set of feature vectors, where a feature vector represents the frequency of the prescribed paths in a chemical compound to be constructed and the set is specified by a pair of upper and lower feature vectors. To enumerate all tree-like (acyclic) chemical graphs from a given set of feature vectors, Shimizu et al. and Suzuki et al. proposed efficient branch-and-bound algorithms based on a fast tree enumeration algorithm. In this study, we devise a novel method for extending these algorithms to enumeration of chemical graphs with monocyclic structure by designing a fast algorithm for testing uniqueness. The results of computational experiments reveal that the computational efficiency of the new algorithm is as good as those for enumeration of tree-like chemical compounds. We succeed in expanding the class of chemical graphs that are able to be enumerated efficiently.