On InChI and evaluating the quality of cross-reference links

Springer Science and Business Media LLC - Tập 6 - Trang 1-15 - 2014

Jakub Galgonek¹, Jiří Vondrášek¹

¹Institute of Organic Chemistry and Biochemistry, Academy of Sciences of the Czech Republic, Prague 6, Czech Republic

Tóm tắt

There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones. We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (28.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links. We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method.

Tài liệu tham khảo

Williams AJ: Public chemical compound databases. Curr Opin Drug Discov Devel. 2008, 11: 393-404. Martin E, Monge A, Duret J-A, Gualandi F, Peitsch MC, Pospisil P: Building an R&D chemical registration system. J Cheminform. 2012, 4: 11-10.1186/1758-2946-4-11. Gobbi A, Lee M-L: Handling of tautomerism and stereochemistry in compound registration. J Chem Inf Model. 2011, 52: 285-292. Sitzmann M, Ihlenfeldt W-D, Nicklaus MC: Tautomerism in large databases. J Comput Aided Mol Des. 2010, 24: 521-551. 10.1007/s10822-010-9346-4. Chen J, Swamidass SJ, Dou Y, Bruand J, Baldi P: ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics. 2005, 21: 4133-4139. 10.1093/bioinformatics/bti683. Chambers J, Davies M, Gaulton A, Hersey A, Velankar S, Petryszak R, Hastings J, Bellis L, McGlinchey S, Overington JP: UniChem: a unified chemical structure cross-referencing and identifier tracking system. J Cheminform. 2013, 5: 3-10.1186/1758-2946-5-3. Weininger D, Weininger A, Weininger JL: SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci. 1989, 29: 97-101. 10.1021/ci00062a008. SMILES - A Simplified Chemical Language. http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html, Williams AJ: The Messy World of Even Curated Chemistry on the Internet. http://www.chemconnector.com/2010/08/15/the-messy-world-of-even-curated-chemistry-on-the-internet/, Ott M, Vriend G: Correcting ligands, metabolites, and pathways. BMC Bioinformatics. 2006, 7: 517-10.1186/1471-2105-7-517. Akhondi S a, Kors J a, Muresan S: Consistency of systematic chemical identifiers within and between small-molecule databases. J Cheminform. 2012, 4: 35-10.1186/1758-2946-4-35. Bachrach SM: InChI: a user’s perspective. J Cheminform. 2012, 4: 34-10.1186/1758-2946-4-34. IUPAC Compendium of Chemical Terminology - the Gold Book. http://goldbook.iupac.org/, Dalby A, Nourse J, Hounshell D, Gushurst A, Grier D, Leland B, Laufer J: Description of several chemical structure file formats used by computer programs developed at molecular design limited. J Chem Inf Comput Sci. 1992, 32: 244-255. 10.1021/ci00007a012. Accelrys: CTfile Formats. 2011, http://accelrys.com/products/informatics/cheminformatics/ctfile-formats/no-fee.php, Stein SE, Heller SR, Tchekhovskoi DV, Pletnev IV: IUPAC International Chemical Identifier (InChI); InChI version 1, software version 1.04 (2011); Technical Manual. 2011, http://www.inchi-trust.org/fileadmin/user_upload/software/inchi-v1.04/InChI_TechMan.pdf, The IUPAC International Chemical Identifier (InChI). http://www.iupac.org/home/publications/e-resources/inchi.html, ChemAxon JChem. http://www.chemaxon.com/products/jchem-base/, De Matos P, Alcántara R, Dekker A, Ennis M, Hastings J, Haug K, Spiteri I, Turner S, Steinbeck C: Chemical entities of biological interest: an update. Nucleic Acids Res. 2010, 38: D249-D254. 10.1093/nar/gkp886. Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R, Guo AC, Wishart DS: DrugBank 3.0: a comprehensive resource for “omics” research on drugs. Nucleic Acids Res. 2011, 39: D1035-D1041. 10.1093/nar/gkq1126. Dimitropoulos D, Ionides J, Henrick K: Using MSDchem to search the PDB ligand dictionary. Curr Protoc Bioinforma. 2006, Chapter 14: Unit14.3- Wishart DS, Jewison T, Guo AC, Wilson M, Knox C, Liu Y, Djoumbou Y, Mandal R, Aziat F, Dong E, Bouatra S, Sinelnikov I, Arndt D, Xia J, Liu P, Yallou F, Bjorndahl T, Perez-Pineiro R, Eisner R, Allen F, Neveu V, Greiner R, Scalbert A: HMDB 3.0–the human metabolome database in 2013. Nucleic Acids Res. 2013, 41: D801-D807. 10.1093/nar/gks1065. Huang R, Southall N, Wang Y, Yasgar A, Shinn P, Jadhav A, Nguyen D-T, Austin CP: The NCGC pharmaceutical collection: a comprehensive resource of clinically approved drugs enabling repurposing and chemical genomics. Sci Transl Med. 2011, 3: 80ps16- Bourne PE, Berman HM, McMahon B, Watenpaugh KD, Westbrook JD, Fitzgerald PM: Macromolecular crystallographic information file. Methods Enzymol. 1997, 277: 571-590. RDF Primer. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/, Resource Description Framework (RDF): Concepts and Abstract Syntax. http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/, Beckett D: RDF/XML Syntax Specification (Revised). http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/, Murray C: Oracle® Database Semantic Technologies Developer’s Guide 11g Release 2 (11.2). 2012, http://docs.oracle.com/cd/E11882_01/appdev.112/e25609.pdf, Kiss R: Five most common issues with molecular database registration systems. Part 2: Isomer detection. http://blog.mcule.com/2011/07/five-most-common-issues-with-molecular_26.html,

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA