Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI
Tóm tắt
There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string.
I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% of the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset.
The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain – such as the development of a standard aromatic model for SMILES – the ability to create the same SMILES using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.
Từ khóa
Tài liệu tham khảo
Warr WA: Representation of chemical structures. WIREs Comput Mol Sci. 2011, 1: 557-579. 10.1002/wcms.36.
Ash S, Cline MA, Homer RW, Hurst T, Smith GB: SYBYL Line Notation (SLN): A Versatile Language for Chemical Structure Representation. J Chem Inf Comput Sci. 1997, 37: 71-79. 10.1021/ci960109j.
Homer RW, Swanson J, Jilek RJ, Hurst T, Clark RD: SYBYL Line Notation (SLN): A Single Notation To Represent Chemical Structures, Queries, Reactions, and Virtual Libraries. J Chem Inf Model. 2008, 48: 2294-2307. 10.1021/ci7004687.
Bolton EE, Wang Y, Thiessen PA, Bryant SH: Chapter 12 PubChem: Integrated Platform of Small Molecules and Biological Activities. Annual Reports in Computational Chemistry. 2008, Elsevier, 217-241.
International Union of Pure and Applied Chemistry. Commission on the Nomenclature of Organic Chemistry, Panico R, Powell WH, Richer J-C: A guide to IUPAC nomenclature of organic compounds: recommendations 1993. 1993, Oxford; Boston; Boca Raton, Fla: Blackwell Scientific Publications; CRC Press [distributor]
Weininger D: SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988, 28: 31-36. 10.1021/ci00057a005.
Daylight Chemical Information Systems, Inc, http://daylight.com,
The IUPAC International Chemical Identifier (InChI). http://www.iupac.org/home/publications/e-resources/inchi.html,
The InChI Trust. http://www.inchi-trust.org/,
Rohbeck H: Representation of Structure Description Arranged Linearly. Software development in chemistry 5: proceedings of the 5th Workshop “Computers in Chemistry. Edited by: Gmehling J. 1991, Berlin; New York: Springer-Verlag
Smith EG, Baker PA, Wiswesser WJ: The Wiswesser Line-Formula Chemical Notation (WLN). 1975, Cherry Hill, New Jersey, US: Chemical Information Management Inc., 3
Gakh AA, Burnett MN: Modular Chemical Descriptor Language (MCDL): Composition, Connectivity, and Supplementary Modules. J Chem Inf Comput Sci. 2001, 41: 1494-1499. 10.1021/ci000108y.
Gakh AA, Burnett MN, Trepalin SV, Yarkov AV: Modular Chemical Descriptor Language (MCDL): Stereochemical modules. J Cheminf. 2011, 3: 5-10.1186/1758-2946-3-5.
Karabunarliev S, Ivanov J, Mekenyan O: Coding of chemical structures based on a line notation. Comput Chem. 1994, 18: 189-193. 10.1016/0097-8485(94)85010-0.
Fujita S, Tanaka N: XyM Notation for Electronic Communication of Organic Chemical Structures. J Chem Inf Comput Sci. 1999, 39: 903-914. 10.1021/ci990018x.
Koichi S, Iwata S, Uno T, Koshino H, Satoh H: Algorithm for Advanced Canonical Coding of Planar Chemical Structures That Considers Stereochemical and Symmetric Information. J Chem Inf Model. 2007, 47: 1734-1746. 10.1021/ci600238j.
Wentang C, Ying Z, Feibai Y: New computer representation for chemical structures: Two-level compact connectivity tables. J Chem Inf Comput Sci. 1993, 33: 604-608. 10.1021/ci00014a013.
Quadrelli L, Bareggi V, Spiga S: A New Linear Representation of Chemical Structures. J Chem Inf Comput Sci. 1978, 18: 37-40. 10.1021/ci60013a009.
Abe H, Kudo Y, Yamasaki T, Tanaka K, Sasaki M, Sasaki S: A convenient notation system for organic structure on the basis of connectivity stack. J Chem Inf Comput Sci. 1984, 24: 212-216. 10.1021/ci00044a005.
Wiswesser WJ: 107 Years of Line-Formula Notations (1861–1968). J Chem Doc. 1968, 8: 146-150. 10.1021/c160030a007.
OpenSMILES Home Page. http://www.opensmiles.org/,
Weininger D, Weininger A, Weininger JL: MILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci. 1989, 29: 97-101. 10.1021/ci00062a008.
O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR: Open Babel: An open chemical toolbox. J Cheminf. 2011, 3: 33-10.1186/1758-2946-3-33.
Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL: Recent Developments of the Chemistry Development Kit (CDK) - An Open-Source Java Library for Chemo- and Bioinformatics. Curr Pharm Des. 2006, 12: 2111-2120. 10.2174/138161206777585274.
RDKit: Open-source cheminformatics. http://rdkit.org/,
Lutz K: Chemkit. http://chemkit.org,
Indigo – GGA Software Services. http://ggasoftware.com/opensource/indigo,
ACD/ChemSketch Freeware. Toronto, ON, Canada: Advanced Chemistry Development, Inc, http://acdlabs.com/home/,
Ihlenfeldt WD, Takahashi Y, Abe H, Sasaki S: Computation and management of chemical properties in CACTVS: An extensible networked approach toward modularity and compatibility. J Chem Inf Comput Sci. 1994, 34: 109-116. 10.1021/ci00017a013.
JChem, ChemAxon. http://www.chemaxon.com/,
OEChem: Santa Fe, NM, USA: OpenEye Scientific Software, Inc, http://eyesopen.com/,
Cho YS, No KT, Cho K-H: yaInChI: Modified InChI string scheme for line notation of chemical structures. SAR QSAR Environ Res. 2012, 23: 237-255. 10.1080/1062936X.2012.657677.
Murray-Rust P: Open Babel mailing list archive. http://sourceforge.net/mailarchive/message.php?msg_id=7048882,
Thalheim T, Vollmer A, Ebert R-U, Kuähne R, Schuäuärmann G: Tautomer Identification and Tautomer Structure Generation Based on the InChI Code. J Chem Inf Model. 2010, 50: 1223-1232. 10.1021/ci1001179.
Fábián L, Brock CP: A list of organic kryptoracemates. Acta Cryst B. 2010, 66: 94-103. 10.1107/S0108768109053610.
Stein SE, Heller SR, Tchekhovskoi DV, Pletnev : IUPAC International Chemical Identifier v1.04 Technical Manual. 2011
Apodaca R: InChI Canonicalization Algorithm, Depth-First. http://depth-first.com/articles/2006/08/12/inchi-canonicalization-algorithm/,
Agarwal KK, Gelernter HL: A Computer-Oriented Linear Canonical Notational System for the Representation of Organic Structures with Stereochemistry. J Chem Inf Comput Sci. 1994, 34: 463-479. 10.1021/ci00019a001.
McKay BD: Practical Graph Isomorphism. Congressus Numerantium. 1981, 30: 45-87.
Pletnev I: InChI mailing list archive. http://sourceforge.net/mailarchive/message.php?msg_id=28292914,
Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP: ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. Nucl Acids Res. 2012, 40 (Database issue): D1100-7-
Bellis LJ, Akhtar R, Al-Lazikani B, Atkinson F, Bento AP, Chambers J, Davies M, Gaulton A, Hersey A, Ikeda K, Krüger FA, Light Y, McGlinchey S, Santos R, Stauch B, Overington JP: Collation and data-mining of literature bioactivity data for drug discovery. Biochem Soc Trans. 2011, 39: 1365-1370. 10.1042/BST0391365.
InChI FAQ. http://www.inchi-trust.org/fileadmin/user_upload/html/inchifaq/inchi-faq.html,
O’Boyle N, Guha R, Willighagen E, Adams S, Alvarsson J, Bradley J-C, Filippov I, Hanson R, Hanwell M, Hutchison G, James C, Jeliazkova N, Lang A, Langner K, Lonie D, Lowe D, Pansanel J, Pavlov D, Spjuth O, Steinbeck C, Tenderholt A, Theisen K, Murray-Rust P: Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on. J Cheminf. 2011, 3: 37-10.1186/1758-2946-3-37.