1) Representation and Manipulation of 2D Molecular Structures.pptx

Full Transcript

A field of information technology that uses computers and computer programs to facilitate the collection, storage, analysis, and manipulation of large quantities of chemical data. Billones Lecture Notes Chem(o)informatics is an abbreviated form of CHEMICAL INFORMATICS - coined by Frank Brown in 19...

A field of information technology that uses computers and computer programs to facilitate the collection, storage, analysis, and manipulation of large quantities of chemical data. Billones Lecture Notes Chem(o)informatics is an abbreviated form of CHEMICAL INFORMATICS - coined by Frank Brown in 1998 Central concepts behind cheminformatics: • structure–activity relationships (QSAR) • compound property prediction (QSPR) Billones Lecture Notes 1.2.1 Graph Theoretic Representations of Chemical Structures • Chemical structures are usually stored in a computer as molecular graphs. • A graph is an abstract structure that contains nodes connected by edges. nodes edges Billones Lecture Notes • In a molecular graph the nodes correspond to the atoms and the edges to the bonds. • The nodes and edges may have properties associated with them.  atom type with each node  bond order with each edge • A graph represents the topology of a molecule only  the way the nodes (or atoms) are connected Billones Lecture Notes • A subgraph is a subset of the nodes and edges of a graph.  graph of Benzene is a subgraph of Aspirin Aspirin Benzene Billones Lecture Notes • A tree is a special type of graph in which there is just a single path connecting each pair of vertices  there are no cycles or rings within the graph A tree • The root node of a tree is the starting point; the other vertices are either branch nodes or leaf nodes. • Acyclic molecules are represented using trees. Billones Lecture Notes • In a completely connected graph there is an edge between all pairs of nodes. • Two graphs that are the same are said to be isomorphic.  have the same number of nodes and edges  established by mapping one graph to the other such that every node and edge in one graph has an equivalent counterpart in the other graph. Billones Lecture Notes 1.2.2 Connection Tables and Linear Notations • A Connection Table is a means to communicate the molecular graph to and from the computer. • The simplest type of connection table consists of two sections:  list of the atomic numbers of the atoms in the molecule  list of the bonds, specified as pairs of bonded atoms  more detailed forms include additional information e.g. hybridization state of each atom, bond order, xyz coordinates • connection table is hydrogen-suppressed if the hydrogen atoms are not explicitly included (i.e. implied) Billones Lecture Notes • Linear notation uses alphanumeric characters to encode the molecular structure. • Linear notations are more compact than connection tables  useful for storing and transmitting large numbers of molecules.  An early line notation was the Wiswesser Line Notation (WLN) [1954]. Billones Lecture Notes • A recent linear notation that has found widespread acceptance is the Simplified Molecular Input Line Entry Specification (SMILES) notation [Weininger 1988].  much easier to use and comprehend than the WLN  just a few rules are needed to write and understand most SMILES strings.  atoms are represented by their atomic symbol. e.g. Methane C CH4 Formula Structure SMILES notation Billones Lecture Notes • Hydrogen atoms are not normally explicitly represented as SMILES is a hydrogen- suppressed notation • Upper case symbols are used for aliphatic atoms and lower case for aromatic atoms. e.g. Cyclohexane C1CCCCC1 C6H12 e.g. Benzene c1ccccc1 C6H6 Formula Structure SMILES notation Billones Lecture Notes • Double bonds are written using “=” and triple bonds using “#”; single and aromatic bonds are not explicitly represented by any symbol Billones Lecture Notes • The absolute stereochemistry at chiral atoms is indicated using the “@” symbol N[C@@H](C)C(O)=O N[C@H](C)C(O)=O • Geometrical (E/Z or cis–trans) isomerism about double bonds is indicated using slashes trans-2-butene C/C=C/C cis-2-butene C/C=C\C Billones Lecture Notes 1.2.3 Canonical Representation of Molecular Structures • There may be many different ways to construct the connection table or the SMILES string for a given molecule  In a connection table one may choose different ways to number the atoms  In SMILES notation the SMILES string may be written starting at a different atom or by following a different sequence through the molecule e.g. Aspirin OC(=O)c1ccccc1OC(=O)C or c1cccc(OC(=O)C)c1C(=O)O) Billones Lecture Notes Chemical database systems need to be able to establish whether two connection tables or two SMILES strings represent the same chemical structure or not (to avoid duplicates).  This problem could in principle be tackled by renumbering one of the connection tables in all possible ways and testing for identity  this is computationally unfeasible since there are N! different ways of numbering a connection table consisting of N atoms. A canonical representation is a unique ordering of the atoms for a given molecular graph. Billones Lecture Notes  A well-known and widely used method for determining a canonical order of the atoms is the Morgan algorithm [Morgan 1965]  SEMA method extends the Morgan algorithm to incorporate stereochemistry information [Wipke et al. 1974]. A key part of the Morgan algorithm is the iterative calculation of “connectivity values” to enable differentiation of the atoms.  Initially, each atom is assigned a connectivity value equal to the number of connected atoms.  In the second and subsequent iterations a new connectivity value is calculated as the sum of the connectivity values of the neighbors. Billones Lecture Notes  The procedure continues until the number of different connectivity values (n) reaches a maximum. Billones Lecture Notes  The atom with the highest connectivity value is then chosen as the first atom in the connection table; its neighbors are then listed (in order of their connectivity values), then their neighbors and so on.  If a “tie” occurs then additional properties are considered such as atomic number and bond order. Billones Lecture Notes • The CANGEN algorithm has been developed to generate a unique SMILES string for each molecule [Weininger et al. 1989].  This uses similar principles to the Morgan algorithm to produce a canonical ordering of the atoms in the graph.  An algorithm called the canonical SMILES for aspirin is CC(=O)Oc1ccccc1C(=O)O) • Many other formats that are designed to have general utility are:  the canonical InChI representation (formerly IChI) by IUPAC [InChI; Stein et al. 2003] and  the Chemical Markup Language (CML) [Murray-Rust and Rzepa 1999]. Billones Lecture Notes 1.3 STRUCTURE SEARCHING • converts the structure (the query) into representation • the representation can also provide the means to retrieve information about a given structure more directly from the database through the generation of a hash key. - a hash key is an integer with a value between 0 and some large number (e.g. 232 − 1). • If the hash key corresponds to the physical location on the computer disk where the data associated with that structure is stored, then the information can be retrieved by moving the disk read mechanism directly to that location. - e.g. Augmented Connectivity Molecular Formula used in the Chemical Abstracts Service (CAS) Registry System [Freeland et al. 1979] Billones Lecture Notes 1.4 SUBSTRUCTURE SEARCHING • identifies all the molecules in the database that contain a specified substructure. e.g. identifying all structures that contain a particular functional group. In this case the dopamine-derived query at the top left was used to search the World Drug Index (WDI) Billones Lecture Notes 1.4.1 Screening Methods Graph theoretic methods is equivalent to determining whether one graph is entirely contained within another, a problem known as subgraph isomorphism. • screens are first used to rapidly eliminate molecules that cannot possibly match the substructure query; to discard 99% of the database. • the remaining structure are then subjected to subgraph isomorphism procedure to determine which of them truly do match the substructure. The query substructure is called bitstrings, consisting of a sequence of “0”s and “1s”. • A “1” in a bitstring usually indicates the presence of a particular structural feature and a “0” its absence. Billones Lecture Notes The bitstring representation of a query substructure and that of two database molecules. passed failed Billones Lecture Notes Screening Methods Structural Key • each position in the bitstring corresponds to the presence or absence of a predefined substructure or molecular feature, which is specified using a fragment dictionary. Examples of the types of substructural fragments included in screening systems (in bold). Billones Lecture Notes Screening Methods Hashed Fingerprint • does not require a predefined fragment dictionary, and so in principle has the advantage of being applicable to any type of molecular structure e.g. • when using the SMILES representation for aspirin the paths of length zero are just the atoms c, C and O; the paths of length one are cc, cC, C=O, CO and cO, and so on. • each of these paths in turn serves as the input to a second program that uses a hashing procedure to set a small number of bits (typically four or five) to “1” in the fingerprint bitstring. Aspirin CC(=O)Oc1ccccc1C(=O)O) Billones Lecture Notes 1.4.2 Algorithms for Subgraph Isomorphism The Ullmann algorithm • uses adjacency matrices to represent the molecular graphs of the query substructure and the molecule. • In an adjacency matrix the rows and columns correspond to the atoms in the structure such that if atoms i and j are bonded then the elements (i, j) and (j, i) of the matrix have the value “1”. If they are not bonded then the value is assigned “0”. • Suppose there are NS atoms in the substructure query and NM atoms in the database molecule and that S is the adjacency matrix of the substructure and M is the adjacency matrix of the molecule. A matching matrix A is constructed with dimensions NS rows by NM columns so that the rows represent the atoms of the substructure and the columns represent the atoms of the database molecule. • The elements of the matching matrix take the value “1” if a match is possible between the corresponding pair of atoms and “0” otherwise. Billones Lecture Notes The adjacency matrix for aspirin, using the atom numbering shown. Billones Lecture Notes  aim is to find matching matrices in which each row contains just one element “1” and each column contains only one element “1”.  atom i in the substructure matches atom j in the molecule (but timeconsuming).  in refinement or relaxation step, the atom is not required to match unless neighboring atoms also represented match (efficient) Ullmann algorithm using a 5-atom substructure and a 7-atom “molecule” by their adjacency matrices. One of the proposed matches is shown in the bottom figure together with the corresponding matching matrix. Billones Lecture Notes 1.4.3 Practical Aspects of Structure Searching Modern substructure search methods enable very complex queries to be constructed. e.g. - an atom is one of a group (e.g. “can match any halogen) - an atom should not be a specific atom (e.g. “oxygen or carbon but not nitrogen”) - certain atoms or bonds must be part of the ring  MDL connection table provides special fields to specify substructure-related information  SMARTS is an extension of the SMILES language for substructure specification. Ring systems play a key role in chemistry. • When bridged and fused ring systems are present there may be a number of different ways in which the rings can be classified. Billones Lecture Notes In decalin one would identify the two 6-membered rings together with the encompassing 10-membered ring.  identify the so-called Smallest Set of Smallest Rings (SSSR), the set of rings from which all others in the molecular graph can be constructed.  SSSR comprises those rings containing the fewest atoms. Thus, for example, in decalin the SSSR contains two 6-membered rings. Billones Lecture Notes Another practical concern is the representation of aromatic systems.  each of the structures below would be considered valid by a chemist, thus computer system must be able to recognize the equivalence of these three representations.  Other examples: azides and diazo, compounds that exhibits tautomerism Billones Lecture Notes 1.5 REACTION DATABASES Reactions are central to the subject of chemistry, being the means by which new chemical entities are produced. • The Beilstein Handbuch der Organischen Chemie (containing information from 1771 onwards) is useful for chemists wishing to plan their own syntheses.  over 9 million reactions Billones Lecture Notes  A reaction search involves the structures or substructures of the precursor or reactant and the product, including the specific reagents, catalysts, solvents, the reaction conditions (temperature, pressure, pH, time, etc.) together with the yield.  SciFinder Scholar  Over 6 million reactions in the CASREACT database Billones Lecture Notes 1.6 REPRESENTATION OF PATENTS AND PATENT DATABASES The patent system confers a period of exclusivity and protection for a novel, useful and non-obvious invention in exchange for its disclosure in the form of a patent specification. • the scopes of chemical disclosures in patents are most frequently expressed using generic, or Markush, structures in which variables are used to encode more than one structure into a single representation.  the representation can refer to an extremely large number of molecules. • Many of the individual molecules may not actually have been synthesized or tested but are included to ensure that the coverage of the patent is sufficiently broad to embrace the full scope of the invention. Billones Lecture Notes Main types of variation seen in Markush structures: 1. Substituent variation refers to different substituents at a fixed position, for example “R1 is methyl or ethyl”. 2. Position variation refers to different attachment positions 3. Frequency variation refers to different occurrences of a substructure, for example “(CH2)m where m is from 1 to 3”. 4. Homology variation refers to terms that represent chemical families, for example “R3 is alkyl or an oxygen-containing heterocycle”. Billones Lecture Notes Three stages of Search System 1. Fragment-screening - similar to that used for the substructure search of specific structures.  two sets of fragments are generated: those that are common to all specific structures represented by the generic structure and those that occur at least once. 2. Search based on a reduced graph representation  In one form, fragments such as “phenyl” to be matched against homologous series identifiers such as “aryl ring”.  A second form involves the differentiation between contiguous assemblies of carbons and heteroatoms. Generation of ring/non-ring (left) and carbon/heteroatom (right) forms of reduced graph 3. Search based on modified Ullmann algorithm  applied on a node-by-node basis Billones Lecture Notes

Use Quizgecko on...
Browser
Browser