2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 1 Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework**
Consortium leader
PETER PAZMANY CATHOLIC UNIVERSITY
Consortium members
SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER
The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund ***
**Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben
***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg.
Explore the known information:
The importance and illustration of the literature search and databases
(Molekulák világa )
(Az ismert információk felfedezése: az irodalomkeresés és az adatbázisok fontossága és ezek bemutatása)
Compiled by dr. Péter Mátyus
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 3
Table of Contents
1. Introduction 5 – 18
2. Useful tools and applications 19 – 26
3. Chemical databases 27 – 38
4. University databases 39 – 43
5. Free databases 44 – 48
6. Protein databases 49 – 62
Information about a relevant project - novelty test
- reproduction
- walk the known way as long as possible...’
- biological effect/property - new project
What is the pupose of literature searc?
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 5
I. Introduction
Publications in a scientific journal
• letter / short communication
• full article
• review
Scientific lectures
• on a conference (abstract)
• posters (abstract)
Scientific Publications
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 7
Impact factor
The impact factor (IF) is a measure reflecting the average number of citations to articles published in science journals. It is frequently used as a proxy for the relative importance of a journal within its field.
In a given year, the impact factor of a journal is the average number of citations received per paper published in that journal during the two preceding years.
28.751 26.372
The impact factor was devised by Eugene
Garfield, the founder of the Institute for Scientific Information (ISI). Impact factors are calculated yearly for those journals that are indexed in Thomson Reuter's Journal Citation Reports.
http://thomsonreuters.com/products_services/science/science_products/a-z/journal_citation_reports
Basic citation data:
Impact factor and citation
Citation is a reference to a published source. A citation is an abbreviated alphanumeric expression embedded in the body of an intellectual work that denotes an entry in the bibliographic references section of the work for the purpose of acknowledging the relevance of the works of others to the topic of discussion at the spot where the citation appears.
A prime purpose of a citation is intellectual honesty; to attribute to other authors the ideas they have previously expressed, rather than give the appearance to the work's readers that the work's authors are the original wellsprings of those ideas.
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 9
• the title of the publication
• authors (the first is the most important)
• abstract (summary)
In a citation, always should be given:
• title of the journal(abbrevations!)
• volume/issue number, page number(s)
• yearof punlication
A scientific article in a database (Chemical Abstracts)
Available as a ‘Journal’ booklet or can be downloaded and printed (as a pdf file) A scientific journal usually has:
• an abstract
• an introduction chapter
• a materials and methods chapter
• a discussion chapter (results)
• a conclusions chapter
• a references (literature)
A scientific article printed form
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 11
What we know:
Journal: Tetrahedron, volume) 66, th number of the page: 2331
get abstract
http://www.sciencedirect.com
EXAMPLE: find a scientific article in a database (ScienceDirect)
get the whole
article (as a pdf)
What we know:
keyword (term): tert-amino effect, (in abstract/title), an Author name: Matyus
http://www.sciencedirect.com
EXAMPLE: find a scientific article in a database (ScienceDirect)
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 13
Patents
A patent is a set of exclusive rights granted by a state (national government) to an inventor (or inventors) or its (their) assignee for a limited period of time in exchange for a public disclosure of an invention.
The exclusive right granted to a patentee is the right to prevent others from making, using, selling, or distributing the patented invention without permission.
Typically, a patent application must include one or more claims defining the invention which must be new, non-obvious, and useful or industrially applicable. In many countries, certain subject areas are excluded from patents, such as business methods and mental acts, etc.
These rights vary widely between countries according to national laws and international agreements.
Forrás: http://www.msz.hu/
Patent offices
A patent office is a governmental (or intergovernmental) organization which controls the issue of patents. They are government bodies that may grant a patent or reject the patent application based on whether or not the application fulfils the requirements for patentability.
Hungarian Patent Office: http://www.msz.hu/
European Patent Office: http://ep.espacenet.com/
United States Patent and Trademark Office: http://www.uspto.gov/
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 15
Some typical search options:
• keyword (in title/abstract)
• patent application number
• patent publication number
• applicant (institute/firm)
• inventor (person)
A patent office’s database (European patent office)
A patent in printed form
Available as a pdf file, it can be downloaded from a patent office’s website
A patent usually has:
• a bibliography
• an abstract
• an description chapter
• a claims chapter
• a search report
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 17
What we know:
patent publication number: WO9929655 http://ep.espacenet.com/
EXAMPLE: find a patent on a patent office’s website
What we know:
Keyword: SSAO, inventor: Matyus http://ep.espacenet.com/
EXAMPLE: find a patent on a patent office’s website
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 19
II. Useful tools and applications
The International Union of Pure and Applied Chemistry (IUPAC) serves to advance the worldwide aspects of the chemical sciences and to contribute to the application of chemistry in the service of Mankind. As a scientific, international, non-governmental and objective body, IUPAC can address many global issues involving the chemical sciences.
IUPAC provides various types of electronic resources:
• Educational resources
• Databases
• Nomenclature and Terminology
• Other
IUPAC
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 21
General
Principles of Chemical Nomenclature: a Guide to IUPAC Recommendations
Leigh, G.J.; Favre, H.A. and Metanomski, W.V.
Blackwell Science, 1998 [ISBN 0-86542-6856]
The Gold Book
Compendium of Chemical Terminology
Gold, V.; Loening, K.L.; McNaught, A.D. and Shemi, P.
Blackwell Science, 1987 [ISBN 0-63201-7651(8)]
IUPAC nomenclature books
The Blue Book
Nomenclature of Organic Chemistry Rigaudy, J. and Klesney, S.P.
Pergamon, 1979 [ISBN 0-08022-3699]
A Guide to IUPAC Nomenclature of Organic Compounds (recommendations 1993)
Panico, R.; Powell, W.H. and Richer, J-C.
Blackwell Science, 1993 [ISBN 0-63203-4882]
Corrections published in Pure Appl. Chem., Vol. 71, No. 7, pp.1327-
IUPAC nomenclature books
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 23
http://www.chemaxon.com/
A free tool to generate IUPAC name for a compound
ChemAxon application:
MarvinSketch
„Free ongoing provision of all tools for teaching, including licenses to allow students of the department to use during
tuition”
Encode chemical structure with ASCII: SMILES
The simplified molecular input line entry specification (SMILES) is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.
In July 2006, the IUPAC introduced the InChI as a standard for formula representation. SMILES is generally considered to have the advantage of being slightly more human-readable than InChI; it also has a wide base of software support with extensive theoretical (e.g., graph theory) backing.
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 25
The IUPAC International Chemical Identifier is a textual identifier for chemical substances, designed to provide a standard and human-readable way to encode molecular information and to facilitate the search for such information in databases and on the web. Developed by IUPAC and NIST during 2000–2005, the format and algorithms are non-proprietary and the software is freely available under the open source LGPL license (though the term "InChI" is a trademark of IUPAC).
C H3
OH
InChI=1/C2H6O/c1-2-3/h3H,2H2,1H3
InChI Key=LFQSCWFLJHTTHZ-UHFFFAOYAB O
O
H OH
O O
H O
H InChI=1/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-10H,1H2/t2-,5+/m0/s1 InChI Key=CIWBSHSKHKDKBQ-JLAZNSOCBT
Encode chemical structure with ASCII: InChi, InChi keys
ACD Labs application:
ChemSketch
„Advanced Chemistry
Development (ACD/Labs) has donated free ChemSketch
licenses to numerous academic institutions.”
A free tool to generate SMILES/InChi codes for a compound
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 27
III. Chemical databases
Some commercially available chemical database
Chemical Abstracts Database
• Company: Chemical Abstracts Service (CAS)
• Application: Chemical Abstracts Scholar
• http://www.cas.org/
Reaxys (Beilstein)
• Company: MDL ELSEVIER
• Application: Reaxys
• https://www.reaxys.com/
The Cambridge Structural Database (CSD)
• Cambridge Crystallographic Data Centre (CCDC)
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 29
What is Chemical Abstracts?
http://www.cas.org
CAS (Chemical Abstracts Service) is a division of the American Chemical Society. CAS is the most authoritative and comprehensive source for chemical information.
…monitors, indexes, and
abstracts the world's chemistry- related literature and patents, updates this information daily, and makes it accessible…
Chemical Abstract Registration number (CASRN?
CAS Registry Numbers (often referred to as CAS RNs or CAS Numbers) are unique identifiers for chemical substances. A CAS Registry Number itself has no inherent chemical significance but provides an unambiguous way to identify a chemical substance or molecular structure when there are many possible systematic, generic, proprietary, or trivial names.
CAS RN 1219909-65-5 is the most recent CAS Registry Number
CAS Registry Numbers are used in many other public and private databases as well as chemical inventory listings and, of course, are
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 31
Forrás: http://www.cas.org
CAS databases
Patent and journal references
from all scientific disciplines Substance information Chemical synthesis information CAplus > 32 million documents
MEDLINE>18 million references
>53 millionorganic and inorganic substances
>61 million sequences
>23 millionsingle- and multi- step reactions
CAplus-Journal articles and patent documents from chemistry and related sciences
• Proteomics
• Genomics
• Biochemistry
• Biochemical genetics
• Organic
• Macromolecular
• Applied
• Physical, inorganic, analytical MEDLINE - Produced by NLM, and covers all areas in the broad field of biomedicine
Informationabout the many different types of substances, including:
• Synonyms
• Molecular formulas
• Nucleic acid and protein sequences
• Ring analysis data
• Structure diagrams
• Experimental and
calculated property data
Reaction information consisting of:
• Structure diagrams for reactants and products
• CAS Registry Numbers for all reactants products, reagents, solvents, and catalysts
• Yields for many products
• extual reaction information
Patent and journal references
from all scientific disciplines Substance information Chemical synthesis information CAplus
1907 to present, plus many records from earlier years
More than 10,000scient. journals Patents from60 patent authorities
• Conference proceedings
• Technical reports
• Books
• Dissertations
• Reviews
• Meeting abstracts
• Electronic-only journals
• Web preprints MEDLINE
1947 to present
Complete coverage from 1957 to present
Many substances back to the early 1900s
New substances as identifiedby the CAS Registry System
GenBank sequences Organic and inorganic substances including:
• Alloys
• Coordination
• Compounds
• Minerals
• Mixtures
• Polymers
• Salts
1840 to present
Journals covered for Chemical AbstractsTM since 1985
Patentscovered for CA from 1991 to present
CAS databases
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 33
CAS databases
SciFinder is a research discovery tool, suitable for both professional searchers and research scientists. You do not have to be an expert searcher
What is SciFinder?
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 35
Refine
Too many hits…
Filter by:
• further keywords
• author name
• date of publication
• type of documents
• etc.
Analyze
Organize hit list by:
• author name
• company name
• date of publication
SciFinder – search by Research topic
SciFinder – search by Structure
SciFinder has it own built-in molecule drawing tool to carry out structure-based searches.
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 37
• a graphical hit-list gives us a detailed information about the compound
• it is possible to save and organize results
SciFinder – search by Structure
Search for a reaction results a list of reaction schemes which generally gives information about the reaction conditions:
- reactant/reagents order of application
- reaction time - temperature - catalysts, etc.
SciFinder – search by Structure
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 39
IV. University databases
Free databeses at the University
Semmelweis Egyetem Central Library
• Semmelweis University http://www.lib.sote.hu/
• Journal Vatabase
• Other database
‘Elektronikus Információszolgáltatás’ (EISZ)
• National program http://www.eisz.hu/
• Web of Science (WoS)
• Science Direct
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 41
Databases available through (EISZ)
EISZ: Web of Science
http://thomsonreuters.com/products_services/scientific/Web_of_Science Web of Science® provides researchers, administrators, faculty, and students with quick, powerful access to the world's leading citation databases. Authoritative, multidisciplinary
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 43
EISZ: ScienceDirect
http://www.sciencedirect.com/
V. Free databases
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 45
Public Medline (PUBMED):
• U.S. National Library of Medicine
• includes over 19 million citations from MEDLINE and other life science journals
• http://www.pubmed.gov/
Free databeses available through the world wide web
Protein databases: Uniprot és PDB Org
• http://www.uniprot.com/
• http://www.pdb.org/
• Protein sequences, structures and protein relataed data
PubMed: biomedical literature
PubMed comprises approximately 20 million citations for biomedical literature from MEDLINE, life science journals, and online books. PubMed citations and abstracts include the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and preclinical sciences. PubMed also provides access to
additional relevant Web sites and links to the other NCBI molecular biology resources.
PubMed is a free resource that is developed and maintained by the National Center for Biotechnology Information (NCBI), at the U.S. National Library of Medicine (NLM), located at the National Institutes of Health (NIH).
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 47
PubMed: Genome Project
http://pubchem.ncbi.nlm.nih.gov/
PubMed: Pubchem Project
PubChem provides information on the biological activities of small molecules. It is a component of
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 49
VI. Protein databases
Structures deposited in the Protein Data Bank (PDB) are assigned a unique four letter code which is often called PDB accession code or PDB code. Because of the PDB's importance as the central repository for biological macromolecular structures, the PDB code is often used in the scientific literature to refer to a particular structure which has been used in a study.
By convention, the PDB code consists of a single numeric digit followed by three alphanumeric characters. The PDB code is not case sensitive, i.e. 1abc and 1ABC refer to the same structure. For classification purposes, e.g. for the directory structure of the PDB archive, the two middle characters (the second and third character of the PDB code) are sometimes used as an index to group PDB codes into not too large and equally sized bins. This two-letter code is preferred over the first and second character because the number of possible values for the first character is limited to the ten digits and the majority of PDB codes in use starts with the character '1'.
PDB identification code
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 51
Accession number (AC)
Forrás: http://www.uniprot.org/
This subsection of the ‘Entry information’ section provides one or more accession number(s). These are stable identifiers and should be used to cite UniProtKB entries. Upon integration into UniProtKB, each entry is assigned a unique accession number, which is called ‘Primary (citable) accession number’.
UniProtKB accession numbers consist of 6 alphanumerical characters in the format:
1 2 3 4 5 6
[A-N,R-Z] [0-9] [A-Z] [A-Z, 0-9] [A-Z, 0-9] [0-9]
[O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]
Examples: A2BC19, P12345, P4A123, Q1AAA9
Entry name
The UniProtKB/Swiss-Prot entry name consists of up to 11 uppercase
alphanumeric characters with a naming convention that can be symbolized as X_Y, where:
• X is a mnemonic protein identification code of at most 5 alphanumeric characters;
• The ’_’ sign serves as a separator;
• Y is a mnemonic species identification code of at most 5 alphanumeric characters.
The mnemonic code ‘X’ is an abbreviation of the protein/gene name, which does not necessarily correspond to the recommended protein name or to the gene name.
Code(X) Recommended protein name Gene name
B2MG Beta-2-microglobulin B2M
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 53
The .pdb file format
www.uniprot.org
The Protein Data Bank (pdb) file format is a textual file format describing the three dimensional structures of molecules held in the Protein Data Bank. Most of the information in that database pertains to proteins, and the pdb format accordingly provides for rich description and annotation of protein properties. However, proteins are often crystallized in association with other molecules or ions such as water, ions, nucleic acids, drug molecules and so on.
The .pdb file format
HEADER, TITLE and AUTHOR records
provide information about the researchers who defined the structure;
numerous other types of records are available to provide other types of information
REMARK records
can contain free-form annotation, but they also accommodate standardized information; for example, how to compute the coordinates of the
experimentally observed multimer from those of the explicitly specified ones of a single repeating unit
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 www.uniprot.org 55
SEQRES records
give the sequences of the peptide chains (named A, B and C etc.), which are veryshort in this example but usually span multiple lines
ATOM records
describe the coordinates of the atoms that are part of the protein. The first three floating point numbers are its x, y and z coordinates and are in units of Ångströms. The next three columns are the occupancy, temperature factor, and the element name, respectively
HETATM records
describe coordinates of hetero-atoms, that is those atoms which are not part of the protein molecule
The .pdb file format
UniProt Protein database
The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. In addition to capturing the core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added. This includes widely accepted biological ontologies, classifications and cross- references, and clear indications of the quality of annotation in the form of
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 57
A search UniProt example
A search UniProt example
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 59
A search UniProt example
The Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. These are the molecules of life that are found in all organisms including bacteria, yeast, plants, flies, other animals, and humans. Understanding the shape of a molecule helps to understand how it
http://www.pdb.org/
RCSB Protein Data Bank
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 61
RCSB Protein Data Bank
Advanced Search: Allows searches of all types - database fields, browsable ontologies, and text searches
Search organicm: browse based on NCBI Taxonomy
RCSB Protein Data Bank
Advanced Search Options
• Author Name
• Chain Length
• Chemical ID
• Chemical Name
• Citation
• Crystal Properties
• Deposit Date
• Enzyme Classification
• Expression Organism
• Keywords
• Latest Released Structures
• Macromolecule Name
• Macromolecule Type
• Molecular Weight
2011.10.07.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 63
RCSB Protein Data Bank
PDB structures can be viewed on the site in 3D with free plugin PDB
viewers. Download of additional free software is required (or that the Web browser be configured correctly)
Several free interactive viewer software can be downloaded from the web:
KiNG Jmol WebMol QuickPDB
Protein Workshop