Basics in bioinformatics Basics in bioinformatics
G G ábor Rákhely PhD. ábor Rákhely PhD.
Institute of Biophysics BRC HAS Institute of Biophysics BRC HAS
Department of Biotecnology Department of Biotecnology
University of Szeged University of Szeged
rakhely
rakhely @brc.hu @brc.hu (599)-726
(599)-726
This presentation can be found:
http://biotech.szbk.u-szeged.hu/bioinf/bioinfo_itc.html
Books are available in English
BIOINFORMATICS
INFORMATICS BIOINFORMATICS BIOLOGY
“The >99% of the ever-lived scientists is contemporary
It is true for data revolution in informatics
INFORMATICS
- experiments information production of new information - treatment, classification (grouping) and displaying of data - harmonizing of data
Entering data, arrangement of data databanks
Databanks:
- fast exchange of data
- interactive link between databanks
Processing, displaying and evaluation of data
newer information newer, other databanks
PREBIOINFORMATICS:
RESOLVING THE INFORMATION CARRIER
1866 Mendel: crossing experiments with peas h eredity in units
1869 Miescher: purification of salmon sperm DNA
DNA as inheriting material
1903 WS Sutton the inheritable pattern is linked to the properties of chromosomes during proliferation
cytochemsitry: the chromosome consist of DNA and protein 1925-1928 F. Griffith mouse infections with Streptococcus pneumoniae
transforming principle
1944 Avery: the transforming compound is DNA
PREBIOINFORMATICS:
RESOLVING THE INFORMATION CARRIER
1952. Hershey és Chase From T2 phage DNA enters into the cells
THE ROAD TO THE DOUBLE HELIX
Chargaff E.: the ratio of the nucleotides is equal in humans and E. coli Biophysical data: e.g. water content of DNA
Rosalind Franklin and Maurice Wilkins X-ray diffraction data
The double helical DNA
The central dogma and the main areas of the bioinformatics in molecular biology
degradation
Transcriptomics, transcriptome
proteomics, proteosome
Genomics
Genomics Genomics
Basically to determine the nucleotide Basically to determine the nucleotide sequence of a genome or
sequence of a genome or extrachromosomal elements extrachromosomal elements
In silico prediction of functional regions, In silico prediction of functional regions, including coding, regulatory regions, splice including coding, regulatory regions, splice
sites e.t.c.
sites e.t.c.
The main three branches of the evolutionary tree
(by Woese and colleagues)
viruses plasmids
bacteria fungi
plants algae insects
mollusks
reptiles birds mammals
Genome sizes in nucleotide base pairs
10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 The size of the human
genome is ~ 3 X 10 9 bp;
almost all of its complexity is in single-copy DNA.
The human genome is thought to contain ~30,000-40,000 genes.
bony fish
amphibians
COMPARISON OF THE CELL ORGANIZATION IN
PROKARYOTES ANN EUKARYOTES
exon intron exon
upstream downstream
Start of the biological information (coding region)
End of biological information (coding region)
Regulatory elements
STRUCTURE OF GENES IN EUKARYOTES
altenative splicing
neurofibromatosis type I gene exons
introns
OGMP EVI2B EVI2A
Genes within genes
THE ORGANIZATION OF THE PROKARYOTE GENOME
The model of the E. coli
nucleoide
THE ORGANIZATION OF THE GENES IN
PROKARYOTES: polycistronic structure
DNS MANIPULTION
WITH COMPUTER
DNA sequencing according to SANGER
THE PRINCIPLE OF THE AUTOMATIC DNA
SEQUENCENG
GENOME SEQUENCING STRATEGIES
Shot gun
Primer walking
ALTERNATIVE SHOT GUN STRATEGIES
PRODUCTION OF BACTERIAL
SHOT GUN LIBRARY
Preparation of shotgun library Preparation of shotgun library
chromosomal DNA
broken DNA fragments
blunting the ends
Preparative gel electrophoresis
2-3,5 kb fragments
dephosphorylation
transformation electroporation
E. coli
Sequence analysis
checking, validation
Removal of vectorial and other contaminating sequences
SEQUENCE PROCESSING
Phrap
Vector_clipping SeqMan/DNASTAR
STADEN programme
Manual checking the sequences
Manual checking the sequences
2000 4000 6000
S11T7
S17T7 S19T7
S148T7 S17SK S19SK
orf1 S148O8
S148O14
pcaB S11SK
S148019
S148O20
S148O9 S12SK
orf2 S148O15
S148O21
S148O18 S148O17
macA S148O10
S148O13
S12T7 S148O22
orf-3 S13T7 S148O11
S148O7
S148O12
S16SK SC110T7
pcaH S13SK
S18SK
pcaG SC110SK
S14SK S148SK
ARRANGMENT OF PRIMARY SEQUENCES INTO CONTIG
an example
Partial digestion of g enomic DNA with MboI (Sau3AI)
(compatible end with BamHI end)
Size fractionation for 30 – 45 kb fragmnets
BamHI- XbaI digestion
cos cos
Amp
rori
ligation
30 – 45 kb fragments
cos cos
in vitro packing with
GigaPack extrackt Selection for ampicillin rezisztent clones
Cosmid library
COSMID LIBRARY
A tool for connecting non-overlapping
contigs
PRIMER WALKING
TEMPLATE GENERATING SYSTEMS
In cosmid, BAC, YAC libraries
- STS: sequence tagged site single 100-500 bp fragment - EST: expressed sequence tag
USEFUL TOOLS FOR ASSEMBLYING:
MAPPING
- genetic: positioning of genes and properties
- physical: arrangment of sequences and genes
ASSEMBLY OF THE CONTIGS: gap closure
DIFFICULTIES IN THE ASSEMBLY:
Abnormal genetic elements:
formation of pseudogenes B.
No regulatory region, driving elements of transcripion
The coding region is sérült
convencional pseudogene: loss of function mutation
A.
DIFFICULTIES IN THE ASSEMBLY
Retroelements and retrotransposition
DIFFICULTIES IN THE ASSEMBLY DNA transposons
the retrotransposons are rather characteristic
for Eukaryotes
1 chromosome
2 chromosome interspersed
repeats
tandem repeated DNA
Long Interspersed Nuclear Elements: LINE
microsatellites
(short tandem repeat, STR)
13 bp repeat 150 bp long:
pl. CACACACACACA
On the average it occurs by 2 kb
Minisatellites
25 bp repeat 20 kbp length
DIFFICULTIES IN THE ASSEMBLY
REPETITIVE SEQUENCES IN THE GENOMES
“THE COMEDY OF ERRORS”
A SEGMENT OF THE HUMAN GENOME
IF EVERYTHING OK, WE HAVE SEQUENCES
What does it contain, a gene or non-coding region?
How do we know we can find anything, e.g. a gene?
CTCGAGACGCTGTTTCTGGGGTCATTCATTCTTGGCGGGCTGCAACTGCTGGTGTGACCGACGCGACCTGGCAGGCCGCGGTGCGCAACTGGCCGGGCGGACTAATGGTGGAGCAAAAGA TCGGCATGTCCAGCGCACCTGAAGCTTGGGTGGTTGCTGCAATAGCAGCCTTCCTTATTGGCATGGCGAAGGGCGGTTTGGCCAATGTGGGGGTTATCGCCGTTCCCTTGATGTCCCTGG TCAAGCCGCCGCTTACCGCTGCCGGATTGCTGCTCCCGATCTATGTCGTTTCTGATGCATTCGGCGTCTGGCTTTATCGGCACCGGTATTCTGCCTCCAATCTGCGCATCCTGATTCCTT CGGGATTTTTTGGGGTCCTGATTGGCTGGTTATTGGCCGGGCAGATCTCCGACGCGATTGCCAGTGTCATTGTTGGTTTCACCGGCTGCGGCTTCGTGGCTGTGCTGCTGGCACGACGAG GGGTGCCATCGGTGCCGCGTCAAGCCAACGTGCCCAAAGGATGGTTTCTGGGGGTGGCCACCGGCTTTACCAGCTTTTTGACTCATTCCGGTGCGGCGACCTTCCAGATGTTCGTGCTGC CGCAACGGCTGGACAAGACCATGTTCGCGGGCACATCAACGCTTACCTTTGCTGCCATAAACCTATTCAAGATTCCGTCCTACTGGGCATTGGGACAGCTTTCGACTTCCTCGGTCATGT CCGCGCTAGTGTTGATTCCGGTGGCCGTGGCCGGGACGTTCGCAGGTGTTTTTGCGACGCGCAGGCTATCGACATCCTGGTTCTTCATTCTGGTCCAGGCGATGTTGCTGGTGGTCTCCA TTCAGCTTCTGTGGAGGGGAATGTCGGATATCCTGAACTAGCTGGAGATCGCAATGTCAGAACGCTCAATCAATCAGAATGTAATCTTGACATAGAATACCGTTCCGATTTATTGCTTCG AGTGAAGCTGCCCGTCCGCTGAGATGTCATGACATTTTCCCCGCTTGATTCCGCCCTGCTTGGACCGTTGTTCGCGACCGATGAAATGCGCACGGTCTTCTCCGAACGGCGTTTTTTGGC GGGAATGCTTCGTGTTGAAGTGGCCCTGGCGCGCGCGCAGGCGGCAGAGGGCCTTGTCAGTTCGGAATTGGCCGACGCGATCGAGGTTGTTGGTACTGCCGGGTTGGACCCCGAGGCGAT GGCGGCGACTACTCGCATGACAGGAGTGCCCGCAATATCGTTCGTCCGTGCGGTGCAATCGGCCCTGCCGCCCTCACTGGCGGGTGGATTTCATTTCGGCGCCACCAGTCAAGACATCGT GGATACGGCCCACGCGCTCCAGCTGGCCGAGGCACTCGATATTATAGAAGTCGATTTACACGCCACTGTCAGCGCAATGATGAATCTGGCCGCTGCTCACTGCAATACACCCTGTATCGG GCGCACGGCCTTGCAGCACGCAGCGCCAGTTACGTTCGGCTACAAGGCGTCCGGCTGGTGCGTTGCCCTGGCGGAGCATCTGGTGCAGCTTCCCGCGCTGCGAAAGCGGGTTCTGGTGGC GTCGCTAGGGGGGCCGGTTGGTACCCTTGCCGCGATGGAGGAGCGGGCCGACGCTGTACTGGAGGGTTTCGCTGCGGACCTGGGGTTGGCCATTCCCGCCCTGGCCTGGCACACGCAGCG GGCCCGGATCGTCGAGGTGGCCAGTTGGCTGGCCATATTGCTGGGAATTCTGGCAAAAATGGCCACCGATGTCGTTCACTTGTCCTCCACGGAAGTGCGCGAGCTTTCCGAACCTGTAGC GCCGGGCAGGGGGGGCTCCTCGGCGATGCCTCACAAGCGGAACCCGATTTCCTCGATTACCATCCTGTCCCAGCATGCTGCGGCAGGGGCCCAGCTCTCCATTCTCGTGAACGGCATGGC CAGTCTGCACGAACGTCCGGTGGGGGCGTGGCATTCGGAATGGTTGGCTCTGCCGACGCTGTTCGGCCTTGCCGGCGGTGCCGTGCGCGAGGGCAGGTTTCTGGCCGAGGGGCTGCTGGT CGATGCCGACCAGATGGGTCGCAATCTACAATTGACCAATGGCCTGATTTTCAGCGACGCGGTAGCCGGCCAGTTGGCAAAGCACTTGGGTCGGGCCGAGGCTTATGCCGCTGTCGAGGA TGCCGCCGCCGAGGTGTTGCGTTCAGGCGGCAGCTTTCAGGGTCAGCTGAACCAGCGCCTGCCCGATCACCGCGACGCTATCGCTATTGCTTTTGATACGACGCCGGCGATCCAGGCCGG GGCCGCCCGCTGCCGTAGTGCGCTGGATCATGTGGCTCGTATTCTTGGACCCGCCTCTACCATCGGATTTCAAGGAGGCTAATGACGTGACGACACTGTTTGAGGCGACGACCATCCCGA TTTGCGAGGGCCCGCGCGACCAGACCGCCGAGATCCTTTTCGAGATGCCGCCGGGTGCGTGGGATACCCATTTTCATGTTTTTGGCCCAGTTTCATCGTTTCCATACGCAGAACACAGGC TCTATTCCCCACCGGAGTCGCCACTTGAGGATTATCTGGTGTTGATGGAGGCTTTGGGGATCGAGCGCGGCGTTTGTGTCCATCCGAATGTTCATGGTGCCGACAATTCGGTGACGCTCG ACGCAGTTGCGCGGTCCGATGGTCGTCTGCTGGCGGTGATCAAGCCACATCACGAGATGACTTTTGTTCAGCTGCGGGACATGAAGGCGCAGGGGGTCTGCGGGGTACGTTTTGCCTTCA ATCCGCAGCATGGCTCGGGCGAGTTGGATACTCGTTTGTTCGAGCGTATGTTGGACTGGTGCCGCGACCTAGGCTGGTGCGTAAAATTGCATTTCGCGCCCGCTGCGCTGGACGGTCTGG CTGAACGTTTGGCGCGCGTCGATATTCCGATCATCATCGATCATTTCGGGCGGGTGGACACCGCGCAAGGTGTGGATCAGCCGCACTTCCTGCGTTTGCTCGATCTGGCCAAACTGGACC ATGTCTGGATCAAGCTTACGGGGGCAGATCGTATTAGCGGTTCCGGCGCGCCATATGACGATGTCGTGCCCTTCGCGCACGCTTTGGCAGATGTGGCGCCCGACCGCCTCCTCTGGGGTT CGGATTGGCCGCATTCAGGCTATTTCGATCCGAAGCACATACCCAATGACGGCGACTTGTTGAACCTTTTGGCGCGTTTTGCCCCCGATGCTGAACTGCGTCGTAAGATCCTTGTGGACA ACCCGCAGCGCCTGTTCGGGGCTGCTTGAGGAGCCGAGCCGATGCAACCTTTCGTCTACGAAACAGCCCCAGCGCGCGTCGTTTTCGGGCGCGGCACTTCGCAGAATCTGCGGCGGGAAC TTGAGGCCCTGAATTTTGGCAGGGCGCTGGTTCTTTCCACGCCCGACCAAAAAGAACAATCGCTGCGAATTGCCCAGGGCCTGGGTTCTCAGCTGGCGGGGTCGTTCCACGCCGCTGCCA TGCATACGCCTGTCGAGGTCACCTTGCAGGCGCTTGAGGTGCTGAAGGATGTGCAGGCCGATTGCATCGTGGCGATTGGCGGCGGCTCAACCATTGGGTTGGGCAAGGCACTGGCCCTGC GCACCGATCTGCCGCAGATCGTCGTCCCGACGACTTATGCCGGCTCGGAAATGACGCCGATCCTGGGAGAGACGGAAAACGGGCTGAAGACCACACAGCGTAATCCCAAAGTGCAGCCGA GGGTGGTTCTCTACGATGTGGACCTGACTGTGACGCTTCCGGTGCAGGCCTCGGTTACATCAGGCATGAATGCGATCGCCCATGCGGCCGAGGCATTATATGCGCGGGACGGCAATCCGG TGATCTCGCTGATGGCCGAAGAGGCGATCCGCGCGCTGGCCCATGCCCTGCCGCGTATCGTTGCCACTCCCGACGATATCGAAGCGCGCAGCGATGCCCTCTATGGCGCGTGGCTGTGCG GAACGTGCCTGGGTTCGGCCGGAATGGCGTTGCACCATAAGCTCTGCCACACCCTCGGCGGAAGTTTCGATTTGCCACATGCCCCGACCCACACGGTCATCCTCCCCTATGCGCTCGCCT ATAATAGTGATGCGGCCAGGCCCGCAATGGCAGCCATCGCGCGCGCGCTGGGCATGGCGGATGCAGCGATGGGCATGAGAGCGTTGTCCATGCGGTTGGGCGCCCCGACATCGCTGCGTG AGTTGGGCATGGCAGAAGCCGATCTTGACCGCGCCGCCGACCTGGCCACGCAAAATGCCTATTGGAACCCGCGACCCATCGAGCATGGGCCGATTCGTAACCTTCTGGGACGGGCCTGGG CTGGAACTCCGGTCTGAAGGACCTAGAGGACAGTCAATTCATTGATCTGAAGTCACCAACGAGGAGATATGGGATGAACGAGAACATTGCGATCCGCAAATTGGGCCGCCGACTCCGATT GGGCATTGCCGGTGGCGCGGGTCATTCGCTGATTGGTCCGGTTCACCGGGAGGCGGCTCGGCTTGACGATTTGTTCTCTCTCGATGCTGCGGTGCTGTCCAGTAACGCGGAACGCGGGGA TGCTGAGGCCGCGGCTCTCGGAATTCCGCGCTCCTATTCGTCCACCGCCGAGATGTTCGCAATGGAGAAGGCTAGGCCCGACGGTATTGAGGCCGTTGCCATAGCCACGCCGAATGACAG CCATTACCGGATTCTGTGCGAGGCGCTGGACGCCGGGTTGCATGTAATCTGCGACAAGCCTTTAACCTCCACGAAGGCCGAGGCCGACGACGTGCTGGTGCGGGCGAAGGCCGCGGGCAA GGTTGTGGTCCTGACCCACAATTATTCTGGCTACGCCATGGTACGCCAAGCCCGCGCCATGGTCGCCGCCGGTGAACTTGGGAAAATCCACCAGATTCACGGGGTCTACGCTCTGGGCCA GATGGGCCGTTTGTTCGAGGCCGACGAAGGGGGCGTGCCTCCGGGGATGCGTTGGCGGATTGATCCTGCGCGCGGTGGCGACAGTCACGCCCTGGTGGATATCGGCACCCATGTGCACCA TCTGGCTACCTTCATCACGCAGTTACAGGTCGTTGAGGTAATGGCCGATCTTGGGCCGGCGGTTCAAGGCCGCGCGGCCCATGACAGTGCCAACGTCATGTTCCGTATGGAAAACGGAGC TTTCGGATCGTTCTGGGCCACCAAGGCGGCATCGGGGGCCAGCAAGCTGGCGATCGAAGTCTACGGTGACAAGGGCGGCGTCCTGTGGGAGCAGGCCGACGCCAATAACTTGCTACATAT GCGGCAGGGCCAACCCCCAGCCCTGATTGGTCGACAAGTTGCCGGGCTGCATCCTGCGGCAATCCGCGCGATGCGGGGGCCGGGTTATCATTTCGTGGAAGGCTATCGCGAGGCCTTTGC GAATATGTACGTGGATTTCGCCGAACAGATCTTGGCCATGATGGGCAAGGGGGCCGCAGATCACCTGGCATTGGAAGCGCCGTCGGTCGTGGACGGCCTGCGCTCCATGGCGTTCATCGA AGCCTGTGTGGCGTCGTCGCAGGACCGCCAATGGCGGCAGGTGGAGCAAGTCAGTTGATCTCTCAGCGGCTTCGGCATTTTTCCCGGGCTGGCGGCTCCCCGCAGCTCCCTCCGGTGGAA AGAACGGGTAATCAAAATAATATTCTGATTTTAAAGGATGTTCCAGACAGCTGATTATTCCTGAAATTTAGGGCTCTTTCGGCTGTAGCAATTGACTAAAAGCCGAATTTAAGGGTAA TTAAACAAACGCTGTTCGTATTATTTAAACAGGTGAGTGATGGCGATATTCCTGGAAGGCTGGCCGATGGTTTCATCTGAATACCCGGCCAGAAGCGTTGAGGCGCACCCGGCCTATCTG AC
GCCAGACTATGTTTTCACGCGAAAGCGTGCGCCGACTCGACCGCTGCGGTTAATTCCTCAGTCTGCGACGGAGCTGTATGGCCCGGTTTATGGACAAGAGAGCGTCCGTCCGGGGGATAA CGACCTGACCCGTCAGCACGAAGCTGAGCCGGTGGGGGAGCGGATTCTGGTGACGGGGCGCGTGACCGACGAAGACGGGCGGGGTGTCCCTAATACGCTGCTAGAGATCTGGCAGGCCAA TGCCGCCGGTCGCTATATCCACAAGCTTGACCAGCATCTTGCCCCGCTTGATCCAAATTTCTCGGGGGCAGGGCGTACGGTTACGGGGGCTGATGGCTCTTATTCCTTCATCACGATCGT GCCGGGCGCCTATCCGGTCGTGGGGCTGCACAATGTCTGGCGCCCGCGCCACATCCATGTGTCGTTGTTCGGTCCGTCCTTCGTGACCCGCTTGGTTACCCAGATATATTTCGAGGGCGA TCCGCTGCTGAAATATGACACGATCTACAACACGGCGCCCGACATCTCGAAGCGCAGCATGGTGGCGCAGTTGGACATGGGCGCCACGCAATCCGAATGGGGCCTGACCTATCGCTTCGA CATCGTTCTGCGTGGGCGCAACGGCAGCTATTTCGAGGAACCCCATGACCACTAAGACCCCACTGACCATCACCCCCTCGCAGACTGTCGGGCCTTTCTATGCCTATTGCCTGACCCCGG AGGACTACGGGACGCTTCCACCGCTGTTCGGCGCGCAGCTTGCGACCGAGGACGCCGAAGGGGAACGGATTACGATCCAGGGAACGATCACGGACGGAGAGGGGGCCATGGTTCCCGATG CCTTGATCGAGATCTGGCAGCCGGACGGGCAGGGGCGTTTTGCTGGAGCCCATCCAGAGCTGCGGAATTCGGCCTTCAAGGGCTTCGGGCGCCGCCACTGTGACAAAAGCGGAAACTTCA GTTTCCAAACCGTGAAGCCTGGCCGGGTGCCCACTGCCGACGGCGTGATGCAGGCACCCCATATCGCTTTGTCGATCTTCGGCAAGGGATTGAACCGCCGGCTCTATACGCGGATCTACT TCGCAGACGAGGCATCGAATGCCGAGGACCCCGTTCTGTCGATGCTGTCCGAGGATGAGCGCGTGACCCTGATCGCCACCTCTGAATCGCCCGCCGCATATCGCCTCGACATCCGCCTGC AAGGCGACGGCGAAACGGTGTTTTTCGAGGCCTGAGTCGGCCGGCAAGTTTGCGGGGATCCGTCCGCCGCAATTGTGTTTCGCTATAGACGCCACGGCTGCCGCATGCCGCCGGGTGGAA GGGCCTTGCAAGGCCTGTCAACGGCGGAGTAAAATCCGGCCAGGCGGCGGAGTAAAACCAGGCCACTTGTGGCCCACGCATGAGACACCCGGGAGGGCGTAGCCCAAGCGGGGGTCTCAT GCGTGTGCGGCGGTTTTCTGGGGGTTCAGCCAGCCTTGCGGGCGCGGCTTTGAGCGAGACGATAGCTGTCGCCGTTCATCTCGAG
Comparison to known sequences Comparison to known sequences
The sequence obtained can be co The sequence obtained can be co mpare mpare d d to known sequences in the databanks
to known sequences in the databanks
Question: what is similar? Question: what is similar?
What to compare DNA or protein What to compare DNA or protein ? ?
SIMILARITY
CTCGAGACGCTGTTTCTGGGGTCATTCATTCTTGGCGGG CTGCAACTGCTGGTGTGACCGACGCGACCTGGCAGGCCGC GGTGCGCAACTGGCCGGGCGGACTAATGGTGGAGCAAAA GATCGGCATGTCCAGCGCACCTGAAGCTTGGGTGGTTGC TGCAATAGCAGCCTTCCTTATTGGCATGGCGAAGGGCGG TTTGGCCAATGTGGGGGTTATCGCCGTTCCCTTGATGTC CCTGGTCAAGCCGCCGCTTACCGCTGCCGGATTGCTGCTC CCGATCTATGTCGTTTCTGATGCATTCGGCGTCTGGCTT TATCGGCACCGGTATTCTGCCTCCAATCTGCGCATCCTGA TTCCTTCGGGATTTTTTGGGGTCCTGATTGGCTGGTTAT TGGCCGGGCAGATCTCCGACGCGATTGCCAGTGTCATTG TTGGTTTCACCGGCTGCGGCTTCGTGGCTGTGCTGCTGGC ACGACGAGGGGTGCCATCGGTGCCGCGTCAAGCCAACGT GCCCAAAGGATGGTTTCTGGGGGTGGCCACCGGCTTTAC CAGCTTTTTGACTCATTCCGGTGCGGCGACCTTCCAGAT GTTCGTGCTGCCGCAACGGCTGGACAAGACCATGTTCGC GGGCACATCAACGCTTACCTTTGCTGCCATAAACCTATT CAAGATTCCGTCCTACTGGGCATTGGGACAGCTTTCGAC TTCCTCGGTCATGTCCGCGCTAGTGTTGATTCCGGTGGCC GTGGCCGGGACGTTCGCAGGTGTTTTTGCGACGCGCAGG CTATCGACATCCTGGTTCTTCATTCTGGTCCAGGCGATG TTGCTGGTGGTCTCCATTCAGCTTCTGTGGAGGGGAATG TCGGATATCCTGAACTAGCTGGAGATCGCAATGTCAGAA CGCTCAATCAATCAGAATGTAATCTTGACATAGAATAC CGTTCCGATTTATTGCTTCGAGTGAAGCTGCCCGTCCGC TGAGATGTCATGACATTTTCCCCGCTTGATTCCGCCCTGC TTGGACCGTTGTTCGCGACCGATGAAATGCGCACGGTCT TCTCCGAACGGCGTTTTTTGGC
CTCGAGACGCTGTTTCTGGGGTCATTCATTCTTGGCGGG CTGCAACTGCTGGTGTGACCGACGCGACCTGGCAGGCCGC GGTGCGCAACTGGCCGGGCGGACTAATGGTGGAGCAAAA GATCGGCATGTCCAGCGCACCTGAAGCTTGGGTGGTTGC TGCAATAGCAGCCTTCCTTATTGGCATGGCGAAGGGCGG TTTGGCCAATGTGGGGGTTATCGCCGTTCCCTTGATGTC CCTGGTCAAGCCGCCGCTTACCGCTGCCGGATTGCTGCTC CCGATCTATGTCGTTTCTGATGCATTCGGCGTCTGGCTT TATCGGCACCGGTATTCTGCCTCCAATCTGCGCATCCTGA TTCCTTCGGGATTTTTTGGGGTCCTGATTGGCTGGTTAT TGGCCGGGCAGATCTCCGACGCGATTGCCAGTGTCATTG TTGGTTTCACCGGCTGCGGCTTCGTGGCTGTGCTGCTGGC ACGACGAGGGGTGCCATCGGTGCCGCGTCAAGCCAACGT GCCCAAAGGATGGTTTCTGGGGGTGGCCACCGGCTTTAC CAGCTTTTTGACTCATTCCGGTGCGGCGACCTTCCAGAT GTTCGTGCTGCCGCAACGGCTGGACAAGACCATGTTCGC GGGCACATCAACGCTTACCTTTGCTGCCATAAACCTATT CAAGATTCCGTCCTACTGGGCATTGGGACAGCTTTCGAC TTCCTCGGTCATGTCCGCGCTAGTGTTGATTCCGGTGGCC GTGGCCGGGACGTTCGCAGGTGTTTTTGCGACGCGCAGG CTATCGACATCCTGGTTCTTCATTCTGGTCCAGGCGATG TTGCTGGTGGTCTCCATTCAGCTTCTGTGGAGGGGAATG TCGGATATCCTGAACTAGCTGGAGATCGCAATGTCAGAA CGCTCAATCAATCAGAATGTAATCTTGACATAGAATAC CGTTCCGATTTATTGCTTCGAGTGAAGCTGCCCGTCCGC TGAGATGTCATGACATTTTCCCCGCTTGATTCCGCCCTGC TTGGACCGTTGTTCGCGACCGATGAAATGCGCACGGTCT TCTCCGAACGGCGTTTTTTGGC
the two sequences are (and look) the same
CTCGAGACGCTGTTTCTGGGGTCATTCATTCTTGGCGGG CTGCAACTGCTGGTGTGACCGACGCGACCTGGCAGGCCGC GGTGCGCAACTGGCCGGGCGGACTAATGGTGGAGCAAAA GATCGGCATGTCCAGCGCACCTGAAGCTTGGGTGGTTGC TGCAATAGCAGCCTTCCTTATTGGCATGGCGAAGGGCGG TTTGGCCAATGTGGGGGTTATCGCCGTTCCCTTGATGTC CCTGGTCAAGCCGCCGCTTACCGCTGCCGGATTGCTGCTC CCGATCTATGTCGTTTCTGATGCATTCGGCGTCTGGCTT TATCGGCACCGGTATTCTGCCTCCAATCTGCGCATCCTGA TTCCTTCGGGATTTTTTGGGGTCCTGATTGGCTGGTTAT TGGCCGGGCAGATCTCCGACGCGATTGCCAGTGTCATTG TTGGTTTCACCGGCTGCGGCTTCGTGGCTGTGCTGCTGGC ACGACGAGGGGTGCCATCGGTGCCGCGTCAAGCCAACGT GCCCAAAGGATGGTTTCTGGGGGTGGCCACCGGCTTTAC CAGCTTTTTGACTCATTCCGGTGCGGCGACCTTCCAGAT GTTCGTGCTGCCGCAACGGCTGGACAAGACCATGTTCGC GGGCACATCAACGCTTACCTTTGCTGCCATAAACCTATT CAAGATTCCGTCCTACTGGGCATTGGGACAGCTTTCGAC TTCCTCGGTCATGTCCGCGCTAGTGTTGATTCCGGTGGCC GTGGCCGGGACGTTCGCAGGTGTTTTTGCGACGCGCAGG CTATCGACATCCTGGTTCTTCATTCTGGTCCAGGCGATG TTGCTGGTGGTCTCCATTCAGCTTCTGTGGAGGGGAATG TCGGATATCCTGAACTAGCTGGAGATCGCAATGTCAGAA CGCTCAATCAATCAGAATGTAATCTTGACATAGAATAC
AAACTCGAGACGCTGTTTCTGGGGTCATTCATTCTTGGC GGGCTGCAACTGCTGGTGTGACCGACGCGACCTGGCAGG CCGCGGTGCGCAACTGGCCGGGCGGACTAATGGTGGAGC AAAAGATCGGCATGTCCAGCGCACCTGAAGCTTGGGTGG TTGCTGCAATAGCAGCCTTCCTTATTGGCATGGCGAAGG GCGGTTTGGCCAATGTGGGGGTTATCGCCGTTCCCTTGA TGTCCCTGGTCAAGCCGCCGCTTACCGCTGCCGGATTGCT GCTCCCGATCTATGTCGTTTCTGATGCATTCGGCGTCTG GCTTTATCGGCACCGGTATTCTGCCTCCAATCTGCGCATC CTGATTCCTTCGGGATTTTTTGGGGTCCTGATTGGCTGG TTATTGGCCGGGCAGATCTCCGACGCGATTGCCAGTGTC ATTGTTGGTTTCACCGGCTGCGGCTTCGTGGCTGTGCTG CTGGCACGACGAGGGGTGCCATCGGTGCCGCGTCAAGCC AACGTGCCCAAAGGATGGTTTCTGGGGGTGGCCACCGGC TTTACCAGCTTTTTGACTCATTCCGGTGCGGCGACCTTCC AGATGTTCGTGCTGCCGCAACGGCTGGACAAGACCATGT TCGCGGGCACATCAACGCTTACCTTTGCTGCCATAAACC TATTCAAGATTCCGTCCTACTGGGCATTGGGACAGCTTT CGACTTCCTCGGTCATGTCCGCGCTAGTGTTGATTCCGGT GGCCGTGGCCGGGACGTTCGCAGGTGTTTTTGCGACGCG CAGGCTATCGACATCCTGGTTCTTCATTCTGGTCCAGGC GATGTTGCTGGTGGTCTCCATTCAGCTTCTGTGGAGGGG AATGTCGGATATCCTGAACTAGCTGGAGATCGCAATGTC AGAACGCTCAATCAATCAGAATGTAATCTTGACATAGA
As now – but almost the same, but they seem to be dissimilar
SIMILARITY
Problems with DNA comparison Problems with DNA comparison
Codon usage preference: various codons may Codon usage preference: various codons may code for the same amino acid,
code for the same amino acid,
the DNA sequences are different, the protein the DNA sequences are different, the protein
sequences are the same
sequences are the same
… AND DOES IT CODE FOR ANY PROTEIN?
Open reading frames:
Usually they start with ATG, but in softwares it’s option Length: default 100 aminoacid, but option
The result is hypothetical, it should be checked compared
to the existing data
Finding
Finding orfs orfs
Finding
Finding orfs orfs
… AND DOES IT CODE FOR ANY PROTEIN?
Open reading frames:
Usually they start with ATG, but in softwares it’s option Length: default 100 aminoacid, but option
The result is hypothetical, it should be checked compared
to the existing data
FRAME SHIFT MUTATION – A SOLUTION FOR IT
Translation in each open reading frame
Stop codons are not taken into account, just as missing aa It compares everything to everything at the protein level
example
BLASTX
Six frame translation
Six frame translation
FRAMESHIFT
WHERE DOES IT START FROM?
2290 2300 2310 2320 2330 2340 GCCGCCCGCTGCCGTAGTGCGCTGGATCATGTGGCTCGTATTCTTGGACCCGCCTCTACC A A R C R S A L D H V A R I L G P A S T
M W L V F L D P P L P
2350 2360 2370 2380 2390 2400 ATCGGATTTCAAGGAGGCTAATGACGTGACGACACTGTTTGAGGCGACGACCATCCCGAT I G F Q G G *
S D F K E A N D V T T L F E A T T I P I
Who knows?
- Identification of other elements
GENOMIC CONTEXT
NH3+
SO3-
OH OH
SO3-
COOCOO
SO3
COO O O SO3
COOCOO O
- -
- -
- - HSO33-
O2
Sulfanilic acid 4-szulfocatechol
sulfomuconate
sulfolaktone
maleilacetate
TCA cycle
+
P340 II dioxygenase
sulfomuconate cycloisomerase
sulfolaktone hydrolase
maleilacetate redukase