• Nem Talált Eredményt

A computational workflow for automated genome annotation and result validation

Zsolt Gelencsér

(Supervisor: Prof. Sándor Pongor) gelzs@digitus.itk.ppke.hu

Abstract The increasing number of available DNA sequences requires fast automated methods to process the new data and to extract as much information as possible. Genome annotation methods are designed for this task which is especially difficult because of the high diversity of biological organisms. I created an automated, subsystem based genome annotation pipeline that contains multiple, independent methods to validate the results.

The pipeline is based on Hidden Markov Model search and requires comprehensive knowledge about the analyzed subsystem.

Keywords-component; HMM; genome annotation; validation;

topology

I. INTRODUCTION

In the past few years the speed of sequencing has greatly increased meanwhile the cost of the analysis has dramatically decreased. The number of currently available DNA sequences is over 100 million (NCBI Genbank report 2008 [1]), but the number of those well-characterized in terms of 3D structure and functions is small and their ratio is constantly decreasing.

Without any added annotation, a sequence is practically is only a string without any biological meaning.

The method by which we add bonus information to a genome sequence is called genome annotation. Some annotation (e.g.: source origin) is added to the raw data in the phase of data production, but the serious part of gene annotation begins when the data are submitted to a public database. Genome annotation is based on a deep knowledge of

biology and bioinformatics and relies to many databases, programs and algorithms. From the computational point of view, raw data refer to a simple string - the DNS sequence of the genome – that consists of only 4 possible characters: A, C, G and T. The theoretical topology of the genome is a linear or circular number line of positive integers; every position corresponds to a nucleotide. The bacterial genome consist one (sometimes a few) long, chromosomal sequence(s) and many short, plasmid sequence. The sequences can be linear or circular.

During the genome annotation we assign attributes (in bioinformatics we called them descriptors) to the structures. There are two types of descriptors; global descriptors refer to to the whole structure (e.g.: name, function, source origin) and local descriptors refer to only a part of it (e.g.: protein segment, domains). There are many sources of the descriptors; human knowledge, computational algorithms, database cross-references …etc. There are many methods to recognize the genes of a genome but we can sort them into two main group: total genome annotation and annotation of a chosen subsystem throughout many genomes.

In the first case we choose a genome and we try to recognize the function of the unknown genes via biological study or database search. The advantage of this method is the proper knowledge of one species is enough to get information about genes but we must rely on the data of the gene-databases that can implicate incorrect annotation. (e.g.: similarity search points to the most similar function in the database that have no data on novel or undocumented biological roles.)

In the case of subsystem annotation throughout several genomes we choose a subsystem that refers to a well-defined biological process or structure in a few, well characterized set of genomes, and characterize it with a set of rules [2]. After we identify the rules that define the genes of the subsystem, we search the known genome sequences using this set of rules. The rules of the subsystem help us to validate the new genes. Even though this method doesn’t give us a certain result either because there is a chance for unknown variation of the subsystem or if the subsystem has only a few genes, the identification is relatively robust against the “noise” caused by random similarities

Figure 1. The steps of a genome annotation. In this figure we can see the connection betwween step of the genome annotation and bioinformatic databases.

115

TABLE IV

EXAMPLE FOR THE LIST OF THE MOST SIMILAR CONCEPTS FOR SOME TARGET WORDS WITH THE MEASURE OF SIMILARITY FOR A SMALL TEST

SET.

kontroll (N) vizsg´alat 0.200137414551; eset 0.164033128256;

panasz 0.155728295147; m˝ut´et 0.0913003247578 cornea (N) h´atlap 0.14516804352; csarnok 0.137587581556;

conj 0.119715252986; pupilla 0.0924046308243;

lencse 0.085279167572

javasol (V) v´egez 0.0757854615073; t¨ort´enik 0.0727785274109;

fel´ır 0.0647341637093; haszn´al 0.0642329428926;

haszn´al (V) javul 0.0934773088188; cseppent 0.0876840590879;

l´at 0.0693182060826; f´aj 0.067417048954; kezel 0.0656283245778;

tiszta (ADJ) b´ek´es 0.154524127446; sima 0.137185024375;

sek´ely 0.113384183723; s´arg´as 0.11103999917; ´ep 0.0983724197695;

rossz (ADJ) hom´alyos 0.12185879024; piros 0.103178644583;

kicsi 0.0754819988997; j´o 0.0648128527085; bal 0.0632304528447; sz´araz 0.0456163404988

1) Examined area: since I used documents only from the department of ophthalmology, the main target area of the examinations documented in the texts are limited to either the left, or the right or both eyes. There might be some additional symptoms or examinations carried out, but the main target can easily be detected by retrieving the target side.

2) Wish of the patient: in most cases, the patient has a direct wish about the purpose of visiting a doctor. They want an examination carried out, or have the doctor prescribe some medication, or especially in the domain I use, they want some glasses or contact lenses. Such wishes are retrieved by looking for some trigger words and their variations (“sz-eretne”,“k´er”,etc)

3) History and past events: these include verbs in past tense together with their complement or target. These are not necessarily neighbouring words of the verb, but still I applied a baseline algorithm to find the nearest possible complement candidate. These are then divided into groups of negated and non negated events. Some examples for such events are“occlusio zajlott”, “olvas´oszem¨uveget viselt”, “k´arosod´as nem igazol´odott”.

4) Present findings and symptoms: similar to the previous category, but in this case the verb is in present tense. For example “elfogadhat´oan l´at”, “k´oros nem l´atszik”. These are also grouped into negated and non negated events.

5) Nominal events: a specific characteristic of the clinical narrative language is the short telegraphic phrases, which sometimes though describe an event or a state, they do not include any verb. For example such phrases like “s´arg´as magreflex”, “´ep papilla”, “sz˝ukebb art´eri´ak”are used as stan-dalone phrases. In standard language these are less frequent, or at least an existential verb is present along them.

The applied methods I used to extract such events are basic pattern recognition or grammatical structures, which do work with high recall, but low precision. However a significant improvement can be expected from integrating the above described distributional thesaurus to the event extraction methodology.

V. FURTHER PLANS

The medical language processing system has already several modules developed with some baseline results. The above de-scribed experiments are carried out by using these modules in a pipelined manner. However a more complex integration and communication between these modules are required, having higher level tools an effect on lower levels of processing. Such a parallel integration of the different processing methods is of crucial importance and is one of the next steps to achieve.

Another aspect besides improving the described modules is developing the missing links, such as disambiguating abbrevi-ations with the help of lexical resources and the distributional methods and processing measurement results included in the texts, which contain one of the most valuable types of infor-mation.

REFERENCES

[1] Z. S. Harris, “The structure of science information,”J. of Biomedical Informatics, vol. 35, no. 4, pp. 215–221, Aug. 2002.

[2] B. Sikl´osi, G. Orosz, A. Nov´ak, and G. Pr´osz´eky, “Automatic structuring and correction suggestion system for hungarian clinical records,” 8th SaLTMiL Workshop on Creation and use of basic lexical resources for lessresourced languages, pp. 29.–34., 2012.

[3] B. Sikl´osi, A. Nov´ak, and G. Pr´osz´eky, “Context-aware correction of spelling errors in hungarian medical documents,”1st International Conference on Statistical Language and Speech Processing, 2013.

[4] A. Nov´ak, “What is good humor like?” in I. Magyar Sz´am´ıt´og´epes Nyelv´eszeti Konferencia. Szeged: SZTE, 2003, pp. 138–144.

[5] G. Pr´osz´eky and B. Kis, “A unification-based approach to morpho-syntactic parsing of agglutinative and other (highly) inflectional lan-guages,” inProceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, ser. ACL

’99. Stroudsburg, PA, USA: Association for Computational Linguistics, 1999, pp. 261–268.

[6] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, “Moses: Open Source Toolkit for Statis-tical Machine Translation,” inProceedings of the ACL 2007 Demo and Poster Sessions. Prague: Association for Computational Linguistics, 2007, pp. 177–180.

[7] G. Orosz and A. Nov´ak, “PurePos – an open source morphological disambiguator,” in Proceedings of the 9th International Workshop on Natural Language Processing and Cognitive Science, B. Sharp and M. Zock, Eds., Wroclaw, 2012, pp. 53–63.

[8] E. Pirk, “N´evkifejez´esek automatikus felismer´ese orvosi szovegekben,”

2013.

[9] J. R. Firth, “A synopsis of linguistic theory 1930-55.” vol. 1952-59, pp.

1–32, 1957.

[10] J. Carroll, R. Koeling, and S. Puri, “Lexical acquisition for clinical text mining using distributional similarity,” inProceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II, ser. CICLing’12. Berlin, Heidelberg:

Springer-Verlag, 2012, pp. 232–246.

[11] D. Lin, “Automatic retrieval and clustering of similar words,” in Proceedings of the 17th international conference on Computational linguistics - Volume 2, ser. COLING ’98. Stroudsburg, PA, USA:

Association for Computational Linguistics, 1998, pp. 768–774.

[12] N. Sager, M. Lyman, C. Bucknall, N. Nhan, and L. J. Tick, “Natural language processing and the representation of clinical data,”Journal of the American Medical Informatics Association, vol. 1, no. 2, Mar/Apr 1994.

[13] S. Meystre, G. Savova, K. Kipper-Schuler, and J. Hurdle, “Extracting information from textual documents in the electronic health record: a review of recent research,”Yearb Med Inform, vol. 35, p. 128–44, 2008.

A computational workflow for automated genome annotation and result validation

Zsolt Gelencsér

(Supervisor: Prof. Sándor Pongor) gelzs@digitus.itk.ppke.hu

Abstract The increasing number of available DNA sequences requires fast automated methods to process the new data and to extract as much information as possible. Genome annotation methods are designed for this task which is especially difficult because of the high diversity of biological organisms. I created an automated, subsystem based genome annotation pipeline that contains multiple, independent methods to validate the results.

The pipeline is based on Hidden Markov Model search and requires comprehensive knowledge about the analyzed subsystem.

Keywords-component; HMM; genome annotation; validation;

topology

I. INTRODUCTION

In the past few years the speed of sequencing has greatly increased meanwhile the cost of the analysis has dramatically decreased. The number of currently available DNA sequences is over 100 million (NCBI Genbank report 2008 [1]), but the number of those well-characterized in terms of 3D structure and functions is small and their ratio is constantly decreasing.

Without any added annotation, a sequence is practically is only a string without any biological meaning.

The method by which we add bonus information to a genome sequence is called genome annotation. Some annotation (e.g.: source origin) is added to the raw data in the phase of data production, but the serious part of gene annotation begins when the data are submitted to a public database. Genome annotation is based on a deep knowledge of

biology and bioinformatics and relies to many databases, programs and algorithms. From the computational point of view, raw data refer to a simple string - the DNS sequence of the genome – that consists of only 4 possible characters: A, C, G and T. The theoretical topology of the genome is a linear or circular number line of positive integers; every position corresponds to a nucleotide. The bacterial genome consist one (sometimes a few) long, chromosomal sequence(s) and many short, plasmid sequence. The sequences can be linear or circular.

During the genome annotation we assign attributes (in bioinformatics we called them descriptors) to the structures.

There are two types of descriptors; global descriptors refer to to the whole structure (e.g.: name, function, source origin) and local descriptors refer to only a part of it (e.g.: protein segment, domains). There are many sources of the descriptors; human knowledge, computational algorithms, database cross-references …etc. There are many methods to recognize the genes of a genome but we can sort them into two main group:

total genome annotation and annotation of a chosen subsystem throughout many genomes.

In the first case we choose a genome and we try to recognize the function of the unknown genes via biological study or database search. The advantage of this method is the proper knowledge of one species is enough to get information about genes but we must rely on the data of the gene-databases that can implicate incorrect annotation. (e.g.: similarity search points to the most similar function in the database that have no data on novel or undocumented biological roles.)

In the case of subsystem annotation throughout several genomes we choose a subsystem that refers to a well-defined biological process or structure in a few, well characterized set of genomes, and characterize it with a set of rules [2]. After we identify the rules that define the genes of the subsystem, we search the known genome sequences using this set of rules. The rules of the subsystem help us to validate the new genes. Even though this method doesn’t give us a certain result either because there is a chance for unknown variation of the subsystem or if the subsystem has only a few genes, the identification is relatively robust against the “noise” caused by random similarities

Figure 1. The steps of a genome annotation. In this figure we can see the connection betwween step of the genome annotation and bioinformatic databases.

Zs. Gelencsér, “A computational workflow for automated genome annotation and result validation,”

in Proceedings of the Interdisciplinary Doctoral School in the 2012-2013 Academic Year, T. Roska, G. Prószéky, P. Szolgay, Eds.

Faculty of Information Technology, Pázmány Péter Catholic University.

Budapest, Hungary: Pázmány University ePress, 2013, vol. 8, pp. 115-117.

TABLE I. THE VALIDATION RESULT OF THE GENOME ANNOTATION.

This table contains the result of the 4 type of validation. The COG and product based validation can be equal with the expected, not equal or simply no information available. The columns named BLAST and topology show us the number of strengthen genes. (* manually checked)

II. THE STEPS OF A GENOME ANNOTATION

Now we can sketch a logical outline of subsystem-based DNA sequence annotation. A DNA sequence is considered annotated if the coding genes and other important regions are located. The protein coding genes are linked as many database as possible: primary protein databases (e.g. UNIPROT), function based cluster databases (e.g.: COG[3]) structure based cluster databases (e.g.: PFAM). To achieve this status, first we need to locate the position of the gene. If we know the exact location of the gene, we search for substructures (domains) like binding-sites, HTH structure … etc and associate them with known domain families. We add the recognized protein and the collected information to the Uniprot database. After the new protein is deposited in the database, we analyze the UniRef clusters that contain our protein. From this analysis we can collect some new information regarding our protein.

This outline leads us to two trivial conclusions: i) Database annotations change very fast, because the background databases are updated frequently; ii) Most annotations are incomplete. To reach the idealized state we need a well-organized, updateable, integrated database, but the current public databases are far from this stage.

III. THE WORKFLOW

In my previous work I created an automated protein search algorithm, based on HMM profiles [1, 2]. This method searches all members of the protein families defined by the HMM profiles in the chosen dataset, so it can be considered as part of a subsystems-based annotation workflow. The 0th step is the collection of the necessary knowledge about our subsystem and the selection of the search dataset. Beside the HMM profiles we need some information about protein families. (e.g.: short name, numeric thresholds values of identification) In the first step, the program executes an hmmsearch (algorithm of the hmmer program [3]) on the chosen dataset, and collects all hits as well as their associated significances (e-values). In the next step we carry out a non-strict pre-verification of the hits; we only check the length of the hits and give a high threshold for the e-values. With this method we filter out the certainly false hits so we greatly shorten our list without losing true positives. In the next step we determine the topology of the genes which denotes their relative position in the chromosome. For this we need more data, such as position of the gene in the genome, predicted

COG group… etc). The program collects them from the NCBI’s ptt files. If all necessary data is collected, we start to analyze the genes near each other (maximum distance between them is 3000 basepairs) and determine the appeared topologies. In the topologies the position and orientation of the genes are fixed and these topologies help us to validate our result, because the knowledge about the subsystem contains the list of the probable topologies.

IV. THE VALIDATION OF THE RESULTS

The most critical part of genome annotation is the validation of the results. If we get the new information about a genome sequence via manual annotation survey, the reliability of the data is high, but these methods are very slow. The automated annotation programs are out and away faster however there is a high chance of errors because of the natural diversity of genomic sequences. If we want to accept our result we have to validate it. There are many different way to make sure our data are correct. I used the following rules to validate the result of my HMM search based annotation.

The simplest method of validation is the examination of the predicted COG values. The COG database (Clusters of Orthologous Group) is a protein database based on phylogenetic clustering [4]. Proteins included in the same group have the same biological role. If the genes of our subsystem are members of one specific COG cluster, we can easily compare that cluster with the candidate protein’s COG value. If they are the same, it increases the credibility of the

Figure 2. The diagram of the workflow. It show the flow of the data (arrows) during the genome annotation.

new protein’s prediction. However, a discordant COG value does not necessarily decrease or abolish the prediction’s credibility; it simply means that with this method we are unable to increase its reliability. The NCBI COG dataset contains the protein names in natural language, so this database can be used only for manual validation.

Parallel with the HMM based search I execute another search based on the BLAST algorithm [4]. Both algorithms are similarity based methods, but the HMM search examines protein sequences while the BLAST algorithm analyzes the DNA sequences. The BLAST algorithm is less accurate but can eliminate the errors of the translation. If a hit of the BLAST search overlaps with the results of an HMM search, the probability of its correctness is higher.

I tested the annotation and validation method with an AHL-driven quorum sensing circuit subsystem. [5] The subsystem contains 4 types of genes: luxR, luxI, [6, 7] rsaL and rsaM [8, 9]. Table 1 shows the result of a test run performed by the bacterial section of the NCBI bacteria database. In the case of the luxR the annotation was quite efficient. Most of the COG and the product were correct (even if there were two different groups for it) and almost all hits were found also by BLAST search. Only about one third of genes were found to be in known topology, but we know that a) there are solo/orphan

I tested the annotation and validation method with an AHL-driven quorum sensing circuit subsystem. [5] The subsystem contains 4 types of genes: luxR, luxI, [6, 7] rsaL and rsaM [8, 9]. Table 1 shows the result of a test run performed by the bacterial section of the NCBI bacteria database. In the case of the luxR the annotation was quite efficient. Most of the COG and the product were correct (even if there were two different groups for it) and almost all hits were found also by BLAST search. Only about one third of genes were found to be in known topology, but we know that a) there are solo/orphan

Outline

KAPCSOLÓDÓ DOKUMENTUMOK