As described in Chapter II, efficient tools have been proposed to index and an- alyze pan-genomes. However, such methods and data structures do not cover all expected features for pan-genome analysis. Most of them operate only on draft or finished assemblies as input, while such assemblies are available only for a small fraction of species. Furthermore, hundreds or thousands of such assemblies might be required to characterize the pan-genome of a species, a number far much larger than what is available in most cases. By the end of February 2017, the Na- tional Center for Biotechnology Information (NCBI) Genome database (NCBI, 2017) contained 23,004 assembled genomes for which about 85 % only have one assembly available. However, unassembled reads abound in databases and rep- resent the vast majority of data available. By the end of February 2017, the NCBI Sequencing Read Archive (SRA) database (NCBI, 2007) contained about 9.8 petabases of reads. Also, methods using an assembly as reference introduce a bias in the analysis towards the reference. Finally, it has been shown that assem- bly errors can lead to an over-estimation of the number of genes inferred from an assembled genome (Denton et al., 2014). It might cause an over-estimation of the size and growth of the core, accessory and singleton genomes. Hence, an ideal data structure indexing a pan-genome should be reference-free and consider assemblies as well as reads as input to take advantage of all the data available in genomic databases.
Results: The species pan-genome of L. monocytogenes is highly stable but open, suggesting an ability to adapt to new niches by generating or including new genetic information. The majority of gene-scale differences represented by the accessory genome resulted from nine hyper variable hotspots, a similar number of different prophages, three transposons (Tn916, Tn554, IS3-like), and two mobilizable islands. Only a subset of strains showed CRISPR/Cas bacteriophage resistance systems of different subtypes, suggesting a supplementary function in maintenance of chromosomal stability. Multiple phylogenetic branches of the genus Listeria imply long common histories of strains of each lineage as revealed by a SNP-based core genome tree highlighting the impact of small mutations for the evolution of species L. monocytogenes. Frequent loss or truncation of genes described to be vital for virulence or pathogenicity was confirmed as a recurring pattern, especially for strains belonging to lineages III and II. New candidate genes implicated in virulence function were predicted based on functional domains and phylogenetic distribution. A comparative analysis of small regulatory RNA candidates supports observations of a differential distribution of trans-encoded RNA, hinting at a diverse range of adaptations and regulatory impact.
Variation among the Spiraeoideae-infecting Strains Phylogenetic and MUMi analysis have shown that Spiraeoi- deae-infecting strains of E. amylovora are highly homogeneous at the chromosome level, which is consistent with previous studies . When a singleton development analysis using only the Spiraeoideae-infecting strains with nearly identical chromosomes was conducted in EDGAR (including plasmids), the pan-genome of this subgroup was open (Figure 3C) with a prediction of 30 new genes to be added to the pan-genome with each additional genome sequenced. When the same analysis was done excluding plasmids the pan-genome of Spiraeoideae-infecting strains was still predicted to be open with 11 new genes to be added to the pan-genome with each additional genome sequenced (Figure 3D) highlighting the important role plasmids play in the genetic diversity of E. amylovora. It is likely that the figures for all of the pan-genome calculations are slightly inflated due to the use of draft genomes (i.e., with contig breaks that influence CDS prediction and comparison) and that the pan- genome of the Spiraeoideae-infecting strains, excluding plas- mids, is closed.
different plasmids. All strains are characterized by a high GC content of 65 - 67% and mostly contain approximately 5500 - 6500 genes. Besides the general core genome, the dispensable genome consists of an additional clone specific core genome (Wiehlmann et al., 2007). Wiehlmann et al. also state that the P. aeruginosa genome is assembled non-randomly: "Individual clones prefer a specific repertoire of accessory segments. Moreover, some parts of the core genome tolerate only a subset of the possible combinations of sequence variants, whereas other segments are freely recombining." In P. aeruginosa many important genes for pathogenicity are located in the dispensable genome (Klockgether et al., 2011). Hot spots of genomic island and RGP integration in the P. aeruginosa genome are tRNA genes. Notable differences to core genome regions are the anomalous mono- to tetradecanucleotide usage and GC content of RGP sequences (Klockgether et al., 2011; Kung et al., 2010). For P. aeruginosa, the GC content of RGPs is mostly lower than the high GC content of the core genome (Kung et al., 2010). Further, RGPs can be identified by mobility factors in their flanking regions. Frequently, the content of RGPs is even a mosaic of regions from different mobile elements (Klockgether et al., 2004). The above mentioned studies act on the assumption that P. aeruginosa has an open pangenome, because all newly analyzed strains added several genes to the pangenome while the core genome slowly decreased.
In the time between November 30th, 2016 and January 20th, 2017 eight new M. tuberculosis genomes became available in the NCBI Ref-Seq database. This already high- lights the importance of having the ability to extend a pan-genome structure. Methods such as the investigated whole genome alignment tools that constrain the user to start the alignment afresh with the increased number of genomes are at risk of reaching computational lim- its (some indications could be observed for Mugsy in the experiments already) which is mitigated by our itera- tive approach which quickly adds new sequences without having to rebuild previously calculated results. Further- more, publicly available sets of genomes, such as the collection of “Complete Genomes” in the NCBI RefSeq database, are subject to change due to altered quality stan- dards or the redefinition of reference genomes, such as the commonly used M. tuberculosis H37Rv strain. There- fore, it is essential that pan-genome representations also provide the feature to easily remove genomes from the initial set without impacting the remaining genomes. Most of the evaluated tools do not provide methods for updating a constructed pan-genome. Particularly research like molecular surveillance, where new data is continu- ously analyzed and incorporated, depends on data struc- tures that allow the integration of an up-to-date set of genomes.
At the core of pan-genomics is the idea of replacing trad- itional, linear reference genomes by richer data structures. The paradigm of a single reference genome has endured in part be- cause of its simplicity. It has provided an easy framework within which to organize and think about genomic data; for ex- ample, it can be visualized as nothing more than linear text, which has allowed the development of rich two-dimensional genome browsers [ 18 , 19 ]. With the currently rapidly growing number of sequences we have at our disposal, this approach in- creasingly fails to fully capture the information on variation, similarity, frequency and functional content implicit in the data. Although pan-genomes promise to be able to represent this information, there is not yet a conceptual framework or a toolset for working with pan-genomes that has achieved wide- spread acceptance. For many biological questions, it is not yet established how to best extract the relevant information from any particular pan-genome representation, and even when the right approach can be identified, novel bioinformatics tools often need to be developed to apply it.
11 th Young Scientists Meeting 2018, Braunschweig, Germany, November 14-16
Lörincz-Besenyei et al.
Potato improvement by genome editing
Enikö Lörincz-Besenyei 1,2 , Thorben Sprink 2 , Janina Metje 2 , Uwe Sonnewald 3 and Björn Krenz 1 1 Leibnitz Institute DSMZ-German Collection of Microorganisms and Cell Cultures, Braunschweig 2 Julius Kühn Institute, Institute for Biosafety in Plant Biotechnology, Quedlinburg
As a target gene for genome editing, the MSH2 gene was selected. The MSH2 gene is implicated in MMR (mismatch repairing system). The MMR system is highly conserved in Eukaryotes and it is involved in correction and reorganiza- tion of mispaired nucleotides to prevent homeologous recombination. For DNA free genome editing six gRNAs with pre- dicted lower ``off target`` effect were designed.
Due to the application of CRISPR/Cas9 and other nucleases it is now for the first time possible to address sequence alter- ation in plant genomes specifically. There are four different nuclease tech- niques, the Meganucleases, zinc finger- and TALE-nucleases which are all protein guided nucleases as well as CRISPR/Cas9, which belongs to the RNA guided nucleases. From these four dif- ferent nucleases, CRISPR/Cas9 is the easiest in construction and application and it turns out to be the most efficient one used in research. In 2012 the first application of CRISPR/Cas9 in eucaryotes was published and since then it has spread over thousands of research labs worldwide and was successfully applied in plant breeding and first medical treatments. In this article the different nucleases used for genome editing and some of their first successful application in plant breeding and research are pre- sented.
Zahlreichen Untersuchungen der Nasennebenhöhlen des Menschen stehen deutlich weniger beim Schimpansen (Pan troglodytes) gegenüber. Beruhen die morphologischen Studien häufig auf Beschreibungen einzelner Tiere anhand mazerierter Schädel, so wurden die volumetrischen Messungen meist mit invasiven Verfahren durchgeführt (WEGNER, 1936; CAVE and HAINES, 1940; BEZOLD, 1945; CAVE, 1949; HOUSE, 1966; BLANEY, 1986; KOPPE und SCHUMACHER, 1992; KOPPE and OHKAWA, 1999). Diese Methoden sind zur präzisen Ermittlung des Rauminhaltes der Nasennebenhöhlen weniger geeignet als die uns nunmehr zur Verfügung stehenden Methoden mittels dreidimensionaler bildgebender Verfahren. Aufgrund der grazilen, mitunter zahlreichen und nur durch dünne Knochensepten voneinander getrennten Siebbeinzellen gestaltet sich deren Volumenbestimmung und Zuordnung zu einzelnen Gruppen besonders schwierig. Ebenso ist die Beschreibung des Siebbeins und der Nasenhaupt- und Nasennebenhöhlen anhand einer geringen Schädelzahl, aufgrund der interindividuellen Varianz, nur bedingt zur Formulierung allgemeingültiger Aussagen geeignet.
First assembly algorithms date back to the times when few but relatively long Sanger reads were predominant. They are called overlap-layout-consensus approaches because they compare the reads all-against-all in order to merge overlapping parts into longer contiguous consensus sequences. The merging of reads to contigs is often done in a greedy fashion, sometimes by building an overlap graph. Some well known programs of this era are the Celera assembler  which was used to assemble one of the first available human genomes, ARACHNE , or Bambus . One of the earliest assemblers that was able to cope with high throughput data, was the Newbler assembler [63, Supplem. material] that is shipped with 454 se- quencing devices. Subsequently, SSAKE , VCAKE , and SHARCGS  were developed to handle short reads as well. The most problematic issue of these assemblers is the computational time needed for comparing the reads all-against-all. A different class of assemblers solves this problem elegantly by using a so-called de Bruijn graph  for assembly , or more precisely a subgraph of it. The ad- vantage is that the graph can be built in time linear to the input size, as opposed to the quadratic time that the overlap-layout-consensus approaches need in general. A de Bruijn subgraph consists of nodes representing all substrings of length k, called k-mers, of the reads. The nodes are connected by an edge if two overlapping k-mers occur adjacently in one of the reads. This way, common substrings are condensed and the overlaps of the reads are collected implicitly in the graph. If the sequencing data were perfect – that is without sequencing errors and with reads longer than re- petitive regions of the genome – then the de Bruijn graph would reveal the complete genomic sequence: By following an Eulerian path that traverses every edge once, the desired complete sequence could be obtained. The de Bruijn graphs generated from real sequencing data are, however, much more complex, such that heuristics have been developed to cope with the limitations.
“The computational tools developed by NRGene, which use Illumina’s sequence data, combined with the sequencing expertise of IWGSC has generated a version of the wheat genome sequence that is better ordered than anything we have seen to date. We are starting to get a better idea of the complex puzzle that is the wheat genome.”
Prof. Nils Stein, of Germany’s Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) “Overall, the quality is breathtaking. NRGene’s results are just amazing and will have a major impact.”
Zusammenfassend lässt sich sagen, dass es mit den neuen Technologien zum Genome Editing mög- lich ist, an einem bekannten Genlocus in einer Pfl anze nahezu jede beliebige Veränderung vorzu- nehmen. Hierdurch können bekannte gewünschte Eigenschaften innerhalb von Arten oder über Artgrenzen hinweg völlig neu kombiniert werden. Durch den Einsatz des sehr einfach herzustel- lenden und sehr effi zienten CRISPR/Cas9-Systems ist es damit auch möglich, Veränderungen am Genom in Pfl anzen zu erzeugen, die nicht zu den „cash crops“ gehören und dies ist zudem auch kleineren und mittelständischen Unternehmen und nicht nur den Großkonzernen möglich. Literatur
WILDMAN et al. (2003) verglichen die DNA des Menschen mit der von den großen Menschenaffen, einem Vertreter der Cercopithecoidea und der Maus. Sie fanden heraus, dass sich der Schimpanse zu 0,87%, der Gorilla zu 1,07% und der Orang-Utan zu 2,32% vom Menschen unterscheiden. So stellten sie eine 99,13%ige Übereinstimmung der DNA Sequenz zwischen Pan und Homo fest. Zudem untersuchten und unterschieden sie auch die in der Evolution durch Nukleotid-Substitution veränderte DNA in nicht-synonym (veränderte Aminosäuren-Expression) und synonym (unveränderte Aminosäuren-Expression) und konnten entsprechend eine 99,4%ige und 98,4%ige Übereinstimmung zwischen Mensch und Schimpanse feststellen. Im Unterschied zu GEISSMANN (2003) schlagen WILDMAN et al. (2003) aufgrund ihrer Ergebnisse jedoch vor, die Spezies Homo in Homo (Homo) und Homo (Pan) zu unterteilen. Die Gattung der Schimpansen teilt sich in die zwei Arten, Pan troglodytes (gemeiner bzw. gewöhnlicher Schimpanse) und Pan paniscus (Bonobo; Zwergschimpanse). Während sich das Ausbreitungsgebiet der gewöhnlichen Schimpansen in einem breiten Ost-West-Gürtel durch Zentralafrika zieht, bewohnen die Zwergschimpansen ein relativ kleines Gebiet in Zentralafrika, südlich des Kongo-Flusses. Innerhalb von Pan troglodytes lassen sich vier Unterarten unterscheiden:
In the previous chapter we mainly focused on a measure of gene family-free genome comparison for two genomes. Here, we go beyond pairwise comparisons and dis- cuss a gene family-free model for the reconstruction of a possible candidate for the common ancestor of three genomes. In doing so, we extend the gene family-based problem of computing the mixed multichromosomal breakpoint median to a gene family- free setting. The present chapter is similarly structured as the previous: After a short review of the gene family-based problem in the subsequent section, we propose a gene family-free generalization. We then discuss its computational complexity by proving that the presented problem is MAX SNP-hard. Further, we formulate a 0-1 linear program that allows us to compute exact solutions. Whereas our model for computing family-free adjacencies between two genomes tolerated events of gene duplication and loss, the herein presented model is susceptible to gene losses and resolves gene duplications only to a limited extent. We discuss the effects of gene family evolution in our presented model and proceed to present a 0-1 linear pro- gram for computing gene family-free adjacencies between three genomes, thereby extending results of the previous chapter. Our algorithm gives rise to a heuristic approach to construct a median of three genomes in a family-free setting. We then compare both methods in simulated datasets. Lastly, we use our heuristic method to reconstruct the genome sequence of the black death again from genome sequences of three Yersinia pestis strains. We compare our results to those of Rajaraman et al. .
because it was considered too radical to be implemented as foreign policy. How was it possible, therefore, that pan-Asian ideology came to influence politics and, in the 1930s, to become Japan’s foreign policy doctrine?
The turning point in the development of Pan-Asianism is to be found in the late Meiji and the Taishô period, when a new kind of informal political society became more and more influential in Japanese politics, namely small political organisations (seiji kessha, 政治結社 or seisha, 政社) below the level of the party, most of them with a certain orientation towards right-wing extremism – in whatever sense. Even though these organisations are in the first place considered a phenomenon of the 1930s (e.g. Storry 1957: 9 and passim), some of them had already launched political activities in the Meiji and Taishô periods. In English-speaking research, these organisations are called “patriotic societies” after the pioneer study by E. Herbert Norman (1944) 17 . However, the societies active already during the Meiji and Taishô periods were not so much “patriotic” but rather pan-Asian. The “prototypes” of these organisations were the Kôakai ( 興 亜 会 ) founded in 1880 (Kuroki 1984, Hazama 2001a), the Gen’yôsha (“Black Ocean Society”, 玄洋社), founded in 1881 (Norman 1944); the Tôa-Dôbunkai (“Society for Common East Asian Culture”; 東亜同文会) and the Kokuryûkai (“Amur Society”, 黒龍会 18 ). These societies will be dealt with in detail later in this chapter. Besides numerous publications, these organisations used informal, personal channels in order to campaign for their objectives amongst politicians, the military and financial and
Asia , particularly its major economies has witnessed slower growth in recent years. To make Asia more economically sustainable and resilient against external shocks to recover from the falling growth, most regional economies need to rebalance their export-oriented (mostly to advanced economies) production and growth towards Asian markets and regional demand, and trade-driven growth through increased intraregional infrastructure connectivity and regional economic integration. In 1992, a pan-Asian transport connectivity was initiated through, Asian Highway Network and Trans-Asian Railways Network. In 2015, an ambitious pan-Asian connectivity initiative, namely “One Belt, One Road” (ancient silk road) initiative has been proposed. This initiative plans to create an economic zone covering Asia, Europe and Africa. To successfully promote and finance greater physical connectivity, at the pan-Asian, sub-regional and national levels, Asia will require a strong and appropriate institutional framework for effective coordination, cooperation and collaboration among national, subregional, and region- wide institutions as well as other stakeholders. This paper discusses the prospects and challenges facing Asian connectivity as well as infrastructure financing needs in Asia. It also examines the nature and characteristics of existing and new institutions and the emerging role of regional and international institutions for enhancing Asian connectivity. Lastly, it proposes an institutional architecture consisting of new “Asian Infrastructure Coordination Facility (AICF)” involving major stakeholders for building a seamless pan-Asian connectivity through bilateral, regional and international cooperation, partnership and collaboration in infrastructure development. JEL-Codes: R100, R400, R420.
The three carefully selected taxa of phototrophic euglenoids in this study have been used to compare their chloroplast genomes with further chloroplast genomes of euglenoids to get an overview of chloroplast evolution in the highly diverse lineage within the Euglenozoa. Although the general gene composition was almost identical in all investigated cpGenomes, the chloroplast genomes show remarkable differences in size. The varying number of RNA repeats, IGS differences and as main factor intron and twintron content, have been identified as the three most pressing causes for size differences. The conducted intrageneric and intergeneric comparisons yielded large cluster rearrangements, which occurred between different clades and resulted in a high synteny of derived taxa due to merging clusters. Despite the approach to detect lineage encompassing trends and consistencies within the phototrophic euglenoids regarding the evolution of euglenoids and their chloroplasts, which could only be found in between species and only as an exception between genera, here molecular morphology trends have been detected in the chloroplast genomes of euglenoids for the first time. These metacharacters appear suitable to support phylogenomic and phylogenetic analyses and can be used to understand and possibly rule out questionable positions. Inter alia cluster arrangement, gene order and individual introns have been determined as significant metacharacters of the euglenoid chloroplast genome. Since plastid genomes from some euglenoid families are still not avaliable, an increasing sampling of euglenoid taxa across the tree would allow to explore and maybe confirm the described metacharacters in other taxa as well. Thereby it would be important to define more potential and appropriate molecular morphology features and to establish a standardized system for analyzing these genome-level features.