Delicious apple orchards in Cuauhtémoc County; LA637 was iso- lated near Creel, Guerrero County. Whole-genomesequencing (Illumina HiSeq 2x 100-bp shotgun sequencing) yielded 15,871,242 (LA635), 15,755,564 (LA636), and 16,055,090 (LA637) reads representing ~400⫻ genome coverage. Genomes were assembled by combining de novo assembly using the Velvet short-read assembler plugin of the Geneious Server (Biomatters, Ltd., Auckland, NZ) and mapping against E. amylovora CFBP 1430 using Lasergene NGen 11 (DNASTAR, Madison, WI) with 8,000,000 reads. For each strain, a 3.8-Mb draft chromosome (8 contigs) and a circular plasmid (pEA29) were assembled. All se-
*These authors contributed equally to this article and share first authorship. Cite this article: Enkelmann J, von Laer A, Simon S, Fruth A, Lachmann R, Michaelis K, Borowiak M, Gillesberg Lassen S, Frank C (2020). Disentangling outbreaks using whole- genomesequencing: concurrent multistate outbreaks of Salmonella Kottbus in Germany, 2017. Epidemiology and Infection 148, e51, 1 –6. https://doi.org/10.1017/S0950268820000394 Received: 16 December 2019
Current genomic studies are limited by the poor availability of fresh-frozen tissue samples. Although formalin-fixed diagnostic samples are in abundance, they are seldom used in current genomic studies because of the concern of formalin- fixation artifacts. Better characterization of these artifacts will allow the use of archived clinical specimens in translational and clinical research studies. To provide a systematic analysis of formalin-fixation artifacts on Illumina sequencing, we generated 26 DNA sequencing data sets from 13 pairs of matched formalin-fixed paraffin-embedded (FFPE) and fresh-frozen (FF) tissue samples. The results indicate high rate of concordant calls between matched FF/FFPE pairs at reference and variant positions in three commonly used sequencing approaches (whole genome, whole exome, and targeted exon sequencing). Global mismatch rates and C·G > T·A substitutions were comparable between matched FF/FFPE samples, and discordant rates were low (<0.26%) in all samples. Finally, low-pass whole genomesequencing produces similar pattern of copy number alterations between FF/FFPE pairs. The results from our studies suggest the potential use of diagnostic FFPE samples for cancer genomic studies to characterize and catalog variations in cancer genomes.
chr10:13,280,000–15,440,000 (containing SUV39H2), chr1:3,360,000–3,359,999 (MUM1, GNA15, GNA11, STK11, and TCF3), chr14:67,880,000–69,720,000 (RAD51B), and chr19:14,600,000 –16,800,000 (BRD4) (Supplementary Data 13). Copy number aberration calling. Copy number aberrations (CNAs) were called using ACEseq 64 , which is available on github [ https://github.com/eilslabs/ ACEseqWorkﬂow ]. Brieﬂy, ACEseq (allele-speciﬁc copy number estimation from whole-genomesequencing) determines copy number states, tumor cell content, ploidy, and sex in the tumor by using read coverage and the B-allele frequency (BAF). Heterozygous germline positions (with BAF 0.33–0.77 at dbSNP version 135 SNP loci) 65 are identiﬁed for later allele-speciﬁc copy number and loss-of-
Campylobacter is the major bacterial agent of human gastroenteritis worldwide and represents a crucial global public health burden. Species differentiation of C. jejuni and C. coli and phylogenetic analysis is challenged by inter-species horizontal gene transfer. Routine real-time PCR on more than 4000 C. jejuni and C. coli field strains identified isolates with ambiguous PCR results for species differentiation, in particular, from the isolation source eggs. K-mer analysis of whole genomesequencing data indicated the presence of C. coli hybrid strains with huge amounts of C. jejuni introgression. Recombination events were distributed over the whole chromosome. MLST typing was impaired, since C. jejuni sequences were also found in six of the seven housekeeping genes. cgMLST suggested that the strains were phylogenetically unrelated. Intriguingly, the strains shared a stress response set of C. jejuni variant genes, with proposed roles in oxidative, osmotic and general stress defence, chromosome maintenance and repair, membrane transport, cell wall and capsular biosynthesis and chemotaxis. The results have practical impact on routine typing and on the understanding of the functional adaption to harsh environments, enabling successful spreading and persistence of Campylobacter.
Results: A high-throughput screening assay with a Vibrio cholerae reporter strain constitutively expressing green
fluorescent protein (GFP) was developed and applied in the investigation of the growth inhibitory effect of approximately 28,300 structurally diverse natural compounds and synthetic small molecules. Several compounds with activities in the low micromolar concentration range were identified. The most active structure, designated vz0825, displayed a minimal inhibitory concentration (MIC) of 1.6 μM and a minimal bactericidal concentration (MBC) of 3.2 μM against several strains of V. cholerae and was specific for this pathogen. Mutants with reduced sensitivity against vz0825 were generated and whole genomesequencing of 15 pooled mutants was carried out. Comparison with the genome of the wild type strain identified the gene VC_A0531 (GenBank: AE003853.1) as the major site of single nucleotide polymorphisms in the resistant mutants. VC_A0531 is located on the small chromosome of V. cholerae and encodes the osmosensitive K + -channel sensor histidine kinase (KdpD). Nucleotide exchange of the major mutation site in the wild type strain confirmed the sensitive phenotype.
A subset of isolates from UK cases were selected for whole genomesequencing (WGS). This included 24 of 37 outbreak isolates from UK cases and the iso- late from the watermelon slice. An additional 11 non- outbreak isolates were selected for comparison. Genomic DNA was extracted using the Wizard Genomic DNA Purification Kit (Promega), and samples were sequenced using multiplex libraries on the HiSeq plat- form (Illumina) using 100 bp paired-end reads. The sequence data were aligned to the reference strain S. Newport SL254 (hereafter, SL254) along with its associated plasmids pSL254_3 and pSN254 (acces- sion numbers CP001113, CP001112 and CP000604, respectively) using SMALT v0.6.4 . Single nucleo- tide polymorphisms (SNPs) was compared to the ref- erence strain and a maximum likelihood phylogeny of the isolates was constructed using RAxML [8-9]. A high divergence between the outbreak isolates and the reference SL254 was observed, with ca 50,600 SNPs separating the outbreak isolates from the reference SL254. To improve resolution, the outbreak isolates,
In this review, we analyse the current health economic evidence with respect to the sequencing of the human genome. Health economists distinguish between several methodological approaches. On one hand, some studies calculate the costs of new technologies and their eco- nomic burden. On the other hand, full economic evalua- tions go beyond pure effectiveness or cost measurements by combining assessments of costs and the consequences/ outcomes of defined diagnostic procedures or interven- tions. Thereby, three evaluation approaches can be distin- guished: cost-effectiveness analysis, cost-utility analysis, and cost-benefit analysis . Cost-effectiveness analyses evaluate alternative technologies (e.g. genomesequencing versus standard diagnostic techniques) in comparing costs and a common effectiveness parameter (e.g. life-years gained through the diagnostic link of patient subgroups to specific treatments). Cost-utility analyses use utilities like quality-adjusted life-years (QALYs) as benefit parameters. The two main advantages of cost-utility analysis are that they adjust for quality of life and allow comparisons between indications. In cost-benefit analyses, not only the cost but also the benefit is measured in monetary units. However, because of the difficulty in expressing patient benefit in monetary terms, this approach is rarely used in practice. The incremental approach is a common factor in all economic evaluations: they divide the additional costs of alternative A versus alternative B by the additional
Genomesequencing and assembly
For each sample, a whole-genome sequence library was prepared using the Illumina-Compati- ble Nextera DNA Sample Prep Kit (Epicentre, Madison, WI USA), according to the manufac- turer’s protocol. Each library was tagged with an individual tag combination and a library pool containing equimolar amounts of the individual libraries was prepared. The library pool was sequenced in 2x250 bp paired read runs on the MiSeq platform, yielding 21,928,122 total reads. After de-multiplexing, the individual sample reads were assembled using the Newbler assembler v2.8 (Roche, Branford, CT USA). Contigs of the initial Newbler assemblies (unor- dered drafts) were then aligned to the reference genome of S. pneumoniae ATCC 700669 using
Die Humangenetik ist eine noch junge und sich rasch weiterentwickelnde Fachdisziplin der Medizin. Seit dem Ende der 50er Jahre ist es möglich, gewisse genetische Erkrankungen mittels Chromosomenanalyse zu untersuchen (Theile 1976). Im Laufe der Zeit wurden stetig weitere Verfahren entwickelt, mit denen ein wachsendes Spektrum von Erkrankungen diagnostiziert werden kann. Seit der Entwicklung der Sanger-Sequenzierung (Sanger, Nicklen, and Coulson 1977) im Jahre 1977 werden vermehrt die Sequenzen des Genoms untersucht, hiermit werden neue Gene identifiziert und nach den genetischen Ursachen von Krankheiten gesucht. Mit dem Abschluss des Humangenomprojekt im Jahr 2004 war die erste Version eines vollständigen menschlichen Genoms bekannt, das nur etwa 20.000-25.000 Gene enthält (International Human GenomeSequencing 2004). Seit dem Beginn des 21. Jahrhunderts wurden immer neue Technologien entwickelt, die sogenannten Next-Generation Sequencing (NGS)-Techniken, die das Sequenzieren des menschlichen Erbgutes schneller und günstiger machen. Diese Technologien ermöglichen es, dass heute mehr Patienten humangenetisch untersucht werden und immer größere Teile des Genoms auch in der Diagnostik analysiert werden können.
The majority of the pan-genome hits were related to K. pneu- moniae (6206 hits of 11,267) followed by E. aerogenes (1537 hits) and K. oxytoca (1014 hits). S. aureus, a Gram-positive species, served as an outgroup and no hits to its pan-genome were found. In total, 37 hits to 21 unique Resfams (core data- base) were found in the query genome CDS with 23 hits on the chromosome and 14 on the plasmid. The top three most occur- ring Resfams were RF0115 (8 hits, RND antibiotic efflux pump), RF0098 (3 hits, multidrug efflux RND membrane fusion protein MexE, RND antibiotic efflux), and RF0053 (3 hits, class A beta-lactamase). Furthermore, the CDSs of eight antibiotic resistance genes reported in the original gen- ome announcement were investigated. The HMM-based search of pan-genome centroids resulted in the identification of two chromosomal CDSs, WP_076027158.1 (multidrug efflux RND transporter periplasmic adaptor subunit OqxA) and WP_004146118.1 (FosA family fosfomycin resistance glu- tathione transferase), being classified as K. pneumoniae-derived centroids according to their top hits (with respect to the full sequence score). The top hits of the remaining genes (5 plasmid-derived and 1 chromosome-derived) included centroids from other Gram-negative species. However, the centroid cluster annotations matched the expected protein functions for all eight CDSs independent of the species. The top three hits for WP_004146118.1 were centroids from K. pneumoniae, E. aerogenes, and K. oxytoca, matching the expected annotation and present in almost all isolates (>98%) of the respective pan-genomes. This high prevalence matches the observations made by Ryota et al. reporting sim- ilarly high frequency (>96%) of fosA in these species  . For the beta-lactamases WP_004176269.1 (class A broad-spectrum beta-lactamase SHV-11) and WP_000027057.1 (class A broad- spectrum beta-lactamase TEM-1), the top hits in Klebsiella were associated with resistance to penicillins and cephalosporins. And for the aminoglycoside transferases WP_000018329.1 (aminoglycoside O-phosphotransferase APH(3 0 )-Ia), WP_032491824.1 (ANT(3 00 )-Ia family aminogly- coside nucleotidyltransferase AadA22), and WP_000557454.1 (aminoglycoside N-acetyltransferase AAC(3)-IId), the top hits in K. pneumoniae were associated with resistance to aminogly- cosides. Moreover, all three chromosome-derived CDSs (WP_004176269.1, WP_076027158.1, and WP_004146118.1) matched to centroids found in >92% of the K. pneumoniae isolates, two of the five plasmid-derived CDSs (WP_032491824.1 and WP_000027057.1) matched to centroids with a frequency of >25%, while the remaining CDSs matched to centroids with a frequency of <12%.
55. Hale JD, Ting YT, Jack RW, Tagg JR, Heng NC: Bacteriocin (mutacin) production by Streptococcus mutans genome sequence reference strain UA159: elucidation of the antimicrobial repertoire by genetic dissection. Appl Environ Microbiol 2005, 71:7613–7617. 56. Dufour D, Cordova M, Cvitkovitch DG, Levesque CM: Regulation of the competence pathway as a novel role associated with a streptococcal bacteriocin. J Bacteriol 2011, 193:6552–6559.
18. Muzzi A, Donati C: Population genetics and evo- lution of the pan-genome of Streptococcus pneumo- niae. Int J Med Microbiol 2011, 301:619–622.
19. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, et al.: Genome analysis of multiple patho- genic isolates of Streptococcus agalactiae: implica- tions for the microbial “pan-genome”. Proc Natl Acad Sci U S A 2005, 102:13950–13955.
pneumoniae genomes, the predicted number of new
genes drops sharply to zero when the number of genomes exceeds 50. However, in the case of S. mutans we could not observe such sharp decrease of new gene number even after 67 genomes were included. In the study of Cornejo et al. , they proposed a finite pan-genome for S. mutans, after they used a special “pseudogene cluster” identification process to exclude about 30% of the rare genes that are considered to be pseudogenes. However, they didn’t provide detailed parameters they obtained from fitting. Our modeling using the 67 S. mutans genomes by applying the model described above without any re- strictions pointed to an infinite pan-genome of S. mutans. However, we would like to understand this predicted “infinite” pan-genome as follows: 1) a “pan-genome” should be considered as “dynamic” rather than “static”, which means the pan-genome content is changing during the evolution, it does not matter if its size is infinite or finite; 2) The change of a pan-genome content can be caused by the acquirement of new genes or by the loss of genes; 3) The actual pan-genome size can be more stable than the content of the pan-genome but can also change during evolution coupled with the change of the environment. Thus, without considering the “gene loss events”, it’s quite understandable to have a “growing” or “infinite” pan-genome as gene acquirement occurs no matter how slow it might be. Interestingly, Cornejo et al. found a high rate of LGT in S. mutans, where many genes were acquired from related streptococci and bacterial strains predominantly residing not only in the oral cavity, but also in the respiratory tract, the digestive tract, cattle, genitalia, in insect pathogens and in the environment in general . Such high rate of LGT might also lead to a continuously growing pan-genome.
and the expected average new gene number with the addition of a new genome is estimated to be 15. The infinite pan-genome was first proposed by Tettelin et al. for S. agalactiae based on the use of 9 S. agalactiae genomes. The three regression models used in this study are all based on the assumption that contingency genes are independently sampled from the pan-genome with equal probability, except in the case of “specific/unique genes”, which are modeled as unique events that appear only once in the entire global population. Hogg et al.  proposed a finite supragenome model for pan-genome based on a different supposition that contingency genes are sampled from the pan-genome with unequal probability. By applying this finite supragenome model to 44 S. pneumoniae genomes, the predicted number of new genes drops sharply to zero when the number of genomes exceeds 50. However, in the case of S. mutans we could not observe such sharp decrease of new gene number even after 67 genomes were included. In the study of Cornejo et al. , they proposed a finite pan-genome for S . mutans, after they used a special “pseudogene cluster” identification process to exclude about 30% of the rare genes that are considered to be pseudogenes. However, they didn’t provide detailed parameters they obtained from fitting. Our modeling using the 67 S. mutans genomes by applying the model described above without any re- strictions pointed to an infinite pan-genome of S. mutans. However, we would like to understand this predicted “infinite” pan-genome as follows: 1) a “pan-genome” should be considered as “dynamic” rather than “static”, which means the pan-genome content is changing during the evolution, it does not matter if its size is infinite or finite; 2) The change of a pan-genome content can be caused by the acquirement of new genes or by the loss of genes; 3) The actual pan-genome size can be more stable than the content of the pan-genome but can also change during evolution coupled with the change of the environment. Thus, without considering the “gene loss events”, it’s quite understandable to have a “growing” or “infinite” pan-genome as gene acquirement occurs no matter how slow it might be. Interestingly, Cornejo et al. found a high rate of LGT in S. mutans, where many genes were acquired from related streptococci and bacterial strains predominantly residing not only in the oral cavity, but also in the respiratory tract, the digestive tract, cattle, genitalia, in insect pathogens and in the environment in general . Such high rate of LGT might also lead to a continuously growing pan-genome.
Effect of mapper on mutation calling. The differences between sets of mutations submitted by the participating groups raised questions about the impact of individual pipeline components on the results. The extent of observed pipeline customization (Supplementary Methods and Supplementary Data 1) did not allow for exhaustive testing of all potentially important analysis steps; however, three pipeline components were selected for closer inspection because of their expected high impact: mapper, reference genome build and mutation caller. Four mappers (Novoalign2, BWA, BWA-mem and GEM), two SSM callers (MuTect 31 and Strelka) and three versions of the human reference genome (b37, b37 þ decoy and ‘hg19r’—a reduced version of hg19, with unplaced contigs and haplotypes removed) were selected for testing, based on their usage by the benchmarking groups (Supplementary Methods for software versions and settings). To limit the effect of non-tested software on the produced mutation sets, a simple SSM-calling pipeline was established. First, we compared the effect of the mapper with each of the SSM callers. With a single SSM caller employed, a considerable fraction of unﬁltered SSM calls for a given mapper (0.22–0.69, depending on the mapper–caller combination) is not reproducible by that caller with any other mapper (Supplementary Fig. 19). When compared with the Gold Set (Tier 3 SSMs), calls supported by a single mapper are almost exclusively FPs (precision o0.02). On the other hand, a large majority of calls supported by all four mappers are TPs (with precision ranging from 0.87 for MuTect to 0.99 for Strelka). Effect of primary mutation caller on mutation calling. Similar trends are observed when SSM callers are compared while holding the mapper constant (Supplementary Table 5). A sizable fraction (0.22–0.87, depending on the mapper) of unﬁltered SSM calls for any given mapper–caller combination is not reproducible by the other caller on the same alignment ﬁle. Remarkably, in case of Novoalign2, the same alignment ﬁle leads to the most somatic calls and the lowest overall precision when used with MuTect, but the fewest somatic calls and highest overall precision when used with Strelka. When compared with the Gold Set, calls private to a single caller appear to be mostly FPs, with precision ranging from 0.01 to 0.05. Calls supported by both callers prove to be mostly correct (with precision between 0.89 and 0.93; Supplementary Table 6). The consensus sets seem to be robust—considerably improving the precision rates while only minimally lowering the sensitivity. The results of reference genome choice and a detailed examination of the alignment characteristics of the different aligners are presented in Supplementary Note 1.
Sequences deposited in the MIPS database, but not immediately published by the sub-contracting laboratory, were confidential, although third parties – for instance the other sub-contractors and the companies who were members of the YIP – could have a limited and controlled access to them, before they were made available to everyone with the publication of the complete chromosome sequence in a open access database (Vassarotti et al. 1995, p. 134; Joly and Mangematin 1998, pp. 81–82) . Before this final publication, the sequence data were checked to reduce the error rate, and assembled. Some re-sequencing was also undertaken for verification purposes by the participating laboratories, as part of their contractual obligations. The publication of the complete chromosome sequence had to follow the conclusion of the data collection process within six months and, at that point, previous intellectual property rights were lost and the sequencing laboratories were “liable for free distribution of DNA material and other biological materials to third parties” (Vassarotti et al. 1995, p. 136). Yeast geneticists shared a deep communitarian ethos, and if the European yeast genome network had to break this tradition to stimulate rapid analysis of the genomic data, it had also to reinstate it eventually, as the project itself had benefitted from this ethos in the acquisition of the original genomic libraries, donated by the US researchers Maynard Olson, Linda Riles and Carol Newlon (Dujon 2015). 25
We have resequenced the genome of P. protegens CHA0 starting with genomic DNA from its accession number CCOS 2 at the Culture Collection of Switzerland (CCOS). The strain was grown in LB broth at 28°C for 1 day. Total DNA was extracted from the pure culture using the DNeasy blood and tissue kit (Qiagen, Hilden, Germany). Genomic library preparation and genomesequencing were outsourced to GATC Biotech, AG (Constance, Germany). Libraries were prepared using a SPRIworks fragment library system I (Beckman Coulter, Brea, CA), following the manufacturer’s instructions. The TruSeq paired-end (PE) cluster kit v3-cBot-HS (Illumina, San Diego, CA) was used for cluster generation. Sequencing was performed on a HiSeq 2000 Illumina sequencer with 2 ⫻ 50-bp paired-end reads using the TruSeq SBS kit v3-HS (Illumina). A total of 97,357,690 quality-ﬁltered reads were obtained from GATC, giving an approximate coverage of 700 ⫻. For de novo assembly using SeqMan NGen v12.2 (DNAStar, Madison, WI) with standard settings, only 8,500,000 reads (55 ⫻ coverage) were used. Repeated cycles of read mapping with the SeqMan NGen software and inspection in different subroutines of the Lasergene package (DNAStar) yielded a complete genome of 6,868,156 bp with a G ⫹C content of 63.39%. Based on a small region close to an rRNA region, the genome was 176 bp larger than the previous version (7). Additionally, few indels mainly in homopolymer regions were observed.
accurate diagnostics of clinically rele- vant germline and somatic mutations [ 45 ]. Diﬀerent methods using semicon- ductors (Ion Torrent), pyrosequencing (Roche), sequencing by ligation (Ap- plied Biosystems), and the widely used sequencing by synthesis with reversible terminators (Solexa, Illumina) enabled gene panel, whole-exome, and whole- genomesequencing within a few days at moderate costs [ 43 ]. However, both Sanger sequencing and NGS technologies deliver only short-read DNA fragments within the range of 50–1000 bases. The short-reads prevent analysis of complex genomic loci, repetitive elements, or variant phasing (haplotyping) and result in ineﬃcient and incomplete genome assemblies. Moreover, PCR ampliﬁca- tion of sequencing templates generates artefacts and precludes detection of native base modiﬁcations. Several of these shortcomings can be overcome by third-generation sequencing technolo- gies (TGS), also referred to as long-read sequencing in the following.
Next generation sequencing (NGS) provides the opportunity to rapidly and at relatively low cost establish gene space assemblies for virtually any species. These assemblies consist of tens to hundreds of thousands of short contiguous pieces of DNA sequence (contigs) and often represent only the low-copy por- tion of the genome. Despite the limitations of such assemblies, they have been widely proposed as surrogates for draft genome sequences for the purposes of gene isolation, genomics-assisted breeding and the assessment of diversity within and between species (Brenchley et al., 2012; IBSC, 2012; Xu et al., 2012; Guo et al., 2012). However in most cases, particularly those concerning large and complex genomes, they remain disconnected collections of short sequence contigs that are not embedded in a genomic context. Bringing these fragments together into a tentative linear order, or even associating contigs with individ- ual chromosomes or chromosome arms, has been a major and costly undertak- ing. In a recent example, the International Barley GenomeSequencing Consor- tium (IBSC, 2012) had reported a gene space assembly of the 5.1 Gb genome of barley. The development and use of a BAC-based physical map, BAC end sequences, flow-sorted and chromosome-arm survey sequences, fully sequenced BAC clones and conserved synteny were all required to fully contextualize only 410 Mb of genomic sequence IBSC (2012). These genomic resources provide an established path towards a reference sequence by sequencing a minimum tiling path of overlapping BAC clones and hierarchically (Feuillet et al., 2012). The development of the necessary resources requires a substantial amount of time, labor and finances which makes this strategy prohibitive for smaller and more poorly resourced research communities, e.g. research in non-model organisms or orphan crops. The establishment of a BAC-based reference sequence of the maize genome took about seven years, required the coordinated effort of sev- eral laboratories and cost about US $50 million (Chandler and Brendel, 2002; Martienssen et al., 2004; Schnable et al., 2009). Similarly, the reference se- quence of a single 1 Gb chromosome of hexaploid wheat has not been finished five years after the publication of a physical map (Paux et al., 2008).