ABSTRACT Ridge regression with heteroscedastic marker variances provides an alternative to Bayesian genome-wideprediction methods. Our objectives were to suggest new methods to determine marker- speciﬁc shrinkage factors for heteroscedastic ridge regression and to investigate their properties with respect to computational efﬁciency and accuracy of estimated effects. We analyzed published data sets of maize, wheat, and sugar beet as well as simulated data with the new methods. Ridge regression with shrinkage factors that were proportional to single-marker analysis of variance estimates of variance components (i.e., RRWA) was the fastest method. It required computation times of less than 1 sec for medium-sized data sets, which have dimensions that are common in plant breeding. A modiﬁcation of the expectation-maximization algorithm that yields heteroscedastic marker variances (i.e., RMLV) resulted in the most accurate marker effect estimates. It outperformed the homoscedastic ridge regression approach for best linear unbiased prediction in particular for situations with high marker density and strong linkage disequilibrium along the chromosomes, a situation that occurs often in plant breeding populations. We conclude that the RRWA and RMLV approaches provide alternatives to the commonly used Bayesian methods, in particular for applications in which computational feasibility or accuracy of effect estimates are important, such as detection or functional analysis of genes or planning crosses.
We applied the BLUP and RMLV analyses to two exper- imental data sets to derive guidelines for the application of genome-wideprediction methods to introgression pop- ulations. In the analysis of the rapeseed introgression population a major gene for glucosinolate content was found, that controls the phenotypic difference between the donor and the recipient (Figure 3). The RMLV analysis estimated an effect size of 23 and the BLUP analy- sis an effect size of 18. The BLUP analysis detected in addition a large number of significant donor seg- ments with small effects. Many of these were shrunken near zero in the RMLV analysis. The results presented in Figure 1C suggest that the true effect size might be more closely to the RMLV estimate than to the BLUP estimate, because the differences between donor and recipient can mainly be attributed to a single major gene.
For the maize data set, the trait GY-WW was investigated, for the wheat data set the trait GY, and for the sugar beet data set the trait SC. GWP, genome-wideprediction; RIR, ridge regression employing preliminary estimates of the heritability; BLUP, best linear unbiased prediction; RMLV, modi ﬁcation of the restricted maximum likelihood procedure that yields heteroscedastic variances; RRWA, ridge regression with weighing factors according to analysis of variance components; RMLA, estimation of the error and genetic variance components with restricted maximum likelihood and partitioning according to analysis of variance components; BL, Bayesian LASSO; HEM, heteroscedastic effects model; SNP, single-nucleotide polymorphism; DArT, diversity array technology; GY, grain yield; WW, well-watered; SC, sugar content.
The downside in genome-wide association studies (GWAS) is the ‘spurious or false’ associations between genetic markers and the trait of interest. It has already been diagnosed that cryptic population structure is one of the main causes of fake causal relations (Li, 1969; Lander and Schork, 1994). Prithard et al. (2000) inferred population structure based on a Bayesian clustering approach (STRUCTURE). They assumed a model with K populations where individuals were assigned to different populations on the basis of their genotypes and at the same time estimating the allele frequencies of the population. Patterson et al. (2006) introduced a new technique to examine population structure in genetic data through the use of principal component analysis (Cavalli Sforza and Feldman, 2003) that determines the statistically significant ‘axes of variation’. However, all these various approaches have had limited success in dealing with this issue effectively (Pritchard et al. 2000; Price et al. 2006; Yu et al. 2006). Variation at the DNA level can provide enough information about the underlying population structure apart from the conventional approaches of pedigree or phenotypic records (Varshney et al. 2005). This knowledge of population structure can play an important role in organising the efficient exploitation of germplasm in crop breeding (Bus et al. 2011). GWAS has the potential to be used directly in plant breeding programmes (Jannink et al. 2010).
With the development of sequencing technology, genomic-assisted crop improvement became a popular approach to predict the hybrid performance. Hybrid prediction can be performed either using marker-assisted selection (MAS; Lande and Thompson 1990) or genomic selection (GS; Meuwissen et al. 2001). The procedures of both strategies are similar, requiring a phenotyped and genotyped training population and a genotyped test population (Figure 2). The most obvious difference is the marker resource in the test population: Prediction of genotypic values by MAS is based on the effects of a limited number of selected markers that show significant marker-trait association. In contrast, in GS all markers are used without marker- specific significance test (Heffner et al. 2009; Zhao et al. 2015b). MAS is most effective for traits that are controlled by a few major genes and GS is preferable if the genetic architecture of target traits is complex (Heffner et al. 2009; Heslot et al. 2012).
To determine whether these RRBS libraries were generally representative we compared the GC content, the representation of CpG islands, transcripts, promoter regions and different classes of repeat elements between the entire mouse genome (Waterston et al., 2002), the 500-600 bp BglII fraction thereof and the genome sequences hit by the RRBS clones (Table 5). While reducing the representation introduced a noticeable bias, in particular a reduction of repeats, bisulfite conversion, PCR amplification, cloning and sequencing did not. The GC content of loci covered by RRBS sequences ranged from 32 to 63%, indicating satisfactory performance of our protocol over a wide range of GC content. Likewise, the distribution of the sequenced clones in the genome did not show conspicuous hot or cold spots (see Figs 20 and 21). Taken together, our data suggest that RRBS libraries are sufficiently random and representative of the genome fraction used to make them.
For these reasons, there is considerable interest in the use of forward genetic screens capable of engineering into the cancer ge- nome mutational events that can be tested for their ability to cause drug resistance in an unbiased fashion. Such screens, if sufficiently unbiased, could in theory capture the entire breadth of genetic re- sistance mechanisms for any drug. Recent studies have demon- strated the power of both genome-wide gain- and loss-of- function screens using CRISPR/Cas9, lentiviral shRNA, and large- scale open-reading frame technologies to identify clinically rele- vant drug resistance mechanisms in cancer (Hu and Zhang 2016). However, these screens all fail to capture a third important mechanism of drug resistance, namely that of point mutations. Point mutations account for resistance in large numbers of pa- tients receiving targeted therapies in melanoma, colon and lung cancers, and chronic myeloid leukemia ( Supplemental Table S1 ; Kobayashi et al. 2005; Katayama et al. 2012; Montagut et al. 2012; Ohashi et al. 2012; Bettegowda 2014; Long et al. 2014; Van Allen et al. 2014; Wagle et al. 2014; Arena et al. 2015; Russo et al. 2015; Siravegna et al. 2015; Thress et al. 2015).
Background: Mutans streptococci are a group of bacteria significantly contributing to tooth decay. Their genetic variability is however still not well understood.
Results: Genomes of 6 clinical S. mutans isolates of different origins, one isolate of S. sobrinus (DSM 20742) and one isolate of S. ratti (DSM 20564) were sequenced and comparatively analyzed. Genome alignment revealed a mosaic-like structure of genome arrangement. Genes related to pathogenicity are found to have high variations among the strains, whereas genes for oxidative stress resistance are well conserved, indicating the importance of this trait in the dental biofilm community. Analysis of genome-scale metabolic networks revealed significant differences in 42 pathways. A striking dissimilarity is the unique presence of two lactate oxidases in S. sobrinus DSM 20742, probably indicating an unusual capability of this strain in producing H 2 O 2 and expanding its ecological niche. In addition, lactate oxidases may form with other enzymes a novel energetic pathway in S. sobrinus DSM 20742 that can remedy its deficiency in citrate utilization pathway.
In order to show applicability of the approach described above we purified CD14++ CD16- monocytes and subsequently prepared formaldehyde-cross-linked chromatin extracts from 4 healthy blood donors. Purity was monitored by FACS and was in all cases equal or above 95 percent (Fig. S1). For each sample we performed chromatin-immunoprecipitation experiments with an- tibodies specific for the histone modifications H3K4me3, H3K27me3 and H3K9ac. Whereas H3K4me3  and H3K9ac  have been repeatedly shown to be associated with promoters of actively transcribed genes, H3K27me3 is known for its role in polycomb-mediated gene repression . After preparation of barcoded libraries we always sequenced 4 of them on a single lane of an Illumina HiSeq 2000. After read alignment, filtering and quality control inspection of the ChIP-seq data revealed highly similar profiles when comparing the binding data for a given histone modification across the 4 different donors. This is exemplarily shown for H3K4me3 and H3K27me3 data in genome browser snap shots (Fig. 2). Even on this level the high degree of reproducibility becomes instantly visible. Additionally several principal features of the data can be appreciated. Of the four genes in the depicted genomic interval on chromosome 12, three are strongly marked by H3K4me3 with the typical bimodal pattern across the transcriptional start sites of the respective genes (ING4, ZNF384 and COPS7A). These genes are virtually free of H3K27me3. In contrast the PIANP gene shows only relatively weak association with H3K4me3 but instead is marked by a strong H3K27me3 signal. Similar to H3K4me3, the H3K27me3 signal peaks in the promoter region but is obviously much more spread out which is in line with reports from the literature .
cation transporters that may also act as calcium transporter. Significant associations were also noted on chromosome 2A with 11 SNP markers located within this region (64–66.6 cM) and some of them encoding a disease resistance protein, CBS domain-containing protein, receptor-like protein kinase 2, phosphatidylinositol-4-phosphate 5-kinase family protein, NHL domain-containing protein or Rho GTPase-activating protein besides other genes with unknown function. The LD region on chromosome 2A is widely spread on the physical map of the genome assembly of IWGSC1 extending to the long and the short arm of chromosome 2A. Discrepancies in the order of the contigs in this genome assembly were already described in Zanke et al. (2017) . This region contains a number of genes potentially related to calcium-accumulation such as mechanosensitive ion channel family proteins (Traes_2AL_6069A884, Traes_2AL_72F83E7B0) and a number of heavy metal transport/detoxification superfamily proteins (Traes_2AS_95611CAD2, Traes_2AL_6DD37E6BE, Traes_2AL_ 9B175F3Da, Traes_2AL_F360E3FE3, Traes_2AL_13CBA4FEA, Traes_2AS_AA84E72D4, and Traes_161086245). Nine signifi- cant SNPs occurred on chromosome 5B encoding for different functions and some of them may be involved in calcium transport, like Traes_5BL_DF8D1B819 gene which is located on 100.9 cM and is encodes an ammonium transporter. On chromosome 5D, there were two significant markers: Jagger_c8037_96 and BS00032035_51 with unknown functions. On chromosome 6A are located six significant SNP markers, which are related to two genes encoding histone superfamily proteins with a role in the activation of calcium/calmodulin- dependent protein kinases ( Davis et al., 2003 ). Based on our results, the annotated functions of significant genes and genes in the LD region suggested the presence of several genes controlling the calcium uptake. These genes can be considered as putative candidate genes for calcium accumulation in wheat grains and provide a solid resource for future work. However, further functional validation of these genes and their role in calcium uptake in wheat grains is still needed.
implicated in nucleosome positioning in S. pombe and its binding was also observable at euchromatic regions, it was considered as a candidate that may generally determine nucleosome positioning. To test for a genome-wide role of Mit1 on nucleosome positioning, a nucleosome occupancy map of the mit1 mutant was prepared. The TSS-aligned overlay of nucleosome occupancy profiles revealed a strongly compromised amplitude of the nucleosomal array compared to wt and the spectral analysis of the nucleosome occupancy profile did not reveal the prominent frequency of 6.5 nucleosomes per 1000 bp (Fig. 28A and B). Further, not only the downstream arrays but also the weaker upstream arrays at promoters containing H2A.Z were diminished in the mit1 mutant (Fig. 28C). These findings argue for a role of Mit1 in regular nucleosome spacing up- and downstream of promoter NDRs.
1,369.41225 ± 1.986,τ c = 15.90248 ± 0.66807 (Detailed in-
formation of all core and pan-genome modeling are given in Additional file 3). Using this fitting result to describe the core-genome of S. mutans, the theoretical core-genome size (Ω) was estimated to be around 1,370 genes, which is slightly lower than the calculated core-genome size (1,373) using 67 genomes. Compared with other strepto- coccus species, the core-genome of S. mutans is at the same level to the core-genome of S. pyogenes (1,400 genes determined using 11 strains), less than that of S. pneumoniae (1,647 genes determined using 47 strains) and S. agalactiae (1,800 genes determined using eight strains) [19,22,23]. However, we should be cautious with such comparison. In a recent study of Cornejo et al. , the core genome size of S. mutans was determined as 1,490 by using 57 S. mutans genomes which is obviously different to the core genome size of S. mutans we esti- mated, although we included the 57 S. mutans genomes used by Cornejo et al. in our study. The difference can be caused by different reasons, such as difference in the cor- rection step for core gene determination and, very likely, different methods and parameter settings used for deter- mining orthologs. Apparently, we have used a more stringent process to determine orthologs which led to smaller core genome size of S. mutans estimated.
The Bovine (Bos taurus) genome was a few years ago, within the Bovine genome project, completely sequenced (Eck et al. 2009). With the advent of high throughput genotyping technologies, the discovery of cattle SNPs and the development of commercial cattle SNP-chips with many thousands polymorphic markers have become straightforward. SNP-chips can be used for genome- wide association studies (GWAS) to find SNPs that are in linkage disequilibrium (LD) with a quantitative trait loci (QTL) behind a trait of interest. The main purpose of a GWAS is to identify chromosome regions that harbor the gene(s) that contribute to the phenotypic variation of a trait, which then could serve as putative regions of QTL for further studies (Sahana et al. 2010). In genomic selection, SNPs with high effects in GWAS can be selected to obtain more accurate breeding values even for individuals without phenotypic observations. Moreover, because of the high density of SNPs in GWAS, it is better suited for fine-mapping of QTLs compared to traditional linkage analysis which usually estimates QTLs within very large chromosome intervals (Goddard & Hayes 2009). Hence, GWAS can be expected to have higher power than linkage studies to detect QTLs behind quantitative traits that are influenced by many genes of small effects (Cordell & Clayton 2005; Sahana et al. 2010).
To illustrate the mechanism for exclusion of termination factors from a central transcribed region by Tyr1 phosphorylation, we summed up the genome-wide occupancies for Ser2- and Ser5-phosphorylated Pol II and subtracted from this sum the occupancy with Tyr1- phosphorylated Pol II. Although calculation of such a difference profile is problematic due to unknown normalization factors between data sets, we obtained a curve that contains peaks just downstream of the TSS and the pA site, and an extended depression in between (Figure 26). Whereas the peaks correspond to regions where termination factors are usually recruited (5’ recruiting region, Nrd1; 3’ recruiting region, Pcf11 and Rtt103), the depression indicates a central region in which the Ser2-phosphorylated CTD is masked and Pol II is shielded from termination factors. We note that additional factors can contribute to the recruitment of termination factors to elongating Pol II. For instance, Nrd1 functions in a complex with Nab3 and Sen1 (308). Nrd1 and Nab3 interact specifically with nascent RNA (60) what contribute to Nrd1 recruitment or its persistency near Pol II even when Tyr1 phosphorylation levels rise. As detailed later, this seems to be especially true at snoRNA genes. In addition, our model assumes uniform CTD phosphorylation on all repeats, which does not necessarily occur, but there is currently no data that address this issue.
The mainstay of all genetic studies has been genome-wide linkage scans in families with at least two asthma-affected siblings. Based on a previous analysis of a genome-wide scan of asthma [7,8] with inconsistent chromosomal find- ings to earlier studies, we decided to expand the initial sample with additional families by the same core protocol for clinical examination and using the same set of micro- satellite markers [7,8]. The increased the number of iden- tically pheno- and genotyped families could be used to define sub-phenotypes, which may be a promising strat- egy to explain the aetiological heterogeneity observed so far. Relevant clinical subsets may be defined by different age of onset, different disease course by degree of severity, extrinsic (allergic sensitization detectable) and intrinsic (no allergic sensitization detectable, symptoms often dur- ing infections of the upper respiratory tract) asthma type, and house dust mite allergy (HDM), as well as genetic background as judged by geographical origin of parents (table 1). We hypothesized that the restriction to a smaller well-defined sample would reduce heterogeneity and improve the power of detecting linkage. Linkage regions should show higher lod scores than in the total sample and lead from phenotype subgroups to a genotypic dissection.
Chapter 5 concentrates on the first proposed CNV analyses strategy, which has a special focus on the selection of genomic marker probe sets being tested for an association with the trait of interest. One of the most striking differences between the genome-wide analysis of CNVs and other genomic variants, such as SNPs or short tandem repeats (STRs), is that the locations in which individuals have gained or lost copies of genetic material are a priori unknown. Current genotyping plat- forms provide SNP probe sets that are designed to reflect the presence or absence of the two SNP marker alleles and (additional) CNV probe sets that are selected for their linear response to copy number changes. The corresponding genome-wide analysis of SNPs is a straight-forward procedure, which includes the assignment of genotype classes AA, AB or BB to each recruited individual and the subsequent as- sociation testing at each available genetic marker. Contrarily, any CNV association testing has to additionally address the question of how and with which precision the quantitative continuous measurements produced by genotyping platforms can be transferred into precise DNA copy numbers. Bypassing the genotype calling step and instead directly testing the CNV intensity measurements, does not sufficiently solve this problem. Instead a new problem arises, since not even the existence of CNVs is ensured for any probe sets that might be found to be statistically signif- icantly associated with phenotypic characteristics. Consequently, the introduced strategy involves to restrict association testing on those probe sets with a certain minimal copy number variability. The consideration of this aspect was first pro- posed by Ionita-Laza et al. (2008). As a first application of the suggested method, a genome-wide CNV association analysis for the binary trait obesity was performed, which was published in Jarick et al. (2011).
of the German genome scan . We now recruited another set of families during a period of 18 months. Trained staff from 3 university hospitals as well as 6 pedi- atric pulmonary practices carried out an identical pheno- typing procedure as described previously . This procedure contained detailed interviews of every family member, skin prick tests (SPT) of frequent allergens, blood samples (for IgE and allergen-specific IgE (RAST) measurements, eosinophil count), peak flow tests for a period of 10 days and dust collection at patients' homes. SPT and RAST assays included several pollens, animal furs, mould, and house dust mite allergens (ALK- SCHERAX, Hamburg, Germany). The ethics commission of "Nordrhein-Westfalen" approved all study methods and informed consent was obtained from all parents and children.
To determine the regions of capping enzyme recruitment to DNA at high resolution we collected genome-wide occupancy proles for capping enzymes by ChIP-chip in exponentially growing S. cerevisiae culture. ChIP-chip and data analysis were performed as described in Sections 5.6 and 5.7, respectively. Averaging of ChIP proles after alignment of genes at their TSS  revealed a sharp peak for Ceg1 (Figure 17A and B, upper panel) ∼20 bp downstream of the TSS. The peak location and shape were virtually identical to that for Cet1 (Figure 17A and B, upper panel, and Section 6.2). Both peaks were independent of gene length, gene type or expression level . The highly similar proles for Cet1 and Ceg1 (R = 0.93) are consistent with the formation of a stable Cet1-Ceg1 heterodimer , and conrm the high resolution of our occupancy proling. Cet1-Ceg1 occupancy increases and decreases sharply, with a peak width similar to that obtained for the initiation factor TFIIB (Section 6.2) that binds a dened promoter region (Figure 17A and B, lower panel). Thus the recruitment and activity of the Cet1-Ceg1 heterodimer is apparently restricted to a narrow window near the TSS. Although ChIP data reect only physical presence of proteins, and although cross-linking can be indirect via other proteins, these results are consistent with the model that the rst two capping reactions occur on the polymerase surface when the nascent RNA 5'-end emerges from the Pol II RNA exit channel.
physical examination, semen analysis and hormone profile with serum FSH, LH and total testosterone are undertaken to define the cause of azoospermia (Figure 2). Together, these factors provide a >90% prediction of the type of azoospermia (obstructive azoospermia v. non-obstructive azoospermia). Males with diagnosed OA may conceive children by one of two ways: 1) surgical correction of the obstruction, which allows the couple to conceive naturally and obviate the need for ART, or 2) retrieval of spermatozoa directly from testis or epididymis, using sperm retrieval techniques like testicular sperm extraction (TESE), testicular sperm aspiration (TESA) and testicular epididymis fine needle aspiration (TEFNA), followed by IVF or ICSI. The use of these techniques in clinical practice revolutionized the treatment of patients with severe male factor of infertility ( Palermo et al., 1992 ).
approach (Ito et al., 2000; Uetz et al., 2000). Ito and colleagues constructed a DNA- binding domain hybrid and an activation domain hybrid for each of the ≈ 6000 predicted yeast proteins. This approach resulted in 4,549 two-hybrid positives. Uetz and colleagues used another strategy: individual DNA-binding domain fusion proteins were tested against an array of ≈6000 separate activation domain transformants, and individual DNA-binding domain transformants tested against a library of all activation domain hybrids. This study resulted in the identification of 957 putative interactions. There is only a small overlap in the results among all three approaches, and neither the first nor the second study recapitulate more than ≈13% of the published interactions detected up to now by using conventional single protein analysis (Hazbun and Fields, 2001). Not only does this rather small fraction of overlapping interactions hint at a high number of false negatives, it also suggests that genome ‘interactomes’ are larger than estimated by earlier studies.