Our research focuses on computational approaches of identifying functionally relevant genetic changes in disease and adaptation as well as developing more sensitive methods in diagnostics (especially exome, genome and cell-free DNA sequencing). Generally, our research spans the fields of sequence analysis, data mining, machine learning and functional genomics.

Areas of Research

We focus on Genome-wide variant effects, functional read-outs of reg. effects and support diagnostic essays

Genome-wide predictors of variant effects

We develop and maintain a widely used variant effect scoring tool (Combined Annotation Dependent Depletion, CADD), that uses machine learning to integrate more than 80 different gene-based and genome-wide annotations. CADD was the first tool to predict variant deleteriousness for all possible single base-pair alterations genome-wide, while also allowing to score multi-base and insertion/deletion changes. Our latest manuscript describes the updates since the initial publication of CADD (incl. the transition to GRCh38) and how we envision its use. It was published by Nucleic Acids Research in 2018.

While many variant annotation and scoring tools are around, most annotations tend to exploit a single information type (e.g. conservation) and/or are restricted in scope (e.g. to missense changes). Thus, a broadly applicable metric that objectively weights and integrates diverse information is needed. Combined Annotation Dependent Depletion (CADD) is a framework that integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations. C-scores strongly correlate with allelic diversity, pathogenicity of both coding and non-coding variants, and experimentally measured regulatory effects, and also highly rank causal variants within individual genome sequences. Finally, C-scores of complex trait-associated variants from genome-wide association studies (GWAS) are significantly higher than matched controls and correlate with study sample size, likely reflecting the increased accuracy of larger GWAS. CADD can quantitatively prioritize functional, deleterious, and disease causal variants across a wide range of functional categories, effect sizes and genetic architectures and can be used prioritize causal variation in both research and clinical settings.

We are now extending the principal idea behind CADD to comprehensively score Structural Variants (CADD-SV). The interpretation of SVs became a research focus for us due to recent technological advances in SV detection.

Functional read-outs of genomic sequence and its organization

Regulatory sequences (promoters and enhancers) are involved in control of gene expression. Mutations (stars) increase or decrease the binding of different proteins (transcription factors) to DNA sequence, thereby changing the amount of RNA which can impact certain phenotypes.
Regulatory sequences (promoters and enhancers) are involved in the control of gene expression. Mutations (stars) increase or decrease the binding of different proteins (transcription factors) to DNA sequence, thereby changing the amount of RNA which can impact certain phenotypes.

We collaborate with the labs of Nadav Ahituv and Jay Shendure to obtain and analyze experimental measures of non-coding sequence activity, specifically from Massively Parallel Reporter Assays (MPRA). Despite the majority of all mutations affecting non-coding sequences and a growing evidence of substantial phenotypic effects as well as clinical relevance, alterations in these sequences remain less well understood than those in coding regions. The lab uses for example experimental data to infer computational models of regulatory sequence effects (i.e. promoter and enhancer sequences) with the goal of contributing to a better understanding of regulatory sequence function and later integrating regulatory sequence models in the next generation of genome-wide variant scores. Here you can have a look at the effect of variants in 20 regulatory elements published in our Saturation Mutagenesis Manuscript.

Supporting the development of molecular assays and diagnostics

In addition, we develop pipelines to support data interpretation and visualization of next generation sequencing applications. This includes developments in primary data processing (e.g. base calling, overlap read merging and adapter trimming), but also supporting the development of new library preparation protocols. In this context, we studied sample cross-contamination in Illumina multiplex experiments and proposed a protocol that introduces indexes into both library adapters (now known as double or dual indexing). In another study, we looked at cell-free DNA (cfDNA) and were able to show that nucleosome and TF binding footprints in DNA fragmentation can be used to infer cell type contribution and inform about pathological states. For assays based on targeted enrichment using molecular inversion probes (MIPs), we developed an automated analysis and result reporting pipeline for identifying disease causal variants in hemophilia patients (HemoMIPs).