T lymphocyte activation by antigen conditions adaptive immune responses and immunopathologies, but we know little about its variation in humans and its genetic or environmental roots. We analyzed gene expression in CD4+ T cells during unbiased activation or in T helper 17 (TH17) conditions from 348 healthy subjects representing European, Asian, and African ancestries. We observed interindividual variability, most marked for cytokine transcripts, with clear biases on the basis of ancestry, and following patterns more complex than simple TH1/2/17 partitions. We identified 39 genetic loci specifically associated in cis with activated gene expression. We further fine-mapped and validated a single-base variant that modulates YY1 binding and the activity of an enhancer element controlling the autoimmune-associated IL2RA gene, affecting its activity in activated but not regulatory T cells. Thus, interindividual variability affects the fundamental immunologic process of T helper activation, with important connections to autoimmune disease.
To extend our understanding of the genetic basis of human immune function and dysfunction, we performed an expression quantitative trait locus (eQTL) study of purified CD4+ T cells and monocytes, representing adaptive and innate immunity, in a multi-ethnic cohort of 461 healthy individuals. Context-specific cis- and trans-eQTLs were identified, and cross-population mapping allowed, in some cases, putative functional assignment of candidate causal regulatory variants for disease-associated loci. We note an over-representation of T cell–specific eQTLs among susceptibility alleles for autoimmune diseases and of monocyte-specific eQTLs among Alzheimer’s and Parkinson’s disease variants. This polarization implicates specific immune cell types in these diseases and points to the need to identify the cell-autonomous effects of disease susceptibility variants.
Expression quantitative trait loci (eQTL) mapping is a tool that can systematically identify genetic variation affecting gene expression. eQTL mapping studies have shown that certain genomic locations, referred to as regulatory hotspots, may affect the expression levels of many genes. Recently, studies have shown that various confounding factors may induce spurious regulatory hotspots. Here, we introduce a novel statistical method that effectively eliminates spurious hotspots while retaining genuine hotspots. Applied to simulated and real datasets, we validate that our method achieves greater sensitivity while retaining low false discovery rates compared to previous methods.
Little is known about how human genetic variation affects the responses to environmental stimuli in the context of complex diseases. Experimental and computational approaches were applied to determine the effects of genetic variation on the induction of pathogen-responsive genes in human dendritic cells. We identified 121 common genetic variants associated in cis with variation in expression responses to Escherichia coli lipopolysaccharide, influenza, or interferon-β (IFN-β). We localized and validated causal variants to binding sites of pathogen-activated STAT (signal transducer and activator of transcription) and IRF (IFN-regulatory factor) transcription factors. We also identified a common variant in IRF7 that is associated in trans with type I IFN induction in response to influenza infection. Our results reveal common alleles that explain interindividual variation in pathogen sensing and provide functional annotation for genetic variants that alter susceptibility to inflammatory diseases.Perspective
Gene expression data, in conjunction with information on genetic variants, have enabled studies to identify expression quantitative trait loci (eQTLs) or polymorphic locations in the genome that are associated with expression levels. Moreover, recent technological developments and cost decreases have further enabled studies to collect expression data in multiple tissues. One advantage of multiple tissue datasets is that studies can combine results from different tissues to identify eQTLs more accurately than examining each tissue separately. The idea of aggregating results of multiple tissues is closely related to the idea of meta-analysis which aggregates results of multiple genome-wide association studies to improve the power to detect associations. In principle, meta-analysis methods can be used to combine results from multiple tissues. However, eQTLs may have effects in only a single tissue, in all tissues, or in a subset of tissues with possibly different effect sizes. This heterogeneity in terms of effects across multiple tissues presents a key challenge to detect eQTLs. In this paper, we develop a framework that leverages two popular meta-analysis methods that address effect size heterogeneity to detect eQTLs across multiple tissues. We show by using simulations and multiple tissue data from mouse that our approach detects many eQTLs undetected by traditional eQTL methods. Additionally, our method provides an interpretation framework that accurately predicts whether an eQTL has an effect in a particular tissue.
Although genetic lesions responsible for some mendelian disorders can be rapidly discovered through massively parallel sequencing of whole genomes or exomes, not all diseases readily yield to such efforts. We describe the illustrative case of the simple mendelian disorder medullary cystic kidney disease type 1 (MCKD1), mapped more than a decade ago to a 2-Mb region on chromosome 1. Ultimately, only by cloning, capillary sequencing and de novo assembly did we find that each of six families with MCKD1 harbors an equivalent but apparently independently arising mutation in sequence markedly under-represented in massively parallel sequencing data: the insertion of a single cytosine in one copy (but a different copy in each family) of the repeat unit comprising the extremely long (~1.5–5 kb), GC-rich (>80%) coding variable-number tandem repeat (VNTR) sequence in the MUC1 gene encoding mucin 1. These results provide a cautionary tale about the challenges in identifying the genes responsible for mendelian, let alone more complex, disorders through massively parallel sequencing.
Background—Essential hypertension, a common complex disease, displays substantial genetic influence. Contemporary methods to dissect the genetic basis of complex diseases such as the genomewide association study are powerful, yet a large gap exists betweens the fraction of population trait variance explained by such associations and total disease heritability.
Methods and Results—We developed a novel, integrative method (combining animal models, transcriptomics, bioinformatics, molecular biology, and trait-extreme phenotypes) to identify candidate genes for essential hypertension and the metabolic syndrome. We first undertook transcriptome profiling on adrenal glands from blood pressure extreme mouse strains: the hypertensive BPH (blood pressure high) and hypotensive BPL (blood pressure low). Microarray data clustering revealed a striking pattern of global underexpression of intermediary metabolism transcripts in BPH. The MITRA algorithm identified a conserved motif in the transcriptional regulatory regions of the underexpressed metabolic genes, and we then hypothesized that regulation through this motif contributed to the global underexpression. Luciferase reporter assays demonstrated transcriptional activity of the motif through transcription factors HOXA3, SRY, and YY1. We finally hypothesized that genetic variation at HOXA3, SRY, and YY1 might predict blood pressure and other metabolic syndrome traits in humans. Tagging variants for each locus were associated with blood pressure in a human population blood pressure extreme sample with the most extensive associations for YY1 tagging single nucleotide polymorphism rs11625658 on systolic blood pressure, diastolic blood pressure, body mass index, and fasting glucose. Meta-analysis extended the YY1 results into 2 additional large population samples with significant effects preserved on diastolic blood pressure, body mass index, and fasting glucose.
Conclusions—The results outline an innovative, systematic approach to the genetic pathogenesis of complex cardiovascular disease traits and point to transcription factor YY1 as a potential candidate gene involved in essential hypertension and the cardiometabolic syndrome.
The analysis of gene coexpression is at the core of many types of genetic analysis. The coexpression between two genes can be calculated by using a traditional Pearson’s correlation coefficient. However, unobserved confounding effects may cause inflation of the Pearson’s correlation so that uncorrelated genes appear correlated. Many general methods have been suggested, which aim to remove the effects of confounding from gene expression data. However, the residual confounding which is not accounted for by these generic correction procedures has the potential to induce correlation between genes. Therefore, a method that specifically aims to calculate gene coexpression between gene expression arrays, while accounting for confounding effects, is desirable.
In this article, we present a statistical model for calculating gene coexpression called mixed model coexpression (MMC), which models coexpression within a mixed model framework. Confounding effects are expected to be encoded in the matrix representing the correlation between arrays, the inter-sample correlation matrix. By conditioning on the information in the inter-sample correlation matrix, MMC is able to produce gene coexpressions that are not influenced by global confounding effects and thus significantly reduce the number of spurious coexpressions observed. We applied MMC to both human and yeast datasets and show it is better able to effectively prioritize strong coexpressions when compared to a traditional Pearson’s correlation and a Pearson’s correlation applied to data corrected with surrogate variable analysis (SVA).
Inference of biological networks from high-throughput data is a central problem in bioinformatics. Particularly powerful for network reconstruction is data collected by recent studies that contain both genetic variation information and gene expression profiles from genetically distinct strains of an organism. Various statistical approaches have been applied to these data to tease out the underlying biological networks that govern how individual genetic variation mediates gene expression and how genes regulate and interact with each other. Extracting meaningful causal relationships from these networks remains a challenging but important problem. In this article, we use causal inference techniques to infer the presence or absence of causal relationships between yeast gene expressions in the framework of graphical causal models. We evaluate our method using a well studied dataset consisting of both genetic variations and gene expressions collected over randomly segregated yeast strains. Our predictions of causal regulators, genes that control the expression of a large number of target genes, are consistent with previously known experimental evidence. In addition, our method can detect the absence of causal relationships and can distinguish between direct and indirect effects of variation on a gene expression level.
Understanding the relationship between genetic variation and gene expression is a central question in genetics. With the availability of data from high-throughput technologies such as ChIP-Chip, expression, and genotyping arrays, we can begin to not only identify associations but to understand how genetic variations perturb the underlying transcription regulatory networks to induce differential gene expression. In this study, we describe a simple model of transcription regulation where the expression of a gene is completely characterized by two properties: the concentrations and promoter affinities of active transcription factors. We devise a method that extends Network Component Analysis (NCA) to determine how genetic variations in the form of single nucleotide polymorphisms (SNPs) perturb these two properties. Applying our method to a segregating population of Saccharomyces cerevisiae, we found statistically significant examples of trans-acting SNPs located in regulatory hotspots that perturb transcription factor concentrations and affinities for target promoters to cause global differential expression and cis-acting genetic variations that perturb the promoter affinities of transcription factors on a single gene to cause local differential expression. Although many genetic variations linked to gene expressions have been identified, it is not clear how they perturb the underlying regulatory networks that govern gene expression. Our work begins to fill this void by showing that many genetic variations affect the concentrations of active transcription factors in a cell and their affinities for target promoters. Understanding the effects of these perturbations can help us to paint a more complete picture of the complex landscape of transcription regulation. The software package implementing the algorithms discussed in this work is available as a MATLAB package upon request.
In genomewide mapping of expression quantitative trait loci (eQTL), it is widely believed that thousands of genes are trans-regulated by a small number of genomic regions called “regulatory hotspots,” resulting in “trans-regulatory bands” in an eQTL map. As several recent studies have demonstrated, technical confounding factors such as batch effects can complicate eQTL analysis by causing many spurious associations including spurious regulatory hotspots. Yet little is understood about how these technical confounding factors affect eQTL analyses and how to correct for these factors. Our analysis of data sets with biological replicates suggests that it is this intersample correlation structure inherent in expression data that leads to spurious associations between genetic loci and a large number of transcripts inducing spurious regulatory hotspots. We propose a statistical method that corrects for the spurious associations caused by complex intersample correlation of expression measurements in eQTL mapping. Applying our intersample correlation emended (ICE) eQTL mapping method to mouse, yeast, and human identifies many more cis associations while eliminating most of the spurious trans associations. The concordances of cis and trans associations have consistently increased between different replicates, tissues, and populations, demonstrating the higher accuracy of our method to identify real genetic effects.
Motivation: Recently, a new type of expression data is being collected which aims to measure the effect of genetic variation on gene expression in pathways. In these datasets, expression profiles are constructed for multiple strains of the same model organism under the same condition. The goal of analyses of these data is to find differences in regulatory patterns due to genetic variation between strains, often without a phenotype of interest in mind. We present a new method based on notions of tight regulation and differential expression to look for sets of genes which appear to be significantly affected by genetic variation.
Results: When we use categorical phenotype information, as in the Alzheimer’s and diabetes datasets, our method finds many of the same gene sets as gene set enrichment analysis. In addition, our notion of correlated gene sets allows us to focus our efforts on biological processes subjected to tight regulation. In murine hematopoietic stem cells, we are able to discover significant gene sets independent of a phenotype of interest. Some of these gene sets are associated with several blood-related phenotypes.
Determining phylogenetic relationships between species is a difficult problem, and many phylogenetic relationships remain unresolved, even among eutherian mammals. Repetitive elements provide excellent markers for phylogenetic analysis, because their mode of evolution is predominantly homoplasy-free and unidirectional. Historically, phylogenetic studies using repetitive elements have relied on biological methods such as PCR analysis, and computational inference is limited to a few isolated repeats. Here, we present a novel computational method for inferring phylogenetic relationships from partial sequence data using orthologous repeats. We apply our method to reconstructing the phylogeny of 28 mammals, using more than 1000 orthologous repeats obtained from sequence data available from the NISC Comparative Sequencing Program. The resulting phylogeny has robust bootstrap numbers, and broadly matches results from previous studies which were obtained using entirely different data and methods. In addition, we shed light on some of the debatable aspects of the phylogeny. With rapid expansion of available partial sequence data, computational analysis of repetitive elements holds great promise for the future of phylogenetic inference.
The prediction of regulatory elements is a problem where computational methods offer great hope. Over the past few years, numerous tools have become available for this task. The purpose of the current assessment is twofold: to provide some guidance to users regarding the accuracy of currently available tools in various settings, and to provide a benchmark of data sets for assessing future tools.