Filter by type:

All types ( 14 ) 2005 ( 1 ) 2007 ( 1 ) 2008 ( 1 ) 2009 ( 1 ) 2010 ( 1 ) 2011 ( 1 ) 2012 ( 1 ) 2013 ( 2 ) 2014 ( 4 ) Genetics ( 2 ) Genomics ( 8 ) Human Disease ( 1 ) Human Genetics ( 7 ) Immunology ( 3 )
Sort by year:

Intersection of population variation and autoimmunity genetics in human T cell activation

2014GenomicsHuman GeneticsImmunology
Ye CJ, Feng T, Kwon HK, Raj, T, Wilson M, Asinovski N, McCabe C, Lee MH, Frohlich I, Paik H, Zaitlen N, Hacohen N, Stranger BE, De Jager P, Mathis D, Regev A*, Benoist C*
Science 12 September 2014: 345:6202


T lymphocyte activation by antigen conditions adaptive immune responses and immunopathologies, but we know little about its variation in humans and its genetic or environmental roots. We analyzed gene expression in CD4+ T cells during unbiased activation or in T helper 17 (TH17) conditions from 348 healthy subjects representing European, Asian, and African ancestries. We observed interindividual variability, most marked for cytokine transcripts, with clear biases on the basis of ancestry, and following patterns more complex than simple TH1/2/17 partitions. We identified 39 genetic loci specifically associated in cis with activated gene expression. We further fine-mapped and validated a single-base variant that modulates YY1 binding and the activity of an enhancer element controlling the autoimmune-associated IL2RA gene, affecting its activity in activated but not regulatory T cells. Thus, interindividual variability affects the fundamental immunologic process of T helper activation, with important connections to autoimmune disease.

Fig. 5. Trans-ancestry fine mapping identifies activation specific cis-eQTLs in GWAS regions. (A) In 48-hour a3+28/TH17, rs2863204 (MAF = 0.53, European) is best associated with IL23R expression and is independent from rs11209026 (R381Q). (B) At IL23R, a coding variant (rs11209026, R381Q, MAF = 0.06, European) and a regulatory variant (rs12095335) are associated with Crohn’s disease. (C) Conditioning on rs2863204 recovers rs12095335 (secondary Crohn’s disease association) as a significant secondary association. (D) At the IL2RA locus, the best T1D association is a regulatory variant, rs12722495. (E) In 48-hour a3+28, rs12251836 (MAF = 0.41, European) is best associated with IL2RA expression and is independent from rs12722495. (F) Conditioning on rs12251836 recovers rs12722495 (primary T1D association). X axis defines genomic intervals local to each gene, and y axis shows the –log10 P value of association in GWAS (Crohn’s disease, T1D) and eQTL analysis (48-hour a3+28 or 48-hour a3+28/TH17).

Polarization of the Effects of Autoimmune and Neurodegenerative Risk Alleles in Leukocytes

2014GenomicsHuman GeneticsImmunology
Raj T, Rothamel K, Mostafavi S, Ye C, Lee MN, Replogle JM, Feng T, Lee M, Asinovski N, Frohlich I, Imboywa S, Von Korff A, Okada Y, Patsopoulos NA, Davis S, McCabe C, Paik H, Srivastava GP, Raychaudhuri S, Hafler DA, Koller D, Regev A, Hacohen N, Mathis D, Benoist C*, Stranger BE*, De Jager PL*
Science 2 May 2014: 344 (6183), 519-523.


To extend our understanding of the genetic basis of human immune function and dysfunction, we performed an expression quantitative trait locus (eQTL) study of purified CD4+ T cells and monocytes, representing adaptive and innate immunity, in a multi-ethnic cohort of 461 healthy individuals. Context-specific cis- and trans-eQTLs were identified, and cross-population mapping allowed, in some cases, putative functional assignment of candidate causal regulatory variants for disease-associated loci. We note an over-representation of T cell–specific eQTLs among susceptibility alleles for autoimmune diseases and of monocyte-specific eQTLs among Alzheimer’s and Parkinson’s disease variants. This polarization implicates specific immune cell types in these diseases and points to the need to identify the cell-autonomous effects of disease susceptibility variants.


Fig. 3a. Polarization of cis-regulatory effects for disease-associated variants. (Left) For each evaluated trait, we report, in parentheses, the number of trait-associated (GWAS) SNPs with cis-eQTL effects over the total number of SNPs, and the number of genes influenced by one or more of these SNPs. For each trait, we present the distribution of the proportion of cell specificity (estimated using a Bayesian hierarchical model) observed for each of 1000 random samplings of matched SNP sets. The proportion of cell-specific cis-eQTL effects observed for a given trait is shown over this distribution using an orange line for monocytes and a green line for T cells. (Right) We report the proportion of cis-eQTLs that are monocyte-specific (orange), shared (blue), and T cell–specific (green).

Effectively identifying regulatory hotspots while capturing expression heterogeneity in gene expression studies.

2014GenomicsHuman Genetics
Joo JWJ, Sul JH, Han B, Ye C, Eskin E
Genome Biology 7 April 2014: 15:r61


Expression quantitative trait loci (eQTL) mapping is a tool that can systematically identify genetic variation affecting gene expression. eQTL mapping studies have shown that certain genomic locations, referred to as regulatory hotspots, may affect the expression levels of many genes. Recently, studies have shown that various confounding factors may induce spurious regulatory hotspots. Here, we introduce a novel statistical method that effectively eliminates spurious hotspots while retaining genuine hotspots. Applied to simulated and real datasets, we validate that our method achieves greater sensitivity while retaining low false discovery rates compared to previous methods.


Common Genetic Variants Modulate Pathogen-Sensing Responses in Human Dendritic Cells

2014GenomicsHuman GeneticsImmunology
Lee MN*, Ye C*, Villani AC, Raj T, Li W, Eisenhaure TM, Imboywa SH, Chipendo PI, Ran FA, Slowikowski K, Ward LD, Raddassi K, McCabe C, Lee MH, Frohlich IY, Hafler DA, Kellis M, Raychaudhuri S, Zhang F, Stranger BE, Benoist CO, De Jager PL, Regev A†, Hacohen N†
Science 7 March 2014: 343 (6175), 1246980


Little is known about how human genetic variation affects the responses to environmental stimuli in the context of complex diseases. Experimental and computational approaches were applied to determine the effects of genetic variation on the induction of pathogen-responsive genes in human dendritic cells. We identified 121 common genetic variants associated in cis with variation in expression responses to Escherichia coli lipopolysaccharide, influenza, or interferon-β (IFN-β). We localized and validated causal variants to binding sites of pathogen-activated STAT (signal transducer and activator of transcription) and IRF (IFN-regulatory factor) transcription factors. We also identified a common variant in IRF7 that is associated in trans with type I IFN induction in response to influenza infection. Our results reveal common alleles that explain interindividual variation in pathogen sensing and provide functional annotation for genetic variants that alter susceptibility to inflammatory diseases.

dc figure

Fig. 3A-C. Association analysis reveals cis-eQTLs and cis-reQTLs. (A and B) Manhattan plot of cis-eQTLs (A) (baseline expression) and cis-reQTLs (B) (LPS-, FLU-, and IFN-b–stimulated fold changes relative to baseline) showing −log10(P values) (left y axis) and R2 values (right y axis) for all cis-SNPs, displayed on the x axis with associated genes ordered by chromosomal location. (C) Box-whisker plots showing expression [left; log2(nCounts), y axis] or fold change [right; log2 (fold), y axis] of DCBLD1, IFNA21, TEC, and ARL5B in resting, LPS-stimulated, FLU-infected, and IFN-b–stimulated MoDCs as a function of genotype of the respective cis-SNPs (x axis: rs27434, rs10964871, rs10938526, and rs11015435). African Americans, Asians, and Europeans in this order are displayed as separate box-whisker plots adjacent to each other in each condition. −Log10(P values) and b statistics are displayed in top right corners.


Press Release

Effectively identifying eQTLs from multiple tissues combining mixed model and meta-analytic approaches.

Sul JH, Han B, Ye C, Choi T, Eskin E
PLoS Genetics 13 June 2013: 9(6): e1003491


Gene expression data, in conjunction with information on genetic variants, have enabled studies to identify expression quantitative trait loci (eQTLs) or polymorphic locations in the genome that are associated with expression levels. Moreover, recent technological developments and cost decreases have further enabled studies to collect expression data in multiple tissues. One advantage of multiple tissue datasets is that studies can combine results from different tissues to identify eQTLs more accurately than examining each tissue separately. The idea of aggregating results of multiple tissues is closely related to the idea of meta-analysis which aggregates results of multiple genome-wide association studies to improve the power to detect associations. In principle, meta-analysis methods can be used to combine results from multiple tissues. However, eQTLs may have effects in only a single tissue, in all tissues, or in a subset of tissues with possibly different effect sizes. This heterogeneity in terms of effects across multiple tissues presents a key challenge to detect eQTLs. In this paper, we develop a framework that leverages two popular meta-analysis methods that address effect size heterogeneity to detect eQTLs across multiple tissues. We show by using simulations and multiple tissue data from mouse that our approach detects many eQTLs undetected by traditional eQTL methods. Additionally, our method provides an interpretation framework that accurately predicts whether an eQTL has an effect in a particular tissue.

Mutations causing medullary cystic kidney disease type 1 lie in a large VNTR in MUC1 missed by massively parallel sequencing

2013Human DiseaseHuman Genetics
Kirby A, Gnirke A, Jaffe DB, Barešová V, Pochet N, Blumenstiel B, Ye C, Aird D, Stevens C, Robinson JT, Cabili MN, Gat-Viks I, Kelliher E, Daza R, DeFelice M, Hůlková H, Sovová J, Vylet'al P, Antignac C, Guttman M, Handsaker RE, Perrin D, Steelman S, Sigurdsson S, Scheinman SJ, Sougnez C, Cibulskis K, Parkin M, Green T, Rossin E, Zody MC, Xavier RJ, Pollak MR, Alper SL, Lindblad-Toh K, Gabriel S, Hart PS, Regev A, Nusbaum C, Kmoch S, Bleyer AJ, Lander ES, Daly MJ
Nature Genetics 10 February 2013: 45, 299–303.


Although genetic lesions responsible for some mendelian disorders can be rapidly discovered through massively parallel sequencing of whole genomes or exomes, not all diseases readily yield to such efforts. We describe the illustrative case of the simple mendelian disorder medullary cystic kidney disease type 1 (MCKD1), mapped more than a decade ago to a 2-Mb region on chromosome 1. Ultimately, only by cloning, capillary sequencing and de novo assembly did we find that each of six families with MCKD1 harbors an equivalent but apparently independently arising mutation in sequence markedly under-represented in massively parallel sequencing data: the insertion of a single cytosine in one copy (but a different copy in each family) of the repeat unit comprising the extremely long (~1.5–5 kb), GC-rich (>80%) coding variable-number tandem repeat (VNTR) sequence in the MUC1 gene encoding mucin 1. These results provide a cautionary tale about the challenges in identifying the genes responsible for mendelian, let alone more complex, disorders through massively parallel sequencing.

Figure 2 – Discovery of a cytosine insertion in a coding VNTR of MUC1. (a) The major domains of the full-length MUC1 precursor protein are shown (TM, transmembrane domain). Based on fully and unambiguously assembled VNTR alleles, the frameshift caused by insertion of a cytosine in the coding strand is expected to introduce a premature stop codon shortly beyond the VNTR domain. (b) Where possible, we used knowledge of segregating phased SNP marker haplotypes to select for de novo VNTR sequencing and assembly those individuals sharing only a single haplotype across the region, as this aided identification of the VNTR allele segregating with the shared risk haplotype. (c) Independent de novo assembly of the shared VNTR allele in 2 individuals from family 4 shows exactly identical complete sequence, with the seventh 60-base unit (white X) out of 44 containing a cytosine insertion. The assembly is oriented relative to the coding strand of MUC1 and covers bases 155,160,963–155,162,030 on chromosome 1 (hg19). Each unique 60-base repeat segment is represented by a different letter or number (supplementary Fig. 2). (d) Translational impact of the cytosine insertion frameshift.

Integrated computational and experimental analysis of the neuroendorcine transcriptome in genetic hypertension iden- tifies novel control points for the cardio-metabolic syndrome.

2012GenomicsHuman Genetics
Friese RS, Ye C, Nievergelt CM, Schork AJ, Mahapatra NR, Rao F, Napolitan PS, Waalen J, Ehret GB, Munroe PB, Schmid-Shonbein GW, Eskin E, O'Connor DT
Circ Cadiovasc Genet 1 Aug 2012: 5(4):430-40

Background—Essential hypertension, a common complex disease, displays substantial genetic influence. Contemporary methods to dissect the genetic basis of complex diseases such as the genomewide association study are powerful, yet a large gap exists betweens the fraction of population trait variance explained by such associations and total disease heritability.

Methods and Results—We developed a novel, integrative method (combining animal models, transcriptomics, bioinformatics, molecular biology, and trait-extreme phenotypes) to identify candidate genes for essential hypertension and the metabolic syndrome. We first undertook transcriptome profiling on adrenal glands from blood pressure extreme mouse strains: the hypertensive BPH (blood pressure high) and hypotensive BPL (blood pressure low). Microarray data clustering revealed a striking pattern of global underexpression of intermediary metabolism transcripts in BPH. The MITRA algorithm identified a conserved motif in the transcriptional regulatory regions of the underexpressed metabolic genes, and we then hypothesized that regulation through this motif contributed to the global underexpression. Luciferase reporter assays demonstrated transcriptional activity of the motif through transcription factors HOXA3, SRY, and YY1. We finally hypothesized that genetic variation at HOXA3, SRY, and YY1 might predict blood pressure and other metabolic syndrome traits in humans. Tagging variants for each locus were associated with blood pressure in a human population blood pressure extreme sample with the most extensive associations for YY1 tagging single nucleotide polymorphism rs11625658 on systolic blood pressure, diastolic blood pressure, body mass index, and fasting glucose. Meta-analysis extended the YY1 results into 2 additional large population samples with significant effects preserved on diastolic blood pressure, body mass index, and fasting glucose.

Conclusions—The results outline an innovative, systematic approach to the genetic pathogenesis of complex cardiovascular disease traits and point to transcription factor YY1 as a potential candidate gene involved in essential hypertension and the cardiometabolic syndrome.

Mixed-model coexpression: calculating gene coexpression while accounting for expression heterogeneity.

Furlotte NA, Kang HM, Ye C, Eskin E
Bioinformatics July 2011: 27(13): i288-94


The analysis of gene coexpression is at the core of many types of genetic analysis. The coexpression between two genes can be calculated by using a traditional Pearson’s correlation coefficient. However, unobserved confounding effects may cause inflation of the Pearson’s correlation so that uncorrelated genes appear correlated. Many general methods have been suggested, which aim to remove the effects of confounding from gene expression data. However, the residual confounding which is not accounted for by these generic correction procedures has the potential to induce correlation between genes. Therefore, a method that specifically aims to calculate gene coexpression between gene expression arrays, while accounting for confounding effects, is desirable.
In this article, we present a statistical model for calculating gene coexpression called mixed model coexpression (MMC), which models coexpression within a mixed model framework. Confounding effects are expected to be encoded in the matrix representing the correlation between arrays, the inter-sample correlation matrix. By conditioning on the information in the inter-sample correlation matrix, MMC is able to produce gene coexpressions that are not influenced by global confounding effects and thus significantly reduce the number of spurious coexpressions observed. We applied MMC to both human and yeast datasets and show it is better able to effectively prioritize strong coexpressions when compared to a traditional Pearson’s correlation and a Pearson’s correlation applied to data corrected with surrogate variable analysis (SVA).

Detecting the presence and absence of causal relationships between expression of yeast genes with very few samples.

Kang EY, Ye C, Shpitser I, Eskin E
Journal of Computation Biology March 2010: 17(3): 533-546


Inference of biological networks from high-throughput data is a central problem in bioinformatics. Particularly powerful for network reconstruction is data collected by recent studies that contain both genetic variation information and gene expression profiles from genetically distinct strains of an organism. Various statistical approaches have been applied to these data to tease out the underlying biological networks that govern how individual genetic variation mediates gene expression and how genes regulate and interact with each other. Extracting meaningful causal relationships from these networks remains a challenging but important problem. In this article, we use causal inference techniques to infer the presence or absence of causal relationships between yeast gene expressions in the framework of graphical causal models. We evaluate our method using a well studied dataset consisting of both genetic variations and gene expressions collected over randomly segregated yeast strains. Our predictions of causal regulators, genes that control the expression of a large number of target genes, are consistent with previously known experimental evidence. In addition, our method can detect the absence of causal relationships and can distinguish between direct and indirect effects of variation on a gene expression level.

Using network component analysis to dissect regulatory networks mediated by transcription factors in yeast

Ye C, Galbraith SJ, Liao JC, Eskin E
PLoS Computational Biology 20 March, 2009, 5(3): e1000311


Understanding the relationship between genetic variation and gene expression is a central question in genetics. With the availability of data from high-throughput technologies such as ChIP-Chip, expression, and genotyping arrays, we can begin to not only identify associations but to understand how genetic variations perturb the underlying transcription regulatory networks to induce differential gene expression. In this study, we describe a simple model of transcription regulation where the expression of a gene is completely characterized by two properties: the concentrations and promoter affinities of active transcription factors. We devise a method that extends Network Component Analysis (NCA) to determine how genetic variations in the form of single nucleotide polymorphisms (SNPs) perturb these two properties. Applying our method to a segregating population of Saccharomyces cerevisiae, we found statistically significant examples of trans-acting SNPs located in regulatory hotspots that perturb transcription factor concentrations and affinities for target promoters to cause global differential expression and cis-acting genetic variations that perturb the promoter affinities of transcription factors on a single gene to cause local differential expression. Although many genetic variations linked to gene expressions have been identified, it is not clear how they perturb the underlying regulatory networks that govern gene expression. Our work begins to fill this void by showing that many genetic variations affect the concentrations of active transcription factors in a cell and their affinities for target promoters. Understanding the effects of these perturbations can help us to paint a more complete picture of the complex landscape of transcription regulation. The software package implementing the algorithms discussed in this work is available as a MATLAB package upon request.

Accurate Discovery of Expression Quantitative Trait Loci Under Confounding From Spurious and Genuine Regulatory Hotspots

2008Human Genetics
Kang HM*, Ye C* and Eskin E
Genetics 9 September 2008: 180: 1909–1925


In genomewide mapping of expression quantitative trait loci (eQTL), it is widely believed that thousands of genes are trans-regulated by a small number of genomic regions called “regulatory hotspots,” resulting in “trans-regulatory bands” in an eQTL map. As several recent studies have demonstrated, technical confounding factors such as batch effects can complicate eQTL analysis by causing many spurious associations including spurious regulatory hotspots. Yet little is understood about how these technical confounding factors affect eQTL analyses and how to correct for these factors. Our analysis of data sets with biological replicates suggests that it is this intersample correlation structure inherent in expression data that leads to spurious associations between genetic loci and a large number of transcripts inducing spurious regulatory hotspots. We propose a statistical method that corrects for the spurious associations caused by complex intersample correlation of expression measurements in eQTL mapping. Applying our intersample correlation emended (ICE) eQTL mapping method to mouse, yeast, and human identifies many more cis associations while eliminating most of the spurious trans associations. The concordances of cis and trans associations have consistently increased between different replicates, tissues, and populations, demonstrating the higher accuracy of our method to identify real genetic effects.

Discovering tightly regulated and differentially expressed gene sets in whole genome expression data

Ye C, Eskin E
Bioinformatics January 2007: 23(2): e84-90


Motivation: Recently, a new type of expression data is being collected which aims to measure the effect of genetic variation on gene expression in pathways. In these datasets, expression profiles are constructed for multiple strains of the same model organism under the same condition. The goal of analyses of these data is to find differences in regulatory patterns due to genetic variation between strains, often without a phenotype of interest in mind. We present a new method based on notions of tight regulation and differential expression to look for sets of genes which appear to be significantly affected by genetic variation.

Results: When we use categorical phenotype information, as in the Alzheimer’s and diabetes datasets, our method finds many of the same gene sets as gene set enrichment analysis. In addition, our notion of correlated gene sets allows us to focus our efforts on biological processes subjected to tight regulation. In murine hematopoietic stem cells, we are able to discover significant gene sets independent of a phenotype of interest. Some of these gene sets are associated with several blood-related phenotypes.

Orthologous repeats and mammalian phylogenetic inference

Bashir A, Ye C, Price AL, Bafna V
Genome Research 3 May 2005: 15: 998-1006


Determining phylogenetic relationships between species is a difficult problem, and many phylogenetic relationships remain unresolved, even among eutherian mammals. Repetitive elements provide excellent markers for phylogenetic analysis, because their mode of evolution is predominantly homoplasy-free and unidirectional. Historically, phylogenetic studies using repetitive elements have relied on biological methods such as PCR analysis, and computational inference is limited to a few isolated repeats. Here, we present a novel computational method for inferring phylogenetic relationships from partial sequence data using orthologous repeats. We apply our method to reconstructing the phylogeny of 28 mammals, using more than 1000 orthologous repeats obtained from sequence data available from the NISC Comparative Sequencing Program. The resulting phylogeny has robust bootstrap numbers, and broadly matches results from previous studies which were obtained using entirely different data and methods. In addition, we shed light on some of the debatable aspects of the phylogeny. With rapid expansion of available partial sequence data, computational analysis of repetitive elements holds great promise for the future of phylogenetic inference.

Assessing computational tools for the discovery of transcription factor binding sites

Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, Van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z
Nature Biotechnology 6 January 2005: 23, 137 - 144


The prediction of regulatory elements is a problem where computational methods offer great hope. Over the past few years, numerous tools have become available for this task. The purpose of the current assessment is twofold: to provide some guidance to users regarding the accuracy of currently available tools in various settings, and to provide a benchmark of data sets for assessing future tools.