Example of how to create a genome-wide genomic pairs plot
8 minute read
Introduction
In this post, I want to give an example of how pairs of genomic entities, such
as genotypes and associated genes (QTLs), can be plotted in a genome-wide scale using
tidyverse/ggplot2 and the GenomicRanges package in R.
We will use publicly available trans-eQTL data (see below) and will create a
single overview matrix showing the frequencies of QTLs in these data.
To this end, we will bin/tile all standard chromosomes of the human reference genome
(GRCh37/hg19) into bins of 10MB, overlap these bins with our QTL results and then
plot them using a simple tile plot available in ggplot2.
In addition, we’ll add margin plots to the main matrix-like plot, showing the
row and column summary counts over all associations.
Data
Expression quantitative trait loci (eQTL) are, in essence, statistical associations
between genetic variants/genotypes (e.g. single nucleotide polymorphisms, SNPs) and
the expression of genes measured ideally in a large number of samples to obtain
sufficient statistical power.
In other words, eQTL are a way of pinpointing DNA sequence variants which, most likely,
have a functional impact in the many complex mechanisms taking place in the studied organism.
Generally, eQTL can be classified in cis- and trans-eQTL, cis meaning that the SNP
and the respective gene reside on the same chromsome and trans meaning that the two
entities are located on different chromosomes.
For our example, we will focus on trans-eQTL only.
We downloaded all significant trans-eQTL data identified by the eQTLgen consortium and use those to generate a nice
overview on trans-eQTLs identified in whole blood data.
Hands-on
Data preparation
As a first step, we load the eQTLgen trans-eQTLs and create GenomicRanges objects, i.e.
objects to easily handle genomic position annotations. We also filter for ‘real’ trans eQTL (a matter of definition), in our case this means that SNP and Gene are on distinct chromosomes.
Ok, nice! We now got all our data loaded and have them neatly available as GenomicRanges objects.
We are looking at SNPs which are associated with a total of 5455 genes ( total associations).
Now, since we want to do a genome-wide plot, we need to get some information on the size of the chromosomes etc. in order to be able to indicate chromosome boundaries.
For that, we get the hg19 genome annotation and extract the sequence lengths for chromosomes 1-22.
We further tile our genome information into tiles (or bins) of size 10MB.
These tiles will be used later on to map the individual genomic positions from the SNPs and genes to the
respective position in the plot.
With the above steps done, we can now almost start creating the plot.
Last remaining things to do is to map our eQTL entities to the respective genome bins and to create the summary
counts for the x and y (row and column) margins.
Let’s do that now:
Nice, now everything is prepared and we can start with the actual plotting!
The actual plotting
Let’s keep it brief, I added some comments to the code (as I’m supposed to do anyway) so you can figure out what is happening, but I kept it in a single code block so you can just copy, paste and adjust if you want.
So, as you can see, we create our final plot by arranging three distinct plots in a single frame using quite some functions from different packages (e.g. plot_grid from the cowplot package, and the nullGrob() method from the grid package).
NOTE: it could happen that the axes are not perfectly aligned with the margins. the plot_grid() function provides an
argument to tackle this (align and axis), but it can get tricky to try to align the axes in both dimensions!
The row and column annotations show the total number of entities falling into the respective row or column and the red dots in the main plot indicate the presence of a SNP (columns) associated with a gene (row) for the two chromosome regions.
This actually looks rather good (well, good enough)! So good in fact that, we are content for now and stop tweaking the plot (I’m lazy today).
Anyway, we can see that there are some SNPs or LD blocks which exhibit a huge number of trans associations as well as some regions in the genome which harbor a relatively larger number of genes which are influenced by those SNPs in trans as compared to other regions.
Conclusion
Alright, you have seen how we can get an impression of genome-wide results from an eQTL study, just by using tidyverse/ggplot2 and the GenomicRanges packages in R.
Of course, there are different ways of doing this, for example you could try to do a circos plot using the ggbio package in R or you can add continuous information about the number of eQTLs per bin directly in the main matrix plot (e.g. using the fill aesthetics).
Maybe I’ll extend the plot later to show how to do this, but in the meanwhile: have fun playing around with this example!
Comments