In this short document we use the XML package to obtain and parse an HTML table from REMAP. This table contains an overview over cell-lines and transcription-factor (TF) binding sites measured in these cell-types.
We further create an overview on the number of TFs per cell-types, generating a plot which shows the number of TFs, accumulated over all cell-types. The order is such that we start with the cell-types having the most TFs available and proceed with the one adding most new TFs and so forth.
REMAP
REMAP is a huge resource which collects TF binding sites (TFBS) for numerous TFs and for hundreds of cell-types. These TFBS are ChIP-seq based, an experimental protocol which enables a genome-wide readout of DNA sites which are bound by
specific transcription factors. The current version (2018) contains over 80 million
TFBS for 485 TFs identified in 346 cell-types.
Implementation
Data processing
First, we load needed libraries and read in the table. We could optionally save the table as a TSV to disc.
NOTE: We load the cowplot package only to get a nice and lightweight default theme for ggplot set up. Also,
I can really recommend the package for publication ready figures.
This was easy enough, we got the table and transformed it into a tibble, getting nicer column names on the way. You can download the transformed table from here.
NOTE: the readHTMLTable actually returns a list of results it finds. We instantly subset the list and only retrieve the first element, which, in our case, is the main table.
Since all TFs are gathered in a single cell for each cell-type, in the next step we separate the table by the TFs in each row, using the very convenient separate_rows() method. This method expands each row into multiple rows, based on the values of a string split generated from the values of a specific column. For us, this yields for each cell-type individual rows for each available TF.
We can see that the cell-type with the most TFs measured is the K562 (used in ENCODE), closely followed by the GM12878 LCL cell-line. In fact,
most of the top cell-types were used in ENCODE.
Finally, we generate the cumulative numbers we want to plot in the end. Since we always want to add only the cell-line contributing most TFs in each step, we do this manually and evaluate the overlap of the TF lists on the way. This might be a somewhat crude implementation, but it does the job. If you know a shortcut for achieving this, let me know!
Plotting
Now we can use the accumulated TF contributions for plotting with ggplot.
We filter for contributions > 0 and calculate the cumulative sum for the y-axis.
We further add the cell type labels to each data point using the geom_text() ggplot layer.
Summary
That’s all! This was a quick (not necessarily dirty) way of extracting a HTML table from a website and generating a brief overview.
Comments