SkidmoreZL, Wagner AH, Lesurf R, Campbell KM, Kunisaki J, Griffith OL, Griffith M. 2016. GenVisR: Genomic Visualizations in R. Bioinformatics. pii: btw325. [Epub ahead of print]
PubMed Bioinformatics Journal BioRxiv Bioconductor GitHub
A commonly desired genomic visualization is the so called mutation lolliplot. After identifying genes with recurrent mutations, the next step is often to visualize how those mutations are distributed across the coding space of the gene. This allows the viewer to identify locations of mutation clustering (hotspots), protein domains affected by mutations, or other patterns related to the position and type of mutation. Typically, a simple model of the protein-coding portion of a gene is shown with mutations marked by stacks of connected dots above and/or below the gene, possibly colored by mutation type, and with appropriate mutation, gene, and protein domain labels. There are several web-based tools for visualizing data in the manner. For example, St Jude's excellent ProteinPaint application available through their PeCan Data Portal provides such visualizations for pre-loaded pediatric cancers and COSMIC data. The ICGC data portal and cBioPortal provide such visualizations for ICGC, TCGA and other data as does COSMIC for its own massive pre-loaded datasets. The MutationMapper tool from cBioPortal allows custom mutation lists to be uploaded. A command-line tool developed by David Larson was the inspiration for GenVisR::lolliplot and is also available through the Genome Modeling System to create similar plots.
However, in many cases, producing publication-ready lolliplots requires further customization. A user may wish to visualize a custom dataset not included in the above web portals or may wish to choose different protein isoform or source of protein domain annotations. In other cases, automated generation of plots for multiple sets of genes (e.g., all recurrently mutated genes) is desired. Such custom plots have historically been created through ad hoc R plotting. To address the needs for automation, customization and accessibility we have created the GenVisR package for Genomic Visualizations in R. The lolliplot function is just one of many convenient functions for the production of highly customizable publication quality graphics for genomic data primarily at the cohort level.
The first required step is to install GenVisR. First, make sure that you have the latest version of R (3.3.0 or later) available from CRAN and launch an R session. GenVisR is available through BioConductor and can be installed by the usual method. At an R prompt, we will install GenVisR and load the GenVisR library as follows:
Now, lets get the mutation data for Ma et al 2015. This is available as Supplementary Table S3 at the paper's Supplementary Data page. I opened this excel file and saved it as a tab-delimited text file for import into R. Take note of where you saved that file and import it into R. The read.table function is a useful tool for this purpose.
Plot the lolliplot. We can customize the plot by coloring dots by mutation type, labelling by amino acid change, and tweaking the text size and angle. Note that the variant annotations from Ma et al (2015) were reported for Ensembl version 74 (see Patients and Methods). By default GenVisR uses the latest version of Ensembl. To ensure consistency between reported mutations and transcript annotation/structure you can specify the appropriate Ensembl archive for version 74 with the 'host' parameter.
apologies, I did not see the tag until now. The lolliplot function uses biomaRt to grab the the nucleotide sequence for a given transcript and then converts to amino acid, that specific line of code is making sure that the nucleotide sequence is a multiple of three.
I suspect that because x of length zero a nucleotide sequence was not retrieved, perhaps biomaRt was down at the time? Depending on the version of GenVisR you are using you may or may not have recieved a warning to this effect.
I am wondering is that a package or a way in R that can let you get access the information like the length of amino acid and domain regions of your input protein? So that I can generate a lolliplot like this by trackViewer or ggplot for my interesting gene list.
yes, thanks. But I already knew trackViewer. I was wondering how to get the GRanges object of a protein so that I can input to trackViewer.And for your second link, it looks very good. But I cannot implement into my R pipeline.Thanks anyway
I have been using trackViewer package in R to make plots of RNAseq, ChIPseq and ATACseq data. We have now some methylation data and I was trying to make some lolliplots to show differential methylation levels in genes/promoters.
I have been following the vignettes, but I am really confused with the section dedicated to lolliplots for methylation data, and have been unable to make it work. My knowledge in Bioinformatics is basic and it's the first time we have methylation data.
What type of input data do I need for these plots? We would like to show basically differentially methylated regions (DMRs) in two conditions (cell types). I would appreciate some help.
What kind of file do you have? Whatever files can be imported into GRanges is OK for trackViewer to do lolliplot. If you want to show differences in two conditions, you can show it one by one or show it in caterpillar layout. Because you may want to fix the methylation positions, you can try to set jitter="label".
Thanks for you reply. I was completely unable to make a lolliplot before with my data, mainly I couldn't read my files. But using your code, I could plot some! However, I am confused with the type of data that I should use. I will try to explain what type of data I have (we did not perform the analysis ourselves, but we got the results from the sequencing facility, who also did the basic bioinformatics analysis).
On the other hand, I have one other file after differential analysis was performed between conditions (group 1 vs group 2, each group including 3 samples), containing differentially methylated regions (DMRs). This file contains in one single table information for each group (g1 vs g2). This file has info on chr, start, end, q-value, mean difference between groups, #CpGs, mean g1 and mean g2:
I tried using this file, but I get an error saying that I should have ranges with width = 1. I tried modifying the width of the range to 1 using the Start position as the location, and then it works, but obviously I am losing information, because there are several CpGs in the region, not only one nucleotide at the Start. So I am not sure what data I should use or if I can work with any of those files.
If there are multiple CpGs in one region, split it for each CpGs. Currently, lolliplot only support the plots with width equal to 1.To plot multiple samples, please follow the guide in vignettes. Basically, your idea should work.
I am trying to use the trackViewer package to create a lolliplot plot to portray the variants in the fut3 gene from the 1000 genomes .popvcf file for this locus. I am able to replicate the example given in the trackViewer vignette but when I alter the code to plot fut3 variants I get this error, "Error: allwidthSNP.gr[[i]]) == 1) is not TRUE"
For the majority of users we recommend installing GenVisR from the release branch of Bioconductor, Installation instructions using this method can be found on the GenVisR landing page on Bioconductor.
Development for GenVisR occurs on the griffith lab github repository available here. For users wishing to contribute to development we recommend cloning the GenVisR repo there and submitting a pull request. Please note that development occurs on the R version that will be available at each Bioconductor release cycle. This ensures that GenVisR will be stable for each Bioconductor release but it may necessitate developers download R-devel.
To view the general behavior of waterfall we use the brcaMAF data structure available within GenVisR. This data structure is a truncated MAF file consisting of 50 samples from the TCGA project corresponding to Breast invasive carcinoma (complete data from TCGA public web portal).
This type of view is of limited use without expanding the graphic device given the large number of genes. Often it is beneficial to reduce the number of cells in the plot by limiting the number of genes plotted. There are three ways to accomplish this, the mainRecurCutoff parameter accepts a numeric value between 0 and 1 and will remove genes from the data which do not have at least x proportion of samples mutated. For example if it were desireable to plot those genes with mutations in >= 6% of samples:
Occasionally there may be samples not represented within the .maf file (due to a lack of mutations). It may still be desirable to plot these samples. To accomplish this simply add the relevant samples into the appropriate column before loading the data and leave the rest of the columns as NA. Alternatively the user can specify a list of samples to plot via the plotSamples parameter which will accept samples not in the input data.
In an effort to maintain a high degree of flexibility the user has the option of selecting columns on which to fill and label. The parameters fillCol and labelCol allow this behavior by taking column names on which to fill and label respectively. Additionally one can plot the amino acid sidechain information in lieu of protein domains.
lolliplot uses a force field model from the package FField to repulse and attract data in an attempt to achieve a reasonable degree of separation between points. Suitable defaults have been set for the majority of use cases. On occasion the user may need to manually adjust the force field parameters especially if the number of points to apply the model to is large. This can be done for both upper and lower tracks individually via rep.fact, rep.dist.lmt, attr.fact, adj.max, adj.lmt, iter.max please see documentation for FField::FFieldPtRep for a complete description of these parameters.
3a8082e126