Brief query on how to extract gene lists and download only parts of a huge database (JASPAR CORE 2022)

63 views

Skip to first unread message

Marcos Cases

unread,

Jun 20, 2022, 12:10:36 PM6/20/22

to gen...@soe.ucsc.edu

To whom it may concern,

First, let me thank you for taking the time to read my email.

My name is Marcos Cases and I am currently doing my Ph.D. in Immunology in the German Center for Research in Rheumatology (DRFZ) Berlin.

As part of my research project I would like to compare three regions that can as promoters of my target gene, FOXP3. The comparison would be in terms of what signalling pathways each promoter region appears to respond to.

To this aim, I decided to apply the following steps:

1) Identify candidate Transcription Factors (TFs) that can bind to the promoter regions based on their TF Binding Site (TFBS) being predicted there.

Create lists.

Relevant database/algorithm: JASPAR CORE 2022

2) Filter candidate TFs according to:

a. conservation of the putative TFBS across species

b. Correlation with reported TSS

c. Focus on those TF that are reported to be expressed in my target cell type (scavenge available RNA-seq data and Mass Spectrometry data)

d. Give special priority to those TFs that are reported to bind to the putative promoter regions (scavenge available ChIP-seq/ChIP-Chip data)

Relevant database/algorithm: a. Vertebrate Multiz Alignment & Conservation (100 Species)

b. FANTOM5 summary tracks

c. ???

d. ???

3) Identify signaling pathways that could regulate the promoter regions by applying a GO analysis using the previously generated candidate TF list.

Relevant database/algorithm: - DAVID (Huang DW, Sherman BT, Tan Q, Collins JR, Alvord WG, Roayaei J, Stephens R, Baseler MW, Lane HC, Lempicki RA. 2007. The DAVID gene functional classification tool: A novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biology 8: R183. DOI: https://doi.org/10.1186/gb-2007-8-9- r183, PMID: 17784955 Hulsen T, de Vlieg J, Alkem)

- JASPAR ENRICHMENT ANALYSIS (https://jaspar.genereg.net/enrichment/)

I tried to summarize for I am aiming for in the figure below.

Having said all this, I would like to ask you for help with:

a) retrieving the information from the JASPAR core 2022 track only from the selected genomic regions (“visible range/window”). There is a link to download* the data that was used to create the track, but the BIGBED file is simply too big and I do not have the space in my disc. Furthermore, I would like to get an Excel sheet with the predicted TFBS in a specific stretch of the genome.

b) finding any RNA-seq and/or Mass Spectrometry database generated using human regulatory T cells (T_REGS, my target cell type) that I could use to filter out those TF that are not expressed in these cells. Is there any hint you could, please, give me?

Thank you so very much for your time and for your consideration! I will be very much looking forward to hearing back from as this analysis would mean a great deal for my project, and therefore, my Thesis.

Sincerely,

Marcos Cases

* Binding site predictions for all and individual TF profiles are available for download at http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/.

Brian Lee

unread,

Jun 28, 2022, 6:26:39 PM6/28/22

to Marcos Cases, gen...@soe.ucsc.edu

Dear Marcos,

Thank you for using the UCSC Genome Browser and your question about JASPAR CORE 2022 data access. The first place to start with track data is to review the Track Description page, https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=jaspar, where we have a "Data Access" section. Please review that section.

There are ways to access data with user interface in the Table Browser and Data Integrator, and also programmatic ways with an API or directly on the command-line. The data in the browser is often stored in bigBed files, and you can extract regions of information programmatically from them with the bigBedToBed tool, which you can get from our utilities directory: https://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads

For instance, here is a command that would extract the data for the coordinate range chrX:49,262,561-49,262,903:

bigBedToBed -chrom=chrX -start=49262560 -end=49262903 http://hgdownload.soe.ucsc.edu/gbdb/hg38/jaspar/JASPAR2022.bb stdout

There are quite a lot of details packed in the rest of your questions, and it entails making connections between other datasets beyond JASPAR 2022. Here is a session of the region discussed: https://genome.ucsc.edu/s/brianlee/FOXP3

It sounds as though once you have used JASPAR for a source of predicted TFBS you wish to filter them against other data such as conservation sites, TSS, and cell-type specific data. Unfortunately there is no brief way to summarize steps involved. We do have tools that allow you to interact with different data sets. Primarily, I would recommend building custom tracks of the regions of your interest and then doing intersections in the Table Browser or more complex extractions with the Data Integrator. I recommend watching some of our videos on these tools and looking at their user guides:
https://genome.ucsc.edu/goldenPath/help/hgIntegratorHelp.html
https://genome.ucsc.edu/goldenPath/help/hgIntegratorHelp.html#video
https://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html
https://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#Videos

I also recommend taking some time to look at our archives of mailing list questions, where you can search terms, for instance TFBS: https://groups.google.com/a/soe.ucsc.edu/g/genome/search?q=TFBS

It sounds like conserved DNA is your top priority so you may wish to use that as the start for your Data Integrator query, a custom track of conserved DNA that intersect with your promoter regions, and then add on data such as the JASPAR TFBS, and then look to the pathways those TFBS connect with across the genome. Along these lines, in the above session the top track is our Protein Interactions track, where if you click into the FOXP3: IL2... annotation, you will see a collection of pathways on our Gene Interactions Tool. Click the top "Genome Browser" to return to your prior view, but you can also use that tool to explore pathway information: https://genome.ucsc.edu/goldenPath/help/hgGeneGraph.html

Another track to mention in the session is the "Interactions between GeneHancer regulatory elements" track toward the middle. GeneHancer used FANTOM5, ENCODE, Ensembl and other data sources to build a view of distance relations between TSS and promoters and enhancers. For instance, in this region there are annotations that point out to the genes FTSJ1, RBM3, PIM2, PLP2, PRICKL3, to name a few. Read more about the track here: https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&c=chrX&g=geneHancer

We wish you the best of luck on your research and hope this has been useful for you. Thank you again for your inquiry and for using the UCSC Genome Browser. If you have any further public questions, please send new questions to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum to help others find answers to similar questions. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu, which is a private internal list to our support team.

All the best,

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/9bd008a4803b42d9981366e6710dd27f%40drfz.de.

Reply all

Reply to author

Forward

0 new messages