How to download annotation for all jaspar core profiles?

155 views
Skip to first unread message

carina...@gmail.com

unread,
Mar 23, 2020, 11:24:28 PM3/23/20
to JASPAR Q&A Forum
Hello,

I would like to download the annotation for all Jaspar core profiles. Something like this for each profile: 

Name: TGA1A
Matrix ID: MA0129.1
Class: Basic leucine zipper factors (bZIP)
Family:
Collection: CORE
Taxon: Plants
Species: Nicotiana sp.
Data Type: SELEX
Validation: 10561063
Uniprot ID: P14232  
Source:
Comment:

Is there a way to do it?

Best,

Paola

Anthony Mathelier

unread,
Mar 24, 2020, 3:50:34 AM3/24/20
to carina...@gmail.com, JASPAR Q&A Forum
Dear Paola,

All these information can be retrieved programmatically. You can for instance use our Biopython JASPAR module (see https://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc218) or our REST API (see https://academic.oup.com/bioinformatics/article/34/9/1612/4747882).

Best
AM
--
You received this message because you are subscribed to the Google Groups "JASPAR Q&A Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jaspar+un...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/jaspar/03ba3629-786b-4c25-a7eb-fef50f7ab1ba%40googlegroups.com.

-- 
Anthony

Jaime Castro

unread,
Mar 24, 2020, 10:44:26 AM3/24/20
to JASPAR Q&A Forum
Hi Paola

I sent you the code I used to extract these features using the API in R.

The code is tricky, because there are nested lists, so it requires a lot of data manipulation. but it works. 

change the taxon variable if you want to use with other taxa.

Let us know if this works for you

Jaime







##################################################################
##################################################################

library("dplyr")
library("data.table")
library("jsonlite")
library("purrr")

taxon <- "Vertebrates"
jaspar.url <- paste0("http://jaspar.genereg.net/api/v1/matrix/?page_size=800&collection=CORE&tax_group=", taxon, "&version=latest&format=json")
result <- fromJSON(jaspar.url)

all.profiles.info <- sapply(result$results$matrix_id, function(id){
 
  indiv.jaspar.url <- paste0("http://jaspar.genereg.net//api/v1/matrix/", id,"/format=json")
  indiv.mat.info <- fromJSON(indiv.jaspar.url)
 
  indiv.mat.info
 
})

## Fields:
# names(all.profiles.info)
#
# [1] "pubmed_ids"    "description"   "family"        "pfm"           "tax_group"     "matrix_id"     "sequence_logo" "remap_tf_name"
# [9] "pazar_tf_ids"  "versions_url"  "collection"    "base_id"       "class"         "tffm"          "tfe_ids"       "name"        
# [17] "tfbs_shape_id" "uniprot_ids"   "sites_url"     "species"       "alias"         "version"       "unibind"       "type"        
# [25] "symbol"  


## Vertebrates: 746 profiles
all.profiles.info.subset <- map(all.profiles.info, `[`, c("name", "matrix_id", "class", "family", "tax_group", "species", "type", "pubmed_ids", "uniprot_ids"))

## Species is a nested list with two entries: name and tax_id.
## They must be processed separately
species.df <- data.frame( species = do.call(rbind, lapply(map(all.profiles.info.subset, c("species", "name")), paste, collapse = "::") ))
tax.id.df <- data.frame( tax_id = do.call(rbind, lapply(map(all.profiles.info.subset, c("species", "tax_id")), paste, collapse = "::") ))

## Family/Class/Uniprot_ids may contain two or more entries (e.g., dimers), therefore, they must be processed separately
family.df <- data.frame( family = do.call(rbind, lapply(map(all.profiles.info.subset, "family"), paste, collapse = "::") ))
class.df <- data.frame( class = do.call(rbind, lapply(map(all.profiles.info.subset, "class"), paste, collapse = "::") ))
uniprot.df <- data.frame( uniprot_ids = do.call(rbind, lapply(map(all.profiles.info.subset, "uniprot_ids"), paste, collapse = "::") ))

## some profiles may contain two pubmed ids (this is a mistake when we curated the database). To avoid problems, we concatenate them
## But this problem must be fixed for future releases
pubmed.df <- data.frame( pubmed_ids = do.call(rbind, lapply(map(all.profiles.info.subset, "pubmed_ids"), paste, collapse = "::") ))

## Conver list to data.frame
all.profiles.info.subset <- map(all.profiles.info.subset, `[`, c("name", "matrix_id", "tax_group", "type"))
all.profiles.info.tab <- rbindlist(all.profiles.info.subset)


## Concat all the data.frames
all.profiles.info.tab.clean <-
  cbind(all.profiles.info.tab, species.df, tax.id.df, class.df, family.df, uniprot.df, pubmed.df) %>%
  dplyr::select(name, matrix_id, class, family, tax_group, species, tax_id, type, uniprot_ids, pubmed_ids)

fwrite(all.profiles.info.tab.clean, sep = "\t", file = paste0("Jaspar_2020_", taxon,"_table.tab"))

Ruo Kery

unread,
Nov 13, 2020, 5:51:37 AM11/13/20
to JASPAR Q&A Forum
Hi Paola,

              using the JASPAR2020 R library to extract annotations for all TFs, it takes minor steps to achieve this goal.
 
              # retreiveing all annotations for CORE+vertebrate as follows
              library( JASPAR2020)
              library(TFBSTools)

              opts<-list()
              opts[["tax_group"]] = "vertebrates"
              opts[["collection"]] ="CORE"
              JASPAR_PFMatrixList = getMatrixSet(JASPAR2020, opts) 

              ##  JASPAR_PFMatrixList  is an object of  PFMatrixList
             ## extract TF IDs as follows
              ID(x)
             ## extract TF name as follows
             name(x)
             ## extract TF name as follows
             tags(x)
Reply all
Reply to author
Forward
0 new messages