Curl timeout when fetching anything related to paralogs

38 views
Skip to first unread message

Kyle Duyck

unread,
May 7, 2020, 1:00:01 PM5/7/20
to biomart-users
This used to work before the migration, but has since stopped working. I have tested on multiple species and using different ensembl versions and mirrors, but to no avail. It seems that all paralog queries are broken.

mymart <- useEnsembl(biomart = "ENSEMBL_MART_ENSEMBL",dataset = "hsapiens_gene_ensembl")
mydatabase <- getBM(attributes=c('ensembl_gene_id',
  'hsapiens_paralog_ensembl_gene',
  'hsapiens_paralog_subtype',
  'hsapiens_paralog_orthology_type'),
 mart=mymart,
 uniqueRows=FALSE)

Kyle Duyck

unread,
May 22, 2020, 5:08:45 PM5/22/20
to biomart-users
I found this problem is partially due to large query but also a potential oversight in biomart query batching.

In the solution below you can see that filtering to only genes with a known paralog solves the timeout problem potentially by reducing the query size (speculated to be related to cartesian product). Although it may seem that the filtering itself solves the timeout problem, in fact it results in a less than 5% reduction in query size. The two step solution also does not change the cartesian product (gives identical results) if timeout limit is modified in getBM member method submitQueryXML to be longer than 300 seconds, allowing such large queries enough time to finish. Then why does the below solution work?

Matthew's two step approach of first identifying the geneIDs with paralogs and then querying the paralogs by geneIDs initiates a condition in getBM's member method generateFilterXML that splits jobs into batches of 5000 using the relatively recently added splitValues member method. As each of these small batches are sure to not timeout, the results are individually queried and stitched together without issue. This leaves me wondering, why this would not be a default method for attributes such as paralogs that are undoubtedly going to timeout for most users.

Solution from Biostars: https://www.biostars.org/p/331872/

para.attr <- c("ensembl_gene_id", attr[grepl("paralog", name), name])

hgid <- getBM(attributes = "ensembl_gene_id",
              filters    = "with_hsapiens_paralog",
              values     = TRUE,
              mart       = human)$ensembl_gene_id

para <- getBM(attributes = para.attr,
              filters    = "ensembl_gene_id",
              values     = hgid,
              mart       = human)
Reply all
Reply to author
Forward
0 new messages