Processing Genemania data with Gallia

21 views
Skip to first unread message

Anthony Cros

unread,
Feb 26, 2021, 10:41:39 AM2/26/21
to genemani...@googlegroups.com

Hi,

 

 

I'd like to offer a restructured version of your data as part of the examples for my library (see announcement on Biostars). The processing would likely resemble this code, though with more "wide"-transformations involved (notably joins), and the result would look something like this gist. The idea is to make more search-indexable, which then allows creating faceted searches such as ICGC's (a project I formerly worked on)

 

In light of this, I have a few questions:

- Is there any specific license agreement I should be aware of? Any attribution format you would prefer?

- Are all interactions symmetrical and if not, what is the best way to determine which ones are not?

- I initially wanted to use the "combined network" files, but I don't see how the two files could be reliably joined, as they only share the weight column (at least for the current version). Unlike the uncombined counterparts, the filenames cannot be leveraged to determine group/network.

Regards,

 

Anthony

Gary Bader

unread,
Feb 26, 2021, 12:12:44 PM2/26/21
to genemani...@googlegroups.com

Hi Anthony,
Sounds great. The license follows that of the original source - almost all are free to use. All the interactions are symmetric. The combined network is a new type of network that combines all the other networks using this method: https://academic.oup.com/bioinformatics/article/26/14/1759/177586

Best,
Gary

--
You received this message because you are subscribed to the Google Groups "genemania-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genemania-disc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genemania-discuss/CAAS0GwBHxhrqr-wy8Jf2HRJGskUtoux_5zcvrOJbjPz%3DbkT53w%40mail.gmail.com.

Anthony Cros

unread,
Mar 4, 2021, 3:09:54 PM3/4/21
to genemania-discuss

Hi Gary,

Thanks for clarifying what the combined network is, I had misunderstood it to be a naive concatenation. I ended up crawling the full set of files instead, and just pushed the resulting code: https://github.com/galliaproject/gallia-genemania

The result data file for Homo_sapiens is 3.5GB (gzipped) and each row basically looks like the following (dummified+prettified):

  {
    "_id": "ENSG00000123456",
    "co_expression": [
      { "target": "ENSG00000654321",
        "context": [
          { "weight": 0.021,
            "network": "Meier-Seiler-2009",
            "source": "GEO",
            "pubmed": "12345678" },
          { "weight": 0.019,
            "network": ... },
          ...
    ],
    "predicted": [ ... ],
    ...  
  }

Would it be ok for me to post the full data somewhere? or a subset thereof?

Notes:

1- The code for gallia-genemania will switch to an Apache2 license once gallia-core switches to BSL (in the works). The result data could be Apache2-licensed right away however.
2- There are results for both the data transformations and their schema counterparts (see schema result), per this section of the documentation.
3- This version uses my poor man's scaling, which takes about ~8 hours on my machine (16 cores/64GB mem) and comes with the following caveats:
    a. It wouldn't run on Windows as-is (see note nt210302124437 in the code)
    b. I actually enabled some of the hacks commented out here to shave off extra runtime
    c. More generally, performance is likely to be subpar until I can get to these tasks
    d. I will provide a Spark-powered version of this processing next (I guess the "rich man's counterpart") 

Regards,

Anthony

Gary Bader

unread,
Mar 4, 2021, 9:14:05 PM3/4/21
to genemania-discuss

Hi Anthony - great! Sure, feel free to remix the data and share. FYI, we haven't updated the database in a while, but are working on an update now - definitely this year.

Best,
Gary

Anthony Cros

unread,
Mar 5, 2021, 12:13:26 PM3/5/21
to genemani...@googlegroups.com

Hi Gary,


I uploaded the full set of results: https://github.com/galliaproject/gallia-genemania/tree/init/result


I also included the first JSON object in a prettified form: https://github.com/galliaproject/gallia-genemania/tree/init/result/data/ENSG00000000003.pretty.json


For those unfamiliar with git-lfs:


  clone repo

  # requires git-lfs installed: 

  git lfs pull

  git lfs pull --include "result/data/genemania.jsonl.gz" # if you only want the one file


I opting for CC-BY-4.0 in the end (for the data): https://github.com/galliaproject/gallia-genemania/blob/init/result/LICENSE-CC-BY-4.0.txt; in a nutshell this means everyone can share/adapt the data as long as they provide adequate credit (see details). I encourage anyone interested in re-using the data to reach out to me regardless.


A few more comments about the actual run:

- correction: the full run takes 4.5h on my machine, not 8h as previously said

- clarification: the hack pertaining to parallelizing Iterator processing is not being used, mostly due to the size of the grouped objects being quite big (which renders this sub-grouping too subject to OOM errors)

- wrapping GNU sort: I added more details to the documentation in this section


Regards,


Anthony


PS: happy to rerun it once the new data is available



Gary Bader

unread,
Mar 5, 2021, 4:56:19 PM3/5/21
to genemani...@googlegroups.com

Hi Anthony - awesome! Thanks so much - I hope people will use it. We’ll announce the new data here when ready.

Best,
Gary

Anthony Cros

unread,
Mar 31, 2021, 10:41:04 AM3/31/21
to genemania-discuss
Hi Gary,

I posted the Apache Spark-powered counterpart of this processing as a separate repo (which depends on earlier repo).

People can try it out via this script, provided they are set up with AWS cli and don't mind the AWS charges (makes use of S3 and EMR). See also the driver (GeneManiaSparkDriver.scala) and the actual transformations in the Spark-unaware GeneMania.scala (from initial repo).

It's possible to bring the run down to ~1h, although it requires quite a lot of worker instances due to the overhead incurred by distributing computation.

Regards,

Anthony

PS: switched the library's license to BSL (see license FAQ): in a nutshell it’s free if you are essential or small (e.g. genemania qualifies as both)

Gary Bader

unread,
Apr 2, 2021, 11:10:47 PM4/2/21
to genemania-discuss

Thanks Anthony. Hopefully people see it in this email forum (or elsewhere) and find it valuable.

Best,
Gary

Reply all
Reply to author
Forward
0 new messages