Automating data injestion

8 views
Skip to first unread message

Lachlan Musicman

unread,
May 22, 2015, 11:15:32 AM5/22/15
to genome...@soe.ucsc.edu
Hola,

So, following the "Building a new genome database" instructions
http://genomewiki.ucsc.edu/index.php/Building_a_new_genome_database

I've almost got them automated.

One of the last things I am looking at is the use of symlinks in the howto.

In particular, steps 3 (fasta to 2bit) and step 10 (gc5Base data).

I am wondering why the files aren't copied across so that there is a single definitive root for all necessary files?

(given that the folder from which the various steps are run is a more "preparatory" directory, containing transient files that are only needed once (to load into the sql) and will be over-written (eg: chrom.sizes)

cheers
L.

------
let's build quiet armies friends, let's march on their glass towers...let's build fallen cathedrals and make impractical plans

- GYBE

Lachlan Musicman

unread,
May 22, 2015, 11:15:36 AM5/22/15
to genome...@soe.ucsc.edu
I guess the next question (or follow up question) is:

It seems to make sense to me that I put my trackDB directory in /gbdb/ along with all the other files. Then everything that needs keeping is in one handy directory, rather than scattered across the file system.

I ask because the documentation seems to suggest I should keep these files within the source tree, which is....unusual.

Or are the .ra files only used the once - when loading into the genome's db? If that's the case, then they can be disposed of I guess. But if they are coupled with html files, then they should be well ordered and kept...hmmm.

Opinions welcome.

Note: this is a non mirror, independent GB setup.



------
let's build quiet armies friends, let's march on their glass towers...let's build fallen cathedrals and make impractical plans

- GYBE

Matthew Speir

unread,
May 22, 2015, 7:23:11 PM5/22/15
to Lachlan Musicman, genome...@soe.ucsc.edu
Hi Lachlan,

Thank you for your questions about setting up your own assemblies in a Genome Browser mirror.


>I am wondering why the files aren't copied across so that there is a single definitive root for all necessary files?

Soft links allow us to place only final product data files in gbdb while also keeping all of our intermediate products for future reference. We could copy the files instead of soft-linking, but many of the final products are quite large (especially 2bit and bigWig files). If you want to delete your working directories and just place the data files in /gbdb/ directly, that's your choice.


>It seems to make sense to me that I put my trackDB directory in /gbdb/ along with all the other files.
>...

>I ask because the documentation seems to suggest I should keep these files within the source tree, which is....unusual.

You can keep these files wherever you'd like, and the documentation even discourages you from storing them in the source tree. From kent/src/product/README.trackDb starting in line 81:
   
    To work independently of the UCSC source tree,
    establish your own trackDb.ra files outside the UCSC source tree in
    a directory of your choice under your control.

Here is a link to README.trackDb on the web, starting at the relevant section about trackDB_local: http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/product/README.trackDb;h=acb06cc6f599384d15c4d757042e5ec3fd136a3f;hb=master#l74. The rest of the instructions in that README go on to mention how you can load you own tracks using your trackDb_local files.

We don't keep trackDb.ra files in /gbdb/ because, again, we want to keep those directories clean of extraneous files and keep only links to the raw data there.


>Or are the .ra files only used the once - when loading into the genome's db? If that's the case, then they can be disposed of I guess. But if they are coupled with html files, then they should be well ordered and kept...hmmm.

Again, it's your choice as to how you want to run the mirror. You get to choose what files you want to keep and how easy you want to make it to maintain. The trackDb.ra files are used in conjunction with the track description html files when loading new tracks into the trackDb table. After the tracks have been loaded into the trackDb table, you could, in theory, delete both the ra and html files. However, we strongly recommend that you keep these trackDb.ra and track description html files. Keeping the files will make it easier to add new tracks in the future, as well as update the current tracks. You would also run the risk of completely deleting all of your previous tracks when you add a new track using the track loader scripts.

I hope this is helpful. If you have any further questions, please reply to genome...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Lachlan Musicman

unread,
May 26, 2015, 11:29:11 AM5/26/15
to Matthew Speir, genome...@soe.ucsc.edu
Thank you for the clarification.

cheers
L.

------
let's build quiet armies friends, let's march on their glass towers...let's build fallen cathedrals and make impractical plans

- GYBE

Reply all
Reply to author
Forward
0 new messages