Newbie questions, genome data set assembly number?

1 view
Skip to first unread message

Mary Mangan

unread,
Aug 2, 2010, 3:24:42 PM8/2/10
to gaggle-discuss
Hey folks--

I'm just getting started (but I swear I read the
documentation...mostly...).

So I loaded up a new Gaggle browser, and choose Homo sapiens genome. I
choose that from the new project wizard. But it only has the species
name--I don't know which assembly it's pulling. Is there a way for me
to know that? Is there a way to choose alternates?

I suspect I'm using UCSC Mar 2006 because if I click a gene to view
there that's what loads. But that is also what may be in my cookies
because I do use that most of the time right now still.

But ok. I have my genes loaded up. I wanted to import a track. I
pulled down TFBS conserved for chr 1, and sent it to Galaxy to convert
to GFF. I download that file, and try to pull it into my Gaggle
interface. Here's the message I get:
Error loading track:
java.lang.RuntimeException
java.lang.ArrayIndexOutOfBoundsException: 6

By the way--I'm also assuming TFBS should be in the Mar 06 assembly
coordinates--but maybe that's wrong. Any other hints?

Thanks for any guidance.

Mary

(Oh, I can do screenshots or a little movie of what I'm doing if that
would help--let me know. I could also give you my Galaxy workflow if
you want to see that.)

Christopher Bare

unread,
Aug 2, 2010, 6:39:36 PM8/2/10
to gaggle-...@googlegroups.com, Mary Mangan
Hi Mary,

You're correct that you're getting the hg18 assembly of the human genome from March 2006. UCSC's genome browsers don't make the meta-data for their assemblies available through the table-browser interface, which is how I'm pulling their data. I can get the eukaryotic data through the public mysql and the microbial genomes by emailing a post-doc. Now my dirty laundry is revealed! Anyway, I'll try and get that data updated shortly. I haven't given much thought to allowing the user to select a specific assembly, but that's certainly a good suggestion.

One issue you'll likely run into is that our gene models are very simple - just start and end position - due to our focus on prokaryotes. Full support for rendering exons is a yet-to-be-developed feature.

Regarding the GFF import, I think I have a decent guess about what's going wrong. The Gaggle Genome Browser's support for loading data from files was coded up on an "as needed" basis, meaning that it worked for our purposes at the time. And, GFF is a particularly squirrelly format. We support only a very specific variant of GFF. I put some new docs up for that here:


The 7th column (or 6th if you start counting at 0) is supposed to contain strand. Is it possible that your TFBS data has no entry in this column? My code interprets '.' as no specific strand. If your TFBS data is open, I'd be glad to take a look.

I hope this helps,

-Chris






--
You received this message because you are subscribed to the Google Groups "gaggle-discuss" group.
To post to this group, send email to gaggle-...@googlegroups.com.
To unsubscribe from this group, send email to gaggle-discus...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/gaggle-discuss?hl=en.




--
------------------------------------------------
 J. Christopher Bare
 Software Engineer, Baliga Lab
 Institute for Systems Biology
------------------------------------------------

Mary Mangan

unread,
Aug 2, 2010, 8:03:52 PM8/2/10
to gaggle-...@googlegroups.com

Hey Chris—

 

Here’s my data: http://main.g2.bx.psu.edu/u/Mary%20M%20from%20OpenHelix/h/marytfbsforgaggle

 

I did a table browser query for  the conserved TFBS on chromosome 1 (just to keep it reasonable). Then picked BED format, sent to Galaxy. Then at Galaxy I converted to GFF (their GFF, whatever that means vs yours). I looked at some of the formatting docs, but didn’t notice anything in the columns right off. But I’ll take a look at the strand column tomorrow, and those other docs you put up—thanks for that. A quick pass suggests they have strand differently located.  That’s easily altered in Galaxy and I can try again.

 

Thanks!

 

Mary

Christopher Bare

unread,
Aug 3, 2010, 10:15:15 AM8/3/10
to gaggle-...@googlegroups.com, Mary Mangan
Hi Mary,

It looks like a simpler problem that my first guess. Let me throw in a couple lines of code and I'll get back to you shortly.

-chris

Christopher Bare

unread,
Aug 3, 2010, 12:26:08 PM8/3/10
to gaggle-...@googlegroups.com, Mary Mangan
Hi Mary,

That GFF file should load now. There were two separate bugs. The first was handling comment lines prefixed with a # character. The second was in the attributes column, which I was expecting to be in the form key1=value1;key2=value2;...

Those are fixed in build 0.9.164.

Give that a try and see if it works for you. If it does, your next problem will be a nice way to visualize that data. The program plots data points as circles by default with their y position proportional to the value (the score column, in this case). I've been using either triangular markers or vertical bars to mark TF positions, putting each TF in its own track where it could be colored separately. But, that's probably not a reasonable approach for this data. Do you have in mind a nice way to visualize this kind of data?

Thanks for the bug reports.

-- Chris

Mary Mangan

unread,
Aug 3, 2010, 1:14:32 PM8/3/10
to gaggle-...@googlegroups.com

Hey Chris—thanks! That did it. Loaded up.

 

I was going to select triangular markers, actually. I wasn’t sure how it was going to look. I’m fearless and I’ll try anything with software first—including breaking it (!), and then use that to evaluate what I want next.  That was really just a first test of loading some data in.

 

Now looking at it I think I’d like to go back and do it as separate tracks for each TF (or some interesting subset thereof).  That’s also an easy query + filter at UCSC, or I might be able to just re-sort the file at Galaxy and extract that. I don’t know. I’ll have to think about it now. But now that the upload works for me I can try other stuff.  With that stored history at Galaxy I can probably just pull some subsets out. That will probably be my first pass.

 

One other newbie question: how can I tell which tracks are which? That’s not obvious to me. I loaded up the sample view of Halobacterium to look around but I wasn’t sure which track was which.  And if I load up the individual TFs I can see that might confuse me too.  I mean, I can write down the color codes (Red = MyoD or something), but is there a way from the interface?

 

Mary

 

 

 

From: Christopher Bare [mailto:cb...@systemsbiology.org]
Sent: Tuesday, August 03, 2010 12:26 PM
To: gaggle-...@googlegroups.com
Cc: Mary Mangan
Subject: Re: [gaggle-discuss] Newbie questions, genome data set assembly number?

 

Hi Mary,

Christopher Bare

unread,
Aug 3, 2010, 1:46:02 PM8/3/10
to gaggle-...@googlegroups.com
Hi Mary,

On Tue, Aug 3, 2010 at 10:14 AM, Mary Mangan <mma...@openhelix.com> wrote:
>
> Hey Chris—thanks! That did it. Loaded up.
>
>

> [...]


>
> One other newbie question: how can I tell which tracks are which? That’s not obvious to me. I loaded up the sample view of Halobacterium to look around but I wasn’t sure which track was which.  And if I load up the individual TFs I can see that might confuse me too.  I mean, I can write down the color codes (Red = MyoD or something), but is there a way from the interface?
>
>
> Mary
>
>

Legends are not a strong point of my program. The reason is that
tracks are draw in a very free-form way. For an early use-case I
needed to draw tracks on top of each other and overlapping but offset.
That means that any point on the screen might intersect with several
tracks.

To compensate, what I've been doing is making legends by hand, such as
the ones in the demos here:

http://gaggle.systemsbiology.net/docs/geese/genomebrowser/demo/b_anthracis/

While that's not entirely sustainable, the best in-program means I
have so far is the Track-Info item on the right-click menu. That shows
information for all tracks at the point where the mouse clicks, which
may leave you more confused than before.

Mary Mangan

unread,
Aug 3, 2010, 2:35:23 PM8/3/10
to gaggle-...@googlegroups.com
Ok, that's fine--just wanted to be sure I hadn't missed it somehow. I can do
my own.

Thanks for the help--I can get much further now for what I want to do.

Mary

-----Original Message-----
From: gaggle-...@googlegroups.com
[mailto:gaggle-...@googlegroups.com] On Behalf Of Christopher Bare

Sent: Tuesday, August 03, 2010 1:46 PM
To: gaggle-...@googlegroups.com
Subject: Re: [gaggle-discuss] Newbie questions, genome data set assembly
number?

Hi Mary,

http://gaggle.systemsbiology.net/docs/geese/genomebrowser/demo/b_anthracis/


-- Chris

--

Reply all
Reply to author
Forward
0 new messages