Problem at search function for gene names at assembly hub

33 views
Skip to first unread message

David da Silva Pires

unread,
May 20, 2015, 12:18:55 PM5/20/15
to gen...@soe.ucsc.edu
Hi, supers.

I am trying to discover what is wrong with the search function for gene names at the following assembly hub:

http://www.vision.ime.usp.br/~davidsp/hub/geneNetwork2/hub.txt

If I search for some genes, everything is OK. For example:
Smp_186980     Chr_1:11159-12750
Smp_193370     Chr_2:294607-298727
Smp_133840     Chr_ZW:663448-722280

But for other genes, the result is incorrect:

Smp_206290    SC_2440:1297-1719    (instead of SC_2388:1297-1719)
Smp_206300    mitochondria:906-1124    (instead of SC_2440:906-1124)
Smp_183620    SC_2388:859-1140    (instead of SC_2321:859-1140)

Note that only the scaffold is incorrect, not the coordinates. Note also that the examples that I chose are not random: the correct ones are at the largest chromosomes while the incorrect ones are at small scaffolds.

It seems that the index that was built by "-extraIndex=name" parameter of bedToBigBed is wrong. Can you help me to solve this problem?

Thanks in advance.

--
David da Silva Pires

David da Silva Pires

unread,
May 20, 2015, 1:57:24 PM5/20/15
to gen...@soe.ucsc.edu
Hello.

One of my colleagues noted that all the genes for which the search is successful have names starting with "Chr_" (the longest ones). The searches that fail are all from scaffolds ou mitochondrial DNA (the shortest ones).

Is this an undocumented feature? Should the genome and, subsequently, all the tracks relative to this genome, have its chromosomes names starting with "Chr_" as a prerequisite in order to the search tool work?

Tranks.

Jonathan Casper

unread,
May 20, 2015, 3:38:19 PM5/20/15
to David da Silva Pires, gen...@soe.ucsc.edu

Hello David,

Thank you for your question about a problem with the search index for your bigBed. We are able to see the issue with your bigBed file, but have been unable to reproduce it with anything we create ourselves. Are you able to send us the data files and the program binaries (e.g., bedToBigBed) that you used to construct your smps.bb file? You can send them to me privately to avoid sharing with the mailing list if you prefer.

One of our engineers notes that search names are not required to start with "Chr_"; we suspect that another bug is responsible.

If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group


--


David da Silva Pires

unread,
May 22, 2015, 4:13:22 PM5/22/15
to gen...@soe.ucsc.edu
Hi, Jonathan.

I sent you all the files that you asked.

If possible, let's continue the thread here in order to help users that could have the same problem.

Thank you very much for all your help.

Best regards.

--
David da Silva Pires

David da Silva Pires

unread,
May 26, 2015, 11:21:36 AM5/26/15
to gen...@soe.ucsc.edu
Hello.

Maybe some more information could be helpful:

* The commands "sed" and "sort" correspond to the versions that are distributed with Kubuntu Linux 15.04.
* The commands "twoBitInfo" and "bedToBigBed" correspond to the versions that are distributed with GBiB (Ubuntu 14.04.1 LTS), updated with the command ~browser/updateBrowser.

If you need something more, just tell me.

Greetings.

--
David da Silva Pires

Jonathan Casper

unread,
May 26, 2015, 1:49:35 PM5/26/15
to David da Silva Pires, gen...@soe.ucsc.edu

Hello David,

One of our engineers looked into this issue and reports that the problem has to do with your environment settings. Your LC_COLLATE environment variable is most likely set to something like en_US.UTF-8. The problem is that this does not provide a case-sensitive sort, which our tools expect to find. Our engineer is now updating bedToBigBed to explicitly check for unsorted input. You can make sure that you get case-sensitive sorting by re-running your sort command as

$ LC_COLLATE=C sort -k1,1 -k2,2n smps-shortChromNames.bed > smps-shortChromNames-sorted.bed

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

--


David da Silva Pires

unread,
May 28, 2015, 12:13:31 PM5/28/15
to Jonathan Casper, gen...@soe.ucsc.edu
Hi, Jonathan.

Thank you very much for solving this problem.

You were right: look the result of the command locale at my GBiB shell:

browser@browserbox:~$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=C
LANGUAGE=
LC_CTYPE=pt_BR.UTF-8
LC_NUMERIC=pt_BR.UTF-8
LC_TIME=pt_BR.UTF-8
LC_COLLATE="C"
LC_MONETARY=pt_BR.UTF-8
LC_MESSAGES="C"
LC_PAPER=pt_BR.UTF-8
LC_NAME=pt_BR.UTF-8
LC_ADDRESS=pt_BR.UTF-8
LC_TELEPHONE=pt_BR.UTF-8
LC_MEASUREMENT=pt_BR.UTF-8
LC_IDENTIFICATION=pt_BR.UTF-8
LC_ALL=

These are the values that are set at my machine (I live in Brazil). I really didn't know that such environment variables could be transferred from a shell to another one via ssh. I thought that after logging in at GBiB, all the environment variables were reset.

Anyway, with your explanation, I put the following lines at the bottom of ~browser/.bashrc on my GBiB:

# Define custom locale settings.
export LANG="C"
export LANGUAGE="C"
export LC_MESSAGES="C"
export LC_CTYPE="C"
export LC_NUMERIC="C"
export LC_TIME="C"
export LC_COLLATE="C"
export LC_MONETARY="C"
export LC_PAPER="C"
export LC_NAME="C"
export LC_ADDRESS="C"
export LC_TELEPHONE="C"
export LC_MEASUREMENT="C"
export LC_IDENTIFICATION="C"
export LC_ALL="C"

After loading .bashrc again, I built my track again. There were only one difference at the sorting: the entries at the BED file relative to positions at the chromosome called "mitochondria" was moved from the middle to the bottom of the file. As you said, this occurred because it is the only one that starts with a lowercase letter. This correction was sufficient to fix all the searches that were failing.

It would be great if the code of bedToBigBed were able to sort BED files itself. Anyway, the solution that I adopted is acceptable for GBiB users and I suggest to include those lines at the default ~browser/.bashrc.

Thanks again.
Reply all
Reply to author
Forward
0 new messages