One line of 50 Ns is missing in chr10.fa for downloading

16 views
Skip to first unread message

Cheng Cui

unread,
Nov 19, 2012, 5:27:35 PM11/19/12
to gen...@soe.ucsc.edu, Rama
Hi,

In mouse mm9 assembly, the chr10.fa.gz file downloaded from this page:

http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/

seems to have one line of 50 N's missing.

If you blast a sequence against this file, the position will be shifted
off by 50 base pairs when compared to the position if you blast using
UCSC genome browser webpage.

Can someone check it out? I will be grateful if I can be informed of the
result.

Thanks.

Cheng

Hiram Clawson

unread,
Nov 23, 2012, 7:20:35 PM11/23/12
to Cheng Cui, gen...@soe.ucsc.edu, Rama
Good Afternoon:

Can you please clarify what error you are reporting.
Is this a discussion of NCBI blast and the command line blat operation ?
Can you describe the two different operations you are performing
with these two procedures and the difference in the results.
Thank you,

--Hiram

Hiram Clawson

unread,
Nov 27, 2012, 11:23:32 AM11/27/12
to Cheng Cui, gen...@soe.ucsc.edu, Rama
Can you give an example location in chr10 that is in the incorrect position ?
And the example sequence you use to find this location with blat that
is in the incorrect position ? Can you use any of the kent utilities to
extract sequence from the fasta or 2 bit files from UCSC to compare with
your extraction program ?

--Hiram

On 11/27/12 7:35 AM, Cheng Cui wrote:
> Hi Hiram,
>
> We wrote a custom C++ program to fetch sequence with a particular length 10 bp before a given
> chromosome number and position. To do this, we used fa files downloaded from
>
> http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/
>
> We then used UCSC blat (with mm9) to confirm the locations of the sequences we found with our
> custom program. All but sequences from chr10 were fine. For sequences from chr10, the positions
> from UCSC blat is 50 bp more than the positions we got from our custom program.
>
> To prove, we put an extra line of 50 N's at the beginning of chr10.fa, the search results now
> match between our custom program and UCSC blat.
>
> Let me know if you need more information.
>
> Cheng

Cheng Cui

unread,
Nov 27, 2012, 10:35:37 AM11/27/12
to Hiram Clawson, gen...@soe.ucsc.edu, Rama
Hi Hiram,

We wrote a custom C++ program to fetch sequence with a particular length
10 bp before a given chromosome number and position. To do this, we used
fa files downloaded from

http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/

We then used UCSC blat (with mm9) to confirm the locations of the
sequences we found with our custom program. All but sequences from chr10
were fine. For sequences from chr10, the positions from UCSC blat is 50
bp more than the positions we got from our custom program.

To prove, we put an extra line of 50 N's at the beginning of chr10.fa,
the search results now match between our custom program and UCSC blat.

Let me know if you need more information.

Cheng



On 11/23/12 7:20 PM, Hiram Clawson wrote:

Cheng Cui

unread,
Nov 27, 2012, 5:10:22 PM11/27/12
to Hiram Clawson, gen...@soe.ucsc.edu, Rama
Hi Hiram,

I think I got it wrong: instead of adding a line of 50 N's, you need to
delete the first line of 50 N's from chr10.fa, in order to match the
result from using Blat tool.

This is what I did:

1) at this page:

http://genome.ucsc.edu/cgi-bin/hgBlat?hgsid=312903247&command=start

Blat sequence within mm9:

GCAGTGAGGAGAGCTTTTCTACTTTATTACAAAGAGAAGAAGCCTTTATG
TGGTCAGAAAATACAAGATTGATCTTTTTCTCTTCCTATGAAACGCTGCT
GTTCAAACAGCCAGAGTGAATTGTCTCAGTTCACGTCTTGAAGCCCCATC
ACATTCACTTGGGTATGGATGTCCTTGGCTGCTGAAAATCTGGCCCTTTA

You will get start position as 115922217 on chr10.

2) With my custom software when I am looking for this sequence within
file chr10.fa.gz downloaded from this page:

http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/

I got start position as 115922267, which was offset by 50 bp.

3) When I delete the first line of 50 N's from chr10.fa and run my
custom program to look for this sequence, it gave me the right position:
115922217.

If you have your own program, you can try to look for this sequence
within chr10.fa to find the position by yourself. Otherwise I can
provide you with my tools to do this.

I am not using Kent Source Utilities. You can try it yourself. But some
kent utilities had N's removed, so take that into account when you are
doing so.

Let me know if it is not clear to you.

Cheng

Mingfeng Li

unread,
Nov 27, 2012, 10:24:17 PM11/27/12
to Cheng Cui, Hiram Clawson, gen...@soe.ucsc.edu, Rama
I am afraid of the error because I am using mm9 currently. So,
I checked and I did not see the error. I guess your tool having some bugs. That is :
(2318445-1)*50+17=115922217

On Tue, Nov 27, 2012 at 2:10 PM, Cheng Cui <cc...@pitt.edu> wrote:
GCAGTGAGGAGAGCTTTTCTACTTTATTAC



--
*********************************************
Mingfeng Li, Ph.D.
Postdoctoral Associate
Department  of Neurobiology
Yale University School of Medicine
333 Cedar Street, SHM C-327C
New Haven, CT 06510

E-mail: mingfeng.li@yale.edu
Lab: (203) 785-5941
Lab website: www.sestanlab.org
*********************************************

Cheng Cui

unread,
Nov 27, 2012, 10:43:08 PM11/27/12
to Mingfeng Li, Hiram Clawson, gen...@soe.ucsc.edu, Rama
Dear all,

I think I found what went wrong. I found the first line of chr10.fa is ">chr10", but not "NNNNNNNN..." as in all other chr*.fa files. My program read and register the positions by line numbers, so ">chr10" was counted as an extra line. This is a bug in my program, as I would put it in one way, but I do think with this extra line ">chr10", it may cause some confusion as it did to me.

Sorry for the confusion, but you need to remember that the first line of chr10.fa is different. I never look at the fa files one by one and I thought they would all start with "NNNNNNN....".

Cheng

Hiram Clawson

unread,
Nov 28, 2012, 12:00:19 AM11/28/12
to Cheng Cui, Mingfeng Li, gen...@soe.ucsc.edu, Rama
All the fasta files begin with the line >chr...
That is the definition of a fasta file header line.
You can not count on each line in a fasta file to be
50 characters. They may be in the ones you have used,
but that is never guaranteed.

You could save yourself a lot of time and trouble and use the kent utilities
to work with the genome sequence files from UCSC. The .2bit files are the
most convenient files to work with since they are indexed. You can
extract any bit of sequence from the .2bit files much faster than reading
the fasta ascii files.

Good luck with your research.

--Hiram

On 11/27/12 7:43 PM, Cheng Cui wrote:
> Dear all,
>
> I think I found what went wrong. I found the first line of chr10.fa is ">chr10", but not
> "NNNNNNNN..." as in all other chr*.fa files. My program read and register the positions by line
> numbers, so ">chr10" was counted as an extra line. This is a bug in my program, as I would put
> it in one way, but I do think with this extra line ">chr10", it may cause some confusion as it
> did to me.
>
> Sorry for the confusion, but you need to remember that the first line of chr10.fa is different.
> I never look at the fa files one by one and I thought they would all start with "NNNNNNN....".
>
> Cheng
>
> On 11/27/2012 10:24 PM, Mingfeng Li wrote:
>> I am afraid of the error because I am using mm9 currently. So,
>> I checked and I did not see the error. I guess your tool having some bugs. That is :
>> (2318445-1)*50+17=115922217
>>
>> On Tue, Nov 27, 2012 at 2:10 PM, Cheng Cui <cc...@pitt.edu <mailto:cc...@pitt.edu>> wrote:
>>
>> GCAGTGAGGAGAGCTTTTCTACTTTATTAC
>>
>>
>>
>>
>> --
>> *********************************************
>> Mingfeng Li, Ph.D.
>> Postdoctoral Associate
>> Department of Neurobiology
>> Yale University School of Medicine
>> 333 Cedar Street, SHM C-327C
>> New Haven, CT 06510
>>
>> E-mail: mingfeng.li <http://mingfeng.li>@yale.edu <http://yale.edu>
>> Lab: (203) 785-5941
>> Lab website: www.sestanlab.org <http://www.sestanlab.org>
>> *********************************************
>
>
>

Galt Barber

unread,
Nov 28, 2012, 11:50:29 AM11/28/12
to Cheng Cui, Mingfeng Li, gen...@soe.ucsc.edu, Rama
Here's one quick way to see the beginning of the fasta files:

curl -s 'http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/chr10.fa.gz' | gunzip -c | head
>chr10
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN


curl -s 'http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/chr9.fa.gz' | gunzip -c | head
>chr9
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

-Galt


--




Cheng Cui

unread,
Nov 28, 2012, 11:52:20 AM11/28/12
to Galt Barber, Mingfeng Li, gen...@soe.ucsc.edu, Rama
Thanks.

Cheng
> <mailto:cc...@pitt.edu> <mailto:cc...@pitt.edu
> <mailto:cc...@pitt.edu>>> wrote:
>
> GCAGTGAGGAGAGCTTTTCTACTTTATTAC
>
>
>
>
> --
> ******************************__***************
> Mingfeng Li, Ph.D.
> Postdoctoral Associate
> Department of Neurobiology
> Yale University School of Medicine
> 333 Cedar Street, SHM C-327C
> New Haven, CT 06510
>
> E-mail: mingfeng.li <http://mingfeng.li>
> <http://mingfeng.li>@yale.edu <http://yale.edu>
> <http://yale.edu>
> Lab: (203) 785-5941 <tel:%28203%29%20785-5941>
> <http://www.sestanlab.org>
> ******************************__***************
>
>
>
>
>
> --
>
>
>
>
Reply all
Reply to author
Forward
0 new messages