Aome problems in understanding the data from NCBI website

9 views
Skip to first unread message

Louis

unread,
Aug 29, 2016, 10:36:32 AM8/29/16
to gen...@soe.ucsc.edu
Dear Sir or Madam


I am Zhuotong Li, a bio-information researcher from Melow technologies, Inc.

We have downloaded all “1000genomes” data from NCBI ftp site : ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/

We have calculated all segment sequence from HG01699 and find something interest.

We tried to solve problems we met, but as we go further ,there are more and more confused concepts from very basic parts.

We will be very appreciated to you answer. And here are the questions:


1.We have downloaded “sequence.index” from ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/.

F orm 1

There are many institute like BCM,SC,WUGSC and many instrument model. How to determine if a sequence is good enough to use?


2.From the same file above, our attention is on the “PAIRED FASTQ” line. “PAIRED FASTQ” means mate pair file if exists.Why some mate pair files are missing in PAIRED FASTQ? And why some mate pair are not equal to original file? (There is a Base count figure in question 3) We think every sequence should have a mate pair.

F orm 2


3.The sample ID HG00097 has fourteen lines. The form 3 below shows Read count and Base count of HG00097. Each line has different Read counts and Base counts (except mate pairs). But actually our downloaded file is one single CRAM file.

See form :ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/HG00097/alignment/

How to choose and integrate these data into one CRAM file? Does the CRAM file use all the data including mate pairs?



Form 3


4.We notice there is a “quality score” in FASTQ file. How does this score work out? Is there any algorithms

Information from: http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm


5.As we said at the beginning, we calculated all bases from HG01699 and load them into a database table, then compare it with Samtools tview result.

F orm 4

Columns from left to the right stand for : chromosome id; Samtools algorithm result; sequence depth; the number of "A" in the column of sequence data; the number of "a"; the number of "T"; the number of "t"; the number of "C"; the number of "c"; the number of "G"; the number of "g"; the number of "*".


We wonder what is the difference between upper and lower letter?


In Figure 1, The line with underline is the result output from the tool “samtools_bp”

The lines under “samtools_bp” are “seq_depth”(without space line). Seq_depth in red square is 11 (or 11 lines of data); seq_depth in yellow square is 1.


Our question is why there is only one depth (this base is only tested once) , the form 4 above shows tested sequence is C/c, but the result of samtools algorithm produces a “*”. We do not understand how this result “*” is from a measured C/c? Are there other factors that participate in the process?


Yellow square shows bases due to the samtools algorithm, from original “atg” change to “K”. we wander what does K mean?


F

igure 1


Let’s continue our stats.

F orm 5


Form 6

Comprehending from the forms 5 and 6 above, we think whatever the base is , it can synthesis to “K”. But what “K” stand for?


6.We stats proportion of each base in chromosome 1.

F orm 7


We notice that, without A,T,C,G,N, the Samtools algorithm result still contains K,M,R,S,W,Y,*. Total number of K,M,R,S,W,Y,* is over 0.25%. What do these letters stand for and how these letters were synthesized ?


7 .

Form 8


HG01699 is tested 66 times with Illumina HiSeq 2000.

The accuracy of Illumina HiSeq 2000 is 0.1%. See from http://www.illumina.com/documents/products/datasheets/datasheet_hiseq2000.pdf

This means in the same base column, probability of one different base shall be 0.1%, probability of 4 different bases shall be about 1/1,000,000,000,000 , and more different bases have much lower chance .

But actually, we find if “samtool_bp” shows base is “A”, maybe proportion of “Aa” in the column is less than 20%, over 80% are TtCcGg. Why samtools synthesizes an “A” here?

Because the proportion of one column contains more than 4 different bases is very low(1/1,000,000,000,000) , can we assume these columns have synthesis errors?


8.Our gene may have insertion and deletion, how does Samtools_bp show these variation?


9 .

Figure 2

In our stats, over 2% columns only have a value of 1 in depth. Can these columns be trusted?

Most depths are between 1-20. The largest depth is over 8000. Can this be possible in sequencing experiment?


Yours sincerely

Zhuotong Li


Christopher Lee

unread,
Aug 29, 2016, 12:38:09 PM8/29/16
to Louis, UCSC Genome Browser Discussion List

Hi Zhuotong,

Thank you for your question about NCBI data.

This mailing list is for questions relating to software or data produced here at UCSC, and is not a source of general scientific advice. Since your questions relate to NCBI data, please direct your questions there, or to a more general knowledge forum such as Biostars.

NCBI Help Desk:
https://www.ncbi.nlm.nih.gov/home/about/contact.shtml

Biostars:
https://www.biostars.org/

If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Thanks,

Christopher Lee
UCSC Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Reply all
Reply to author
Forward
0 new messages