Aligning 3 fastq files from the same sample

Osama Hamzah

unread,

Nov 16, 2016, 10:32:04 AM11/16/16

to rna-...@googlegroups.com

I am quite new to Bioinformatics, but I found the STAR aligner to be one of the fastest tools to use in this field.

I am trying to align 106 publicly available samples https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54460

When I use the SRA tools to download them with --split-3 option (./fastq-dump --split-3 SRR1164866), I get three fastq files. For example: sample 'SRR1164866' is a paired sample http://www.ebi.ac.uk/ena/data/view/SRR1164866 It will be downloaded into three different fastq files : SRR1164866_1.fastq SRR1164866_2.fastq and SRR1164866.fasq.

After researching about this behavior, I found this article http://www.internationalgenome.org/faq/why-sequence-data-distributed-2-or-3-files-labelled-srr1-srr2-and-srr/ which states:

'We distribute our fastq files for our paired end sequencing in 2 files, mate1 is found in a file labelled _1 and mate2 is found in the file labelled _2. The files which do not have a number in their name are singled ended reads, this can be for two reasons, some sequencing early in the project was singled ended also, as we filter our fastq files as described in our README if one of a pair of reads gets rejected the other read gets placed in the single file.'

I am getting great alignment results against the _1.fastq and _2.fastq files, but what about the third .fastq file (which sometimes have a huge size (2-3 Gigabytes) almost the same size as each of the _1.fastq and _2.fastq).

Is there away for STAR to align all the data (_1.fastq ,_2fastq and .fastq) of my samples.

Alexander Dobin

unread,

Nov 17, 2016, 3:38:15 PM11/17/16

to rna-star

Hi Osama,

I have only used split-files option with the SRA.

Here is a good explanation:

https://www.biostars.org/p/156909/

It seems like for PE sequencing, the 3rd file are "orphaned" (unpaired) reads and should be disregarded most of the times.

I am not sure what it means when this file is comparable in size to _1 and _2. Poor sequencing quality? A mixture of SE and PE sequencing?

I guess you could map it separately as SE reads.

Cheers

Alex

Osama Hamzah

unread,

Nov 17, 2016, 4:08:40 PM11/17/16

to rna-star

Thanks for the reply.

I am getting upto 97% alignment from the two files only. I am just gonna ignore the third file.

Reply all

Reply to author

Forward