Merging unmapped reads

Pawan Chinari

unread,

Jul 10, 2020, 3:59:25 PM7/10/20

to rna-star

Hi,

I am trying to merge aligned and unmapped reads together for paired-end sequencing data.

I have downloaded aligned BAM file and unampped reads in FASTQ format from a public resource.

The public resource has used STAR to do the alignment using hg38 as the reference genome.

I want to do the alignment using hg19 as the reference genome so these are the steps that I am doing:

1. Convert hg38 BAM files to FASTQ format using Samtools sort -n followed by Samtools fastq

2. Merge mate1 and mate2 unmapped files to FASTQ_1 and FASTQ_2 files (obtained after running Samtools fastq) respectively.

3. Use the merged FASTQ files to align to hg19 reference genome using STAR.

Can I please know if it is correct to follow these steps to convert from hg38 to hg19 and merging unmapped reads ?

Do I have to consider any other step when converting from hg38 to hg19 format and also merging mate1/2 files to FASTQ_1/2 files ?

Thank you for your help.

Best Regards,

Pawan

Alexander Dobin

unread,

Jul 15, 2020, 2:50:08 PM7/15/20

to rna-star

Hi Pawan,

I am not sure what you mean by "Merge mate1 and mate2 unmapped files to FASTQ_1 and FASTQ_2 ".

It's important that the reads in two fastq files are in exactly the same order.

I do not use samtools fastq - I prefer picard SamToFastq. It's a bit slow, but it checks carefully for errors. It will produce two fastq files that you can directly use for mapping.

Cheers

Alex

Pawan Chinari

unread,

Jul 15, 2020, 3:46:06 PM7/15/20

to rna-star

Hi Prof. Dobin,

Thank you for your response.

By this sentence "Merge mate1 and mate2 unmapped files to FASTQ_1 and FASTQ_2 ":

I mean that I am trying to merge the "Unmapped.out.mate1/2" from STAR output to "FASTQ_1/2" files respectively before aligning to reference genome hg19.

Please let me tell you a bit more detail of the data:

The downloaded data from the public resource has two sets of files - aligned bam files and unmapped files (Unmapped.out.mate1/2).

The public resource has used STAR to do the alignment and the alignment is being done using reference genome hg38.

In order to ensure that the fastq files are in same order, first I sort the bam file using samtools like this: samtools sort -n

And then I use samtools fastq which gives me two fastq files _1.gz and _2.gz.

Now I want to merge the unmapped reads to these two fastq files.

The rationale for merging the unmapped reads is that those reads which did not map to hg38 may map to hg19.

Please let me know if it is correct to proceed this way for the unmapped reads or there is any alternative way to process the unmapped reads ?

Thanks,

Pawan

Alexander Dobin

unread,

Jul 15, 2020, 7:25:06 PM7/15/20

to rna-star

Hi Pawan,

here a few things you need to check to make sure your procedure works:

1. The BAM file does not contain unmapped reads - otherwise, you will be including them twice.

2. The BAM file does not contain single-end alignments. If it does, then I think samtools fastq -s flag will write them in a separate file. You do not need to add these reads since they are recorded in Unmapped.

3. I am not sure if samtools fastq outputs primary alignments only. I think you should be able to filter the non-primary alignments with the -F 0x100 option.

Also, after you create the FASTQs, I would recommend that you check that

1. The reads in two files are in the same exact order, by comparing the read file names (i.e. lines 1,5,9...) in both files.

2. There are no duplicate read names

3. The set of read names in FASTQs is exactly the same as the union of BAM and Unmapped reads.

Cheers

Alex

Pawan Chinari

unread,

Jul 16, 2020, 7:00:23 PM7/16/20

to rna-star

Hi Prof. Dobin,

Thank you for your detailed response and suggestions.

I have performed these checks:

-----------------------------------------

2. The BAM file does not contain single-end alignments. If it does, then I think samtools fastq -s flag will write them in a separate file. You do not need to add these reads since they are recorded in Unmapped.

I did write the -s files and none of them have any reads.

3. I am not sure if samtools fastq outputs primary alignments only. I think you should be able to filter the non-primary alignments with the -F 0x100 option.

Samtools does filter out secondary alignments (0x100) and supplementary alignments (0x800) by default using -F 0x900

1. The reads in two files are in the same exact order, by comparing the read file names (i.e. lines 1,5,9...) in both files.

I have checked these also and they are in the same exact order.

I have these follow-up queries:

----------------------------------------

1. The BAM file does not contain unmapped reads - otherwise, you will be including them twice.

Why will the aligned BAM file contain unmapped reads ?

And if the aligned does contain unmapped reads, can I check them with this flag (0x4) ?

3. The set of read names in FASTQs is exactly the same as the union of BAM and Unmapped reads.

For this can check the list of sequence names again for e.g. lines (1,5,9,....) ?

In addition, I wanted to ask

------------------------------------

1. how important it is to use the unmapped reads to align to hg19 ?

2. Also I noticed that the read order for mate1 and mate2 in the unmapped files "Unmapped.out.mate1/2" are not sorted.

I am thinking of first sorting the unmapped files and then concatenating with the respective fastq _1 and _2 files.

Is this the proper way to proceed ?

I look forward to your response again.

Thank you for your help.

Thanks,

Pawan

Alexander Dobin

unread,

Jul 22, 2020, 11:47:13 AM7/22/20

to rna-star

Hi Pawan,

we can store unmapped reads in BAM so that we can have *all* of the reads in the BAM file, and thus can restore the original "raw" FASTQs. Then, in principle, we can only keep the BAM file and do not need to save the original FASTQ. STAR has an option to write unmapped reads to the BAM file.

Indeed, samtools view -f0x4 will output unmapped reads.

There was a bug in earlier version of STAR that sometimes produced improperly ordered Unmapped.* files. Sorting them by read name is exactly the right way to go. I think the easiest thing is to convert the 4-line FASTQ into one line, sort, and then convert back to 4-line.

It's probably not very important to map to hg19 those reads that were unmapped to h38, unless you are specifically looking into the difference between hg19 and h38 alignments.

h38 is a better assembly than hg19, so reads that do not map to h38 but map to hg19 will be suspicious.