Alignment with BAM file as input

1,528 views
Skip to first unread message

Jose Fernandez Navarro

unread,
Sep 4, 2017, 6:03:36 AM9/4/17
to rna-star
Hello, 

I am wondering if there is any plan to add full support
for BAM input files (as oppose to FASTQ for the alignment mode). 

If so, is that something that will happen in the next release/s? 

Thanks!

Best,
Jose

Alexander Dobin

unread,
Sep 6, 2017, 5:38:58 PM9/6/17
to rna-star
Hi Jose,

this feature is high on my list. I think I hope to release it within 1-2 months.

Cheers
Alex

Jose Fernandez Navarro

unread,
Oct 6, 2017, 12:39:39 PM10/6/17
to rna-star
Hi Alex, 

I saw that you implemented this in your latest commit. 
I am about to test it :) I have one question though. What 
will happen now with the option --outReadsUnmapped?
It would be great if it that option would support BAM/SAM as well. 

Best,
Jose

Alexander Dobin

unread,
Oct 6, 2017, 12:47:35 PM10/6/17
to rna-star
Hi Jose,

great, it would be very helpful - please test it thoroughly and let me know if there are any issues or you have any suggestions.
I am waiting for several users to test it before I make an official release.

There is an option --outSAMunmapped Within which will output unmapped reads in SAM/BAM format within the main BAM file.
All the tags from uBAM are supposed to be transferred. You can then extract the unmapped reads with samtools.

Cheers
Alex

Jose Fernandez Navarro

unread,
Oct 8, 2017, 11:54:29 AM10/8/17
to rna-star
Hi Alex,

I tested in a Linux environment and I got the following error:

EXITING because of FATAL INPUT ERROR: unknown/unimplemented value for --readFilesType: SAM SE

SOLUTION: specify one of the allowed values: Fastx or SAM


However, if I use --readFilesType SAM I get this error:

EXITING because of FATAL INPUT ERROR: --readFilesType SAM requires specifying SE or PE reads

SOLUTION: specify --readFilesType SAM SE for single-end reads or --readFilesType SAM PE for paired-end reads



I am about to test it in MAC but I had to build GCC from scratch and it is taking forever. 

I still think that adding the option to output unmapped reads in BAM format
with --outReadsUnmapped would be convenient. However, I understand
if this would not have a high priority :) 

Best,
Jose

Jose Fernandez Navarro

unread,
Oct 9, 2017, 7:10:06 AM10/9/17
to rna-star
Update: I could not build STAR on my MAC. I get this error:

bam_cat.c:57:10: fatal error: 'cstring' file not found

#include <cstring>


I tried 

$ brew reinstall --build-from-source gcc

and 

$ gcc -c -O3 bam_cat.c

but I still get the same error. I will try to build gcc from the source. 

Best,
Jose

Alexander Dobin

unread,
Oct 10, 2017, 12:40:57 PM10/10/17
to rna-star
Hi Jose,

could you please send me the Log.out file for the Linux run?

For the Mac, the problem likely indicates that the gcc standard libraries are not installed.

Cheers
Alex

Jose Fernandez Navarro

unread,
Oct 12, 2017, 3:39:08 AM10/12/17
to rna-star
Hi Alex, 

I managed to build STAR in my MAC. I had installed GCC
but I was not referencing to the correct compiler. 

I also managed to run STAR with --readFilesType: SAM SE
The problem was that the Python wrapper that was invoking STAR was adding
some spaces between SAM and SE. 

So, STAR runs now but the outputted BAM does not contain
the FASTQ sequences and qualities. 

Input: 

NS500688:11:H5FLYBGXX:1:11101:25104:1041 2:N:0:1 0 * 0 0 * * 0 0 CCATCTTTCTATCGAAAGATGGTGAACTATGCCTANGCAGGGCGAAGCCAGAGGAAACTCTGGTGGAGGTCCGTAGCGGTCCTGACGT bb]bb]b]gOJXgg]gbgbgb]Og]bggggbgggJDgg]]Ogb]OJgbgOgObJOgggggbXXgOO]]]bgbXgOgbOOXbgbbggX] B0:Z:ATGAANGACGGAGGTACT B3:Z:ACGTACGGG

NS500688:11:H5FLYBGXX:1:11101:19443:1041 2:N:0:1 0 * 0 0 * * 0 0 CGGCCCACGGCCCTGGCGGAGCGCTGAGAAGACGGTCGAACTTGACTATCTAGAGGAAGC bX]X]gggJOgggb]]bg]ggg]ggXgbggXggbggg]gggggXgggbggb]gbgggg]g B0:Z:TCCTANCGCTACTATCTG B3:Z:AGGCTCTAG

NS500688:11:H5FLYBGXX:1:11101:20435:1042 2:N:0:1 0 * 0 0 * * 0 0 CGCTCTCTCTCCGAGTTTCAGGCTCGTGCTAAGCTAGCGCCGTCGTCGTCTCACTTCAGTCGCCATCATGATTATCTACCGGGACCTCATCAGNCACGANGAGATGNTCTC bXbObggJggggOg]gggg]JJg]ggbOgbggJggggg]XggbggggbggggJ]ggggXgggbOObg]ggbbgggggggXOgbOggg]OgXOXDb]ggbD]gbO]]Dbbg] B0:Z:CGCTCNTTCGCAGACGTA B3:Z:AGTTAGATA

NS500688:11:H5FLYBGXX:1:11101:24680:1042 2:N:0:1 0 * 0 0 * * 0 0 CACTTCTCCCCAGTCTTCAACTCTACTTTATGCAAGAGGTTGTTGACCACGTGCCCTGGGCAGCTTCAAGACTGGTGGCTCACGCCTGTCATCNCAGCACTCTGGGNGGCAGAGGCGGGC bXbbbgggObXbJbbggXJJOJgbOggbgOgggggggX]ggggbgXg]]gJgOJJggbgOJOXggbgOggggXgOggOgbgOgOgOgOgg]bJDbXgbO]XbO]ggDgOgOgJ]g]g]Xg B0:Z:AGCAGNACAGACAGCGTT B3:Z:TCTATCACA

NS500688:11:H5FLYBGXX:1:11101:12075:1043 2:N:0:1 0 * 0 0 * * 0 0 GAGGCTCTGCACAACAACTACACGCAGAAGAGCCTCTCCCTGTCCCCGGGTTAATGAGTGCGACGGCCGGCAAGCCCCCGCTCCCCGGGCTCTNGCGGTCGCACGANGATGCTTGGCAC XbOJ]gggOgJ]ObgJJgggJX]JgXggggbgOggggJOXg]gOJXJXOggJO]ggXOgg]J]]]bOXOOXObX]JJJbbggOJJggOOgggODJbXbgbObOOgJDb]ggggggXbO] B0:Z:TGGTANGCGGGCTACACC B3:Z:AGTATGCTA


Output:

NS500688:11:H5FLYBGXX:1:11101:25104:1041 4 * 0 0 * * 0 0 N 0 NH:i:0 HI:i:0 AS:i:nM:i:0 uT:A:0 B0:Z:ATGAANGACGGAGGTACT B3:Z:ACGTACGGG

NS500688:11:H5FLYBGXX:1:11101:19443:1041 4 * 0 0 * * 0 0 N 0 NH:i:0 HI:i:0 AS:i:nM:i:0 uT:A:0 B0:Z:TCCTANCGCTACTATCTG B3:Z:AGGCTCTAG

NS500688:11:H5FLYBGXX:1:11101:20435:1042 4 * 0 0 * * 0 0 N 0 NH:i:0 HI:i:0 AS:i:nM:i:0 uT:A:0 B0:Z:CGCTCNTTCGCAGACGTA B3:Z:AGTTAGATA

NS500688:11:H5FLYBGXX:1:11101:24680:1042 4 * 0 0 * * 0 0 N 0 NH:i:0 HI:i:0 AS:i:nM:i:0 uT:A:0 B0:Z:AGCAGNACAGACAGCGTT B3:Z:TCTATCACA

NS500688:11:H5FLYBGXX:1:11101:12075:1043 4 * 0 0 * * 0 0 N 0 NH:i:0 HI:i:0 AS:i:nM:i:0 uT:A:0 B0:Z:TGGTANGCGGGCTACACC B3:Z:AGTATGCTA


I am attaching the LOG file.

Best,
Jose
Log.out

Alexander Dobin

unread,
Oct 13, 2017, 11:01:04 AM10/13/17
to rna-star
Hi Jose,

thanks for testing it!
This looks like a bug, I will fix it shortly.

Cheers
Alex

Alexander Dobin

unread,
Oct 20, 2017, 12:15:56 PM10/20/17
to rna-star
Hi Jose,

I think the problem here is that read name contains 2:N:0:1 strings separated by space from the main read name.
I do not think read names are allowed to have spaces.
If you use picard to convert the FASTQ into SAM, it does not keep these strings, while I think it reads the "Y/N" information into the FLAG.
If you want to keep them, you will need to separate them with the underscore, i.e. 
NS500688:11:H5FLYBGXX:1:11101:25104:1041_2:N:0:1
I would not recommend it since the 1st and 2nd mates will have different names, _1:N:0:1 and _2:N:0:1.

Cheers
Alex

Jose Fernandez Navarro

unread,
Oct 23, 2017, 5:39:19 AM10/23/17
to rna-star
Hi Alex,

Is that what causes the outputted BAM/SAM file to not contain
the reads sequences and qualities? Is that the expected behaviour?

In our case, it doesn't really matter if the reads names get trimmed out
but we do need the sequences for downstream analysis after aligning. 

Thanks for looking into this. 

Best,
Jose

Alexander Dobin

unread,
Oct 23, 2017, 12:40:04 PM10/23/17
to rna-star
Hi Jose,

the problem is that the uBAM file you generated contains a space in the read ID of the file, and it causes STAR to read wrong fields for sequence and quality.
If you replace the space in the name with an underscore "_", the problem should go away. Or, better, when you generate the uBAM file, only keep the part of the read name before the space, i.e. drop the "1:N:0" fields.

Cheers
Alex

Jose Fernandez Navarro

unread,
Oct 24, 2017, 11:37:17 AM10/24/17
to rna-star
Hi Alex,

I see :) I fixed that but now I get this message on stderr (the mapping job completes though). 

samtools: writing to standard output failed: Broken pipe

samtools: error closing standard output: -1


However, the outputted BAM file seems to be unsorted. 

I attached the log file. 

Best,
Jose
Log.out

Jose Fernandez Navarro

unread,
Oct 24, 2017, 11:38:32 AM10/24/17
to rna-star
unsorted and truncated*

Jose Fernandez Navarro

unread,
Oct 25, 2017, 9:12:11 AM10/25/17
to rna-star
Hi again Alex,

I also noticed that output BAM files have duplicated the NH tag:

 NH:i:1  HI:i:1  AS:i:110        nM:i:2  NH:i:0  HI:i:0  AS:i:9  nM:i:1


Is that expected? 

Thanks! 

Best,
Jose

Alexander Dobin

unread,
Oct 26, 2017, 11:35:54 AM10/26/17
to rna-star
Hi Jose,

are you trying to re-map unmapped reads that were output as SAM?
"nM:i:2  NH:i:0  HI:i:0  AS:i:9  nM:i:1" look like the unmapped reads tags.
For uBAM input STAR imply copies all the SAM tags from the uBAM. So if any tags in uBAM are the same as the tags that STAR generates, they will be duplicate.
You would have to remove all unwanted tags from uBAM before supplying it to STAR.

I have run STAR with your parameters on one of my uBAM test files, and the runs completed OK with the output BAM sorted and not truncated.
I think the problem is likely with the formatting of your uBAM file.
Could you please try to generate a uBAM file with picard, and check if it can be mapped fine? I used:
$ java -jar FastqToSam.jar F1=r1.fq F2=r2.fq O=uBAM.sam RG=aa  QUALITY_FORMAT=Illumina SAMPLE_NAME=s1

Cheers
Alex

e...@spatialtranscriptomics.com

unread,
Nov 8, 2017, 12:04:49 PM11/8/17
to rna-star
Hi Alex,

I'm working with Jose and I've tested the picard FastqToSam conversion you suggested.


In summary I used two input files for the tests:
    1. A test set of reads quality trimmed and output to fastq-format
    2. The same set of reads (quality trimmed) but processed in another way to produce a bam file
Both these files were converted to the other formats if needed (bam, sam and fastq).

Ie the unmapped reads are now available as bam, sam and fastq-formatted files
Created by conversion from two the different "source formats" (bam and fastq above),
in total six files.



I then tested to map all these six files with both
    STAR_2.5.3a
  and
    STAR_2.5.3a_modified (built from the code in the GitHub repo 20171106).

I initially mapped with the standard parameters for STAR,
this produced identical output for all input files and STAR versions.

Then I added the following parameters:
    --alignIntronMin 1
    --alignIntronMax 1
    --outFilterMultimapNmax 1
    --alignEndsType EndToEnd
    --outFilterMatchNmin 25
This resulted in that the two versions of STAR gave different output stats
for the fastq input files ran with the same parameters
(as bam/sam not possible for STAR_2.5.3a I could not test that input format).


I also tried to use:
    --alignEndsType Local instead of --alignEndsType EndToEnd
and tried to add the parameters one by one to the command line.
But I still get different output stats for same input files.

Have I found a bug here or am I doing something wrong?

The script I used for conversions and to map can be found here:
     https://github.com/elhb/scripts/blob/master/scripts/star_test_171107.sh

I also have a .tar.gz archive (~80MB) with the input and output-files that I can share with you if you like.

Best,
Erik

Alexander Dobin

unread,
Nov 8, 2017, 4:11:19 PM11/8/17
to rna-star
Hi Erik,

if I understood it correctly, the results are different if you run the same fastqs with the same parameters with two different versions of STAR?
Could you please send me the Log.out and Log.final.out file for the two runs?

Cheers
Alex

e...@spatialtranscriptomics.com

unread,
Nov 9, 2017, 4:44:47 AM11/9/17
to rna-star
Here you go,
Best,
Erik
STAR_2.5.3a_modified.v151.param_3.fastq_Log.final.out
STAR_2.5.3a_modified.v151.param_3.fastq_Log.out
STAR_2.5.3a.v151.param_3.fastq_Log.final.out
STAR_2.5.3a.v151.param_3.fastq_Log.out

Alexander Dobin

unread,
Nov 9, 2017, 6:56:11 PM11/9/17
to rna-star
Hi Erik,

this is strange indeed, I cannot see the problem in my tests.

Let's try this again. I have pushed my latest commit to the GitHub master.
Please download it and try to run the compiled binary.

If it still shows different result, please send me the fastq file and the genome fasta - I will try to reproduce it.

Cheers
Alex

e...@spatialtranscriptomics.com

unread,
Nov 14, 2017, 5:59:21 AM11/14/17
to rna-star
Hi Alex,
seems like the problem still remains
Find the fastq and fasta file attached
Best regards,
//Erik
Rn45s_Rn5s.fasta
R2_quality_trimmed.v160.fastq.gz

Alexander Dobin

unread,
Nov 15, 2017, 12:31:43 PM11/15/17
to rna-star
Hi Erik,

ahh, I forgot about it - this was actually a bug that was fixed in a post-2.5.3a patch 4c42f224ea165e .
It occurs when you have a small genome and use --alignIntronMax 1.
The results from the master are the correct one - and you have more reads mapped. :)

Cheers
Alex

Jose Fernandez Navarro

unread,
Nov 17, 2017, 3:59:00 AM11/17/17
to rna-star
Hi Alex, 

Thanks for all the help! 

We can confirm that the current master branch of STAR works 
for us with the BAM input in both Linux and MAC. Are you considering
making a release soon? 

One last thing, I would like to propose to add a flag to output discarded
reads in BAM format. I think conceptually it makes sense to have (or be able to)
the discarded reads in the same format as the input format. What do you think? 

Best,
Jose

Alexander Dobin

unread,
Nov 17, 2017, 6:19:56 PM11/17/17
to rna-star
Hi Jose,

sounds good, thanks for testing it!
I will try to make a release next week before the Thanksgiving, or at the latest the week after.

I will add the unmapped SAM output to my TODO list.

Cheers
Alex

Jose Fernandez Navarro

unread,
Jan 23, 2018, 5:12:29 AM1/23/18
to rna-star
Hi Alex,

Have you gotten around making a release yet? 

Best,
Jose

Alexander Dobin

unread,
Jan 25, 2018, 11:03:01 AM1/25/18
to rna-star
Hi Jose,

thanks for the reminder and sorry for the delay.
2.5.4a is released:

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages