RSEM after HISAT2

Petros Tsantoulis

unread,

Apr 22, 2016, 9:26:27 AM4/22/16

to RSEM Users

Hello all,

Excuse my ignorance, but I'm trying to use HISAT2 for alignment (I like the idea that it considers SNPs). Unfortunately, I can't get the output to be read by rsem.

RSEM index:
# rsem-prepare-reference --gtf ../ensembl-GRC38/Homo_sapiens.GRCh38.84.gtf ../ensembl-GRC38/Homo_sapiens.GRCh38.dna.primary_assembly.fa ensembl.GRCh38.84
rsem-extract-reference-transcripts ensembl.GRCh38.84 0 ../ensembl-GRC38/Homo_sapiens.GRCh38.84.gtf None 0 ../ensembl-GRC38/Homo_sapiens.GRCh38.dna.primary_assembly.fa
Parsed 200000 lines
Parsed 400000 lines
Parsed 600000 lines
Parsed 800000 lines
Parsed 1000000 lines
Parsed 1200000 lines
Parsed 1400000 lines
Parsed 1600000 lines
Parsed 1800000 lines
Parsed 2000000 lines
Parsed 2200000 lines
Parsed 2400000 lines
Parsing gtf File is done!
../ensembl-GRC38/Homo_sapiens.GRCh38.dna.primary_assembly.fa is processed!
199184 transcripts are extracted.
Extracting sequences is done!
Group File is generated!
Transcript Information File is generated!
Chromosome List File is generated!
Extracted Sequences File is generated!

rsem-preref ensembl.GRCh38.84.transcripts.fa 1 ensembl.GRCh38.84
Refs.makeRefs finished!
Refs.saveRefs finished!
ensembl.GRCh38.84.idx.fa is generated!
ensembl.GRCh38.84.n2g.idx.fa is generated!

These (GRCh84.dna.primary_assembly.fa and 84.gtf) are the same files that are used for the HISAT2 index.

HISAT2 run:
for ((i=0; i<${#readlist[@]}; i++ )) ; do
    echo "Align ${readlist[$i]} (outfile ${samplename[$i]}.sam)" ;
    $HISAT \
        -t -p 4 \
    -x $REFERENCE \
    -U ${readlist[$i]} \
    -S ${samplename[$i]}.sam ;
    PREFIX=${samplename[$i]} ;
    INFILE=${PREFIX}.sam ;
    OUTFILE=${PREFIX}.bam ;
    samtools sort -m 4G -@ 4 -o $OUTFILE $INFILE ;
    samtools index -b $OUTFILE
    rm $INFILE

RSEM run:
# rsem-calculate-expression --alignments 361174.bam ../../reference/rsem/ensembl.GRCh38.84 361174
rsem-parse-alignments ../../reference/rsem/ensembl.GRCh38.84 361174.temp/361174 361174.stat/361174 361174.bam 1 -tag XM
Warning: The SAM/BAM file declares less reference sequences (194) than RSEM knows (199184)!
RSEM can not recognize reference sequence name 1!
"rsem-parse-alignments ../../reference/rsem/ensembl.GRCh38.84 361174.temp/361174 361174.stat/361174 361174.bam 1 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline!

Note that I'm using a locally compiled version of RSEM 1.2.30 with default flags.No error during compilation.

Am I missing something obvious here? Has someone used HISAT2 + RSEM? Is it technically possible?

I did notice that a similar error was reported previously with STAR and was later resolved.

Thanks a lot!
Petros

Bo Li

unread,

Apr 22, 2016, 6:13:40 PM4/22/16

to rsem-...@googlegroups.com

Hi Petros,

RSEM only takes alignments in transcript coordinates. It seems that you
align your reads to the genome directly. That's why it does not work.

Hope it helps,
Bo

> --
> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [1]
> ---
> You received this message because you are subscribed to the Google
> Groups "RSEM Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to rsem-users+...@googlegroups.com.
> To post to this group, send email to rsem-...@googlegroups.com.
> Visit this group at https://groups.google.com/group/rsem-users [2].
>
>
> Links:
> ------
> [1] http://deweylab.biostat.wisc.edu/rsem/
> [2] https://groups.google.com/group/rsem-users

Petros Tsantoulis

unread,

Apr 24, 2016, 3:06:59 AM4/24/16

to RSEM Users

Hello Bo,

Thank you for your answer!

I tried to align against the transcripts, which resolves the warning, but the names do not match:
$ rsem-calculate-expression --alignments 361174.bam ../../reference/rsem/gencode.v24 361174
rsem-parse-alignments ../../reference/rsem/gencode.v24 361174.temp/361174 361174.stat/361174 361174.bam 1 -tag XM
RSEM can not recognize reference sequence name ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-002|DDX11L1|1657|processed_transcript|!
"rsem-parse-alignments ../../reference/rsem/gencode.v24 361174.temp/361174 361174.stat/361174 361174.bam 1 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline!

Then I simplified the RSEM reference, by removing the GTF
rsem-prepare-reference ../gencode-GRC38/gencode.v24.transcripts.fa gencode.v24.trans

Finally, the names matched, but I got the following:
$ rsem-calculate-expression --alignments 361174.bam ../../reference/rsem/gencode.v24.trans 361174
rsem-parse-alignments ../../reference/rsem/gencode.v24.trans 361174.temp/361174 361174.stat/361174 361174.bam 1 -tag XM
Read WINDU:108:C9AY9ANXX:2:1308:9460:47184: RSEM currently does not support gapped alignments, sorry!

"rsem-parse-alignments ../../reference/rsem/gencode.v24.trans 361174.temp/361174 361174.stat/361174 361174.bam 1 -tag XM" failed! Plase check if you provide co rrect parameters/options for the pipeline!

I do not know if there is any particular option in HISAT2 that would make this work, but at this point I'm thinking it's probably not (easily) doable. For me, the main benefit of HISAT2 would be the inclusion of SNPs, which slightly increases mapping rate and also the fact that it can run in a machine with less than 32GB RAM. I don't know if you are willing to consider integration of HISAT2 in a future RSEM version (if it's doable!), but I'll just use STAR+RSEM for the moment.

Thanks a lot for your help!
Petros

Reply all

Reply to author

Forward