Hi,
I was using gffread to convert transcripts from gtf (output of Cufflinks) to fasta. I get some GFaSeqGet errors, as follows:
$ gffread transcripts.gtf -g ../genome.fa -w transcripts.fasta
Error (GFaSeqGet): end coordinate (32519) cannot be larger than sequence length 32512
Error (GFaSeqGet): end coordinate (7812) cannot be larger than sequence length 7800
Error (GFaSeqGet): end coordinate (7812) cannot be larger than sequence length 7800
Error (GFaSeqGet): end coordinate (6327) cannot be larger than sequence length 6326
Error (GFaSeqGet): end coordinate (4651) cannot be larger than sequence length 4648
Error (GFaSeqGet): subsequence cannot be larger than 3551
Error getting subseq for CUFF.29386.2 (1..3568)!
That's a bit weird since I used the same genome file as the one used to generate the alignment file required by Cufflinks, and the coordinates overhang does not make a lot of sense to me.
The alignment file was generated using Hisat2:
$ hisat2 -q --rna-strandness RF --dta-cufflinks -x ./hisat_idx -1 ../1A_1.fq.gz,../1B_1.fq.gz,../2A_1.fq.gz,../2B_1.fq.gz,../3A_1.fq.gz,../3B_1.fq.gz -2 ../1A_2.fq.gz,../1B_2.fq.gz,../2A_2.fq.gz,../2B_2.fq.gz,../3A_2.fq.gz,../3B_2.fq.gz -S out.sam
Then sorted:
$ sort -k 3,3 -k 4,4n -T .
My question is, are only the few sequences reported in the log affected? Or is it a more general problem?
I got a fasta file out of gffread (537M out of 404M gtf file), and no error in generation of the fasta - but there are around 13k transcripts missing from the fasta (if I am not wrong with interpretation):
$ cat transcripts.gtf | awk '$3=="transcript"{print $0}' | wc -l
231139
$ cat transcripts.fasta | grep '^>' | wc -l
218075
Do you know what would be a cause of this problem?
Best
Amina