Hi,
I've created a panhuman reference following the steps in the Kaminow et al paper and as a simple sanity check, would like to map the GENCODE reference transcriptome onto the panhuman reference to see if any transcripts are excluded from the new reference. This appears to be more challenging than I'd expected.
Initially using a Fasta file for the reference transcripts (--readFilesIn), only the first 30 transcripts were aligned. I then converted the Fasta File to a Fastq file using reformat.sh from the bbmap suite of tools and setting the fake quality scores to be uniformly 40, as follows
reformat.sh in=gencode.v44.transcripts.fa.gz out1=gencode.v44.transcripts.fq.gz qfake=40
I tried again by running STAR with the parameters for long reads (--outFilterMismatchNmax 999 --alignIntronMin 20 --alignIntronMax 1000000) At this point, STAR flagged the first read in the fastq file as having quality scores with a different length to the reads
EXITING because of FATAL ERROR in reads input: quality string length is not equal to sequence length
Manually checking this revealed the file structure to be correct, as shown by the line lengths for the first read (read & score only)
gunzip -c gencode.v44.transcripts.fq.gz | head -n4 | awk '{print length; }'
92
1657
1
1657
I also tried using seqtk for the conversion from .fa to .fq and received the same error, which leads me to think there's something else going wrong and the error isn't what it appears to be. I'm not sure if it's a bug or if there's something else I'm missing
Has anyone else tried this and is there a way to perform this task?
Thanks in advance,
Stevie