Hi Bo,
I created a truncated sam file by concatenating the entire header
section with the first 1000 lines of the sequence portion of the file.
When I ran rsem-calculate-expression, I did get isoforms.results and
genes.results files, but again nearly all of the expected read counts
were 0 (this was the original problem I posted to initiate this
thread).
When I looked line-by-line at the truncated sam file, I noticed that
many of the contigs were short, and contained duplicated regions
relative to other contigs. In case this presented a problem, I tried a
second run, where I filtered my original reference fasta (i.e. my de
novo assembly) to retain only those sequences >1000nt. I ran rsem-
prepare-reference on this filtered reference, and then ran calculate-
expression. Again, rsem-run-em ran forever. I then went through the
same truncation procedure. When I ran calculate-expression with the
new truncated sam, the .results files were created, but again ~99% of
'genes' showed 0 expected counts, and the other ~1% had 1 expected
alignment.
The reference was created with these same sets of reads, so it is
strange to me that I would have so few expected counts. Though I
didn't see alignment stats in the output from the run with the
truncated sam, the run with the full (>1000nt) filtered reference
showed that 24 million reads aligned:
# reads processed: 61349468
# reads with at least one reported alignment: 24294996 (39.60%)
# reads that failed to align: 37025542 (60.35%)
# reads with alignments suppressed due to -m: 28930 (0.05%)
I will send you the filtered truncated sam and the ref.transcripts.fa
file in case you'd like to look at them. Please let me know if you
have any thoughts.
Thanks,
Aman (pathunk)
> >> >> > Parsed 65000000 entries...
>
> read more »