I have created a script to run STAR on my samples looking at both mRNAs and circRNAs.
I ran the script on some samples where the libraries were polyA enriched (as I wa sonly looking at the mRNAs) but I had left in my script the command '--chimSegmentMin 10' during the second mapping.
I noticed from the Log.final.out stats that I was getting about 5% of the reads coming out as Chimeric reads eg:
Number of input reads | 64320253
Average input read length | 177
UNIQUE READS:
Uniquely mapped reads number | 55240784
Uniquely mapped reads % | 85.88%
Average mapped length | 175.14
Number of splices: Total | 20634301
Number of splices: Annotated (sjdb) | 20464285
Number of splices: GT/AG | 20429756
Number of splices: GC/AG | 180974
Number of splices: AT/AC | 14807
Number of splices: Non-canonical | 8764
Mismatch rate per base, % | 0.30%
Deletion rate per base | 0.01%
Deletion average length | 1.67
Insertion rate per base | 0.00%
Insertion average length | 1.44
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 4467520
% of reads mapped to multiple loci | 6.95%
Number of reads mapped to too many loci | 22404
% of reads mapped to too many loci | 0.03%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 7.06%
% of reads unmapped: other | 0.07%
CHIMERIC READS:
Number of chimeric reads | 3362030
% of chimeric reads | 5.23%
and when I ran your 'filterCirc.awk' script I get around 45000 coming out. These are all false positives right as we wouldn't expect any circRNAs in a polyA library.
I'm worried how high this number is for when I actually run the circRNA analysis.