make_final_table.R fails because blat uses transcript-specific contigs

51 views

Skip to first unread message

peter....@gmail.com

unread,

Apr 4, 2018, 9:31:59 AM4/4/18

to jaffa-project

Hi -

I'm using jaffa v.1.09 on ubuntu 17.10, and am using the version that I downloaded from github, although I also installed jaffa via bioconda (see below for more info). Jaffa runs fine until the make_final_table.R script is run

The table generated by blat contains non-standard contig names, i.e. it doesn't use chr1, chr2, etc. Instead, the table it generates contains contig names like the following:

hg38_wgEncodeGencodeCompV22_ENST00000615060.3__range=chr6:73515752-73521032__5'pad=0__3'pad=0__strand=-__repeatMasking=none

It appears that this is causing problems when the make_final_table.R script tries to map these contigs to the transTable data.frame.

Since I installed this by downlaoding from the github site, and building it using the install_linux64.sh script, the version of blat that I'm using (the one that the script downloads is the following:

blat - Standalone BLAT v. 36x2

Since I also installed JAFFA via bioconda, I looked up the version of blat that you used with the bioconda, and it appears that it uses blat version 35. So, to see if there was a versioning issue with blat, I substituted that in the tools.groovy file; however, I got the same behavior when I tried using the bioconda version of blat as well.

Below is the log of the output from Jaffa:

bpipe run -n 15 /home/local/ARCS/pw2470/bin/jaffa/1.09/JAFFA_direct.groovy /home/local/ARCS/pw2470/isilon/home/PROJECTS/fusionTesting/jaffa/CUAC1854/CUAC1854_S2_L004_R*_001.fastq.gz
====================================================================================================
| Starting Pipeline at 2018-04-03 17:50 |
====================================================================================================

========================================= Stage run_check ==========================================

================================== Stage prepare_reads (CUAC1854) ==================================
TrimmomaticPE: Started with arguments: -threads 15 -phred33 /home/local/ARCS/pw2470/isilon/home/PROJECTS/fusionTesting/jaffa/CUAC1854/CUAC1854_S2_L004_R1_001.fastq.gz /home/local/ARCS/pw2470/isilon/home/PROJECTS/fusionTesting/jaffa/CUAC1854/CUAC1854_S2_L004_R2_001.fastq.gz /data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/tempp1.fq /dev/null /data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/tempp2.fq /dev/null LEADING:0 TRAILING:0 MINLEN:35
Input Read Pairs: 80342599 Both Surviving: 80342599 (100.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
80342599 reads; of these:
80342599 (100.00%) were paired; of these:
16417390 (20.43%) aligned concordantly 0 times
63925209 (79.57%) aligned concordantly exactly 1 time
0 (0.00%) aligned concordantly >1 times
79.57% overall alignment rate
16417390 reads; of these:
16417390 (100.00%) were paired; of these:
4911154 (29.91%) aligned concordantly 0 times
11506236 (70.09%) aligned concordantly exactly 1 time
0 (0.00%) aligned concordantly >1 times
70.09% overall alignment rate

================================== Stage get_unmapped (CUAC1854) ===================================
9822308 reads; of these:
9822308 (100.00%) were unpaired; of these:
7028136 (71.55%) aligned 0 times
2794172 (28.45%) aligned exactly 1 time
0 (0.00%) aligned >1 times
28.45% overall alignment rate
[bam_sort_core] merging from 2 files...
java -ea -Xmx200m -cp /home/local/ARCS/pw2470/bin/jaffa/1.09/tools/bbmap/current/ jgi.ReformatReads in=/data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/unmapped.fastq out=/data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/temp.fasta threads=15
Executing jgi.ReformatReads [in=/data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/unmapped.fastq, out=/data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/temp.fasta, threads=15]
Set threads to 15
Input is being processed as unpaired
Input: 7028136 reads 529971481 bases
Output: 7028136 reads (100.00%) 529971481 bases (100.00%)
Time: 3.130 seconds.
Reads Processed: 7028k 2245.37k reads/sec
Bases Processed: 529m 169.32m bases/sec
java -Djava.library.path=/home/local/ARCS/pw2470/bin/jaffa/1.09/tools/bbmap/jni/ -ea -Xmx49183m -Xms49183m -cp /home/local/ARCS/pw2470/bin/jaffa/1.09/tools/bbmap/current/ jgi.Dedupe sort=d in=/data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/temp.fasta out=/data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/CUAC1854.fasta threads=15 absorbcontainment=f
Executing jgi.Dedupe [sort=d, in=/data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/temp.fasta, out=/data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/CUAC1854.fasta, threads=15, absorbcontainment=f]
Set threads to 15
Initial:
Memory: max=49424m, free=48651m, used=773m
Found 2194215 duplicates.
Finished exact matches. Time: 4.667 seconds.
Memory: max=49424m, free=40914m, used=8510m
Input: 7028136 reads 529971481 bases.
Duplicates: 2194215 reads (31.22%) 165316730 bases (31.19%) 0 collisions.
Result: 4833921 reads (68.78%) 364654751 bases (68.81%)
Sorted output. Time: 1.945 seconds.
Memory: max=49424m, free=36968m, used=12456m
Printed output. Time: 2.531 seconds.
Memory: max=49424m, free=46202m, used=3222m
Time: 9.152 seconds.
Reads Processed: 7028k 767.98k reads/sec
Bases Processed: 529m 57.91m bases/sec

============================ Stage align_reads_to_annotation (CUAC1854) ============================
Using tileSize of 15
Loaded 297165649 letters in 195175 sequences
Searched 364654751 bases in 4833921 sequences
real 192m43.464s
user 184m54.839s
sys 0m3.386s

=============================== Stage filter_transcripts (CUAC1854) ================================
real 29m3.683s
user 27m50.832s
sys 0m2.887s

============================ Stage extract_fusion_sequences (CUAC1854) =============================
java -ea -Xmx200m -cp /home/local/ARCS/pw2470/bin/jaffa/1.09/tools/bbmap/current/ jgi.ReformatReads in=/data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/CUAC1854.fasta out=stdout.fasta fastawrap=0
Executing jgi.ReformatReads [in=/data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/CUAC1854.fasta, out=stdout.fasta, fastawrap=0]
Input is being processed as unpaired
Input: 4833921 reads 364654751 bases
Output: 4833921 reads (100.00%) 364654751 bases (100.00%)
Time: 2.746 seconds.
Reads Processed: 4833k 1760.54k reads/sec
Bases Processed: 364m 132.81m bases/sec

=========================== Stage align_transcripts_to_genome (CUAC1854) ===========================
Loaded 297165649 letters in 195175 sequences
Searched 274072 bases in 3629 sequences

============================= Stage make_simple_reads_table (CUAC1854) =============================
R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> #options(echo=FALSE)
>
> options(stringsAsFactors=F)
>
> args = commandArgs(trailingOnly = TRUE)
>
> ## load the table of candidates
> candidates=read.delim(args[1],header=F)
>
> ## load the reference annotation which maps gene IDs to
> ref=read.delim(args[2],stringsAsFactors=F)[,c("name","name2")]
>
> ## load the actually gene ids as they are stored in the bam file
> ref_contigs=read.delim(args[3],header=F,stringsAsFactors=F)[,1]
>
> ## replace the contig ids in the ref table
> ## ^ENST(ensembl gene) ENST\d{11} ENST[:digit:]{11}
> getEnsemblTranscriptID <- function(x) {
+ y <- regexpr("ENST\\d{11}(.\\d+)?", x, ignore.case=FALSE)
+ substr(x, y, y+attr(y, "match.length")-1)
+ }
> names(ref_contigs)<-sapply(ref_contigs,getEnsemblTranscriptID)
> ref$name<-ref_contigs[ref$name]
> sref=split(ref$name,ref$name2)
> sref=sapply(sref,function(x){paste(x,collapse=" ")})
>
>
> #now make the table which need to be run on the .bam file
> genes=unique(candidates$V4)
> sgenes=lapply(strsplit(genes,":"),function(x){sort(x)})
> sgenes=unique(sgenes)
> g1=sapply(sgenes,function(x){ sref[[x[1]]] })
> g2=sapply(sgenes,function(x){ sref[[x[2]]] })
> genes_new=sapply(sgenes,function(x){ paste(x[1],x[2],sep=":") })
> write.table(data.frame(genes_new,g1,g2),args[4],sep="?",row.names=F,col.names=F,quote=F)
>
[main_samview] region "NA" specifies an unknown reference name. Continue anyway.
R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> #options(echo=FALSE)
>
> options(stringsAsFactors=F)
> args = commandArgs(trailingOnly = TRUE)
>
> ## load the table of candidates
> candidates=read.delim(args[1],header=F)
>
> ## fix the ids to be alphabetically ordered
> candidates$V4=sapply(strsplit(candidates$V4,":"),function(x){ paste(sort(x),collapse=":") })
> colnames(candidates)<-c("transcript","break_min","break_max","fusion_genes","length")
> candidates$length<-NULL
> candidates$spanning_pairs<-0
> candidates$spanning_reads<-1
>
> ## load the count data
> sr=read.delim(args[2],header=F)
>
> for( n in 1:length(sr[,1])){
+ entries=candidates$fusion_genes==sr[n,1]
+ spanning_rs=sr[n,2]
+ if(spanning_rs>0){ #distribute evelyn amongst instances of this fusion in the list
+ scounts=table(rep_len(1:sum(entries),spanning_rs))
+ if(length(scounts)<sum(entries))
+ scounts=c(scounts,rep(0,sum(entries)-length(scounts)))
+ candidates$spanning_pairs[entries]<-scounts
+ }
+ }
>
> write.table(candidates,args[3],sep="\t",row.names=F,col.names=T,quote=F)
>

================================= Stage get_final_list (CUAC1854) ==================================
R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> options(echo=FALSE)
[1] "Getting the location of fusion transcripts in the genome.."
Calculating gap size in the genome...
Checking if the fusions are in frame...
Merging with read coverage data...
Error in `$<-.data.frame`(`*tmp*`, known, value = "-") :
replacement has 1 row, data has 0
Calls: $<- -> $<-.data.frame
Execution halted
ERROR: Command failed with exit status = 1 :

/usr/bin/R --vanilla --args /data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/CUAC1854_genome.psl /data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/CUAC1854.reads /data/resources/TOOL_SPECIFIC/JAFFA//hg38_genCode22.tab /home/local/ARCS/pw2470/bin/jaffa/1.09/known_fusions.txt 10000 NoSupport,PotentialRegularTranscript /data/PROJECTS/NET/rnaseq/fusions/jaffa/CUAC1854/CUAC1854.summary < /home/local/ARCS/pw2470/bin/jaffa/1.09/make_final_table.R

========================================= Pipeline Failed ==========================================

One or more parallel stages aborted. The following messages were reported:

Branch CUAC1854 in stage Unknown reported message:

Command failed with exit status = 1 :

Use 'bpipe errors' to see output from failed commands.

peter....@gmail.com

unread,

Apr 4, 2018, 10:52:44 AM4/4/18

to jaffa-project

I've figured out what the issue is. The resource file that you provide for hg38 (https://drive.google.com/open?id=0B1FwZagazjpcd2RDVmtiazVWVk0) doesn't include the genome reference, so I accidentally specified the hg38_genCode22.fa as the genome fasta.

It would help if you clarified in the wiki how to set up the genome fasta along with the reference files that you provide. According the FAQ, I should be able to generate my own version if I want. However, it's not clear to me where we should download the reference file, relative to those from the resource directory. The instructions say that we should put it into JAFFA's bin directory, but it still isn't clear from the documentation whether or not we need to put all of the reference files into the bin directory as well.

It also insn't clear whether or not we should be specifying a "Masked" version of the reference genome, or the un-"Masked" version. Since the bowtie2 indices in the resource tarbal are for a Masked_hg38 genome, it would seem like we would want to use that one, rather than the "raw" version that we can download from UCSC.

For those of us who are interested in using your took, I think it would really behoove you to either include the Masked_hg38 reference in your resource kit, or clean up and/or update the sections on how to use them.

Reply all

Reply to author

Forward

0 new messages