Unable to get R script to work from inside Nextflow, but Rscript works fine on it's own

3,104 views
Skip to first unread message

Carlos Guzman

unread,
Sep 12, 2016, 9:09:07 PM9/12/16
to Nextflow
I have an R script I am trying to run from inside Nextflow. It contains a standard DGE analysis that I'd like to automate. Whenever I run the script by itself, it works just as it is supposed to.

Whenever I attempt to run the script using the exact same parameters inside of Nextflow, I get an error. I've been attempting to fix this for days now, and I haven't been successful so i'm hoping somebody here can help me out.

This is my Nextflow Script:

// salmon index transcriptome
process salmon_index
{

    input
:
    file
(transcriptome_file)

    output
:
    file
("transcriptome.index") into transcriptome_index

    script
:
   
"""
    salmon --no-version-check index -t ${transcriptome_file} -p ${params.threads} -i transcriptome.index
    """

}

//use salmon to psuedo-align for DGE
process salmon_quant
{

    publishDir
'results/rnaseq/dge', mode: 'copy'

    input
:
   
set id, file(read1_trimmed_fastq), file(read2_trimmed_fastq), condition from trimmedfastqs2
    file transcriptome_index
from transcriptome_index.first()

    output
:
    file
"salmon_${id}" into salmon_out_dirs

    script
:
   
"""
    salmon --no-version-check quant -i transcriptome.index -l IU -1 <(gunzip -c ${read1_trimmed_fastq}) -2 <(gunzip -c ${read2_trimmed_fastq}) -p ${params.threads} -o salmon_${id}
    """

}

//conduct DGE analysis using DESeq2

process differential_gene_expression
{

    input
:
    file
'salmon_*' from salmon_out_dirs.toSortedList()
    file exp_file

    output
:
    file
"rnaseq_dge_report.html"

    script
:
   
"""
    Rscript '$baseDir/bin/dge.R' results/rnaseq/dge ${exp_file}
    """

}

Here is my R script:

#!/usr/bin/env Rscript

# CIPHER DGE Script
# Input from CIPHER is several folders under a single directory located
# in results/rnaseq/salmon ... folders are named using ID from CIPHER
# ARGUMENTS: 1 -> directory of salmon folders
# 2 -> experiment file

# Import libraries
library
("DESeq2")
library
("tximport")
library
("readr")
library
("ReportingTools")
library
("AnnotationDbi")
library
("org.Hs.eg.db")
library
("biomaRt")

# Set up variable to control command line arguments
args
<- commandArgs(TRUE)

# Set up variable for the directory holding our sample folders
sample_id
<- dir(args[1])
sample_id

# Set up variable to hold folder directory names
salmon_dirs
<- sapply(sample_id, function(id) file.path(args[1], id, "quant.sf"))
salmon_dirs

# Read in experiment file
exp_file
<- read.table(args[2], header = TRUE, stringsAsFactors=FALSE)

# Set up tximport annotation base
library
(EnsDb.Hsapiens.v79)
txdf
<- transcripts(EnsDb.Hsapiens.v79, return.type="DataFrame")
tx2gene
<- as.data.frame(txdf[,c("tx_id", "gene_id")])

# Import files using tximport
txi
.salmon <- tximport(salmon_dirs, type = "salmon", tx2gene = tx2gene, reader = read_tsv)

# Create a sampleTable of conditions
sampleTable
<- data.frame(condition = factor(exp_file$condition))
rownames
(sampleTable) <- colnames(txi.salmon$counts)

# Create DESeqDataset from Tximport
dds
<- DESeqDataSetFromTximport(txi.salmon, sampleTable, ~condition)
dds
<- DESeq(dds)
rld
<- rlog(dds)

# Generate RESULTS from DESeq2 object
res
<- results(dds)
res
.05 <- results(dds, alpha=.05)
resSig
<- subset(res, padj < 0.05)
resSig
<- as.data.frame(resSig)
resSig$ensemblID
<- rownames(resSig)

# Annotation of results
res$symbol
<- mapIds(org.Hs.eg.db, keys=row.names(res),
                     column
="SYMBOL",
                     keytype
="ENSEMBL",
                     multiVals
="first")

resSig$symbol
<- mapIds(org.Hs.eg.db, keys=row.names(res),
                        column
="SYMBOL",
                        keytype
="ENSEMBL",
                        multiVals
="first")

# Use BioMart to Add Ensembl Gene Annotations
mart
<- useMart("ensembl", dataset="hsapiens_gene_ensembl")

add
.anns <- function(df, mart, ...) {
  nm
<- rownames(df)
  anns
<- getBM(
    attributes
= c("ensembl_gene_id", "hgnc_symbol", "description"),
    filters
= "ensembl_gene_id", values = nm, mart = mart)
  anns
<- anns[match(nm, anns[, 1]), ]
  colnames
(anns) <- c("ID", "Gene Symbol", "Gene Description")
  df
<- cbind(anns, df[, -1, drop=FALSE])
  rownames
(df) <- nm
  df
}

# Generate visualization plots
maplot
<- plotMA(res.05, ylim=c(-5,5))
pcaplot
<- plotPCA(rld, intgroup="condition")

# Generate an HTML report
htmlRep
<- HTMLReport(shortName = "rnaseq_dge_report",
                      title
= "CIPHER RNA-seq Report",
                      reportDirectory
= "./reports")

publish
("Table of top 1000 differentially expressed genes and their associated individual gene counts (boxplots)", htmlRep)
publish
(dds, htmlRep, factor = colData(dds)$condition,
       
.modifyDF = list(add.anns, modifyReportDF), mart = mart,
        reportDir
="./reports")

publish
("Table of all differentially expressed genes with a p adjusted value of < 0.05", htmlRep)
publish
(resSig, htmlRep)

finish
(htmlRep)

And here is the error that keeps being produced:

[7a/183c7e] Submitted process > differential_gene_expression (1)
Error executing process > 'differential_gene_expression (1)'

Caused by:
 
Process 'differential_gene_expression (1)' terminated with an error exit status

Command executed:

 
Rscript '/dataA/code/cipher-nf/bin/dge.R' results/rnaseq/dge experiment.txt

 
Error in file.exists(files) : invalid 'file' argument
 
Calls: tximport -> stopifnot -> file.exists
 
Execution halted


Officially at a loss. Any help is appreciated.

Robert Syme

unread,
Sep 12, 2016, 9:58:26 PM9/12/16
to Nextflow
Hi Carlos

It looks to me like a mismatch in where the R script expects to find the input files (the salmon outputs) and where you're asking Nextflow to put the input files.

The R script is reading in the input files at this line:

sample_id <- dir(args[1])

The problem might be that args[1] contains the string 'results/rnaseq/dge' but it looks like you're supplying the output directories from the salmon runs as directories that start as 'salmon_*':

file 'salmon_*' from salmon_out_dirs.toSortedList()

You can check to see the directory structure by  running:

    ls -lh 7a/183c7e*

P.S. The $baseDir/bin directory is automatically added to the $PATH, so if you make your R script executable with something like `chmod +x bin/dge.R`, you can simplify the differential_gene_expression process from:

    "Rscript '$baseDir/bin/dge.R' results/rnaseq/dge ${exp_file}"
to 
    "dge.R results/rnaseq/dge ${exp_file}"

Carlos Guzman

unread,
Sep 12, 2016, 10:09:01 PM9/12/16
to Nextflow
Thanks for the help Robert! I figured that this was likely the issue. I've unable to address it though. When I look in my work folder, my directories are showing up as "salmon_1" "salmon_2" salmon_3" etc. 

Any recommendations on how to alter the nextflow script so that the files are read correctly?

Thanks for the tip about the Rscript though! 

Carlos Guzman

unread,
Sep 13, 2016, 12:41:52 AM9/13/16
to Nextflow
Was able to fix the issue by setting `file 'salmon/*' from salmon_out_dirs.toSortedList()` and setting the DGE script to  `"dge.R salmon ${exp_file}"`. Thanks everyone!

Robert Syme

unread,
Oct 4, 2016, 11:33:15 PM10/4/16
to Nextflow
Oh great! Sorry for not following up on this. Totally fell off my radar (I need a better system). Glad you found a solution!
-r
Reply all
Reply to author
Forward
0 new messages