Piping SRA data through fastq-dump directly to STAR?

Darren Tyson

unread,

Aug 5, 2016, 4:34:56 PM8/5/16

to rna-star

Hi all,

Just started using STAR recently (awesome software!) and decided I should join the group so I can learn more about what the app does and is capable of doing. I've written a shell script to parse through our data files and STAR handles them like a champ. (I'm happy to post the script if anyone is interested.)

Since these were our own data the files were obviously local. Now, however, I would like to tap into the vast array of publicly available data, particularly those available through the NCBI's SRA portal. These are compressed files that can be pulled from their remote location and parsed into FASTQ format using the SRA-toolkit app fastq-dump. The program fastq-dump can save files locally or pipe them to stdout. I was wondering if anyone has been able to (or if it is possible to) replace the --readFilesIn argument with the SRA id and use fastq-dump as the --readFilesCommand argument? I tried formatting the STAR call multiple ways to no avail. Below is one example:

STAR \

--genomeDir ../RNAseq_analysis/STAR_analysis/STAR_ref_genome \

--sjdbGTFfile /Volumes/Wade/RNAseq_analysis/Ref_genome/gencode.v24.annotation.gtf \

--runThreadN 8 \

--outFileNamePrefix ../RNAseq_analysis/STAR_analysis/STAR_aligned/SRR2968938_ \

--readFilesIn ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR2968/SRR2968921/SRR2968921.sra \

--readFilesCommand fastq-dump -z \

--outSAMtype BAM Unsorted \

--outReadsUnmapped Fastx \

--outSAMmode Full \

--quantMode TranscriptomeSAM GeneCounts

I know I could just download the SRA file using fastq-dump to save it locally and then run STAR on the local file, but I'd like to set this up so I could process many files through our cluster and there is no reason to keep a local copy of the original (very large) data file.

Any help would be appreciated. Thanks!

Darren

Alexander Dobin

unread,

Aug 10, 2016, 5:56:59 PM8/10/16

to rna-star

Hi Darren,

you can do it using the fifo files, e.g.:

$ mkfifo Read_1

$ mkfifo Read_2

$ fastq-dump --split-spot --stdout SRR768411 | awk '{print > "Read_" ( (NR-1)%8 < 4 ? 1 : 2 ) }' &

$ STAR ....... --readFilesIn Read_1 Read_2

It works on my system, but I did not test it thoroughly.

Cheers

Alex

Varun Gupta

unread,

Aug 10, 2016, 6:03:29 PM8/10/16

to rna-star

HI Darren,
Do you run your script as a job array for different sra files?

Thanks

Regards
Varun

On Friday, August 5, 2016 at 4:34:56 PM UTC-4, Darren Tyson wrote:

Darren Tyson

unread,

Aug 13, 2016, 12:25:28 PM8/13/16

to rna-star

Thanks Alex and Varun.

I'm gearing up to have this run on our compute cluster but haven't implemented it yet. The mkfifo option looks great. I'll test it locally then see what I'll need to do to batch it through the cluster.