Running the pipeline

Neha Joshi

unread,

May 3, 2022, 1:26:41 PM5/3/22

to Smart-3SEQ

Hi,

I am an analyst at University of Chicago, in Dr Luis Barreiro lab. I am trying to follow the same analysis pipeline for the smart 3-seq that was sequenced in our lab. However, I am getting the following error while running the align_smart-3seq.sh script on my samples:

Could not retrieve index file for '/project2/lbarreiro/DATA/analysis/Joao_redo/alignment_2/output_Phix30-i5_S10_R1_001.bam_tmp'
Traceback (most recent call last):
File "/project2/lbarreiro/DATA/analysis/Joao/unzippedfiles/umi-dedup-master/dedup.py", line 50, in <module>
for alignment in dup_marker:
File "/project2/lbarreiro/DATA/analysis/Joao/unzippedfiles/umi-dedup-master/lib/markdup_sam.py", line 83, in __next__
return next(self.output_generator)
File "/project2/lbarreiro/DATA/analysis/Joao/unzippedfiles/umi-dedup-master/lib/markdup_sam.py", line 227, in get_marked_alignment
alignment = umi_data.set_umi(alignment, truncate = self.truncate_umi)
File "/project2/lbarreiro/DATA/analysis/Joao/unzippedfiles/umi-dedup-master/lib/umi_data.py", line 98, in set_umi
if umi is None: umi = parse_umi(alignment.query_name, truncate)
File "/project2/lbarreiro/DATA/analysis/Joao/unzippedfiles/umi-dedup-master/lib/umi_data.py", line 95, in parse_umi
raise RuntimeError('read name %s does not contain UMI in expected Casava/bcl2fastq format' % label)
RuntimeError: read name A00639:1070:H2FJ7DRX2:2:2228:30770:5149:ATGCA:GGGGG:ACTGG does not contain UMI in expected Casava/bcl2fastq format

It creates the alignment log but the bam and the bai files are empty.

This is what I did: After unzipping the fastq files , I ran the extract_umi.py script and then ran the umi_homopolymer.py script. This output was used as input for the align_smart-3seq.sh script.

I also tried the other way to run umi_homopolymer.py first and then extract_umi.py but that didn't work out and gave errors.

Can you please guide me on how to run the pipeline and if there is something that I am doing wrong? I really appreciate the help!!

Thanks!!

Joe Foley

unread,

May 3, 2022, 1:38:42 PM5/3/22

to smart...@googlegroups.com

"extract_umi.py" and "umi_homopolymer.py" are two versions of the same script so you should run one or the other but not both. "umi_homopolymer.py" moves the UMI into the read name and trims (or at least measures) the trailing poly(A), while "extract_umi.py" is a replacement in the development version of the pipeline that moves the UMI into the read name but doesn't handle the trailing poly(A) since that can be done by the aligner instead.

"align_smart-3seq.sh" includes this step, so you actually extracted UMIs three times, which is why your read name has three UMIs instead of one. That shell script is meant to be run on FASTQ files whose only preprocessing was removal of the sequencing adapters, by bcl2fastq or Cutadapt or equivalent (the replacement scripts in the development version include that step as well).

--
You received this message because you are subscribed to the Google Groups "Smart-3SEQ" group.
To unsubscribe from this group and stop receiving emails from it, send an email to smart-3seq+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/smart-3seq/7b71ab49-31c6-4969-93be-05daa22b9743n%40googlegroups.com.

OpenPGP_signature

Neha Joshi

unread,

May 4, 2022, 2:18:29 PM5/4/22

to Smart-3SEQ

Hi Dr Foley,

Thank you so much for your help! I am running it now.

Thanks !

Neha Joshi

unread,

May 17, 2022, 11:15:34 AM5/17/22

to Smart-3SEQ

Hi Dr Foley,

I have one more question regarding the alignment step. Are these bams ready to be used for the read counting step? Or do we have to remove the duplicates with a separate step? Because when I look at the dedup.log , it shows that a different number for usable alignments read than the one being used as input reads in align.log.

Thank you!

Best,

Neha

Joe Foley

unread,

May 17, 2022, 12:01:46 PM5/17/22

to smart...@googlegroups.com

That's a very good question and I don't have a very good answer. The short version is: if you ran the pipeline with the deduplicator and got dedup.log files, duplicates are already marked (not removed) in the BAM files and you can choose whether to exclude them in featureCounts or equivalent, but I recommend you include the duplicates. The dedup.log files are mainly useful for QC metrics (mainly estimated library size). That's why the deduplicator is optional.

Duplicate removal is common in other applications like whole-genome sequencing, when it's very unlikely that you'll ever see two identical DNA fragments by chance so you can safely assume all duplicates are PCR duplicates, but with RNA-seq in general and especially 3'-end sequencing all your sequenceable fragments are concentrated in a very small number of sites in the genome, so fragments with identical start sites will often occur before library construction. Adding a UMI before PCR helps distinguish PCR duplicates from library duplicates, but our UMIs are only 5 nt (longer ones create other problems) so there's theoretically a 1 in 1024 probability of two molecules also having the same UMI by chance, and realistically the probability is much higher because UMI sequences vary widely in their efficiency at becoming usable library molecules. So the assumption that all duplicates with the same UMI are PCR duplicates also does not hold.

If you have false positives in your duplicate removal (incorrectly marking some reads as PCR duplicates when they were actually real duplicate RNA fragments), they will affect the results unevenly: more-abundant transcripts will have disproportionately more false positives. Transcript abundance spans several orders of magnitude so this effect is quite pronounced: the usual method of duplicate removal makes the results substantially less accurate (bends the standard curves from ERCC data). I tested some algorithms to recover true library duplicates that have the same UMI by chance and the best one ("weighted_average2" in the software, described in the paper, an extension of a previous idea from Fumiaki Katagiri) merely removed the harm of deduplicating but didn't actually achieve any benefit. So I've never observed a benefit of removing duplicates for RNA quantification with Smart-3SEQ, only a spectrum from destructive to dubiously harmless.

I gave up on this years ago but it's possible someone else has since come up with an algorithm that does more good than harm with low-diversity sequence libraries, short UMIs, and widely varying coverage. If you're aware of any, please let me know and I'll test them.

To view this discussion on the web visit https://groups.google.com/d/msgid/smart-3seq/46b17a74-f1b5-4b19-9329-f0cad9ddc8dcn%40googlegroups.com.

OpenPGP_signature

Reply all

Reply to author

Forward