Merging datasets produced in different sequencing runs with asymetric coverages

57 views
Skip to first unread message

Tiago Antão

unread,
May 17, 2017, 3:22:59 PM5/17/17
to Qiime 1 Forum
Hi,

I have two datasets
1. low-coverage, multiplexed paired-end reads
2. high-coverage, de-multiplexed paired-ends reads

Reads are of equal size in both studies,

When I run dataset 1 alone I have a median report of circa 20000 OTU reads per sample. But
if I merge the datasets (i.e. join seqs.fna from both and call pick_open_reference_otus) it gets reduced to 7000 OTUs reads per sample (as per the BIOM file). The high-coverage study will have much more reads.

I am a bit lost as to the cause of this. Should I sub-sample for the higher-coverage study to correct any bias, or am I doing something blatantly wrong? Or maybe merging the datasets is a bad idea altogether?

Any help would be most appreciated.

Thanks,
Tiago

Jai Ram Rideout

unread,
May 17, 2017, 7:11:03 PM5/17/17
to Qiime 1 Forum
Hi Tiago,

Your approach generally sounds correct -- you should be able to combine datasets from multiple sequencing runs and provide the combined dataset to pick_open_reference_otus.py. However, combining sequencing runs can be tricky -- you'll want to be wary of any run/batch effects in your results (e.g. seeing differences in your samples based on what sequencing run they came from).

1. Did you demultiplex both datasets and combine the seqs.fna files with the cat command?

2. Can you run validate_demultiplexed_fasta.py on each demultiplexed dataset (i.e. on each of your seqs.fna files) as well as on the combined seqs.fna? That'll do some sanity checks that may give us some hints as to what's going on.

Best,
Jai

Tiago Antão

unread,
May 18, 2017, 8:53:24 PM5/18/17
to Qiime 1 Forum
Hi,

I did the cat as you say.
One of the projects was verified with validate_demultiplexed_fasta.py
But the other one was not because it came out of the sequencer already demuxed. So no mappling file (two files per sample - paired data)

The headers look like (for the demuxed data_:
>Sample29_S29_L001_R1_001_0 M02585:44:000000000-ARTCH:1:1101:19866:1629 1:N:0:29 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0

So it makes some sense to me. Also I do get the correct sample names on the biom file.

I used for the one that was demuxed:
multiple_split_libraries_fastq.py -i join_pairs/ -o seqs --demultiplexing_method sampleid_by_file --include_input_dir_path --remove_filepath_in_name


Thanks a lot for your help,
Tiago

Jai Ram Rideout

unread,
May 19, 2017, 6:31:32 PM5/19/17
to Qiime 1 Forum
Hi Tiago,

Can you run validate_demultiplexed_fasta.py on the seqs.fna file you received from multiple_split_libraries_fastq.py? You can create a simple mapping file for those samples. And can you also run validate_demultiplexed_fasta.py on the combined (concatenated) seqs.fna file?

Best,
Jai

Tiago Antão

unread,
Jun 20, 2017, 12:38:00 PM6/20/17
to Qiime 1 Forum
Hi Jai and all,

First my apologies for the long delay in answering.
Let me just remind of my problem:

I have two datasets
1. low-coverage, multiplexed paired-end reads
2. high-coverage, de-multiplexed paired-ends reads

Reads are of equal size in both studies,

When I run dataset 1 alone I have a median report of circa 20000 OTU reads per sample. But
if I merge the datasets (i.e. join seqs.fna from both and call pick_open_reference_otus) it gets reduced to 5000 OTUs reads per sample (as per the BIOM file). The high-coverage study will have much more reads.

I have, as you suggested run validate_demultiplexed_fasta and all seems fine (I also did a fake mapping file for the high coverage data).

The gist of the script is as this, for the multiplex data:
join_paired_ends.py -f R1.fastq.gz -r R2.fastq.gz  -b i1.fastq.gz -o join
split_libraries_fastq.py -i join/fastqjoin.join.fastq -b join/fastqjoin.join_barcodes.fastq -m Map.txt  -o seqs --rev_comp_mapping_barcodes
pick_open_reference_otus.py -o otus -i seqs/seqs.fna -p ../params

For the de-multiplexed:
multiple_join_paired_ends.py -p ../params -i ../data/ -o join_pairs
multiple_split_libraries_fastq.py -i join_pairs -o seqs --demultiplexing_method sampleid_by_file --include_input_dir_path --remove_filepath_in_name
pick_open_reference_otus.py -o otus -i seqs/seqs.fna -p ../params


After this I cat the seqs.fna files and do a pick_open_reference_otus on the concatenation

Any help explaining the disparity would be appreciated.

Thanks,
Tiago
Reply all
Reply to author
Forward
0 new messages