post split fasta file import

74 views
Skip to first unread message

Masanobu Hiraoka

unread,
May 12, 2020, 6:55:26 PM5/12/20
to Qiime 1 Forum

Hi,


I am a beginner at qiime forum. I want to import my Sanger sequenced data into qiime2 pipeline. With the support of Q-tips colleagues, I encountered this thread in qiime2 forum.

https://forum.qiime2.org/t/fasta-files-importing/5380/3


In this thread, to "convert the FASTA to the QIIME 1 post split_libraries.py seqs.fna format and import as SampleData[Sequences] " was suggested.


I have fasta files without barcodes and quality information, and need to command “split_libraries.py” command to make .fna file, I think. But I can not find out how to command it properly.


Please let me know how to proceed.


I input this command using only one fasta file:

And, this error message came back.

http://qiime.org/scripts/split_libraries.html


command:

(qiime1)

$ split_libraries.py -m Mapping_File.txt -f 24wk01-seq-DNAcut.csv.fasta


error:

Traceback (most recent call last):

  File "/home/qiime2/miniconda/envs/qiime1/bin/split_libraries.py", line 4, in <module>

    __import__('pkg_resources').run_script('qiime==1.9.1', 'split_libraries.py')

  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/pkg_resources/__init__.py", line 666, in run_script

    self.require(requires)[0].run_script(script_name, ns)

  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1462, in run_script

    exec(code, namespace, namespace)

  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/qiime-1.9.1-py2.7.egg-info/scripts/split_libraries.py", line 17, in <module>

    from qiime.util import parse_command_line_parameters, get_options_lookup

  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/qiime/util.py", line 41, in <module>

    from biom.util import compute_counts_per_sample_stats, biom_open, HAVE_H5PY

  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/biom/__init__.py", line 51, in <module>

    from .table import Table

  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/biom/table.py", line 195, in <module>

    from biom.util import (get_biom_format_version_string,

  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/biom/util.py", line 27, in <module>

    import h5py

  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/h5py/__init__.py", line 36, in <module>

    from ._conv import register_converters as _register_converters

  File "h5py/h5t.pxd", line 14, in init h5py._conv

  File "h5py/numpy.pxd", line 66, in init h5py.h5t

ValueError: numpy.dtype has the wrong size, try recompiling. Expected 88, got 96


---

And I also wonder if it is not necessary to use "split" command or not.
To make .fna file, I have to do this?

Thanks,

TonyWalters

unread,
May 13, 2020, 3:24:40 AM5/13/20
to Qiime 1 Forum
Hello,

Since you have multiple fasta files (already split according to sample, it appears), I would recommend using this script:
http://qiime.org/scripts/add_qiime_labels.html

It will create the QIIME 1 demultiplexed form of the data, combined into a single fasta file, which you can import into QIIME 2.

You do have to make a mapping file that matches the fasta file name with the desire SampleID name.

-Tony

Masanobu Hiraoka

unread,
May 13, 2020, 6:36:53 AM5/13/20
to Qiime 1 Forum
Thank you,Tony

I tried to use the script you recommended,

$ add_qiime_labels.py -i fasta_dir -m example_mapping.txt -c InputFileName -o combined_fasta


error message was as below;
---
Traceback (most recent call last):
  File "/home/qiime2/miniconda/envs/qiime1/bin/add_qiime_labels.py", line 4, in <module>
    __import__('pkg_resources').run_script('qiime==1.9.1', 'add_qiime_labels.py')
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1462, in run_script
    exec(code, namespace, namespace)
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/qiime-1.9.1-py2.7.egg-info/scripts/add_qiime_labels.py", line 14, in <module>
    from qiime.util import parse_command_line_parameters, get_options_lookup,\
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/qiime/util.py", line 41, in <module>
    from biom.util import compute_counts_per_sample_stats, biom_open, HAVE_H5PY
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/biom/__init__.py", line 51, in <module>
    from .table import Table
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/biom/table.py", line 195, in <module>
    from biom.util import (get_biom_format_version_string,
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/biom/util.py", line 27, in <module>
    import h5py
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/h5py/__init__.py", line 36, in <module>
    from ._conv import register_converters as _register_converters
  File "h5py/h5t.pxd", line 14, in init h5py._conv
  File "h5py/numpy.pxd", line 66, in init h5py.h5t
ValueError: numpy.dtype has the wrong size, try recompiling. Expected 88, got 96
--

My mapping .txt is like:
"#SampleID,BarcodeSequence,LinkerPrimerSequence,InputFileName,Description
24wk01,AAAACCCCGGGG,CTACATAATCGGRATT,24wk01-seq-DNAcut.csv.fasta,24wk01
24wk02,AAAACCCCGGGG,CTACATAATCGGRATT,24wk02-seq-DNAcut.csv.fasta,24wk02
..."

barcode and primer column were filled with same text.

Would you give me some advice?

TonyWalters

unread,
May 13, 2020, 7:07:36 AM5/13/20
to Qiime 1 Forum
Hmmm, I think that error is probably an issue with the numpy dependency creating a conflict. You could try to install an earlier version (1.10.4 should be fine for QIIME 1). Hopefully you are already using a conda environment for qiime1 (otherwise you may have to downloaded and install the 1.10.4 version, run this, then update to the newer version of numpy if it's being used for other software).


You probably will encounter the same error if you run:
print_qiime_config.py

but it may be worth seeing if gives you output. Otherwise, you can see the version of numpy by typing:
python
import numpy
numpy.__version__
quit()

But if you have a conda environment (hopefully you do), try, when you're in the qiime1 environment:

conda install numpy=1.10.4

If successful, see if the prior commands will complete.

Masanobu Hiraoka

unread,
May 13, 2020, 5:56:52 PM5/13/20
to Qiime 1 Forum
Hi, Tony,

By following this thread of qiime2 forum, I installed Miniconda and 2 environments exists there.

$ conda list
numpy                     1.9.3            py27h7e35acb_3  


This is my environment now.

--
Downloading and Extracting Packages
libgfortran-3.0.0    | 281 KB    | ##################################### | 100% 
numpy-1.10.4         | 6.0 MB    | ##################################### | 100% 
matplotlib-1.4.3     | 45.4 MB   | ##################################### | 100% 
mkl-11.3.3           | 122.1 MB  | ##################################### | 100% 
scipy-0.17.1         | 30.1 MB   | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
--

After reboot of Ubuntu, I got this one.

# Name                    Version                   Build  Channel
numpy                     1.10.4                   py27_2  

I retried to input the command:
$ add_qiime_labels.py -i fasta_dir -m example_mapping.txt -c InputFileName -o combined_fasta

Error message partly changed.
--
Traceback (most recent call last):
  File "/home/qiime2/miniconda/envs/qiime1/bin/add_qiime_labels.py", line 4, in <module>
    __import__('pkg_resources').run_script('qiime==1.9.1', 'add_qiime_labels.py')
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1462, in run_script
    exec(code, namespace, namespace)
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/qiime-1.9.1-py2.7.egg-info/scripts/add_qiime_labels.py", line 111, in <module>
    main()
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/qiime-1.9.1-py2.7.egg-info/scripts/add_qiime_labels.py", line 107, in main
    output_dir, count_start)
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/qiime/add_qiime_labels.py", line 39, in add_qiime_labels
    variable_len_barcodes=False)
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/qiime/check_id_map.py", line 182, in process_id_map
    added_demultiplex_field)
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/qiime/check_id_map.py", line 822, in check_header
    desc_ix, bc_ix, linker_primer_ix, added_demultiplex_field)
  File "/home/qiime2/miniconda/envs/qiime1/lib/python2.7/site-packages/qiime/check_id_map.py", line 888, in check_header_required_fields
    if (header[curr_check] != header_checks[curr_check] and
IndexError: list index out of range
--

This IndexError was originated by my fasta file or the problem of version?
How do you think?

Thanks,

TonyWalters

unread,
May 14, 2020, 12:38:39 AM5/14/20
to Qiime 1 Forum
Hello,

That looks like an issue with the supplied mapping file.

You'll want to make sure the mapping file passes this script's tests: http://qiime.org/scripts/validate_mapping_file.html

Here's a page about the mapping file format: http://qiime.org/documentation/file_formats.html#metadata-mapping-files

Masanobu Hiraoka

unread,
May 14, 2020, 3:29:52 AM5/14/20
to Qiime 1 Forum
Thank you, Tony.

I made it.
My mapping file had been saved as .csv, and had same barcode sequence with all samples.

By confirming mapping file using the validate command, I arranged barcode sequences as unique in tab-delimited file.

I have now  "combined_seqs.fna" file.
According to the import tutorial to qiime2 pipeline as shown, I tried to command like this:
$ qiime tools import \
  --input-path combined_seqs.fna \
  --output-path sequences.qza \
  --type 'FeatureData[Sequence]'


And I have got "Imported combined_seqs.fna as DNASequencesDirectoryFormat to sequences.qza" message.

Maybe I should command 'vsearch' to dereplicate data next, but I should ask more questions in Qiime2 Forum, right?

I appreciate all of your support. 

TonyWalters

unread,
May 14, 2020, 3:43:34 AM5/14/20
to Qiime 1 Forum
Hello again,

It does look like we're on the right track. I would proceed to dereplicating/clustering with vsearch in QIIME2.

-Tony

Masanobu Hiraoka

unread,
May 15, 2020, 7:31:48 PM5/15/20
to Qiime 1 Forum
Thank you, Tony,

I did not select import command I commented previously.
After importing and dereplicating appropriately, I tried analyzing rep-seqs.qza and  table.qza files to investigate their PCoA and diversity using same "example_mapping.txt" as of qiime1.

To evaluate significant difference between groups, I think I had better add meta-data such as age or body-site to the mapping file.

I re-read instruction of input files,  and find the article of "Mapping Files Without Barcodes and/or Primers".
Let me ask one more question about it.

When I arrange the mapping file, should I add (virtual )barcode sequence like 'AAAACCCCGGGG' in the second left column? Or use -p, -b parameters?

I think barcode sequences are used for identification of samples in table.qzv, when I analyzed it in "https://view.qiime2.org/".

And LinkerPrimerSequence shouid be the same as my forward primer sequence?

Thanks,

TonyWalters

unread,
May 16, 2020, 12:03:55 AM5/16/20
to Qiime 1 Forum
Hello,

QIIME2 can use the QIIME1 mapping files, but it's not as restricted in some ways, see : https://docs.qiime2.org/2020.2/tutorials/metadata/

It should already be split according to sample, as it would have read the sample names from the fasta sequence labels that are in the merged fasta file from add_qiime_labels.py.

I guess the simplest way to address this is to look at the taxonomy visualization object from QIIME2 on the view.qiime2.org page-when it first loads, does it show the sample names at the bottom of the columns, as they were specified (#SampleID) in the mapping file used for add_qiime_labels.py?

You can, as you've noted, add other metadata to the mapping file, such as body site (and rerun the QIIME2 scripts that generated the visualization objects, like the taxa plots), and then you can use the sorting in the top right hand side to select body site or other metadata.

The barcodes and primers can be added, but they aren't necessary for QIIME2 in this case (for QIIME1, the mapping validation is more strict, so there is a requirement for unique barcodes or other data to uniquely identify the samples from multiplexed reads).

-Tony


Masanobu Hiraoka

unread,
May 29, 2020, 7:52:06 PM5/29/20
to Qiime 1 Forum
Hello again, Tony,

Following your instruction, I could arrange my mapping file and perform non-phylogenic analysis with background data.

But I can not proceed to "phylogenic" analysis with dereplicated table file and aligned rooted tree file.

One of the reasons might be “why there are so few reads per sample in my dataset”. The highest # of reads in one sample is 139.

Is this supposed not enough to run any typical microbial diversity analyses?


How do you think whether it might be originated in the procedure of qiime1 during converting my Sanger fasta dataset to the one which can be used in qiime2 pipeline?


Considering this is from qiime2 process, let me consult this problem in qiime2 forum, too.

 

Thanks,

Masanobu Hiraoka

unread,
May 30, 2020, 11:02:13 PM5/30/20
to qiime...@googlegroups.com
let me add comment about previous post,

When I added de novo clustering(97% identity following my previous analysis), I could carry out both alpha and beta diversity analyses with renewed table and sequence files regardless of the number of sample reads.
.


TonyWalters

unread,
May 31, 2020, 5:58:01 PM5/31/20
to Qiime 1 Forum
Hello again,

I'm sorry that I don't have an answer on this (i've not tried Sanger data in QIIME2). The 97% clustering approach (vsearch, I assume) seems appropriate to use for this data, since you can't use the denoising software (dada2 or deblur, made for Illumina error models). When you first ran it, did it not make a tree with matching tips to the OTUs? That's the only thing I can think of offhand that would give you an OTU counts object that worked (did it give reasonable clustering?) for, e.g. bray-curtis but not for UniFrac or PD in alpha diversity.

Masanobu Hiraoka

unread,
Jun 2, 2020, 3:44:56 PM6/2/20
to Qiime 1 Forum
Thank you for your comment again, Tony,

This is my history of commands list.
And I succeeded in analysis of my Sanger data with the qiime2 pipeline as my preliminary trials in this topic "Diversity core metrics of Sanger sequencing data: " in qiime2 forum.

I'm sorry that I don't have an answer on this (i've not tried Sanger data in QIIME2). The 97% clustering approach (vsearch, I assume) seems appropriate to use for this data, since you can't use the denoising software (dada2 or deblur, made for Illumina error models). 

OK, I proceed with 97% identity of de novo clustering.

When you first ran it, did it not make a tree with matching tips to the OTUs?
No, it did not make a tree at first for the reason introduced in the topic as above.  
 
That's the only thing I can think of offhand that would give you an OTU counts object that worked (did it give reasonable clustering?) for, e.g. bray-curtis but not for UniFrac or PD in alpha diversity.

 And with de novo clustering after dereplicating, I could perform phylogenic diversity analyses.

Some colleague advised me the less read numbers than NGS data.
According to the document about "Key differences between next-generation sequencing and Sanger sequencing", it seems to the difference of data type...

Don't I have to bother this problem as far as I can get the reasonable results?

Thanks,

TonyWalters

unread,
Jun 2, 2020, 3:49:16 PM6/2/20
to Qiime 1 Forum
That's good to hear-I think you're okay as far a proceeding with the analyses.
Reply all
Reply to author
Forward
0 new messages