help installing EMIRGE

Lauren

unread,

Jul 7, 2017, 1:45:41 PM7/7/17

to EMIRGE users

Hello,

I am having difficultly installing EMIRGE on my laptop (Mac OSX 10.11.6). I downloaded "EMIRGE-0.61.0.tar.gz" from github (https://github.com/csmiller/EMIRGE/releases) and untarred the file. I then tried to build, install, and test the installation. When I run "emirge.py" to test the installation, I get an error (see below). Can someone please help me troubleshoot?

Thanks,

Lauren

Build EMIRGE (again):

ltom-mr:EMIRGE-0.61.0 ltom$ python setup.py build

Found Cython.

running build

running build_ext

running build_scripts

NOTE:

To download a standard candidate SSU database to use with EMIRGE, run

python emirge_download_candidate_db.py

Install EMIRGE (again):

ltom-mr:EMIRGE-0.61.0 ltom$ python setup.py install

Found Cython.

running install

running bdist_egg

running egg_info

writing top-level names to EMIRGE.egg-info/top_level.txt

writing EMIRGE.egg-info/PKG-INFO

writing requirements to EMIRGE.egg-info/requires.txt

writing dependency_links to EMIRGE.egg-info/dependency_links.txt

reading manifest file 'EMIRGE.egg-info/SOURCES.txt'

writing manifest file 'EMIRGE.egg-info/SOURCES.txt'

installing library code to build/bdist.macosx-10.6-x86_64/egg

running install_lib

running build_ext

creating build/bdist.macosx-10.6-x86_64/egg

copying build/lib.macosx-10.6-x86_64-3.5/_emirge.cpython-35m-darwin.so -> build/bdist.macosx-10.6-x86_64/egg

copying build/lib.macosx-10.6-x86_64-3.5/_emirge_amplicon.cpython-35m-darwin.so -> build/bdist.macosx-10.6-x86_64/egg

copying build/lib.macosx-10.6-x86_64-3.5/pykseq.cpython-35m-darwin.so -> build/bdist.macosx-10.6-x86_64/egg

creating stub loader for pykseq.cpython-35m-darwin.so

creating stub loader for _emirge.cpython-35m-darwin.so

creating stub loader for _emirge_amplicon.cpython-35m-darwin.so

byte-compiling build/bdist.macosx-10.6-x86_64/egg/pykseq.py to pykseq.cpython-35.pyc

byte-compiling build/bdist.macosx-10.6-x86_64/egg/_emirge.py to _emirge.cpython-35.pyc

byte-compiling build/bdist.macosx-10.6-x86_64/egg/_emirge_amplicon.py to _emirge_amplicon.cpython-35.pyc

installing package data to build/bdist.macosx-10.6-x86_64/egg

running install_data

warning: install_data: setup script did not provide a directory for 'pykseq/kseq.h' -- installing right in 'build/bdist.macosx-10.6-x86_64/egg'

copying pykseq/kseq.h -> build/bdist.macosx-10.6-x86_64/egg

warning: install_data: setup script did not provide a directory for '_emirge_C.h' -- installing right in 'build/bdist.macosx-10.6-x86_64/egg'

copying _emirge_C.h -> build/bdist.macosx-10.6-x86_64/egg

creating build/bdist.macosx-10.6-x86_64/egg/EGG-INFO

installing scripts to build/bdist.macosx-10.6-x86_64/egg/EGG-INFO/scripts

running install_scripts

running build_scripts

creating build/bdist.macosx-10.6-x86_64/egg/EGG-INFO/scripts

copying build/scripts-3.5/emirge.py -> build/bdist.macosx-10.6-x86_64/egg/EGG-INFO/scripts

copying build/scripts-3.5/emirge_amplicon.py -> build/bdist.macosx-10.6-x86_64/egg/EGG-INFO/scripts

copying build/scripts-3.5/emirge_rename_fasta.py -> build/bdist.macosx-10.6-x86_64/egg/EGG-INFO/scripts

changing mode of build/bdist.macosx-10.6-x86_64/egg/EGG-INFO/scripts/emirge.py to 755

changing mode of build/bdist.macosx-10.6-x86_64/egg/EGG-INFO/scripts/emirge_amplicon.py to 755

changing mode of build/bdist.macosx-10.6-x86_64/egg/EGG-INFO/scripts/emirge_rename_fasta.py to 755

copying EMIRGE.egg-info/PKG-INFO -> build/bdist.macosx-10.6-x86_64/egg/EGG-INFO

copying EMIRGE.egg-info/SOURCES.txt -> build/bdist.macosx-10.6-x86_64/egg/EGG-INFO

copying EMIRGE.egg-info/dependency_links.txt -> build/bdist.macosx-10.6-x86_64/egg/EGG-INFO

copying EMIRGE.egg-info/requires.txt -> build/bdist.macosx-10.6-x86_64/egg/EGG-INFO

copying EMIRGE.egg-info/top_level.txt -> build/bdist.macosx-10.6-x86_64/egg/EGG-INFO

writing build/bdist.macosx-10.6-x86_64/egg/EGG-INFO/native_libs.txt

zip_safe flag not set; analyzing archive contents...

__pycache__._emirge.cpython-35: module references __file__

__pycache__._emirge_amplicon.cpython-35: module references __file__

__pycache__.pykseq.cpython-35: module references __file__

creating 'dist/EMIRGE-0.61.0-py3.5-macosx-10.6-x86_64.egg' and adding 'build/bdist.macosx-10.6-x86_64/egg' to it

removing 'build/bdist.macosx-10.6-x86_64/egg' (and everything under it)

Processing EMIRGE-0.61.0-py3.5-macosx-10.6-x86_64.egg

removing '/Users/ltom/anaconda/lib/python3.5/site-packages/EMIRGE-0.61.0-py3.5-macosx-10.6-x86_64.egg' (and everything under it)

creating /Users/ltom/anaconda/lib/python3.5/site-packages/EMIRGE-0.61.0-py3.5-macosx-10.6-x86_64.egg

Extracting EMIRGE-0.61.0-py3.5-macosx-10.6-x86_64.egg to /Users/ltom/anaconda/lib/python3.5/site-packages

EMIRGE 0.61.0 is already the active version in easy-install.pth

Installing emirge.py script to /Users/ltom/anaconda/bin

Installing emirge_amplicon.py script to /Users/ltom/anaconda/bin

Installing emirge_rename_fasta.py script to /Users/ltom/anaconda/bin

Installed /Users/ltom/anaconda/lib/python3.5/site-packages/EMIRGE-0.61.0-py3.5-macosx-10.6-x86_64.egg

Processing dependencies for EMIRGE==0.61.0

Searching for biopython==1.69

Best match: biopython 1.69

Processing biopython-1.69-py3.5-macosx-10.6-x86_64.egg

biopython 1.69 is already the active version in easy-install.pth

Using /Users/ltom/anaconda/lib/python3.5/site-packages/biopython-1.69-py3.5-macosx-10.6-x86_64.egg

Searching for scipy==0.17.1

Best match: scipy 0.17.1

Adding scipy 0.17.1 to easy-install.pth file

Using /Users/ltom/anaconda/lib/python3.5/site-packages

Searching for pysam==0.11.2.2

Best match: pysam 0.11.2.2

Processing pysam-0.11.2.2-py3.5-macosx-10.6-x86_64.egg

pysam 0.11.2.2 is already the active version in easy-install.pth

Using /Users/ltom/anaconda/lib/python3.5/site-packages/pysam-0.11.2.2-py3.5-macosx-10.6-x86_64.egg

Searching for numpy==1.11.1

Best match: numpy 1.11.1

Adding numpy 1.11.1 to easy-install.pth file

Using /Users/ltom/anaconda/lib/python3.5/site-packages

Finished processing dependencies for EMIRGE==0.61.0

NOTE:

To download a standard candidate SSU database to use with EMIRGE, run

python emirge_download_candidate_db.py

Test installation (again):

ltom-mr:EMIRGE-0.61.0 ltom$ emirge.py

Traceback (most recent call last):

File "/Users/ltom/anaconda/bin/emirge.py", line 4, in <module>

__import__('pkg_resources').run_script('EMIRGE==0.61.0', 'emirge.py')

File "/Users/ltom/anaconda/lib/python3.5/site-packages/setuptools-23.0.0-py3.5.egg/pkg_resources/__init__.py", line 719, in run_script

File "/Users/ltom/anaconda/lib/python3.5/site-packages/setuptools-23.0.0-py3.5.egg/pkg_resources/__init__.py", line 1503, in run_script

File "/Users/ltom/anaconda/lib/python3.5/site-packages/EMIRGE-0.61.0-py3.5-macosx-10.6-x86_64.egg/EGG-INFO/scripts/emirge.py", line 369

raise OSError, "\n\nERROR: Cannot resume from non-existent directory %s"%(resume_iterdir)

^

SyntaxError: invalid syntax

Lauren

unread,

Jul 7, 2017, 4:03:13 PM7/7/17

to EMIRGE users

Chris Miller

unread,

Jul 12, 2017, 11:20:50 PM7/12/17

to emirge...@googlegroups.com

Hi,

It looks like you are using a python3 version of anaconda. EMIRGE was written in python2, so you will need to install and use python2.7 when installing EMIRGE.

Thanks,

Chris

--
--
You received this message because you are subscribed to the Google
Groups "EMIRGE users" group.
To post to this group, send email to emirge...@googlegroups.com
To unsubscribe from this group, send email to
emirge-users+unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/emirge-users?hl=en?hl=en

EMIRGE code:
https://github.com/csmiller/EMIRGE

---
You received this message because you are subscribed to the Google Groups "EMIRGE users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to emirge-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ltom.b...@gmail.com

unread,

Jul 17, 2017, 4:03:22 PM7/17/17

to EMIRGE users

Hi Chris,

Thanks for your reply. I was able to successfully install EMIRGE using python2.7.

Now for running "emirge.py", I am trying to determine what values to input for -l, -i, and -s.

I used FastQC to look at the distribution of sequence lengths. My sequences are 150bp (I'm guessing that is what I put for -l, max read length). How do I determine insert mean and insert stddev?

Thanks,
Lauren

emirge-users...@googlegroups.com

For more options, visit this group at
http://groups.google.com/group/emirge-users?hl=en?hl=en

EMIRGE code:
https://github.com/csmiller/EMIRGE

---
You received this message because you are subscribed to the Google Groups "EMIRGE users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to emirge-users...@googlegroups.com.

Chris Miller

unread,

Jul 17, 2017, 4:07:47 PM7/17/17

to EMIRGE users

You are correct on --max_read_length

The insert mean and insert standard deviation are best estimated from a mapping of reads to an assembly or SILVA. Sometimes your sequencing core will give you an estimate of this from library preparation. The truth is that in practice I have found these parameters don't matter much in terms of affecting EMIRGE output. Originally, these were designed to increase efficiency of the mapping process. EMIRGE will reject read pairs that map with too large of an insert size, based on this distribution. If you don't know these, just choose something reasonable with a large standard deviation to allow all pairs to map. For example, a reasonable choice might be:

mean 500

std dev 500

Then all 16S pairs should map.

Chris

Lauren

unread,

Jul 18, 2017, 12:40:13 PM7/18/17

to EMIRGE users

Hi Chris,

Thanks for your response. I am indeed trying to map all 16S reads using EMIRGE.

I set the parameters for mean and std dev to 500, as you suggested and am currently running "emirge.py" (on my Mac laptop). Here is the command I used.

ltom-mr:EMIRGE-testing ltom$ python2.7 /Users/ltom/EMIRGE-0.61.0/emirge.py 37A-test1 -1 f.6437.3.44325.CTTGTA.adnq.fastq -2 r.6437.3.44325.CTTGTA.adnq.fastq -f SILVA_128_SSURef_Nr99_tax_silva_trunc.ge1200bp.le2000bp.0.97.fixed.fasta -b SILVA_128_SSURef_Nr99_tax_silva_trunc.ge1200bp.le2000bp.0.97.fixed -l 150 -i 500 -s 500 --phred33

The run is taking a long time to complete. It is on iteration 5 after ~18 hours of run time. Is this typical?

Also, I noticed that a large number of reads are failing to align. For example, on iteration 4:

Time loading reference: 00:00:00
Time loading forward index: 00:00:00
Time loading mirror index: 00:00:00
Seeded quality full-index search: 01:44:16
# reads processed: 47552348
# reads with at least one reported alignment: 34372 (0.07%)
# reads that failed to align: 47517976 (99.93%)
Reported 17186 paired-end alignments to 1 output stream(s)
Time searching: 01:44:16
Overall time: 01:44:16
Finished Bowtie for iteration 04 at Tue Jul 18 09:28:42 2017:
DONE with read mapping for iteration 4 at Tue Jul 18 09:28:42 2017...
Finished iteration 4 at Tue Jul 18 09:28:42 2017...
Total time for iteration 4: 1:44:36.200247

Is this typical for mapping 16S reads?

Please let me know if you need additional information about my current run.

Thanks,
Lauren

Chris Miller

unread,

Jul 18, 2017, 6:07:00 PM7/18/17

to emirge...@googlegroups.com

Lauren,

Your results are typical for a normal shotgun metagenomics dataset dominated by non-16S reads. It's not uncommon to have only about 0.05 % of reads map. The slow speed is probably expected because your iterations are spending a long time trying to map those other 99.95% of reads which have no home in the SILVA database.

Modern shotgun metagenomics datasets are very large compared to the time when EMIRGE was written. To speed up shotgun metagenomics runs with EMIRGE (that is, if your data is not 16S amplicon data), we are currently doing something similar to the following:

1. Screen your metagenome for "candidate" 16S reads.

One way we do this currently is to use bowtie2 mapping reads as unpaired to SILVA with something like this, where $SILVA_BT2 is assigned to the path to a bowtie2 database of SILVA, and $cpus is set to the number of CPUs you want to use.

bowtie2 -x $SILVA_BT2 -U reads_1.fastq,reads_2.fastq --very-sensitive-local --time --threads $cpus -k 1 --no-unal -S SSUhits.sam

2. Collect read name prefixes of any single read in a pair that mapped.

The best way to do this will depend on your read naming convention, but you want to remove the trailing 1 and 2 (/1, .1, #1 etc) that indicate 1st and 2nd read. Perhaps some egrep or sed command like:

samtools view SSUhits.sam | cut -f 1 | egrep -o '^[^/]+' | sort | uniq > names.txt

3. Get the correctly paired, filtered candidate SSU reads into fastq files.

You might use filterbyname.sh from the bbmap tools:

filterbyname.sh include=t names=names.txt prefix=t in=reads_1.fastq in2=reads_2.fastq out1=filtered_reads_1.fastq out2=filtered_reads_2.fastq

4. Run emirge_amplicon.py on the files:

filtered_reads_1.fastq

filtered_reads_2.fastq

emirge_amplicon.py is optimized under the assumption nearly all reads are 16S reads, and they will fit in memory. This speeds it up as compared to emirge.py, but uses more memory.

Chris

emirge-users+unsubscribe@googlegroups.com

For more options, visit this group at
http://groups.google.com/group/emirge-users?hl=en?hl=en

EMIRGE code:
https://github.com/csmiller/EMIRGE

---
You received this message because you are subscribed to the Google Groups "EMIRGE users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to emirge-users+unsubscribe@googlegroups.com.

ltom.b...@gmail.com

unread,

Jul 18, 2017, 6:24:21 PM7/18/17

to EMIRGE users

Hi Chris,
Thanks for the quick response and also for your suggestion on how to speed up the process by screening out 16S reads and then running "emirge_amplicon.py".
Regards,
Lauren

ltom.b...@gmail.com

unread,

Jul 20, 2017, 1:47:35 PM7/20/17

to EMIRGE users

Hi Chris,

My "emirge.py" run completed and I ran "emirge_rename_fasta.py" on my "iter.40" directory to generate a "renamed.fasta" file.

The file looks like this:

ltom@sirocco:~/Beatrice/EMIRGE-testing/37A-test1$ head renamed.fasta

>1|KF836151.1.1526 Prior=0.371982 Length=1519 NormPrior=0.349536

AGAGTTTGATCCTGGCTCAGAACGAACGTTAGCGGCGCGCTTAACACATGCAAGTCGAAC

GCGTGAGGGCTTGCCCTCACGAGTGGCGCACGGGTGAGGAACACGTAGGTAATCTGCCCT

CGAGTGGTGGATAACTCTCCGAAAGGAGAGCTAATACAGCATGAGACCACGTCCCCTCGG

GGATGCGGCCAAAGCGGGGGAACTTCGGTCCTCGCGCTTGAGGAGGAGCCTGCGGCCCAT

CAGCTAGTTGGTAGTGTAACGGACTACCAAGGCTAAGACGGGTAGCTGGTCTGAGAGGAT

GAACAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAA

TCTTGCGCAATGGGCGAAAGCCTGACGCAGCGACGCCGCGTGAGCGATGAAGGCCCTCGG

GTTGTAAAGCTCTGTGGATGGGAAAGAATAAGTGTACGCTAACACCGTGCATGATGACGG

TACCCATTTAGCAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAAGACAGAGGGTGC

I would like to determine the taxonomy of the sequences in the file. To do this, I think I need a taxonomy file (perhaps a .txt file) with matching accession numbers to the header lines in "renamed.fasta".

SILVA has a number of taxonomy files listed here: https://www.arb-silva.de/no_cache/download/archive/release_128/Exports/taxonomy/
But, I'm not sure which to choose.

FYI: I used "emirge_makedb.py" to download the most current SILVA database (release 128, I believe):

ltom-mr:Beatrice_Garcia_Rodriguez ltom$ python2.7 /Users/ltom/EMIRGE-0.61.0/emirge_makedb.py

Fetching https://ftp.arb-silva.de/current/Exports/

Fetching https://ftp.arb-silva.de/release_128/Exports/LICENSE.txt

The SILVA database is published under a custom license. To proceed,

you need to read this license and agree with its terms:

Contents of "https://ftp.arb-silva.de/release_128/Exports/LICENSE.txt":

> The SILVA database project employs a dual licensing model as follows:

>

> (1) SILVA is free for academic/non-commercial users. The SILVA webpage can be

> browsed and all downloads offered (data sets, subsets thereof, and analysis

> results) can be used and modified without any restrictions. Also redistribution

> of the downloads or derivatives thereof is permitted, but only as long as the

> SILVA Terms of Use/License Information are made transparent by linking /

> referring to www.arb-silva.de/silva-license-information.

>

> The group of academic/non-commercial users is represented by universities and

> non-commercial research institutes such as members of the German Helmholtz

> Association, Leibniz Association and Max-Planck Society, as well as US National

> Labs. In case of doubt, please contact emailli...@arb-silva.de.

>

> (2) If you represent a NON-ACADEMIC/COMMERCIAL USER, you need to purchase a

> license as soon as you exploit any SILVA downloads (data sets, subsets thereof,

> and/or analysis results). Without a valid license, use of SILVA data downloads

> is only allowed for test purposes. All downloaded files must be deleted latest

> after 10 days and commercial exploitation of your results/outcome is never

> permitted.

>

> The SILVA Commercial Licenses are granted by the SILVA partner Ribocon GmbH,

> Bremen. Further information and contact data are available at Opens external

> link in new windowwww.ribocon.com/silva_licenses.

>

> For browsing/using the SILVA webpage without downloading any data, no

> restrictions for non-academic users apply (no commercial license is required).

> A non-academic environment is defined by a direct or indirect commercial

> interest in the data and includes all industrial research entities.

>

> Last updated: September 2016

> (clarifications and extension of SILVA downloads test period for non-academic

> users)

Do you agree to these terms? [yes|NO]yes

Downloading https://ftp.arb-silva.de/release_128/Exports/SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta.gz

Local filename is "./SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta.gz"

I would also like to use the "renamed.fasta" file as input into QIIME for futher analysis and visualization. Any suggestions on how integrate with QIIME is much appreciated.

Thanks,
Lauren

ltom.b...@gmail.com

unread,

Jul 28, 2017, 3:18:20 PM7/28/17

to EMIRGE users

Hi Chris,
I got (40) 16S taxa as output from EMIRGE (using the default SILVA reference database). For comparison, I also ran EMIRGE using a greengenes reference database and got (25) 16S taxa as output. Both results seem rather low, considering that my original metagenome consists of 95 million reads. Even with ~0.07% mapping, I had expected many more hits. Is this typical for shotgun metagenomics?
Thanks,
Lauren

Reply all

Reply to author

Forward