converting a makefile to *.nf

74 views
Skip to first unread message

Pierre Lindenbaum

unread,
Jul 29, 2015, 6:15:34 AM7/29/15
to Nextflow
Hi Paolo & al.

I'm currently trying to find a way to convert a GNU-Makefile to the nextflow syntax.

Overview :  I'm currently generating *.nf files from a text template. I'm sure there are ways to simplify my syntax (patterns, loops, etc...) but for now all the plain 'targets' are generated.

So, my Makefile below, download some fasta sequences from NCBI, get the longest sequence and print "Done" at the end. Some targets (all, clean... ) are said ".PHONY", meaning that they don't really generate a file.

.PHONY: all all_fasta clean
GILIST
=52854274 156118490 290782623 209485592 149126991 254749437 269857780 14971105 256041807 269857713

%.fa:
    $
(description $@,download gi:$(basename $@) from NCBI as fasta)wget -O "$@"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pr
otein&id=$(basename $@)&retmode=text&rettype=fasta"


all
: all_fasta
    echo
"Done"

all_fasta
: longest.fa

longest
.fa : all.fa
    $
(description $@,get the longest sequence in $<)awk '/^>/ { printf("%s%s\t",(NR==1?"":"\n"),$$0);next;} {printf("%s",$$0);} END {printf("\n");}' $< |\
    awk
-F '\t' '{printf("%d\t%s\n",length($$2),$$0);}' | sort -t '    ' -k1,1n | tail -n1 | cut -f 2- |\
    tr
"\t" "\n" > $@

all
.fa : $(addsuffix .fa,${GILIST})
    $
(description $@,concatenate everything)cat $^ > $@
   
clean
:
    rm
-f $(addsuffix .fa,${GILIST}) longest.fa


an expanded version of this Makefile would be:



SHELL
=/bin/sh

.PHONY:  all_fasta all
all
:  all_fasta
    echo
"Done"

all_fasta
:  longest.fa

longest
.fa :  all.fa
    awk
'/^>/ { printf("%s%s\t",(NR==1?"":"\n"),$$0);next;} { printf("%s",$$0);} END {printf("\n");}' all.fa |\
    awk
-F '\t' '{printf("%d\t%s\n",length($$2),$$0);}' | sort -t '    ' -k1,1n | tail -n1 | cut -f 2- |\
    tr
"\t" "\n" > longest.fa

all
.fa :  52854274.fa 156118490.fa 290782623.fa 209485592.fa 149126991.fa 254749437.fa 269857780.fa 14971105.fa 256041807.fa 269857713.fa
    cat
52854274.fa 156118490.fa 290782623.fa 209485592.fa 149126991.fa 254749437.fa 269857780.fa 14971105.fa 256041807.fa 269857713.fa > all.fa
   

269857713.fa :
    wget
-O "269857713.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=269857713&retmode=text&rettype=fasta"

256041807.fa :
    wget
-O "256041807.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=256041807&retmode=text&rettype=fasta"

14971105.fa :
    wget
-O "14971105.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=14971105&retmode=text&rettype=fasta"

269857780.fa :
    wget
-O "269857780.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=269857780&retmode=text&rettype=fasta"

254749437.fa :
    wget
-O "254749437.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=254749437&retmode=text&rettype=fasta"

149126991.fa :
    wget
-O "149126991.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=149126991&retmode=text&rettype=fasta"

209485592.fa :
    wget
-O "209485592.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=209485592&retmode=text&rettype=fasta"

290782623.fa :
    wget
-O "290782623.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=290782623&retmode=text&rettype=fasta"

156118490.fa :
    wget
-O "156118490.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=156118490&retmode=text&rettype=fasta"

52854274.fa :
    wget
-O "52854274.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=52854274&retmode=text&rettype=fasta"



I tried to generate a newflow file:

#!/usr/bin/env nextflow



/** 52854274.fa  : download gi:52854274 from NCBI as fasta */
process proc1    
{

   
   
    output
:
    file
'52854274.fa' into proc11_input
   
   
'''
    #!/bin/sh
    wget -O "52854274.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=52854274&retmode=text&rettype=fasta"
    '''

   
}






/** 156118490.fa  : download gi:156118490 from NCBI as fasta */
process proc2    
{

   
   
    output
:
    file
'156118490.fa' into proc11_input
   
   
'''
    #!/bin/sh
    wget -O "156118490.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=156118490&retmode=text&rettype=fasta"
    '''

   
}






/** 290782623.fa  : download gi:290782623 from NCBI as fasta */
process proc3    
{

   
   
    output
:
    file
'290782623.fa' into proc11_input
   
   
'''
    #!/bin/sh
    wget -O "290782623.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=290782623&retmode=text&rettype=fasta"
    '''

   
}






/** 209485592.fa  : download gi:209485592 from NCBI as fasta */
process proc4    
{

   
   
    output
:
    file
'209485592.fa' into proc11_input
   
   
'''
    #!/bin/sh
    wget -O "209485592.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=209485592&retmode=text&rettype=fasta"
    '''

   
}






/** 149126991.fa  : download gi:149126991 from NCBI as fasta */
process proc5    
{

   
   
    output
:
    file
'149126991.fa' into proc11_input
   
   
'''
    #!/bin/sh
    wget -O "149126991.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=149126991&retmode=text&rettype=fasta"
    '''

   
}






/** 254749437.fa  : download gi:254749437 from NCBI as fasta */
process proc6    
{

   
   
    output
:
    file
'254749437.fa' into proc11_input
   
   
'''
    #!/bin/sh
    wget -O "254749437.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=254749437&retmode=text&rettype=fasta"
    '''

   
}






/** 269857780.fa  : download gi:269857780 from NCBI as fasta */
process proc7    
{

   
   
    output
:
    file
'269857780.fa' into proc11_input
   
   
'''
    #!/bin/sh
    wget -O "269857780.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=269857780&retmode=text&rettype=fasta"
    '''

   
}






/** 14971105.fa  : download gi:14971105 from NCBI as fasta */
process proc8    
{

   
   
    output
:
    file
'14971105.fa' into proc11_input
   
   
'''
    #!/bin/sh
    wget -O "14971105.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=14971105&retmode=text&rettype=fasta"
    '''

   
}






/** 256041807.fa  : download gi:256041807 from NCBI as fasta */
process proc9    
{

   
   
    output
:
    file
'256041807.fa' into proc11_input
   
   
'''
    #!/bin/sh
    wget -O "256041807.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=256041807&retmode=text&rettype=fasta"
    '''

   
}






/** 269857713.fa  : download gi:269857713 from NCBI as fasta */
process proc10    
{

   
   
    output
:
    file
'269857713.fa' into proc11_input
   
   
'''
    #!/bin/sh
    wget -O "269857713.fa"  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=269857713&retmode=text&rettype=fasta"
    '''

   
}






/** all.fa  : concatenate everything */
process proc11    
{

   
   
    output
:
    file
'all.fa' into proc12_input
   
    input
:
    file  
'52854274.fa' from proc11_input
    file  
'156118490.fa' from proc11_input
    file  
'290782623.fa' from proc11_input
    file  
'209485592.fa' from proc11_input
    file  
'149126991.fa' from proc11_input
    file  
'254749437.fa' from proc11_input
    file  
'269857780.fa' from proc11_input
    file  
'14971105.fa' from proc11_input
    file  
'256041807.fa' from proc11_input
    file  
'269857713.fa' from proc11_input
   
   
'''
    #!/bin/sh
    cat 52854274.fa 156118490.fa 290782623.fa 209485592.fa 149126991.fa 254749437.fa 269857780.fa 14971105.fa 256041807.fa 269857713.fa > all.fa
   
    '''

   
}






/** longest.fa  : get the longest sequence in all.fa */
process proc12    
{

   
   
    output
:
    file
'longest.fa' into proc13_input
   
    input
:
    file  
'all.fa' from proc12_input
   
   
'''
    #!/bin/sh
    awk '
/^>/ { printf("%s%s\t",(NR==1?"":"\n"),$0);next;} { printf("%s",$0);} END {printf("\n");}' all.fa |\
    awk -F '
\t' '{printf("%d\t%s\n",length($2),$0);}' | sort -t '    ' -k1,1n | tail -n1 | cut -f 2- |\
    tr "\t" "\n" > longest.fa
    '''

   
}






/** all_fasta  : all_fasta */
process proc13    
{

   
   
    output
:
    file
'all_fasta' into proc14_input
   
    input
:
    file  
'longest.fa' from proc13_input
   
   
'''
    #!/bin/sh
    '''

   
}






/** all  : all */
process proc14    
{


   
    input
:
    file  
'all_fasta' from proc14_input
   
   
'''
    #!/bin/sh
    echo "Done"
    '''

   
}



running the workflow:

./nextflow run tests/test03.nf
N E X T F L O W  
~  version 0.15.0
Launching tests/test03.nf
[warm up] executor > local
[2c/08a2c4] Submitted process > proc3 (1)
[9a/97009a] Submitted process > proc1 (1)
[8e/7036d6] Submitted process > proc2 (1)
[a2/b5807b] Submitted process > proc4 (1)
[0f/45c69e] Submitted process > proc7 (1)
[15/9a9e43] Submitted process > proc6 (1)
[58/200508] Submitted process > proc5 (1)
[56/865af8] Submitted process > proc8 (1)
[34/34afae] Submitted process > proc9 (1)
[f0/9aaa2b] Submitted process > proc10 (1)


searching the sequences


$ find work
/ -name "*.fa"
work
/2c/08a2c430b3ead36c8540b8b8ea9fdf/290782623.fa
work
/a2/b5807b98341f815b9cb9b77f41d7ce/209485592.fa
work
/56/865af85167270d0061ff20816397d0/14971105.fa
work
/34/34afae00a1b0b972cfa217e13fbe39/256041807.fa
work
/f0/9aaa2b533b431cc3ff297ef8780438/269857713.fa
work
/58/200508a14f1b2f9e9cdee8291548c4/149126991.fa
work
/9a/97009ad2fd080f6253bc910965c825/52854274.fa
work
/0f/45c69e070768543be978047527bbcf/269857780.fa
work
/15/9a9e43942415d77875ec95b1b3db2f/254749437.fa
work
/8e/7036d60e51565cc18bbf68be238dea/156118490.fa


so, only the gi-sequences have been downloaded and the longest sequence was not found. I would say, it comes from the PHONY state ?
How should I write the *.nf file until `echo "Done"` is invoked ?
About the PHONY target, can I set an integer somewhere to tell the engine that this target was computed ?


Thank you for your help.

Pierre

Paolo Di Tommaso

unread,
Jul 29, 2015, 6:31:07 AM7/29/15
to nextflow
Hi Pierre, 

The problem is that you have multiple processes writing into the same channel, that is "proc11_input". This is not allowed by nextflow. 

A channel must connect exactly one producer and one consumer. In the case you don't to change the structure of your your script because you need to translate it automatically from a Makefile, you will need to use an unique name for that output channels. 

Otherwise you should refactor your script having a single download process where the URL is specified as an input parameter. 


Hope it helps. 

Cheers,
Paolo
  

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Paolo Di Tommaso

unread,
Jul 29, 2015, 6:52:01 AM7/29/15
to nextflow
I've realised that I didn't reply to your question about PHONY targets. 

It is perfectly valid to have a process without inputs and/or outputs. However a process with no inputs will *always* be executed because it is waiting no inputs(!). 

So if you want to mimic the behaviour phony targets in Make you could do something similar to this example: 

process foo {
  when:
  params.target == 'foo'

  script:
  '''
  echo Foo
  '''
}

process bar {
   when: 
   params.target == 'bar'
 
   script:
   '''
   echo Bar 
   '''
}

Then when launching the script you can specify the target as shown below: 

    nextflow run <script name> --target foo|bar 


Let me know it this answer your question. 


Cheers,
Paolo

Pierre Lindenbaum

unread,
Jul 29, 2015, 6:54:47 AM7/29/15
to Nextflow, paolo.d...@gmail.com


On Wednesday, July 29, 2015 at 12:31:07 PM UTC+2, Paolo Di Tommaso wrote:
Hi Pierre, 

The problem is that you have multiple processes writing into the same channel, that is "proc11_input". This is not allowed by nextflow. 

A channel must connect exactly one producer and one consumer. In the case you don't to change the structure of your your script because you need to translate it automatically from a Makefile, you will need to use an unique name for that output channels. 


ok, if I get the problem well, I need a unique name to specify the link between the targets. So I re-wrote my nf file:


#!/usr/bin/env nextflow



/** 52854274.fa  : download gi:52854274 from NCBI as fasta */
process proc1    
{

   
   
    output
:

    file
'52854274.fa' into proc_1_to_11
   
   
'''

   
}


(...)/* skipped content */


/** all.fa  : concatenate everything */
process proc11    
{

   
   
    output
:

    file
'all.fa' into proc_11_to_12
   
    input
:
    file  
'52854274.fa' from proc1_to_11
    file  
'156118490.fa' from proc2_to_11
    file  
'290782623.fa' from proc3_to_11
    file  
'209485592.fa' from proc4_to_11
    file  
'149126991.fa' from proc5_to_11
    file  
'254749437.fa' from proc6_to_11
    file  
'269857780.fa' from proc7_to_11
    file  
'14971105.fa' from proc8_to_11
    file  
'256041807.fa' from proc9_to_11
    file  
'269857713.fa' from proc10_to_11
   
   
'''

    #!/bin/sh
    cat 52854274.fa 156118490.fa 290782623.fa 209485592.fa 149126991.fa 254749437.fa 269857780.fa 14971105.fa 256041807.fa 269857713.fa > all.fa
   
    '''

   
}



when I run the workflow, I get a 'Not such variable: proc1_to_11'


N E X T F L O W  
~  version 0.15.0
Launching tests/test03.nf
[warm up] executor > local
[19/e1e6da] Submitted process > proc3 (1)
[0f/8bcb50] Submitted process > proc1 (1)
[91/894b8b] Submitted process > proc4 (1)
[fc/1f5338] Submitted process > proc2 (1)
[0c/8f8645] Submitted process > proc6 (1)
[46/39314b] Submitted process > proc7 (1)
[d5/38ceaa] Submitted process > proc5 (1)
[23/35dec9] Submitted process > proc8 (1)
[fb/3a4f63] Submitted process > proc9 (1)
[c6/1c2b54] Submitted process > proc10 (1)
ERROR
~ Not such variable: proc1_to_11

 
-- Check script 'test03.nf' at line: 204 or see '.nextflow.log' file for more details
make
: *** [test-nextflow] Error 255




 

Paolo Di Tommaso

unread,
Jul 29, 2015, 6:59:53 AM7/29/15
to Pierre Lindenbaum, Nextflow
You have an extra (or a missing) underscore: 

into proc_1_to_11

  and 

from proc1_to_11


Cheers,
Paolo

Maria Chatzou

unread,
Jul 29, 2015, 7:03:45 AM7/29/15
to next...@googlegroups.com

Hi Paolo,

I think we should have an error message when someone is doing that by mistake. This will help Nextflow users to debug easier.

Cheers,
Maria

Pierre Lindenbaum

unread,
Jul 29, 2015, 7:04:57 AM7/29/15
to Nextflow, paolo.d...@gmail.com


On Wednesday, July 29, 2015 at 12:59:53 PM UTC+2, Paolo Di Tommaso wrote:
You have an extra (or a missing) underscore: 

into proc_1_to_11

  and 

from proc1_to_11


 
oppss. Works better now :-)

Now problem with the phony target : no file is produced. I'm going to fool nextflow by creating a dummy file, but just tell me if your think there is another way :-)

By the way, as all the files are not produced in the same directory , how does the nextflow engine retrives the files when the 'script' par is invoked ?


 [49/b5ba98] Submitted process > proc13 (1)
Error executing process > 'proc13 (1)'

Caused by:
 
Missing output file(s): 'all_fasta' expected by process: proc13 (1)





 

Paolo Di Tommaso

unread,
Jul 29, 2015, 7:18:39 AM7/29/15
to Pierre Lindenbaum, Nextflow
Well, the problem is that your last two steps do nothing other than renaming that file. 

If this structure is required by your conversion tool, the only wait is creating a dummy file or having "proc13" to rename "longest.fa" to "all_fasta" with a move command for example.


Otherwise I would simply replace both "proc13" and "proc14" with this lines: 

  proc13_input.subscribe { file -> 
      file.copyTo('all_fasta')
      println "Done"
  }


Cheers, Paolo

Paolo Di Tommaso

unread,
Jul 29, 2015, 7:19:08 AM7/29/15
to nextflow
Yes, I agree. 

The problem is that mistakes are infinite :)


p

Pierre Lindenbaum

unread,
Jul 29, 2015, 7:35:48 AM7/29/15
to Nextflow, paolo.d...@gmail.com
OK, my workflow is running now ! Thank you for the help !

I'm now trying to compile a regular  project like 'tabix', that is, with pref-existing files, If I keep my previous structure, nextflow complains :

Command executed:

 
#!/bin/sh
      gcc
-c -g -Wall -O2 -fPIC  -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE  kstring.c -o kstring.o

Command exit status:
 
4

Command output:
 
(empty)

Command error:
  gcc
: error: kstring.c: No such file or directory
  gcc
: fatal error: no input files
  compilation terminated
.


$ ls kstring
.c
kstring
.c




is there a way to specify a 'path' where to search for the local files ?

Thank you !

Paolo Di Tommaso

unread,
Jul 29, 2015, 8:16:52 AM7/29/15
to Pierre Lindenbaum, Nextflow
Wow, a pipeline task compiling itself... ! :)

Um, no. However you need simply to provide the source file as an input, eg. 

process compile {
   storeDir 'bin/'
   input: 
   file 'kstring.c' from file('kstring.c')
   output: 
   file 'kstring'

  script: 
  """
  gcc -c -g -Wall -O2 -fPIC  -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE  kstring.c -o kstring.o
  """


Also you can add the "storeDir" directive, by doing that the output will be copied to the specified folder (and the task skipped if that binary already exists). 

Also note that the folder "bin/" in the project launching dir is added to the PATH of the executed command  automatically, so you will be able to use it without having to specify an absolute path. 


Cheers,
Paolo
Reply all
Reply to author
Forward
0 new messages