OLCF deployment: looking for developer / source documentation

24 views
Skip to first unread message

dvs...@gmail.com

unread,
Jun 28, 2016, 3:40:31 PM6/28/16
to fireworkflows
Hello,
I'm new to Fireworks but have been given the task of supporting Fireworks workflows for a facilities integration project here at the ORNL LCF. Users here have previously deployed Fireworks as jobs running entirely within LCF compute environments; however, my particular use case differs in that Fireworks needs to be deployed across both LCF and external compute/instrument environments. This requires the MongoDB to be located outside of the LCF security enclave, and, unlike other LCFs, we have moderate security restrictions which do not allow us to open arbitrary ports to allow LCF jobs to communicate with external services (or vice versa). Thus, the documented approach of integrating Fireworks with a PBS queuing system won't work for us.

I believe will need to develop a privileged proxy launcher facility that will interface with the Firework MongoDB service and launch and monitor jobs on our machines. It would be very helpful for me to get an understanding of the overall software architecture of Fireworks as well as the specifics of the MongoDB interactions necessary for launching and monitoring Fireworks. I've been perusing the source code, but I haven't found any architecture/design docs yet. Can someone point me to this level of documentation, assuming it exists?

Thanks,
Dale

Anubhav Jain

unread,
Jun 28, 2016, 4:32:17 PM6/28/16
to dvs...@gmail.com, fireworkflows
Hi Dale

The best resources for architecture / design is the paper:


Let me know if you have any problems accessing it and I can send it to you.

Although we don't have such security restrictions on our machines, some approaches to consider are:

1) Having two separate FW dbs, one for LCF jobs and one for outside jobs. Of course, this assumes that you can separate the workflows neatly into two piles. Some users are currently taking this approach.
2) You can try the "offline" mode (see documentation), but that requires being able to access the database from a login node or similar. e.g. if the LCF login node can access an outside MongoDB service (even if the compute nodes cannot). It might not be the case for your situation.
3) Set up an ssh tunnel, although I am not sure how secure this is.

Other than that, I don't have any great advice forward, but am happy to take a stab at any specific questions you might have.

Best,
Anubhav

--
You received this message because you are subscribed to the Google Groups "fireworkflows" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fireworkflow...@googlegroups.com.
To post to this group, send email to firewo...@googlegroups.com.
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/206a49b4-ab24-4f76-9675-1c8e9fb1ea78%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dvs...@gmail.com

unread,
Jun 29, 2016, 9:30:53 AM6/29/16
to fireworkflows, dvs...@gmail.com
Hi Anubhav,
I'll read the doc you linked and discuss the options you provided with the PI; however, I'm fairly sure we can't currently access external networks from login nodes either (this is an ongoing issue I'm pursuing here). Tunneling is an option I'm currently pursuing - just have to get Ops to go along with it.

Thanks,
Dale

Anubhav Jain

unread,
Jun 29, 2016, 1:00:41 PM6/29/16
to dvs...@gmail.com, fireworkflows
Hi Dale,

Ok - if you think a phone call / video chat would be helpful at a later point, just let me know by email (aj...@lbl.gov).

Best,
Anubhav

dvs...@gmail.com

unread,
Jul 1, 2016, 10:55:24 AM7/1/16
to fireworkflows, dvs...@gmail.com
Hi Anubhav,
It looks like we will be able to use the queue reservation offline mode for our case. Ops has a method of accommodating external access from special nodes (like login nodes) that we can use to reserve fireworks tasks and schedule them on Titan. I was able to get the offline mode working after modifying the qadapter template and temporarily tunneling the MongoDB port.

Thanks,
Dale

Anubhav Jain

unread,
Jul 1, 2016, 1:46:59 PM7/1/16
to dvs...@gmail.com, fireworkflows
Hi Dale,

Ok, happy to hear that you found some solution.

I haven't used offline mode much myself since it is less powerful and more difficult to use/maintain than normal FWS. That said, I am interested in making this feature better if possible for the future. Please let me know if you come across any points of friction during your work that you think could be smoothed out with a little development of FWS, or if you end up making any changes yourself to FWS that you think might be useful for the overall codebase.

Best,
Anubhav

dvs...@gmail.com

unread,
Jul 1, 2016, 1:56:03 PM7/1/16
to fireworkflows, dvs...@gmail.com
Hi Anubhav,
I am having one issue that I'm not able to resolve: I can launch fireworks on Titan, I can see that they run, and I see the "FW_offline.json" file created that says the job is "complete"; however, the job in MongDB remains in "reserved" status and never updates to "completed". Any ideas?

Thanks,
Dale

dvs...@gmail.com

unread,
Jul 1, 2016, 1:59:02 PM7/1/16
to fireworkflows, dvs...@gmail.com
Never mind - just found the lpad recover_offline command.
Thanks
Reply all
Reply to author
Forward
0 new messages