Executing Java code with dependencies on MIG

19 views
Skip to first unread message

Casper Petersen

unread,
Sep 16, 2013, 5:13:26 AM9/16/13
to mig...@googlegroups.com
I am in the process of parsing a large number (16.000.000) files. The java-files (it has to be Java due to dependencies) used for processing the files has some dependencies (other *.jar files) which must be available on each node for the processing to be possible and the java-files writes the output of each file to a new file with (broadly) the same name.

The questions:
1) Is it even possible to "host" the 16.000.000 files on the MIG cluster?
2) Is it even possible to have dependencies on the java file in order for the processing to happen on the nodes?
3) My intuition tells me that it should be possible to submit a job using the java file, point the input files to
a directory where they (a subset of the 16.000.000 files) are stored and point to an output file where the parsed files should be placed. Is that even possible? 

The files:
In its naive sequential state the java files simply read a file from disk, processes it and stores the processed part on disk. In its current multithreaded form with RAID SSD disks for storage and using 16 threads, the efficiency is roughly 14 minutes for 2500 files. I am hoping the MIG can speed it up even more.

From the documentation I get half the sense that what I want is not possible, but then again it might be. Hope you can clarify it for me, and thanks up front for taking the time to answer my questions.

Jonas Bardino

unread,
Sep 16, 2013, 6:31:25 AM9/16/13
to mig...@googlegroups.com


Thanks for posting your questions here where others may find them useful or chip in.

I'd say yes to all three questions as such, but let's go a bit more into detail.

Principally you have only 1GB of storage in your MiG home where your files are hosted but we provide external storage resources when much more than that is needed. The files are not automatically available on compute resources like the cluster. They need to be explicitly transferred as part of the job (INPUTFILES) or made available in a more ad-hoc fashion. The same applies for the results (OUTPUTFILES). I'm sure we can find a suitable setup for your actual file access pattern.
It sounds like your jobs take ~2500 files and produce one output file. Is that about right and approximately how big are those files?
Are the input files overlapping for different runs or do they work on disjunctive sub sets of the complete set of files?

I'd suggest that you upload e.g. a zip file with a single set of job files and see if you can make a job description file to get it processed on the Octuplets cluster.
You may find the java example around page 32 in the 'Intro for new users of MiG' pdf on
https://sites.google.com/site/minimumintrusiongrid/tutorials-and-talks
useful for inspiration.
The runtime environment you want for java jobs is called JAVA-ANY-1 and it is available on the Octuplets cluster now. Do you know if your Java app requires a specific Java flavor and version? We generally have OpenJDK 6 available.
Other dependencies have to be fulfilled either by other runtime environments or with software explicitly included in the job. If you have a set of jar files they can easily be included in the job using INPUTFILES or EXECUTABLES. If they are huge we may prefer to make a runtime environment for them on the resource(s) instead of transferring them for each job, however.

Please feel free to ask ...

Cheers, Jonas

Casper Petersen

unread,
Sep 16, 2013, 6:40:12 AM9/16/13
to mig...@googlegroups.com
Hi Jonas,

Thanks for the reply. As for your questions:
1) The example was poorly formulated. Each input file produces one output file, so 2500 inputs files results in 2500 output files (in its current state).
2) The files are between 1kb and 300kb of plaintext.
3) The files are disjunct subsets.
4) No particular version of Java except that what I currently have is compiled and tested using Suns latest version of Java.
5) The jar files (4 in total) are between 5 and 200mb.

Thanks for the link (caught it while reading the documentation). I will try to see if I can get it all to work.

best,

Casper

Jonas Bardino

unread,
Sep 16, 2013, 7:11:29 AM9/16/13
to mig...@googlegroups.com
1+2) Ah, that should not be a problem. A few hundred megabytes of text can surely be packed into quite small input and output archives so it can just be transferred along with the job.
3) perfect, then the input can simply be kept in packed format on the grid.
4) AFAIK it is a bit cumbersome but not impossbile to install Oracle (=Sun) java on the resources. Do you have all the source code so that you can recompile there if it doesn't work with openjdk out of-the-box?
5) Depending on the total size and number of jobs we might want to cache/install the files permanently on the resource and avoid the transfer each time.

If you get stuck with writing the MiG job description you can just post it here along with any job IDs for related jobs. Then we will look into it.

Cheers, Jonas

Casper Petersen

unread,
Sep 16, 2013, 9:10:00 AM9/16/13
to mig...@googlegroups.com
I am getting increasingly more convinced that it is doable :).

1+2+3) Super.
4) I have all the source code and will give it a try or ten with openJDK. When you say install - is that an administrative task i.e. you need to do it, or can I as an ordinary user do it?
5) That sounds like a good idea. I imagine overhead will kill the entire idea otherwise.

Will post back when I have news. Thanks again for the help.

/Casper

Jonas Bardino

unread,
Sep 16, 2013, 11:32:06 AM9/16/13
to mig...@googlegroups.com
4) Alright, I think OpenJDK is quite focused on compatibility with Oracle/Sun so you can give it a shot with the originals first, but it's always good to have a fall back plan.
5) The transfers are running on a LAN so it should not be that bad, but we might as well save the time for repeated transfers.

By install i mean 'proper' installation with admin rights to some read-only location. You can do some tricks to cache files on the resource between jobs even without admin help but then in principle you are never sure they will be there unchanged for the next job.
We've previously done that kind of manual caching with something like:
#!/bin/bash
#
# Simple job wrapper to fetch shared data into cache if not there already and then actually run the job 

DATACACHE=$HOME/.cache/mycachedir
if [ ! -d $DATACACHE ]; then
    # fetch my data into $DATACACHE here
    mkdir -p $DATACACHE
    fill_cache $DATACACHE
fi
run_my_job $DATACACHE


Anyway, we do have admin right on the cluster so it's only a matter of asking us to install once you have something that runs.

Cheers, Jonas

Casper Petersen

unread,
Sep 17, 2013, 5:16:40 AM9/17/13
to mig...@googlegroups.com
So after a few hours of fiddling with some stuff I have more basic questions:

- What is the difference between Queued (Stay) and Queued in Job Manager?
- When submitting jobs the ::EXECUTE:: header requires the path of the script (i.e. the Fibonacci example I uploaded the script to a code dir so it says code/fib.py). 
  Is the same true for the ::EXECUTABLES:: i.e. path/script.py?
- I tried the fibonacci program (fib.py is located in the code directory) except I put the commands under their respective headlines (instead of one blurp as in the example):
  1) Execute Commands: $PYTHON code/fib.py 5
  2) Executable Files: code/fib.py
  3) CPU/Wall Time = 300
  4) Memory = 256
  5) Runtime environments: PYTHON-2-X-1
  Everything else was set to default.
But after a few hours the program is still queuing. Is there somewhere I can check the load of the grid or, assuming the above is correct, is it just a waiting game?

Thanks again,

Casper

Jonas Bardino

unread,
Sep 17, 2013, 5:44:51 AM9/17/13
to mig...@googlegroups.com
Queued (stay) is the result of you calling schedule status on the job and it means that the job is queued and doesn't fit any known resources. More info at
http://code.google.com/p/migrid/wiki/UserFAQ#Why_do_my_jobs_remain_in_QUEUED_state_seemingly_forever?
That happens if you request some combination of specs not available and particularly if you did not request a vgrid with resources (use the ANY keyword if in doubt). You want to target the eScience vgrid where the cluster is:
https://dk-cert.migrid.org/cgi-bin/showvgridmonitor.py?vgrid_name=eScience

If that still doesn't help you can post the raw job description or give me a job ID so I can take a look.

Files are transferred to the same relative path unless you use the SRC DST form in e.g. INPUTFILES. Examples and further explanation at
http://code.google.com/p/migrid/wiki/UsingJobFiles

You want the PYTHON-2.X-1 runtime env for the fib example but maybe that was just a typo.

Generally you can use the monitor links from your VGrids page to see available resources and their status.

Cheers, Jonas
Reply all
Reply to author
Forward
0 new messages