OpenCoarrays On a Cluster

evansste

unread,

Mar 23, 2022, 5:16:44 AM3/23/22

to OpenCoarrays

I own a cluster that is composed of eight Dell PowerEdge M910 blade servers. I'm using Rocks Cluster software (rocksclusters.org), and have successfully installed OpenCoarrays 2.9.2. The setup seems to work okay when I launch programs using a relatively small number of cores (24). However, I've noticed that when using 248 cores, programs don't tend to work as expected, and I get error messages.
I'm interested in connecting with OpenCoarray users, or developers, who have managed to use OpenCoarrays on a cluster. Are you running into issues, or is everything running smoothly? Have you experienced limitations? If so, have you found a way around those limitations? These are the types of questions that I'm interested in.
I've written two Fortran programs, called "rigor" and "rigor2". They're designed to show how my installation isn't working correctly. If my installation, of both Rocks 7 and OpenCoarrays, were working correctly, I should be able to run "rigor" and "rigor2", with 248 cores, and receive no error messages. However, when I run these programs with 248 cores, the programs print out statements which are designed to clearly show that my system is not working correctly. An example would be that it complains that 'A' is not equal to 'B', but will then print out 'A' and 'B', showing them to be equal. In contrast, these programs will always complete with no errors, if they're run with 24 cores.
"rigor" and "rigor2" are very similar, and they both perform the same task. The only difference is that "rigor" uses a function that generates random values, while carrying out its task. As a result, it's able to show that the problem occurs with a wide range of random scenarios. "rigor2", on the other hand, uses a fixed set of values; allowing it to show the problem with only one specific scenario.

I have attached copies of the two programs. They appear to be a bit long. However, this is mostly because they use a few simple functions, which are also included.

Here are the results of running these two programs with 24 cores, versus 248 cores. When compiling, make sure you use "-g", as an option:

[evansste@cluster work]$ caf /home/evansste/Base/work/rigorc.f08 -g -o /home/evansste/Base/work/rigorc
[evansste@cluster work]$ cafrun -np 24 -machinefile machines /home/evansste/Base/work/rigorc
[evansste@cluster work]$ cafrun -np 248 -machinefile machines /home/evansste/Base/work/rigorc
On image 7 30 - 21 +1 should equal 10 So why does this message print, and why didn't it fail pretest on line 253?
On image 16 0 - 0 +1 should equal 10 So why does this message print, and why didn't it fail pretest on line 253?

...

[evansste@cluster work]$ caf /home/evansste/Base/work/rigor2c.f08 -g -o /home/evansste/Base/work/rigor2c
[evansste@cluster work]$ cafrun -np 24 -machinefile machines /home/evansste/Base/work/rigor2c
[evansste@cluster work]$ cafrun -np 248 -machinefile machines /home/evansste/Base/work/rigor2c
On image 133 10 - 1 +1 should equal 6 So why does this message print, and why didn't it fail pretest on line 153?
On image 128 10 - 1 +1 should equal 6 So why does this message print, and why didn't it fail pretest on line 153?
.. .

Because the full results are quite lengthy, what's shown is only a part of what is displayed. In order to see the full contents of what is displayed, feel free to look at the attached files.

If anyone is running OpenCoarrays on a cluster, I'd love to know whether or not you can get these programs to complete, with no errors, while running them with 248 cores. If you are, then your cluster is more stable than mine, and I'd love to know what you've done to set it up correctly.
Thank you for taking the time to read my post, and I look forward to receiving any input on running OpenCoarrays on a cluster.

rigorc full response.txt

rigor2c full response.txt

rigorc.f08

rigor2c.f08

London Drugs

unread,

Apr 1, 2022, 9:05:10 PM4/1/22

to OpenCoarrays

(replying from work; Knarfnarf/Frank here)

My understanding is that the -n option specifies the number of ACTUAL cores that you have or are willing to allocate from what you have.

If you over allocate you will be time slicing on the cores that you do have.

So if you have 24 actual cores you would slice 248 by 24 to get 10 slices of each core with 8 slices to spare.

The algorithm to allocate slices to core is beyond me, but it will require extra time to evaluate the spare slices while most of the cores sleep.

Even multiples of your actual cores would time slice better, but over allocating your cores will never be as effective as just allocating your cores properly.

As to why things don't add as you are expecting, check out the parallel sort routine that I (knarfnarf) did earlier in this forum. I could not get sync all to sync anything so I used a local data structure to store the data and the round number, and copied in from remote to test the round number for that data.

As to text output; it is a known bug and you'll have to get use to it as there is very little (as said by those who know) that can be done for it.

Knarfnarf

evansste

unread,

Apr 1, 2022, 9:41:18 PM4/1/22

to OpenCoarrays

Thanks, so much for responding, Knarfnarf.

I should point out that my cluster has 256 actual cores. It's composed of eight blade servers, which are set up as compute nodes. Each blade has 32 actual cores. I chose 248 cores. in order to allow a single core to be available, on each compute node, to run the operating system. I don't know whether or not my cluster software delegates cores that way. That was just my reasoning for choosing 248.

Because of all of this, I still wonder if the problem has to do with my cluster setup and internal MPI commands.

It's good to know that the text outputs may be inaccurate. Even so, it still makes me wonder what would cause the arguments of the corresponding "if-then" statements to fail. After all, these statements are allowed to print, only if the "if-then" arguments are true.

I'd be very interested in seeing whether or not these programs would run into any problems if they were run on a single machine that has 256 cores -- not a cluster. Unfortunately, a machine like that, is almost impossible to come by.

I was able to find the following:

https://www.titancomputers.com/Titan-S375-Dual-AMD-EPYC-Rome-7002-Series-p/s375.htm

It has 128 cores in a single computer. Such machines are extremely rare.

Thank you for taking the time to respond to my post. I truly value your input and attention.

Frank Meyer

unread,

Apr 2, 2022, 2:43:12 AM4/2/22

to OpenCoarrays

I regret to say that most of your code is well beyond my understanding, and my ability to run!

The code does compile, but not run and I do not have the math skill to figure out why. I might try reasoning out the execution path by adding more print statements.

As to the test statements; remember that all threads do not run at the same speeds (!!!) so without testing to make sure that data is being replicated properly there will be errors. Try adapting a scheme as I had to and test for the need to sleep before bringing in data from adjacent nodes.

I do notice that you are not using the built in routines for team management, is that a choice? My understanding is that it reduces the cross chatter between threads of different teams and allows the correct threads to replicate data faster.

As to the cores per machine; I remember a monster touted by IBM as the next big thing that had a ridiculous number of cores on it... Can't remember the year that was or the total core count. Through put killed it, though.

I hope you find any of that useful!

Knarfnarf

evansste

unread,

Apr 2, 2022, 9:13:41 PM4/2/22

to OpenCoarrays

Thanks for your response.

You said that you were able to compile the program, but that it won't run. Are you saying that it won't run, even with 24 cores, or fewer?

I should point out that, a successful run, of this program, means that nothing will print out. So, if you run the program, and all you receive is a return prompt, a few seconds later, then that means that it ran successfully. Both programs only print something if something goes wrong.

You asked why I've chosen not to use the team feature of OpenCoarrays. I've chosen not to use it because, the last time I checked, their implementation of teams is not fully functional. You're allowed to create teams, but you're limited in the degree, to which, you can use them. If I remember correctly, OpenCoarrays is not yet written so that you can use the square bracket notation to transfer information between teams. For instance, if you write a program that has a statement like "a[1, team_number=2] = n", the program won't compile. The Coarray Fortran Compiler will give you an error message, and doesn't recognize "team_number=2". It treats it like a syntax error.

Because of this, I've written the program, in such a way, that I should be able to use the teams concept, with only using coarrays, and no teams.

I understand how the program may seem confusing. However, it mostly boils down to being able to envision how the teams concept may assign a team number to each image. In this program, I imagine it assigning values, in a systematic way -- like filling a two-demensional matrix. You could imagine the images being assigned team numbers, by filling a matrix that has 'N' columns. It fills this matrix, from left to right, top to bottom, the following way:

image1, image2, image3, image4, ..., imageN

imageN+1, imageN+2, imageN+3, imageN+4 ..., image2*N

image2*N+1, image2*N+2, image2*N+3, image2*N+4, ..., image3*N

...

This is with "N" being the number of teams. It fills an imaginary matrix, this way, until it runs out of images. Each column, of this imagined matrix, represents the images that belong to a team. The first column has all images of team 1, the second column has all images that are in team 2, and so on. So image1, imageN+1, and image 2*N+1, would all be in team 1. image2, imageN+2, and image2*N+2, would all be in team 2, and so on.

I understand not being up to the task to try to understand the intricacies of someone else's program. I'm exactly the same way. Unless I wrote it, I usually don't like trying to debug it, or follow it. In any case, hopefully the imagined matrix, will help to make it seem less daunting.

You're right that each image will work at its own rate. For this reason, the programs use the "syncimlist" function. That function will wait until all of the images, in its list, reach the same line, in the program, before moving forward. The program makes sure that all of the images, that it's going to work with, have made all of their changes, before attempting to access any of those remote images.

For me, the biggest piece of evidence that the issue is with MPI and the cluster, is the fact that it works, with no problem, on a single machine. I can run either of these programs, all day long, on a single machine, using 24 images, and never have any issues.

Thanks, again, for taking the time to help. As always, I truly appreciate all of your input.

Frank Meyer

unread,

Apr 10, 2022, 7:55:17 PM4/10/22

to OpenCoarrays

Hmmm...

My understanding was that the team designation didn't change the thread number, just the number of updates sent out. So if you have threads 1:80 in teams of 10 then then majority of updates will pass through each team, but you can still call t_pointertodata = t_data[80] from thread 1 and expect the data to show up...

Knarfnarf

Reply all

Reply to author

Forward