Parallel Computing modules

David Mertens

ungelesen,

19.07.2012, 12:46:3019.07.12

an the-quanti...@googlegroups.com

I've added a section on parallel computing to perl4science including MPI and GPU stuff. Any others that you have used that I missed?

http://perl4science.github.com/software/

(If that hasn't hit the main web site, you can view the draft here: https://github.com/perl4science/perl4science.github.com/blob/source/source/software/index.markdown)

- David

Christopher Frenz

ungelesen,

19.07.2012, 14:51:2919.07.12

an the-quanti...@googlegroups.com

Parallel::Loops can be nice for simple tasks on multicore machines.

--

B. Estrade

ungelesen,

19.07.2012, 18:08:3719.07.12

an the-quanti...@googlegroups.com

Anything related to OpenMP or threading?

Brett

>
> - David
>
> --
>
>
>

David Mertens

ungelesen,

19.07.2012, 18:30:3119.07.12

an the-quanti...@googlegroups.com

On Thursday, July 19, 2012 1:51:29 PM UTC-5, cfrenz wrote:

Parallel::Loops can be nice for simple tasks on multicore machines.

Not exactly high performance... but ok.

David Mertens

ungelesen,

19.07.2012, 19:12:2419.07.12

an the-quanti...@googlegroups.com

On Thursday, July 19, 2012 5:08:37 PM UTC-5, B. Estrade wrote:

Anything related to OpenMP or threading?

Brett

I know of no OpenMP stuff for Perl, sadly. As for threading, I've added a few cool fork-based modules, found in large part thanks to Parallel::Loops, but there seem to be a ton of Thread:: modules and I don't have time to comb through them. Do you have any experience with these? Can you make any recommendations?

David

Felipe Leprevost

ungelesen,

19.07.2012, 19:43:0519.07.12

an the-quanti...@googlegroups.com

There is a Perl API for Gearman (http://gearman.org/). Gearman is a "application framework to farm out work to other machines or processes that are better suited to do the work". It's pretty nice.

cheers.

David Mertens

ungelesen,

20.07.2012, 23:19:5420.07.12

an the-quanti...@googlegroups.com

OK, so I've had suggestions for Parallel::Loops (and therefore Perl forks), Perl threading, and Gearman. These are all excellent solutions to parallelizing Perl tasks, but I am not confident that these are good strategies for scientific (i.e. high performance) computing. I wonder if some investigations and benchmarks may be good to work out which of these technologies is truly useful for computing. My guess is that none of these technologies support efficient data transfer or shared memory, meaning they are not well suited for anything apart from embarrassingly parallel problems, and maybe not even that.

If we create a useful shoot-out, I'll be happy to write about them on the perl4science blog. :-)

B. Estrade

ungelesen,

27.07.2012, 11:48:1527.07.12

an the-quanti...@googlegroups.com

Sorry for the delay.

I thought I had seen some PDL threading work recently. This might have been it:

http://search.cpan.org/~chm/PDL-2.4.11/Basic/Pod/ParallelCPU.pod

I have search high and low for OpenMP related Perl interfaces, but I think the
issues involved in providing them are maybe too complex. I toy with approaches
every once in a while, but never seem to get anywhere.

Thanks!
Brett

>
> David
>
> --
>
>
>

--
Register Now for cPanel Conference
Oct 8-10, 2012, Houston, Texas
http://conference.cpanel.net/

David Mertens

ungelesen,

02.08.2012, 09:03:0902.08.12

an the-quanti...@googlegroups.com

Ah, yes, PDL does offer some built-in facilities for parallel processing if your system supports posix threads. That's worth mentioning. :-)

David

Mario Roy

ungelesen,

06.02.2013, 23:01:1706.02.13

an the-quanti...@googlegroups.com

This is an early code demonstration on parallelizing PDL with MCE. The MCE Release 1.4 will be out in a couple of weeks.

MCE loves big log files. MCE is a chunking engine plus parallel engine all in one. It's also powerful for things like helping PDL maximize on available cores.

; mario

#!/usr/bin/env perl


##
## Demonstration of parallelizing PDL with Many-core Engine for Perl (MCE)
## using Strassen's algorithm for matrix multiplication.
##
## Requires MCE version 1.4 or later to run.
## http://code.google.com/p/many-core-engine-perl/
##
## MCE is my personal project. I'm new to PDL and wanted to see if PDL + MCE
## can be combined to maximize on available cores. I had no idea what to
## expect and was pleasantly surprised. The 1.4 release adds the send method.
##
## PDL is extremely powerful by itself. However, add MCE to it and be amazed.
##
## Regards,
##   Mario Roy
##


use strict;
use warnings;


use FindBin;
use lib "$FindBin::Bin/lib";
use Time::HiRes qw(time);


use PDL;
use PDL::IO::Storable;                   ## Required for PDL + MCE combo


use MCE::Signal qw(-use_dev_shm);
use MCE;


my $tam = 2048;                          ## Size 2048x2048 (power of 2 only)


my $a = sequence $tam,$tam;
my $b = sequence $tam,$tam;
my $c = zeroes   $tam,$tam;


my $max_parallel_level = 1;              ## Levels deep to parallelize
my @p = ( );                             ## For MCE results - must be global


my $start = time();


strassen($a, $b, $c, $tam);              ## Start matrix multiplication


my $end = time();


# print $c;


printf STDERR "\n## Compute time: %0.03f (secs)\n\n",  $end - $start;


##


sub strassen {


   my $a = $_[0]; my $b = $_[1]; my $c = $_[2]; my $tam = $_[3];
   my $level = $_[4] || 0;


   ## Perform the classic multiplication when matrix is <= 128 X 128


   if ($tam <= 128) {


    # for my $i (0 .. $tam - 1) {        ## Perl arrays
    #    for my $j (0 .. $tam - 1) {
    #       $c->[$i][$j] = 0;
    #       for my $k (0 .. $tam - 1) {
    #          $c->[$i][$j] += $a->[$i][$k] * $b->[$k][$j];
    #       }
    #    }
    # }


      ins(inplace($c), $a x $b);         ## PDL


      return;
   }


   ## Otherwise, perform multiplication using Strassen's algorithm


   my ($mce, $p1, $p2, $p3, $p4, $p5, $p6, $p7);


   my $nTam = $tam / 2;


   if (++$level <= $max_parallel_level) {


      ## Configure and spawn MCE workers early


      sub store_result {
         my ($n, $result) = @_;
         $p[$n] = $result;
      }


      $mce = MCE->new(
         max_workers => 7,
         user_tasks => [{
            user_func => sub {
               my $self = $_[0];
               my $data = $self->{user_data};
               my $result = zeroes $nTam,$nTam;
               strassen($data->[0], $data->[1], $result, $data->[3], $level);
               $self->do('store_result', $data->[2], $result);
            },
            task_end => sub {
               $p1 = $p[1]; $p2 = $p[2]; $p3 = $p[3]; $p4 = $p[4];
               $p5 = $p[5]; $p6 = $p[6]; $p7 = $p[7];
               @p  = ( );
            }
         }]
      );


      $mce->spawn();
   }


   ## Allocate memory after spawning MCE workers


   my $a11 = zeroes $nTam,$nTam;  my $a12 = zeroes $nTam,$nTam;
   my $a21 = zeroes $nTam,$nTam;  my $a22 = zeroes $nTam,$nTam;
   my $b11 = zeroes $nTam,$nTam;  my $b12 = zeroes $nTam,$nTam;
   my $b21 = zeroes $nTam,$nTam;  my $b22 = zeroes $nTam,$nTam;


   my $t1  = zeroes $nTam,$nTam;  my $t2  = zeroes $nTam,$nTam;


      $p1  = zeroes $nTam,$nTam;     $p2  = zeroes $nTam,$nTam;
      $p3  = zeroes $nTam,$nTam;     $p4  = zeroes $nTam,$nTam;
      $p5  = zeroes $nTam,$nTam;     $p6  = zeroes $nTam,$nTam;
      $p7  = zeroes $nTam,$nTam;


   ## Divide the matrices into 4 sub-matrices


   divide_m($a11, $a12, $a21, $a22, $a, $nTam);
   divide_m($b11, $b12, $b21, $b22, $b, $nTam);


   ## Calculate p1 to p7


   if ($level <= $max_parallel_level) {
      sum_m($a11, $a22, $t1, $nTam);
      sum_m($b11, $b22, $t2, $nTam);
      $mce->send([ $t1, $t2, 1, $nTam ]);


      sum_m($a21, $a22, $t1, $nTam);
      $mce->send([ $t1, $b11, 2, $nTam ]);


      subtract_m($b12, $b22, $t2, $nTam);
      $mce->send([ $a11, $t2, 3, $nTam ]);


      subtract_m($b21, $b11, $t2, $nTam);
      $mce->send([ $a22, $t2, 4, $nTam ]);


      sum_m($a11, $a12, $t1, $nTam);
      $mce->send([ $t1, $b22, 5, $nTam ]);


      subtract_m($a21, $a11, $t1, $nTam);
      sum_m($b11, $b12, $t2, $nTam);
      $mce->send([ $t1, $t2, 6, $nTam ]);


      subtract_m($a12, $a22, $t1, $nTam);
      sum_m($b21, $b22, $t2, $nTam);
      $mce->send([ $t1, $t2, 7, $nTam ]);


      $mce->run();


   } else {
      sum_m($a11, $a22, $t1, $nTam);
      sum_m($b11, $b22, $t2, $nTam);
      strassen($t1, $t2, $p1, $nTam, $level);


      sum_m($a21, $a22, $t1, $nTam);
      strassen($t1, $b11, $p2, $nTam, $level);


      subtract_m($b12, $b22, $t2, $nTam);
      strassen($a11, $t2, $p3, $nTam, $level);


      subtract_m($b21, $b11, $t2, $nTam);
      strassen($a22, $t2, $p4, $nTam, $level);


      sum_m($a11, $a12, $t1, $nTam);
      strassen($t1, $b22, $p5, $nTam, $level);


      subtract_m($a21, $a11, $t1, $nTam);
      sum_m($b11, $b12, $t2, $nTam);
      strassen($t1, $t2, $p6, $nTam, $level);


      subtract_m($a12, $a22, $t1, $nTam);
      sum_m($b21, $b22, $t2, $nTam);
      strassen($t1, $t2, $p7, $nTam, $level);
   }


   ## Calculate and group into a single matrix $c


   calc_m($p1, $p2, $p3, $p4, $p5, $p6, $p7, $c, $nTam);


   return;
}


sub divide_m {


   my $m11 = $_[0]; my $m12 = $_[1]; my $m21 = $_[2]; my $m22 = $_[3];
   my $m   = $_[4]; my $tam = $_[5];


 # for my $i (0 .. $tam - 1) {           ## Perl arrays
 #    for my $j (0 .. $tam - 1) {
 #       $m11->[$i][$j] = $m->[$i][$j];
 #       $m12->[$i][$j] = $m->[$i][$j + $tam];
 #       $m21->[$i][$j] = $m->[$i + $tam][$j];
 #       $m22->[$i][$j] = $m->[$i + $tam][$j + $tam];
 #    }
 # }


   my $n1 = $tam - 1;                    ## PDL
   my $n2 = $tam + $n1;


   ins(inplace($m11), $m->slice("0:$n1,0:$n1"));
   ins(inplace($m12), $m->slice("$tam:$n2,0:$n1"));
   ins(inplace($m21), $m->slice("0:$n1,$tam:$n2"));
   ins(inplace($m22), $m->slice("$tam:$n2,$tam:$n2"));


   return;
}


sub calc_m {


   my $p1  = $_[0]; my $p2  = $_[1]; my $p3  = $_[2]; my $p4  = $_[3];
   my $p5  = $_[4]; my $p6  = $_[5]; my $p7  = $_[6]; my $c   = $_[7];
   my $tam = $_[8];


   my $c11 = zeroes $tam,$tam;  my $c12 = zeroes $tam,$tam;
   my $c21 = zeroes $tam,$tam;  my $c22 = zeroes $tam,$tam;
   my $t1  = zeroes $tam,$tam;  my $t2  = zeroes $tam,$tam;


   sum_m($p1, $p4, $t1, $tam);
   sum_m($t1, $p7, $t2, $tam);
   subtract_m($t2, $p5, $c11, $tam);


   sum_m($p3, $p5, $c12, $tam);
   sum_m($p2, $p4, $c21, $tam);


   sum_m($p1, $p3, $t1, $tam);
   sum_m($t1, $p6, $t2, $tam);
   subtract_m($t2, $p2, $c22, $tam);


 # for my $i (0 .. $tam - 1) {           ## Perl arrays
 #    for my $j (0 .. $tam - 1) {
 #       $c->[$i][$j] = $c11->[$i][$j];
 #       $c->[$i][$j + $tam] = $c12->[$i][$j];
 #       $c->[$i + $tam][$j] = $c21->[$i][$j];
 #       $c->[$i + $tam][$j + $tam] = $c22->[$i][$j];
 #    }
 # }


   ins(inplace($c), $c11, 0, 0);         ## PDL
   ins(inplace($c), $c12, $tam, 0);
   ins(inplace($c), $c21, 0, $tam);
   ins(inplace($c), $c22, $tam, $tam);


   return;
}


sub sum_m {


   my $a = $_[0]; my $b = $_[1]; my $r = $_[2]; my $tam = $_[3];


 # for my $i (0 .. $tam - 1) {           ## Perl arrays
 #    for my $j (0 .. $tam - 1) {
 #       $r->[$i][$j] = $a->[$i][$j] + $b->[$i][$j];
 #    }
 # }


   ins(inplace($r), $a + $b);            ## PDL


   return;
}


sub subtract_m {


   my $a = $_[0]; my $b = $_[1]; my $r = $_[2]; my $tam = $_[3];


 # for my $i (0 .. $tam - 1) {           ## Perl arrays
 #    for my $j (0 .. $tam - 1) {
 #       $r->[$i][$j] = $a->[$i][$j] - $b->[$i][$j];
 #    }
 # }


   ins(inplace($r), $a - $b);            ## PDL


   return;
}

Mario Roy

ungelesen,

11.02.2013, 08:19:0311.02.13

an the-quanti...@googlegroups.com

Many-core Engine for Perl (MCE) 1.4 has been released.

https://metacpan.org/release/MARIOROY/MCE-1.400

That release comes with several examples demonstrating Matrix Multiplication under examples/matmult/. The README file also includes benchmark results taken from a modern hardware.

Regards,

Mario

Demian Riccardi

ungelesen,

11.02.2013, 13:11:5911.02.13

an the-quanti...@googlegroups.com

This is very exciting!!

Mario Roy

ungelesen,

11.02.2013, 17:24:3511.02.13

an the-quanti...@googlegroups.com

I want to share some great news on the Windows side.

All matrix multiplication examples work equally well under the Windows environment when including the threads modules after inclusion of PDL:

use PDL;

use PDL::IO::Storable; ## Required for PDL + MCE combo

use threads;

use threads::shared;

MCE will auto-detect that and use threads internally. One doesn't have to explicitly set the use_threads => 1 option for MCE.

I will update the examples to include the threads modules automatically when running under the Windows environment. Now, both strassen_pdl_m.pl and matmult_pdl_m.pl work as expected under the Windows environment. Yeah :)

Tested with Windows 7 (32-bit) and ActiveState 5.14.3.1404 and PDL 2.4.9 (installed via ppm).

Wanted to let you guys know.

PDL is amazingly fast. PDL + MCE is simply powerful. I created many matrix multiplication examples just to show case all the various ways one can pass data around with MCE. Workers can fetch data for the A matrix and read a shared cache file for the B matrix. Another example sends data individually, after spawning workers, and prior to running. The do method in MCE can send and also receive data (very natural cross-over). The do method is serialized (only one worker can call do at any given time). Basically, one does not have to worry about locking resources at the manager process level. Do is serialized. http://cpansearch.perl.org/src/MARIOROY/MCE-1.400/images/08_Natural_Callback.gif

MCE is a chunking engine, follows a bank queueing model for input data, has a sequence engine (can be chunked as well), and has the serialized do action (can be called as often as needed). It's the Many-core Engine I've been working on for quite some time and just released last November. MCE loves big files -- look at the egrep.pl and wc.pl example. Try it on very big files (very long lines). The bank-queuing model helps ensure only one worker is reading IO, because sequential IO is normally always reportedly faster than random IO (many workers reading simultaneously). There are only "total" 4 shared socket pairs no matter the number of workers. The egrep example shows how one can use the chunk_id to preserve output order in real-time without having to wait till after processing the entire content. Line numbers are accurately reported.

Not sure if folks have seen the images, but when you have a chance, take a look at:

http://cpansearch.perl.org/src/MARIOROY/MCE-1.400/images/01_Bank_Queuing_Model.gif

http://cpansearch.perl.org/src/MARIOROY/MCE-1.400/images/02_Bank_Model_Chunking.gif

http://cpansearch.perl.org/src/MARIOROY/MCE-1.400/images/03_Bank_Model_Chunk_ID.gif

http://cpansearch.perl.org/src/MARIOROY/MCE-1.400/images/04_Enabling_Chunking.gif

http://cpansearch.perl.org/src/MARIOROY/MCE-1.400/images/05_Power_of_Randomness.gif

http://cpansearch.perl.org/src/MARIOROY/MCE-1.400/images/06_Shared_Sockets.gif

http://cpansearch.perl.org/src/MARIOROY/MCE-1.400/images/07_Sequential_IO.gif

http://cpansearch.perl.org/src/MARIOROY/MCE-1.400/images/08_Natural_Callback.gif

http://cpansearch.perl.org/src/MARIOROY/MCE-1.400/images/09_Supported_OS.gif

http://cpansearch.perl.org/src/MARIOROY/MCE-1.400/images/10_Scaling_Pings.gif

http://cpansearch.perl.org/src/MARIOROY/MCE-1.400/images/11_SNMP_Collection.gif

Our SNMP poller is now using the new user_tasks option introduced in MCE 1.2. Pingers, pollers, and writers are all running simultaneously (all under one MCE instance). Net::Ping + Net::SNMP + AnyEvent::SNMP + MCE is very powerful. We are polling greater than 30,000 devices for 20+ metrices (couple walks) all in a single minute. I'm using a chunk_size => 600. This is powerful and has been running for quite some time: http://code.google.com/p/many-core-engine-perl/wiki/MCE_Tasks

It's a beauty to watch the CPU percentage go over 2250% for the threaded process. It's ridiculously powerful. What I've learned along the way is that chunking reduces overhead.

Then, parallelizing matrix multiplication was my next step. What's next -- oh, cannot tell you yet :)

Mario

Mario Roy

ungelesen,

12.02.2013, 01:21:4412.02.13

an the-quanti...@googlegroups.com

Hi all,

MCE 1.401 will be released this coming weekend. The matrix multiplication examples will get a small update. The windows environment does not support MMAP. Thus far, no changes are needed to MCE.pm or MCE::Signal, just the examples.

It looks like MCE will be able to parallelize PDL quite well after all, among other things. I will benchmark 2048x2048 on the next round, instead of 1024x1024. MCE gets more powerful the larger the dimensions versus using PDL's $c = $a x $b. I had no idea.

The research on having MCE parallelize matrix multiplication using pure Perl or PDL will be completed with MCE 1.401. All maxmult examples will work under the Windows environment as well -- just completed testing.

:) Regards, Mario

Mario Roy

ungelesen,

12.02.2013, 19:58:4312.02.13

an the-quanti...@googlegroups.com

I'm happy to report that all regressions with PDL + MCE have been solved. One can use either use_threads => 0 or 1. It doesn't matter, even under the MSWin32 environment. Workers will exit gracefully like usual. I had no idea that MCE was failing when used with PDL. The same is true with PDL + MCE 1.306 when using threads under Unix or forking under MSWin32.

This completes the long research on having MCE perform Matrix Multiplication as pure Perl code or with PDL.

https://metacpan.org/release/MARIOROY/MCE-1.401

It turns out PDL::CLONE_SKIP { 1 } was all I needed, which I read just days ago. The following is the added lines inside MCE.pm. David, had I not received the initial email from you, I would have been sorely hurting on trying to figure why threads were crashing during exiting when using MCE with PDL. Your email led me to this site which then got me reading a few posts and one of them mentioned CLONE_SKIP.

## PDL + MCE (spawning as threads) is not stable. A comment from David Mertens

## mentioned the fix for his PDL::Parallel::threads module. The CLONE_SKIP is

## also needed here in order for PDL + MCE threads to not crash during exiting.

## Thanks goes to David !!! I would have definitely struggled with this one.

##

sub PDL::CLONE_SKIP { 1 }

Thanks David and all.

Mario

Mario Roy

ungelesen,

12.02.2013, 20:14:5112.02.13

an the-quanti...@googlegroups.com

If folks agree, MCE 1.401 can be added to the following site.

http://perl4science.github.com/software/

Perl under Windows or Unix can run with use_threads => 0 or 1 with PDL. The matrix multiplication examples have been updated to show how workers keep only a local copy of the "b" matrix while the manager process has the "a" and "c" matrices. MCE's do method is used to fetch data as well as send results. Although MMAP IO doesn't work under Windows, I went with a cache file to store the "b" matrix for workers to read from.

The README contains benchmark results for 1024x1024, 2048x2048, 4096x4096, and 8192x8192. The matmult_pdl_m.pl example scales linearly across all logical processor with near 100% CPU utilization. In my opinion, the technique is good at maximizing on all available cores, keeping memory consumption low, and keeping overhead to a minimum.

Regards,

Mario

David Mertens

ungelesen,

13.02.2013, 00:07:1513.02.13

an the-quanti...@googlegroups.com

Mario -

I think this is great work and a very nice set of benchmarks. I have meant for some time to take all the different parallelization frameworks out for a spin, and writing up my experiences. I have not done that, but I was able to spend a couple of hours working through some of your code, and trying my hand at things. So, I propose:

A Challenge!

I wrote PDL::Parallel::threads specifically to allow for transparent PDL data-sharing across multiple Perl threads. Pretty much the only computational or memory overhead involved in this scheme is Perl's thread overhead. I would like to see this eventually rolled into PDL's core modules (especially the CLONE_SKIP part), but for now it sits out on CPAN.

Using PDL:Parallel::threads for data sharing and the included but fairly independent PDL:Parallel::threads::SIMD module, I wrote a Perl-threaded implementation of PDL matrix multiplication. Benchmarks on my machine are shown below; my implementation is called matmult_pdl_thr.pl:

## matmult_pdl_b.pl   32: compute time: 0.000 secs
## matmult_pdl_thr.pl 32: compute time: 0.136 secs
## matmult_pdl_m.pl   32: compute time: 0.188 secs
## strassen_pdl_m.pl  32: compute time: 0.000 secs

## matmult_pdl_b.pl   64: compute time: 0.001 secs
## matmult_pdl_thr.pl 64: compute time: 0.138 secs
## matmult_pdl_m.pl   64: compute time: 0.109 secs
## strassen_pdl_m.pl  64: compute time: 0.001 secs

## matmult_pdl_b.pl   128: compute time: 0.007 secs
## matmult_pdl_thr.pl 128: compute time: 0.139 secs
## matmult_pdl_m.pl   128: compute time: 0.141 secs
## strassen_pdl_m.pl  128: compute time: 0.007 secs

## matmult_pdl_b.pl   256: compute time: 0.053 secs
## matmult_pdl_thr.pl 256: compute time: 0.152 secs
## matmult_pdl_m.pl   256: compute time: 0.499 secs
## strassen_pdl_m.pl  256: compute time: 0.040 secs

## matmult_pdl_b.pl   512: compute time: 0.431 secs
## matmult_pdl_thr.pl 512: compute time: 0.222 secs
## matmult_pdl_m.pl   512: compute time: 2.046 secs
## strassen_pdl_m.pl  512: compute time: 0.147 secs

## matmult_pdl_b.pl   1024: compute time: 3.605 secs
## matmult_pdl_thr.pl 1024: compute time: 0.965 secs
## matmult_pdl_m.pl   1024: compute time: 8.530 secs
## strassen_pdl_m.pl  1024: compute time: 0.801 secs

## matmult_pdl_b.pl   2048: compute time: 29.346 secs
## matmult_pdl_thr.pl 2048: compute time: 6.233 secs
## matmult_pdl_m.pl   2048: compute time: 37.156 secs
## strassen_pdl_m.pl  2048: compute time: 4.927 secs

## matmult_pdl_b.pl   4096: compute time: 234.716 secs
## matmult_pdl_thr.pl 4096: compute time: 64.579 secs
## matmult_pdl_m.pl   2048: -- took too long --
## strassen_pdl_m.pl  4096: -- took too long --

For small matrices, my code loses due to the overhead of setting up the Perl threads. However, for sizes 1024 and larger, the multithreaded approach pays off. The big difference between my code and the Strassman algorithm is that my code churns the CPU while the Strassen algorithm demolishes memory. The memory consumption ultimately leads to Strassen's demise in the 4096-sized case, where enormous memory allocations and/or swapping out to disk kills the performance. Also, my implementation is *much* shorter. :-)

I suspect that matmult_pdl_m.pl could be improved if you change how you actually assign values. I don't quite understand the structure of your code, or how threaded/forked code in MCE interacts with data in the main thread/fork, but within a thread the general rule with PDL assignment is to assign to a slice. So, for example, it is better to say this:

# uses PDL::NiceSlice;
$piddle(5:10) .= $new_values;

than to say this:

$piddle->ins($new_values, 5);

and a single assignment of many rows (or columns) is better than one Perl assignment for each row (or column), especially if the number or Perl assignment scales with the size of the data.

I'll have to look more into the Strassman algorithm. I never heard of it until this, and I'd like to learn more.

So, do you accept the challenge? Do you think you can beat my threaded implementation? Do you think you can improve matmult_pdl_m.pl with more PDLish assignments, or by using PDL::Parallel::threads to manage the data sharing instead of memory mapping?

David

P. S. Memory mapping should now work for PDL on Windows as well as Linux and Mac for the latest (non developer's) version of PDL. If you ran into trouble with memory mapping, updating your copy of PDL should fix that.

Mario Roy

ungelesen,

13.02.2013, 00:46:1413.02.13

an the-quanti...@googlegroups.com

Hi David,

That was a great post comparing the two. I'm new to PDL, just started about 3 weeks ago. I can try a couple of things.

Can you post the matmul_pdl_thr.pl source somewhere where I can look at it. That will be quite helpful.

It's a definite yes on the challenge for the sole reason on wanting to help scientists out there wanting more parallelism with PDL. :)

Thanks,

Mario

Mario Roy

ungelesen,

13.02.2013, 00:51:2513.02.13

an the-quanti...@googlegroups.com

I did a search and found your amazing example at:

https://gist.github.com/run4flat/4942132

Very interesting. What a great way to learn PDL.

Thanks,

Mario

Mario Roy

ungelesen,

13.02.2013, 01:00:4213.02.13

an the-quanti...@googlegroups.com

Hi David,

One of the reasons why your example is extremely fast is the use of the "x" operator for matrix multiplication. The MCE example is using the "*" operator. The "x" operator in PDL is very fast. The strassen examples in MCE make use of "x", but not matmult_pdl_m.pl.

What a journey. You've been so helpful. First the tip on PDL::CLONE_SKIP and now a great example of PDL while keeping it as close to PDL during parallelization.

Amazing. :)

Mario Roy

ungelesen,

13.02.2013, 06:16:4613.02.13

an the-quanti...@googlegroups.com

Thank you David !!! What a way to learn PDL. Check out both matmult_pdl_m.pl and matmult_pdl_n.pl (the latter makes use of PDL::IO::FastRaw).

The SVN repo has been updated with the latest updates to the examples (r191).

http://code.google.com/p/many-core-engine-perl/source/list

The 2 examples, mentioned above, make use of the step size for sequence. That helps remove some overhead. A step size of 10 is all that's needed. However, the main boost comes from using the "x" operator when performing the matrix multiplication which was not the case before. That, essentially, was the main reason for the large delta difference in compute time between the two modules.

e.g. sequence => { begin => 0, end => $rows - 1, step => $step_size },

The matmult_pdl_n.pl example is running faster than strassen_pdl_m.pl which was a pleasant surprise up till 4096x4096 on a 24 way box.

Enjoy the new results below, specifically matmul_pdl_m.pl and matmul_pdl_n.pl.

## -- Results for 1024x1024 ---------------------------------------------------

##

## matmul_pdl_b.pl 1024: compute: 2.705 secs 1 worker ( 1 running)

## matmul_pdl_b.pl 1024: compute: 11.035 secs 1 worker (24 running)

##

## matmul_pdl_m.pl 1024: compute: 0.697 secs 8 workers ( 1 running)

## matmul_pdl_m.pl 1024: compute: 1.625 secs 8 workers ( 3 running)

## matmul_pdl_m.pl 1024: compute: 0.705 secs 24 workers ( 1 running)

##

## matmul_pdl_n.pl 1024: compute: 0.500 secs 8 workers ( 1 running)

## matmul_pdl_n.pl 1024: compute: 0.978 secs 8 workers ( 3 running)

## matmul_pdl_n.pl 1024: compute: 0.368 secs 24 workers ( 1 running)

##

## matmul_perl_m.pl 1024: compute: 33.833 secs 8 workers ( 1 running)

## matmul_perl_m.pl 1024: compute: 69.830 secs 8 workers ( 3 running)

## matmul_perl_m.pl 1024: compute: 23.995 secs 24 workers ( 1 running)

##

## strassen_pdl_m.pl 1024: compute: 0.564 secs 7 workers ( 1 running)

## strassen_perl_m.pl 1024: compute: 45.408 secs 7 workers ( 1 running)

##

## Output

## (0,0) 365967179776 (1023,1023) 563314846859776

##

## -- Results for 2048x2048 ---------------------------------------------------

##

## matmul_pdl_b.pl 2048: compute: 21.470 secs 1 worker ( 1 running)

## matmul_pdl_b.pl 2048: compute: 96.217 secs 1 worker (24 running)

##

## matmul_pdl_m.pl 2048: compute: 4.873 secs 8 workers ( 1 running)

## matmul_pdl_m.pl 2048: compute: 12.610 secs 8 workers ( 3 running)

## matmul_pdl_m.pl 2048: compute: 4.715 secs 24 workers ( 1 running)

##

## matmul_pdl_n.pl 2048: compute: 3.198 secs 8 workers ( 1 running)

## matmul_pdl_n.pl 2048: compute: 7.453 secs 8 workers ( 3 running)

## matmul_pdl_n.pl 2048: compute: 2.515 secs 24 workers ( 1 running)

##

## matmul_perl_m.pl 2048: compute: 270.556 secs 8 workers ( 1 running)

## matmul_perl_m.pl 2048: compute: 558.837 secs 8 workers ( 3 running)

## matmul_perl_m.pl 2048: compute: 190.302 secs 24 workers ( 1 running)

##

## strassen_pdl_m.pl 2048: compute: 2.734 secs 7 workers ( 1 running)

## strassen_perl_m.pl 2048: compute: 322.932 secs 1 level parallelization

## strassen_perl_m.pl 2048: compute: 200.440 secs 2 levels parallelization

##

## Output

## (0,0) 5859767746560 (2047,2047) 1.80202496872953e+16 matmul examples

## (0,0) 5859767746560 (2047,2047) 1.8020249687295e+16 strassen examples

##

## -- Results for 4096x4096 ---------------------------------------------------

##

## matmul_pdl_b.pl 4096: compute: 172.220 secs 1 worker ( 1 running)

## matmul_pdl_m.pl 4096: compute: 35.923 secs 24 workers ( 1 running)

## matmul_pdl_n.pl 4096: compute: 23.580 secs 24 workers ( 1 running)

## strassen_pdl_m.pl 4096: compute: 16.941 secs 7 workers ( 1 running)

##

## Output

## (0,0) 93790635294720 (4095,4095) 5.76554474219245e+17 matmul examples

## (0,0) 93790635294720 (4095,4095) 5.76554474219244e+17 strassen example

##

Again, the updates are posted at the SVN repo mentioned above. I have not released a new MCE release just yet :) No changes were made to MCE.pm or MCE::Signal.pm. Only examples/matmult/* were updated.

Cheers,

Mario

David Mertens

ungelesen,

13.02.2013, 08:58:1313.02.13

an the-quanti...@googlegroups.com

On Tue, Feb 12, 2013 at 11:46 PM, Mario Roy <mari...@gmail.com> wrote:

Hi David,

That was a great post comparing the two. I'm new to PDL, just started about 3 weeks ago. I can try a couple of things.

Feel free to ask questions about PDL as they arise, either here or on the PDL mailing list.

Can you post the matmul_pdl_thr.pl source somewhere where I can look at it. That will be quite helpful.

Yeah, the link was in the original post, but it probably wasn't obvious. I see you've found it, but in case anybody else missed it: https://gist.github.com/run4flat/4942132. Note you'll need to install both PDL as well as PDL::Parallel::threads to run my code.

It's a definite yes on the challenge for the sole reason on wanting to help scientists out there wanting more parallelism with PDL. :)

Thanks,
Mario

Excellent! I'll try to work on my stuff and we'll see if we can converge on something cool. :-)

Joel Berger

ungelesen,

13.02.2013, 12:34:4613.02.13

an the-quanti...@googlegroups.com

I'm sure we can squeeze you in!

:-)

Joel

On Tuesday, February 12, 2013 7:14:51 PM UTC-6, Mario Roy wrote:

Mario Roy

ungelesen,

13.02.2013, 17:34:5613.02.13

an the-quanti...@googlegroups.com

Thank you Joel.

Mario Roy

ungelesen,

13.02.2013, 17:51:0813.02.13

an the-quanti...@googlegroups.com

I will release an update to MCE 1.402 to contain the latest updates to the matrix multiplication examples submitted in SVN revision r191 this coming weekend. In the meantime, I wanted to benchmark this myself and post the results here.

## MacBook Pro Intel Core i7 2.00 GHz (4 Cores -- 8 Logical Processors)

## Perl 5.16.2, PDL 2.4.11, MCE 1.401 (plus updated examples from SVN r191)

## CentOS 6.3 Linux VM under Parallels Desktop (w/ 8 logical processors)

## Scripts are configured to use 8 workers, strassen is configured with 7

Both matmult_pdl_m.pl (new) and matmult_pdl_n.pl (new) are similar. The latter utilizes PDL::IO::FastRaw. MCE can keep up with PDL::Parallel::threads::SIMD.

The (old) performs matrix multiplication using the "*" operator whereas the others use the "x" operator, which is quite fast. That was the main reason for the delta time difference between matmult_pdl_m.pl (old) and matmult_pdl_thr.pl.

Matmult_pdl_thr.pl uses PDL::Parallel::threads. All others use MCE.

Compute time for 128x128

matmult_pdl_b.pl 0.008 secs

matmult_pdl_m.pl (old) 0.118 secs

matmult_pdl_thr.pl 0.187 secs

matmult_pdl_m.pl (new) 0.023 secs

matmult_pdl_n.pl (new) 0.014 secs

strassen_pdl_m.pl 0.008 secs

Compute time for 256x256

matmult_pdl_b.pl 0.058 secs

matmult_pdl_m.pl (old) 0.409 secs

matmult_pdl_thr.pl 0.195 secs

matmult_pdl_m.pl (new) 0.047 secs

matmult_pdl_n.pl (new) 0.025 secs

strassen_pdl_m.pl 0.041 secs

Compute time for 512x512

matmult_pdl_b.pl 0.499 secs

matmult_pdl_m.pl (old) 1.626 secs

matmult_pdl_thr.pl 0.290 secs

matmult_pdl_m.pl (new) 0.180 secs

matmult_pdl_n.pl (new) 0.127 secs

strassen_pdl_m.pl 0.151 secs

Compute time for 1024x1024

matmult_pdl_b.pl 11.352 secs

matmult_pdl_m.pl (old) 6.839 secs

matmult_pdl_thr.pl 3.010 secs

matmult_pdl_m.pl (new) 3.016 secs

matmult_pdl_n.pl (new) 0.862 secs

strassen_pdl_m.pl 0.768 secs

Compute time for 2048x2048

matmult_pdl_b.pl 106.714 secs

matmult_pdl_m.pl (old) 32.280 secs

matmult_pdl_thr.pl 29.467 secs

matmult_pdl_m.pl (new) 28.917 secs

matmult_pdl_n.pl (new) 7.141 secs

strassen_pdl_m.pl 4.857 secs

Compute time for 4096x4096

matmult_pdl_thr.pl 249.702 secs

matmult_pdl_m.pl (new) 188.379 secs

matmult_pdl_n.pl (new) 72.565 secs

As always, thanks to David Mertens. Had it not been for your first email suggesting I post here, I don't know if I would have gone this far. PDL's "x" operator is quite fast alone for matrix multiplication. The modules help PDL maximize on extra cores.

Regards,

-- mario

Mario Roy

ungelesen,

13.02.2013, 18:07:3913.02.13

an the-quanti...@googlegroups.com

I forgot to mentioned that I tested with ActiveState's Perl binary under Linux. That binary does not appear to be as optimized as the native Perl binary which is part of CentOS 6.3. I choose to test with ActiveState's Perl binary in order to get the latest PDL release including David's PDL::Parallel::threads package installable via ppm (due to lack of time).

This is perl 5, version 16, subversion 2 (v5.16.2) built for x86_64-linux-thread-multi

(with 1 registered patch, see perl -V for more detail)

Binary build 1602 [296513] provided by ActiveState http://www.ActiveState.com

Built Dec 18 2012 15:07:22

I'm not sure by how much, due to running inside a VM, if any additional performance penalties may have occurred. The benchmarks just posted were all from the same VM using ActiveState's Perl binary. I exported my PATH to /opt/ActivePerl-5.16/bin:$PATH.

Mario Roy

ungelesen,

13.02.2013, 22:30:5613.02.13

an the-quanti...@googlegroups.com

The results I posted made me wonder about ActiveState's Perl 5.16.2 build. So, I ran matmult_pdl_thr.pl 2048 against the native Perl on CentOS Perl 5.10.1/PDL 2.4.9 against ActiveState's Perl 5.16.2/PDL 2.4.11. This time on a physical box (running CentOS 6.2) and utilizing 8 worker threads.

CentOS matmult_pdl_thr.pl 2048: 5.670 secs
ActiveState matmult_pdl_thr.pl 2048: 7.968 secs

That is surprising news to me to see ActiveState's Perl binary run so slow. I remember reading somewhere about Perl 5.16.x being faster these days. Maybe it's the way the Perl binary was built such as the compiler version and optimization options.

-- mario

Mario Roy

ungelesen,

14.02.2013, 08:12:0714.02.13

an the-quanti...@googlegroups.com

MCE 1.402 has been released

https://metacpan.org/release/MARIOROY/MCE-1.402

Mario Roy

ungelesen,

14.02.2013, 13:13:4214.02.13

an the-quanti...@googlegroups.com

This is exciting. First, I updated SVN to wrap PDL::CLONE_SKIP into a block to not get a warning if using MCE with use PDL::Parallel::threads. Yes, that's right. :) One can choose PDL::Parallel::threads::SIMD or MCE to have parallel workers.

{

no warnings 'redefine';

sub PDL::CLONE_SKIP { 1 }

}

Here is the code combining PDL, PDL::Parallel::threads, and MCE. Below all matrices are shared. I will create 2 new examples with the next MCE update. One example will demonstrate only sharing the "b" matrix (workers will fetch and submit results). The 2nd example is shown below (all matrices are shared).

The nice thing about MCE is that workers can be spawned early. One can keep a pool of workers up and running and do many things with the same MCE instance (process many jobs using the same pool of workers).

David's PDL::Parallel::threads module is great because it keeps memory consumption low. MCE provides a lot of flexibility.

Cheers,

Mario

#!/usr/bin/env perl

##

## Usage:

## perl matmult_pdl_n.pl 1024 ## Default size is 512: $c = $a x $b

##

use strict;

use warnings;

use FindBin;

use lib "$FindBin::Bin/../../lib";

my $prog_name = $0; $prog_name =~ s{^.*[\\/]}{}g;

use threads;

use threads::shared;

use Time::HiRes qw(time);

use PDL;

use PDL::IO::Storable; ## Required for PDL + MCE combo

use PDL::Parallel::threads qw(retrieve_pdls);

use MCE;

my $pdl_version = sprintf("%20s", $PDL::VERSION); $pdl_version =~ s/_.*$//;

my $chk_version = sprintf("%20s", '2.4.11');

if ($^O eq 'MSWin32' && $pdl_version lt $chk_version) {

print "This script requires PDL 2.4.11 or later for PDL::IO::FastRaw\n";

print "to work using MMAP IO under the Windows environment.\n";

exit;

}

###############################################################################

# * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #

###############################################################################

my $tam = shift;

$tam = 512 unless (defined $tam);

unless ($tam > 1) {

print STDERR "Error: $tam must be an integer greater than 1. Exiting.\n";

exit 1;

}

my $cols = $tam;

my $rows = $tam;

my $max_workers = 8;

my $mce = configure_and_spawn_mce($max_workers);

my $a = sequence $cols,$rows; $a->share_as('left_input');

my $b = sequence $rows,$cols; $b->share_as('right_input');

my $c = zeroes $rows,$rows; $c->share_as('output');

my $start = time();

$mce->run(0);

my $end = time();

$mce->shutdown();

printf STDERR "\n## $prog_name $tam: compute time: %0.03f secs\n\n",

$end - $start;

my $dim_1 = $tam - 1;

print "## (0,0) ", $c->at(0,0), " ($dim_1,$dim_1) ", $c->at($dim_1,$dim_1);

print "\n\n";

###############################################################################

# * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #

###############################################################################

sub configure_and_spawn_mce {

my $max_workers = shift || 8;

return MCE->new(

max_workers => $max_workers,

job_delay => ($tam > 2048) ? 0.043 : undef,

user_func => sub {

my ($self) = @_;

my ($l, $r, $o) = retrieve_pdls(

'left_input', 'right_input', 'output'

);

my $step = int($rows / $max_workers);

my $start = ($self->wid - 1) * $step;

my $stop = $start + $step - 1;

$stop = $rows - 1 if ($stop >= $rows);

use PDL::NiceSlice;

$o(:,$start:$stop) .= $l(:,$start:$stop) x $r;

no PDL::NiceSlice;

return;

}

)->spawn;

}

Mario Roy

ungelesen,

14.02.2013, 13:21:1714.02.13

an the-quanti...@googlegroups.com

Amazing... :)

PDL + PDL::Parallel::threads + PDL::IO::FastRaw + MCE are ingredients for some very powerful applications. There are lots of flexibility. Folks can choose what to share, what to configure as MMAP IO and serialized the things they want and the imagination is the limit. Workers can request and/or submit results as often as needed.

Just wanted to share this news. If memory is plentiful, PDL::IO::FastRaw can be used to help speed things up. For flexibility, PDL::Parallel::threads comes to mind in order to keep memory consumption low.

Happy Valentine to all !!!

-- mario

Mario Roy

ungelesen,

15.02.2013, 01:00:3815.02.13

an the-quanti...@googlegroups.com

I tried something different tonight. This is on my Linux VM (this time using the native Perl binary with PDL package installed via yum install perl-PDL).

Below, matrices are configured as shown for both demo_thr.pl and demo_mce.pl. Both are configured to use 8 workers. Essentially the only difference between the two is that one uses PDL::Parallel::threads::SIMD and the other MCE. Both are using PDL::Parallel::threads and PDL::IO::FastRaw. I really like PDL::Parallel::threads.

writefraw(sequence($rows,$cols), "$tmp_dir/raw.b");

my $b = mapfraw "$tmp_dir/raw.b", { ReadOnly => 1 };

my $a = sequence $cols,$rows;

my $o = zeroes $rows,$rows;

$a->share_as('left_input');

$b->share_as('right_input');

$o->share_as('output');

$ time perl demo_thr.pl 4096 (memory consumption 1.4 GB)

## test.pl 4096: compute time: 48.156 secs

## (0, 0): 93790635294720

## (324, 5): 797336174714880

## (42, 172): 2.42948503082552e+16

## (4095, 4095): 5.76554474219245e+17

real 0m48.653s

user 6m11.813s

sys 0m0.809s

$ time perl demo_mce.pl 4096 (memory consumption 1.3 GB)

## matmult_pdl_v.pl 4096: compute time: 44.416 secs

## (0,0) 93790635294720 (4095,4095) 5.76554474219245e+17

real 0m45.195s

user 5m43.655s

sys 0m0.741s

The demo_thr.pl example is David's original code (8 workers) and matrices configured as described above. No other changes. The demo_mce.pl example is shown below. What makes the MCE code faster?

1. The job_delay option stagger workers at initial processing. This has the effect on minimizing initial memory fragmentation when workers read the "b" matrix from the raw shared file. It also helps enable the power of randomness to kick in.

2. MCE has the sequence option which enables chunking. Step size is set to 64. Each of the 8 workers will run through the user_func 8 times: 8 workers x 8 times = 64 * 64 = 4096. The larger the matrices, the more noticeable this becomes. Processing in smaller chunks helps with the data fitting in the CPU L1/L2 cache while processing the chunk. Chunking helps the engine run at the same "sustained" rate no matter the size of the matrices. Imagine you're the car, the car is the engine, and the road is the matrix (or data). No matter how long the road, the speed of the car does not slow down while in cruise control. Chunking in MCE helps enable the "sustain rate" of the engine during processing, especially with larger data sets.

3. Off topic for the delta difference, is that MCE comes with MCE::Signal. An option can be passed to use /dev/shm for the $tmp_dir location. That file system lives in memory. For demo_thr, I also specified /dev/shm/ for the "b" raw matrix.

#!/usr/bin/env perl

##

## Usage:

## perl demo_mce.pl 1024 ## Default size is 512: $c = $a x $b

##

use strict;

use warnings;

use Cwd qw( abs_path );

use lib abs_path . "/../../lib";

my $prog_name = $0; $prog_name =~ s{^.*[\\/]}{}g;

use Time::HiRes qw(time);

use PDL;

use PDL::Parallel::threads qw(retrieve_pdls);

use PDL::IO::Storable; ## Required for PDL + MCE combo

use PDL::IO::FastRaw; ## Required for MMAP IO

use MCE::Signal qw($tmp_dir -use_dev_shm);

use MCE;

my $pdl_version = sprintf("%20s", $PDL::VERSION); $pdl_version =~ s/_.*$//;

my $chk_version = sprintf("%20s", '2.4.11');

if ($^O eq 'MSWin32' && $pdl_version lt $chk_version) {

print "This script requires PDL 2.4.11 or later for PDL::IO::FastRaw\n";

print "to work using MMAP IO under the Windows environment.\n";

exit;

}

###############################################################################

# * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #

###############################################################################

my $tam = shift;

$tam = 512 unless (defined $tam);

unless ($tam > 1) {

print STDERR "Error: $tam must be an integer greater than 1. Exiting.\n";

exit 1;

}

my $cols = $tam;

my $rows = $tam;

my $step_size = 64;

my $max_workers = 8;

my $mce = configure_and_spawn_mce($max_workers);

writefraw(sequence($rows,$cols), "$tmp_dir/raw.b");

my $b = mapfraw "$tmp_dir/raw.b", { ReadOnly => 1 };

my $a = sequence $cols,$rows;

my $o = zeroes $rows,$rows;

$a->share_as('left_input');

$b->share_as('right_input');

$o->share_as('output');

my $start = time();

$mce->run(0, { sequence => [ 0, $rows - 1, $step_size ] });

my $end = time();

$mce->shutdown();

printf STDERR "\n## $prog_name $tam: compute time: %0.03f secs\n\n",

$end - $start;

my $dim_1 = $tam - 1;

print "## (0,0) ", $o->at(0,0), " ($dim_1,$dim_1) ", $o->at($dim_1,$dim_1);

print "\n\n";

###############################################################################

# * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #

###############################################################################

sub configure_and_spawn_mce {

my $max_workers = shift || 8;

return MCE->new(

job_delay => ($tam > 2048) ? 0.031 : undef,

max_workers => $max_workers,

user_begin => sub {

my ($self) = @_;

( $self->{l}, $self->{r}, $self->{o} ) = retrieve_pdls(

'left_input', 'right_input', 'output'

);

},

user_func => sub {

my ($self, $seq_n, $chunk_id) = @_;

my $l = $self->{l};

my $r = $self->{r};

my $o = $self->{o};

my $start = $seq_n;

my $stop = $start + $step_size - 1;

$stop = $rows - 1 if ($stop >= $rows);

use PDL::NiceSlice;

$o(:,$start:$stop) .= $l(:,$start:$stop) x $r;

no PDL::NiceSlice;

return;

}

)->spawn;

}

PDL + PDL::Parallel::threads + PDL::IO::Storable + PDL::IO::FastRaw + MCE allows for some very powerful applications with lots of flexibility. PDL::IO::Storable is required if passing around piddle data using MCE's "do" method.

MCE 1.403 will include 2 news examples demo_thr.pl and demo_mce.pl demonstrating these modules working together.

In addition, MCE 1.403 will wrap "sub PDL::CLONE_SKIP { 1 }" into a no warnings 'redefine' block. I will slow down a bit so that I don't miss anything in order to have a completed 1.403 release (matmult-pdl-m.pl was missing in MANIFEST).

-- mario

Mario Roy

ungelesen,

17.02.2013, 18:48:1217.02.13

an the-quanti...@googlegroups.com

Hi all,

https://metacpan.org/release/MARIOROY/MCE-1.403

I didn't give up until I was happy with MCE. With that, I'm happy to report on the MCE 1.403 release. The strassen multiplication examples were updated. Memory consumption dropped by > 50% from before. I can now run 8192 for the strassen examples on the same box (with 32 GB memory).

I captured memory consumption this time and updated the README file.

## -- Results for 2048x2048 ---------------------------------------------------

##

## matmult_pdl_b.pl 2048: compute: 21.470 secs 1 worker 0.3% memory

## matmult_pdl_m.pl 2048: compute: 4.706 secs 24 workers 2.7% memory

## matmult_pdl_n.pl 2048: compute: 2.613 secs 24 workers 2.7% memory

## matmult_pdl_o.pl 2048: compute: 2.751 secs 24 workers 3.0% memory

## matmult_pdl_p.pl 2048: compute: 4.313 secs 24 workers 0.9% memory

## matmult_pdl_thr.pl 2048: compute: 4.524 secs 24 workers 0.8% memory

## strassen_pdl_m.pl 2048: compute: 2.522 secs 7 workers 2.7% memory

## strassen_pdl_n.pl 2048: compute: 2.496 secs 7 workers 2.0% memory

##

## matmult_perl_m.pl 2048: compute: 190.302 secs 24 workers 9.7% memory

## strassen_perl_m.pl 2048: compute: 321.655 secs 7 workers 8.6% memory

## strassen_pdl_h.pl 2048: compute: 4.023 secs 4 workers 2.0% memory

## -- Results for 4096x4096 ---------------------------------------------------

##

## matmult_pdl_b.pl 4096: compute: 172.220 secs 1 worker 1.2% memory

## matmult_pdl_m.pl 4096: compute: 34.873 secs 24 workers 10.8% memory

## matmult_pdl_n.pl 4096: compute: 22.941 secs 24 workers 10.8% memory

## matmult_pdl_o.pl 4096: compute: 21.971 secs 24 workers 10.9% memory

## matmult_pdl_p.pl 4096: compute: 34.253 secs 24 workers 1.8% memory

## matmult_pdl_thr.pl 4096: compute: 33.664 secs 24 workers 2.0% memory

## strassen_pdl_m.pl 4096: compute: 14.577 secs 7 workers 10.0% memory

## strassen_pdl_n.pl 4096: compute: 14.384 secs 7 workers 9.3% memory

##

## strassen_pdl_h.pl 4096: compute: 24.608 secs 4 workers 7.8% memory

## -- Results for 8192x8192 ---------------------------------------------------

##

## matmult_pdl_b.pl 8192: compute: 1388.001 secs 1 worker 4.8% memory

## matmult_pdl_m.pl 8192: compute: 275.778 secs 24 workers 45.7% memory

## matmult_pdl_n.pl 8192: compute: 455.516 secs 24 workers 43.2% memory

## matmult_pdl_o.pl 8192: compute: 470.470 secs 24 workers 42.1% memory

## matmult_pdl_p.pl 8192: compute: 269.506 secs 24 workers 5.5% memory

## matmult_pdl_thr.pl 8192: compute: 274.152 secs 24 workers 6.9% memory

## strassen_pdl_m.pl 8192: compute: 95.015 secs 7 workers 40.0% memory

## strassen_pdl_n.pl 8192: compute: 92.477 secs 7 workers 37.2% memory

##

## strassen_pdl_h.pl 8192: compute: 161.786 secs 4 workers 31.6% memory

Look at how little memory is utilized for matmult_pdl_p.pl (MCE) and matmult_pdl_thr.pl (SIMD). These two were faster at 8192 than the other matmult examples. It's the other way around at 4096x4096 and below. Interesting. The README contains the URL if folks want to try matmult_pdl_thr.pl.

The strassen_pdl_h.pl breaks up the job into 2 job submissions. The idea is further work on reducing memory consumption. Here, MCE reuses the same workers without spawning again on the 2nd run.

That's the best I can do for now. MCE 1.403 is a nice release. I'm taking a break from all of this. :)

Best regards,

Mario

David Mertens

ungelesen,

17.02.2013, 21:39:3117.02.13

an the-quanti...@googlegroups.com

Fantastic work! You're taking a break for the moment, but some time in the future we should see if we can rope the other Parallel modules authors in to see if we can race competing implementations. :-)

I am thoroughly impressed with your work. I really thought I'd have you beat with my threaded implementation, but your module clearly is able to do just fine. In truth, I am marvelously impressed by Perl's multithreading capabilities for improving the speed of calculations. On a problem with nontrivial data sharing, we manage to get a 5x speed-up for a somewhat naive parallelization (and even better speed-up for smart parallelization). How many cores do you have on your machine? Eight?

David

--

---
You received this message because you are subscribed to the Google Groups "The Quantified Onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to the-quantified-o...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." -- Brian Kernighan

Mario Roy

ungelesen,

18.02.2013, 09:27:4018.02.13

an the-quanti...@googlegroups.com

Oh yes, should have listed the hardware configuration when posting the results.

The following is taken from the README file.

## Benchmarked under Linux -- RHEL 6.3, Perl 5.10.1, perl-PDL-2.4.7-1.

## System is configured with both Turbo-Boost and Hyper-Threads enabled.

## Hardware is an Intel(R) Xeon(R) CPU E5649 @ 2.53GHz x 2 (24 logical procs).

## The system memory size is 32 GB.

https://metacpan.org/source/MARIOROY/MCE-1.403/examples/matmult/README

Mario

--

Die Nachricht wurde gelöscht

Mario Roy

ungelesen,

22.02.2013, 02:03:0522.02.13

an the-quanti...@googlegroups.com

Hello,

Added strassen_pdl_o.pl (double-level parallelization) 7 + 49 workers. First level workers

can submit data to the 2nd level workers. It works quite well (either forking or threading).

All workers are spawned right from the start to minimize memory copy of variables.

http://code.google.com/p/many-core-engine-perl/source/browse/trunk/examples/matmult/strassen_pdl_o.pl

my (@mce_a, $lvl);

if ($tam > 128) {
   $lvl = 2;  $mce_a[$_] = configure_and_spawn_mce() for (1 .. 7);
   $lvl = 1;  $mce_a[ 0] = configure_and_spawn_mce();
}

...

sub strassen_r {

   ...

   if ($tam <= 128) {
      ins(inplace($c), $a x $b);
      return;
   }
   elsif ($lvl < 2 && defined $mce) {
      strassen($a, $b, $c, $tam, $mce_a[ $mce->wid ]);
      return;
   }

   ...
}

Each worker under the 1st level is able to control the MCE instance (a different processe/thread)

at the 2nd level such as sending data to each worker and than running.

My next attempt will combine both levels into one (for 49 workers max). A first shot at this did not

pass. Will revisit this again at a later date. In the meantime, strassen_pdl_o.pl (7 + 49 workers) is

completed and posted in SVN.

-- mario

Mario Roy

ungelesen,

22.02.2013, 02:05:2222.02.13

an the-quanti...@googlegroups.com

I deleted my post just previous to the one above due to the formatting being all messed up.

I wasn't able to edit the post.

Mario Roy

ungelesen,

22.02.2013, 20:12:1722.02.13

an the-quanti...@googlegroups.com

Tonight, a new example strassen_pdl_p.pl was posted to SVN.

http://code.google.com/p/many-core-engine-perl/source/browse/trunk/examples/matmult/strassen_pdl_p.pl

When I first began with the strassen algorithm, I could not compute 8192x8192 on a 32 GB server using 7 workers due to swapping to disk. Not only can I compute with 7 workers now, but even (7 + 49) strassen_pdl_o.pl and (49) strassen_pdl_p.pl are possible.

## matmult_pdl_b.pl 8192: compute: 1388.001 secs 1 worker 4.8% memory

## matmult_pdl_m.pl 8192: compute: 275.778 secs 24 workers 45.7% memory

## matmult_pdl_n.pl 8192: compute: 455.516 secs 24 workers 43.2% memory

## matmult_pdl_o.pl 8192: compute: 470.470 secs 24 workers 42.1% memory

## matmult_pdl_p.pl 8192: compute: 269.506 secs 24 workers 5.5% memory

## matmult_pdl_thr.pl 8192: compute: 274.152 secs 24 workers 6.9% memory

##

## strassen_pdl_m.pl 8192: compute: 95.015 secs 7 workers 40.0% memory

## strassen_pdl_n.pl 8192: compute: 92.477 secs 7 workers 37.2% memory

## strassen_pdl_o.pl 8192: compute: 79.309 secs 56 workers 83.4% memory

## strassen_pdl_p.pl 8192: compute: 73.246 secs 49 workers 54.3% memory

Look at strassen_pdl_p.pl go. This example will also run on a 24 GB box. This is nearly 19x faster than PDL utilizing a single core on a 24 logical processor box (Dual Intel E5649 at 2.53 GHz 6 cores each). Please note that PDL running on a single core (nothing else running) may be running with Intel's Turbo Boost applied.

Will this drop below 50 seconds on a dual Intel E5-2660 (32 logical processors combined)? That will be very interesting.

-- mario

Die Nachricht wurde gelöscht

Mario Roy

ungelesen,

23.02.2013, 17:41:2423.02.13

an the-quanti...@googlegroups.com

There was one thing remaining and that was to try using David's PDL::Parallel::threads with the strassen examples.

http://code.google.com/p/many-core-engine-perl/source/browse/trunk/examples/matmult/strassen_pdl_q.pl

http://code.google.com/p/many-core-engine-perl/source/browse/trunk/examples/matmult/strassen_pdl_r.pl

## strassen_pdl_q.pl 8192: compute: 68.299 secs 49 workers 29.6% memory

## strassen_pdl_r.pl 8192: compute: 88.005 secs 7 workers 18.5% memory

On a 32-way box, strassen_pdl_q.pl completes in 35.852 seconds.

-- mario

Allen antworten

Antwort an Autor

Weiterleiten