Problems on running GPS

38 views
Skip to first unread message

Yi Lu

unread,
Mar 19, 2014, 11:23:42 AM3/19/14
to stanford...@googlegroups.com
Hi Everyone,

Thanks for developing such an amazing system. I have some problems when using the GPS system.

1) Does this system support multithreading? I have a cluster of machines, do I need to run multiple GPS instances (processes) on each machine? 
2)When I run multiple processes on each machine, I have a new problem. The slaves file in master-scripts is like the following.

worker1
worker1
worker1
worker1
....
....
worker9
worker9
worker9
worker9

and the cfg file in hdfs is like the following

-1 master 5999
0 worker1 6000
1 worker1 6001
2 worker1 6002
3 worker1 6003
....
....
32 worker9 6032
33 worker9 6033
34 worker9 6034
35 worker9 6035

Is that right? If it is right, when I start to run GPS, some process hang in DOING_INITIAL_VERTEX_PARTITIONING and never  
go to READY_TO_DO_COMPUTATION. therefore, it can not continue to do the computation. How can i fix the problem?

3) I write an algorithm in GPS. In the first super step, some vertices send msgs to their neighbors and all vertices vote to halt. Then I think in the second super step, some vertices which received msgs from the first super step should be activated. But, it seems none get activated and the job is finished. Is this the deign of GPS, or am I wrong when implementing my algorithm? 

Thanks for your time.

Yi Lu

unread,
Mar 19, 2014, 11:37:23 AM3/19/14
to stanford...@googlegroups.com
In the 2) problem, the log is like the following, then it never changes.

Id Host Latest Status Latest Status Receive Time Connection Establishment Time
0 worker1 doing_initial_vertex_partitioning superstepNo: -1 Wed, 19 Mar 2014 23:35:49.783 Wed, 19 Mar 2014 23:35:19.184
1 worker1 doing_initial_vertex_partitioning superstepNo: -1 Wed, 19 Mar 2014 23:35:49.934 Wed, 19 Mar 2014 23:35:20.259
2 worker1 doing_initial_vertex_partitioning superstepNo: -1 Wed, 19 Mar 2014 23:35:50.432 Wed, 19 Mar 2014 23:35:20.261
3 worker1 ready_to_do_computation superstepNo: -1 Wed, 19 Mar 2014 23:35:52.039 Wed, 19 Mar 2014 23:35:20.265
4 worker2 doing_initial_vertex_partitioning superstepNo: -1 Wed, 19 Mar 2014 23:35:49.882 Wed, 19 Mar 2014 23:35:21.277
5 worker2 doing_initial_vertex_partitioning superstepNo: -1 Wed, 19 Mar 2014 23:35:49.858 Wed, 19 Mar 2014 23:35:21.281
6 worker2 ready_to_do_computation superstepNo: -1 Wed, 19 Mar 2014 23:35:52.055 Wed, 19 Mar 2014 23:35:22.288
7 worker2 doing_initial_vertex_partitioning superstepNo: -1 Wed, 19 Mar 2014 23:35:50.686 Wed, 19 Mar 2014 23:35:22.292
8 worker3 doing_initial_vertex_partitioning superstepNo: -1 Wed, 19 Mar 2014 23:35:49.905 Wed, 19 Mar 2014 23:35:23.299
9 worker3 doing_initial_vertex_partitioning superstepNo: -1 Wed, 19 Mar 2014 23:35:49.754 Wed, 19 Mar 2014 23:35:24.306
10 worker3 doing_initial_vertex_partitioning superstepNo: -1 Wed, 19 Mar 2014 23:35:50.434 Wed, 19 Mar 2014 23:35:24.309
11 worker3 doing_initial_vertex_partitioning superstepNo: -1 Wed, 19 Mar 2014 23:35:49.271 Wed, 19 Mar 2014 23:35:24.311
12 worker4 doing_initial_vertex_partitioning superstepNo: -1 Wed, 19 Mar 2014 23:35:49.989 Wed, 19 Mar 2014 23:35:26.320
13 worker4 doing_initial_vertex_partitioning superstepNo: -1 Wed, 19 Mar 2014 23:35:49.751 Wed, 19 Mar 2014 23:35:26.323
14 worker4 doing_initial_vertex_partitioning superstepNo: -1 Wed, 19 Mar 2014 23:35:50.507 Wed, 19 Mar 2014 23:35:26.325

Semih Salihoglu

unread,
Mar 19, 2014, 11:45:38 AM3/19/14
to Yi Lu, stanford...@googlegroups.com
Hi Yi Lu,

Comments inline.


On Wednesday, March 19, 2014, Yi Lu <luyi...@gmail.com> wrote:
> Hi Everyone,
> Thanks for developing such an amazing system. I have some problems when using the GPS system.
> 1) Does this system support multithreading? I have a cluster of machines, do I need to run multiple GPS instances (processes) on each machine? 
You can run as many instances as you want on any machine. I recommend running # cores of machine/6 instances per machine.
If they're at DOING_INITIAL_VERTEX_PARTITIONING state that means they connected successfully. The problem is most likely that one instance is running out of memory or something. So I recommend try one instance per machine and give more memory to your instances. You do that by changing the xms and xmx flags in the start_gps_nodes.sh script.

> 3) I write an algorithm in GPS. In the first super step, some vertices send msgs to their neighbors and all vertices vote to halt. Then I think in the second super step, some vertices which received msgs from the first super step should be activated. But, it seems none get activated and the job is finished. Is this the deign of GPS, or am I wrong when implementing my algorithm? 
This sounds like a problem in your implementation. Be sure that not every vertex votes to halt in a superstep. For example make sure the vertex that sent the message does not vote to halt.
> Thanks for your time.
>
> --
> You received this message because you are subscribed to the Google Groups "stanfordgpsusers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stanfordgpsuse...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

Yi Lu

unread,
Mar 19, 2014, 11:58:50 AM3/19/14
to stanford...@googlegroups.com, Yi Lu, se...@stanford.edu
Thanks for your quick reply. I appreciate it very much. I have some further questions.

1) My graph is very small, say 50MB, and I set xms and xmx to 2000M, I have 48GB in each machine. however, when running 4 processes together, will that be a problem?

3) Imagine, I have only two vertices (0 and 1). In the first super step, 0 sends msgs to 1, 1 sends msg to 0, then both of them vote to halt. I think, in pregel original paper, both of the vertices should be activated since they receive msgs. Will GPS terminate the computation? Is there some difference between Pregel and GPS? Thanks very much.

Yi Lu

unread,
Mar 19, 2014, 12:11:53 PM3/19/14
to stanford...@googlegroups.com, Yi Lu, se...@stanford.edu
the result of free -m is like the following

             total       used       free     shared    buffers     cached
Mem:         48218       4745      43473          0         18        537
-/+ buffers/cache:       4189      44029
Swap:        20479          0      20479


On Wednesday, March 19, 2014 11:45:38 PM UTC+8, Semih Salihoglu wrote:

Arash Jalal Zadeh Fard

unread,
Mar 19, 2014, 1:29:20 PM3/19/14
to Yi Lu, stanford...@googlegroups.com, se...@stanford.edu

Hi Yi,


I used to run several GPS instances on the same node and my config and slaves files were similar to yours. I'd like to remind you that you need to set Xmx size in the script which runs the GPS nodes in order to set the memory for each JVM.


For the termination condition, it might be slightly different than Pregel. Whenever, all the vertices votes for halt the algorithm will terminate regardless of any message which might be delivered in the next superstep to activate another vertex. To overcome your problem, the vertex which sends a message should not vote to halt at the same superstep. Then, you need to add another condition to your if-else ladder to make any vertex which has not task and no incoming message vote to halt.


Semih, 

Your reply implies that each GPS instance creates 6 process+threads. Can you please clarify the number of process and threads in each GPS instance?


Thanks,

Arash



From: stanford...@googlegroups.com <stanford...@googlegroups.com> on behalf of Yi Lu <luyi...@gmail.com>
Sent: Wednesday, March 19, 2014 12:11 PM
To: stanford...@googlegroups.com
Cc: Yi Lu; se...@stanford.edu
Subject: Re: Problems on running GPS
 

Yi Lu

unread,
Mar 19, 2014, 1:43:34 PM3/19/14
to stanford...@googlegroups.com, Yi Lu, se...@stanford.edu, ar...@uga.edu
Thanks for your great help. 
Do I need to set 21st, 22nd and 25th, 26th lines together? Previouly, I only set 21st and 22nd lines, then it hangs there. Now, it works.

# Set the XMS_SIZE and XMX_SIZE properties according to the RAM in the machines of your cluster.
 21 XMS_SIZE=8000M
 22 XMX_SIZE=8000M
 23 OUTPUT_FILE_NAME=${4}-output-${2}-of-${3}
 24 if [ $2 -eq -1 ]; then
 25     XMS_SIZE=8000M
 26     XMX_SIZE=8000M
 27     OUTPUT_FILE_NAME=${4}-machine-stats
 28 fi

Arash Jalal Zadeh Fard

unread,
Mar 19, 2014, 1:53:38 PM3/19/14
to Yi Lu, stanford...@googlegroups.com, se...@stanford.edu

​You know that XMS indicates the start memory and XMX indicates maximum memory; so they can be different. AFAIK, lines 25 and 26 set the JVM for master and lines 21 and 22 set it for workers. Usually, you need more memory for workers. Semih, please correct me if I am wrong.


Arash


From: Yi Lu <luyi...@gmail.com>
Sent: Wednesday, March 19, 2014 1:43 PM
To: stanford...@googlegroups.com
Cc: Yi Lu; se...@stanford.edu; Arash Jalal Zadeh Fard
Reply all
Reply to author
Forward
0 new messages