Data structures using protocol buffers

GDR

unread,

Oct 21, 2008, 5:02:28 PM10/21/08

to Protocol Buffers

Hi,

I'm wondering how would one go about implementing self-referential
data structures? As an exercise, I tried to implement a PB version of
the adjacency list representation of a graph. I'm having a hard time
getting it work. Any suggestions?

Thanks!

--- graph.proto ---

package graph;

option java_package = "graph";
option java_outer_classname = "UndirectedGraph";
option optimize_for = CODE_SIZE;

message UndirectedGraphNode {
required string id = 0;
repeated UndirectedGraphNode neighbors;
}

message UndirectedGraph {
repeated UndirectedGraphNode nodes;
}

Jeremy Leader

unread,

Oct 21, 2008, 6:37:17 PM10/21/08

to GDR, Protocol Buffers

Keep in mind that protobufs describe serialized data, and there's no
concept of an object reference like Java uses. In your example, if A
and B are neighbors, then in your proto, the data representing A
contains the data representing B, and the data representing B contains
the data representing A!

One way around this is to implement your own form of references, perhaps
using the node ids like this:

package graph;

option java_package = "graph";
option java_outer_classname = "UndirectedGraph";
option optimize_for = CODE_SIZE;

message UndirectedGraphNodeReference {

required string id = 0;
}

message UndirectedGraphNode {
required string id = 0;

repeated UndirectedGraphNodeReference neighbors;
}

message UndirectedGraph {
repeated UndirectedGraphNode nodes;
}

--
Jeremy Leader
jle...@oversee.net

GDR

unread,

Oct 22, 2008, 2:09:34 PM10/22/08

to Protocol Buffers

Thanks Jeremy. That worked!
But we now have information about the same node being replicated. For
instance, let's say we have a field 'weight' attached to each node as
shown below. This setup will replicate the weight information of a
node as many times as its degree. If the weight of a node changes, I
will have update all it's occurrences in the PB. Any way I can avoid
it?

package graph;

option java_package = "graph";

option java_outer_classname = "UndirectedGraphType";

option optimize_for = CODE_SIZE;

message UndirectedGraphNodeReference {

required string id = 1;
required double weight = 2;
}

message UndirectedGraphNode {
required string id = 1;
repeated UndirectedGraphNodeReference neighbors = 2;
}

message UndirectedGraph {
repeated UndirectedGraphNode nodes = 1;

}

On Oct 21, 6:37 pm, Jeremy Leader <jlea...@oversee.net> wrote:
> Keep in mind that protobufs describe serialized data, and there's no
> concept of an object reference like Java uses. In your example, if A
> and B are neighbors, then in your proto, the data representing A
> contains the data representing B, and the data representing B contains
> the data representing A!
>
> One way around this is to implement your own form of references, perhaps
> using the node ids like this:
>
> package graph;
>
> option java_package = "graph";
> option java_outer_classname = "UndirectedGraph";
> option optimize_for = CODE_SIZE;
>
> message UndirectedGraphNodeReference {
> required string id = 0;
>
> }
>
> message UndirectedGraphNode {
> required string id = 0;
> repeated UndirectedGraphNodeReference neighbors;
>
> }
>
> message UndirectedGraph {
> repeated UndirectedGraphNode nodes;
>
> }
>
> --
> Jeremy Leader

> jlea...@oversee.net

Jeremy Leader

unread,

Oct 22, 2008, 2:14:38 PM10/22/08

to GDR, Protocol Buffers

I was assuming all the properties of a node (weight, label, color,
whatever) would be in UndirectedGraphNode; UndirectedGraphNodeReference
would only have the id and nothing else.

--
Jeremy Leader
jle...@oversee.net

GDR

unread,

Oct 22, 2008, 2:36:03 PM10/22/08

to Protocol Buffers

That does solve the duplicate information problem but it makes updates
to node attributes (like weight) difficult. Let's say, I want to
assign the weight of each node to the average of its neighbors.

for(UndirectedGraphNode node : UndirectedGraph.getNodesList() ) {
double sum = 0;
int count = 0;
for(UndirectedGraphNodeReference neighbor :
node.getNeighborsList() ) {
sum += ????
count++;
}
node.setWeight(sum/count);
}

----- graph.proto -----

package graph;

option java_package = "graph";
option java_outer_classname = "UndirectedGraphType";
option optimize_for = CODE_SIZE;

message UndirectedGraphNodeReference {
required string id = 1;
required double weight = 2;
}

message UndirectedGraphNode {
required string id = 1;
repeated UndirectedGraphNodeReference neighbors = 2;
}

message UndirectedGraph {
repeated UndirectedGraphNode nodes = 1;
}

On Oct 22, 2:14 pm, Jeremy Leader <jlea...@oversee.net> wrote:
> I was assuming all the properties of a node (weight, label, color,
> whatever) would be in UndirectedGraphNode; UndirectedGraphNodeReference
> would only have the id and nothing else.
>
> --
> Jeremy Leader

> jlea...@oversee.net

GDR

unread,

Oct 22, 2008, 2:38:15 PM10/22/08

to Protocol Buffers

my bad. The code snippet should be as follows:

for(UndirectedGraphNode node : UndirectedGraph.getNodesList() ) {
double sum = 0;
int count = 0;
for(UndirectedGraphNodeReference neighbor :
node.getNeighborsList() ) {
sum += ????
count++;
}
node.setWeight(sum/count);

}

----- graph.proto -----

package graph;

option java_package = "graph";
option java_outer_classname = "UndirectedGraphType";
option optimize_for = CODE_SIZE;

message UndirectedGraphNodeReference {
required string id = 1;
}

message UndirectedGraphNode {
required string id = 1;
required double weight = 2;
repeated UndirectedGraphNodeReference neighbors = 2;
}

message UndirectedGraph {
repeated UndirectedGraphNode nodes = 1;
}

Jeremy Leader

unread,

Oct 22, 2008, 2:48:45 PM10/22/08

to GDR, Protocol Buffers

Protocol Buffers are a serialization format, rather than general-purpose
data structures. To do computations, you'd probably want to build some
auxiliary data structures, which you populate when you deserialize the
protobuf data. You could have node objects that resemble your original
.proto file, where nodes have references to their neighbors, and you'd
probably need a map from node id to node object reference, which you'd
use during deserialization.

--
Jeremy Leader
jle...@oversee.net

GDR

unread,

Oct 23, 2008, 3:12:55 AM10/23/08

to Protocol Buffers

Thanks. I am trying to use protocol buffers with a map reduce
framework to work on large graphs. The graphs are too big to fit in
memory so building auxiliary data structures is difficult. Any ideas?

On Oct 22, 2:48 pm, Jeremy Leader <jlea...@oversee.net> wrote:
> Protocol Buffers are a serialization format, rather than general-purpose
> data structures. To do computations, you'd probably want to build some
> auxiliary data structures, which you populate when you deserialize the
> protobuf data. You could have node objects that resemble your original
> .proto file, where nodes have references to their neighbors, and you'd
> probably need a map from node id to node object reference, which you'd
> use during deserialization.
>
> --
> Jeremy Leader

> jlea...@oversee.net

Dave Bailey

unread,

Oct 23, 2008, 4:44:33 PM10/23/08

to Protocol Buffers

Sounds like fun. Just curious what is actually supposed to be encoded
in the protobuf message, if the graph is too big to fit in memory?
Are you sending just a single node, with its edges? It seems like you
wouldn't be able to avoid scenarios where some of the edges in your
message point to nodes that are not also serialized in your message,
given that you can't actually store the entire graph in memory, let
alone a protobuf message? What graph operations are you looking to
parallelize?

-dave

GDR

unread,

Oct 24, 2008, 12:58:06 PM10/24/08

to Protocol Buffers

Abstractly, a lot of graph algorithms fall into this paradigm:

foreach node in the graph
neighbors = getNeighbors(graph, node);
doSomething(node, neighbors);
end

I am trying to parallelize this operation for large graphs with
millions of nodes. So, one approach would be to partition the graph
nodes and have a mapper operate on each partition. The representation
I'm seeking is one which will allow me to access the neighbors of a
node and read/write to their attributes.

Reply all

Reply to author

Forward