Welcome Sebastiano Vigna and WebGraph

34 views
Skip to first unread message

John Sichi

unread,
Oct 27, 2020, 11:34:51 PM10/27/20
to jgrapht-dev
Hey all,

Dimitrios and I had some email conversations (copied below) with the creator of WebGraph, fastutil, and many other libraries.  He's interested in collaborating with us on a JGraphT/WebGraph adapter as well as other projects which can help make JGraphT's algorithm library usable for much bigger data sets.

Here's a summary of some of the proposals so far:

1) Add the new adapter in JGraphT (similar to our Guava adapter).  It can either live directly in jgrapht-opt, or in a new jgrapht-webgraph module which depends on jgrapht-opt.  We will either explicitly exclude the spurious Jung dependencies in our pom.xml (at least until there's a webgraph release which discards the Jung adapter).  We do need to wait for the dual-licensed release of WebGraph, since it is currently GPL-only.

2) Widen our Graph interface with support for sets of size greater than 32 bits, support for iterating vertices and edges without materializing an entire result set, and possibly some related forward-looking changes for parallel algorithm support.  This might be done in something like a new BigGraph sub-interface, but probably we can just do it in place in the existing Graph interface via default methods.  The adapter in #1 would implement this BigGraph support.

3) Sweep through our existing algorithms to make them take advantage of the BigGraph support wherever it is straightforward to do so.

For discussions of specifics of the proposals, let's create new jgrapht-dev topics (instead of replying on this thread).  Or for proposals which are ready for code-level discussions, let's do it via github issues and then pull requests.

---------- Forwarded message ---------
From: Sebastiano Vigna <sebastia...@unimi.it>
Date: Sun, Oct 25, 2020 at 7:12 PM
Subject: Re: Contributing to JGrapthT
To: John Sichi <jsi...@gmail.com>
Cc: Dimitrios Michail <dimitrio...@gmail.com>


> On 24 Oct 2020, at 05:51, John Sichi <jsi...@gmail.com> wrote:
>
> Greetings Sebastiano,
>
> With libraries like fastutil and WebGraph, JGraphT can truly stand on the shoulders of giants :)
>
> I took a look at the dependencies for WebGraph, and in general it looks like those should be fine for the jgrapht-opt module.  (As you noted, we try to keep jgrapht-core as light as possible.)  However, I did notice compile-time dependencies on jung-api and jung-io, which are needed for the JungAdapter included by WebGraph.  It probably wouldn't be great for jgrapht-opt to drag in Jung spuriously.

You can exclude Jung from the WebGraph dependencies, as you won't be using it. Would that work for you? I don't think it can do any harm.

But yes, we can also put the Jung adapter aside. Frankly, I even forgot it existed...

> Regarding the license, note that we dual-license under both LGPL and EPL for maximum compatibility.  We don't require any copyright assignment; all contributors retain their own copyright on the portions they contributed (same model as Linux).

That is fine for us. In fact, WebGraph will be soon double-licensed in a similar way (ASL/LGPL) because of its usage in a EU project (the current GPL licensing might be a problem with JGraphT, I believe).

> And on the topic of the BigGraph proposals, if we add something like the iterators you propose, we may want to think about parallel algorithm support at the same time (like the spliterators in Java).

Well, one can have the same approach: have a, say, edgeSetSplitIterator() method with a default implementation that simples uses Guava's Streams.stream(Iterable) to adapt the result of edgeSetIterable(). At that point classes can provide more efficient implementations (WebGraph could, for example). That would be backward-compatible.

But it's a huge difference in that the Iterable extension can improve the efficiency of, I guess, 90% of the algorithm in your codebase with a 1-hour effort overall--it's just a matter of chasing for loops on (incoming, outgoing, etc.) edge sets and replacing the set with the iterable. Implementing facilities for parallel computation is a good idea, but then you have to rewrite the algorithms to take advantage of that feature.

I understood now how the sparse representation works--you actually have a distinct edge index for each edge. That is impossible with WebGraph, as we support graphs with up to 2^64 edges. We could do an adapter <Integer,Long> (by representing the two endpoints in a Long), but I don't know if it would be useful for the Python bindings.

(Now that I think of it, maybe, indeed, given the goal of WebGraph and of the adapter using Endpoint is not a good idea--three objects per arc).

Ciao,

                                        seba

---------- Forwarded message ---------
From: John Sichi <jsi...@gmail.com>
Date: Sat, Oct 24, 2020 at 12:51 PM
Subject: Re: Contributing to JGrapthT
To: Sebastiano Vigna <sebastia...@unimi.it>
Cc: Dimitrios Michail <dimitrio...@gmail.com>


Greetings Sebastiano,

With libraries like fastutil and WebGraph, JGraphT can truly stand on the shoulders of giants :)

I took a look at the dependencies for WebGraph, and in general it looks like those should be fine for the jgrapht-opt module.  (As you noted, we try to keep jgrapht-core as light as possible.)  However, I did notice compile-time dependencies on jung-api and jung-io, which are needed for the JungAdapter included by WebGraph.  It probably wouldn't be great for jgrapht-opt to drag in Jung spuriously.

I can think of a couple of approaches:

a) since you already have a Jung adapter, put the JGraphT->WebGraph adapter next to it in WebGraph  (instead of inside of JGraphT)

or

b) move the Jung adapter out of WebGraph into a separate module, and then have the JGraphT->WebGraph adapter in jgrapht-opt depend only on a Jung-free release of WebGraph

There may be other good ways to solve this.

Regarding the license, note that we dual-license under both LGPL and EPL for maximum compatibility.  We don't require any copyright assignment; all contributors retain their own copyright on the portions they contributed (same model as Linux).

And on the topic of the BigGraph proposals, if we add something like the iterators you propose, we may want to think about parallel algorithm support at the same time (like the spliterators in Java).

Looking forward to working more closely with you!

On Fri, Oct 23, 2020 at 8:53 PM Sebastiano Vigna <sebastia...@unimi.it> wrote:

> On 23 Oct 2020, at 11:44, Dimitrios Michail <dimitrio...@gmail.com> wrote:
>
> thanks a lot for your kind words. Your libraries and research efforts are also very impressive!  I remember listening to a talk by Ian Munro about succinct data structures and wondering if there are any actual implementations of similar results.

Well, there's Sux (both C++ and Java), and Simon Gog's library of succinct data structures. In fact, Facebook is running (graph and index) on the Elias-Fano quasi-succinct representation of monotone sequences (https://github.com/facebook/folly/blob/master/folly/experimental/EliasFanoCoding.h). More precisely, on the improved version with partitioning who won the best paper at one of the last SIGIR.

> Regarding the adapter classes for WebGraph, I would personally be very happy to include them in the library. Since this is not a solo decision, I am also cc-ing my colleague John Sichi whose opinion I value a lot.

Of course!

> Some questions for both of you:
>   - Would this fit into the jgrapht-opt package?

Yes, certainly. You do not want all the WebGraph dependencies in jgraph-core.

>   - For the "big" version is the degree only the limit or do you also have problems with the Sets we are returning?  I was thinking of adding in the future one more
>     `BigGraph` interface where degrees would be longs and all sets would be returned as streams or iterators, thus avoiding such issues.

You read my mind. I would like to propose (see below) a backward-compatible, non-intrusive extension of the JGraphT Graph interface that would address that issue. It is not only a problem with "big" graphs--even a small Wikipedia graph (see the example below) will give you problems if you try to call edgeSet() or run recursive algorithms.

>   - We recently released the python bindings for JGraphT which compiles natively the JGraphT library (as a shared library using GraalVM). Any chance
>     this would also work with the webgraph? The only restriction we have is that the graph implements Graph<Integer, Integer>.

Well, my current implementation uses <Integer, EndpointPair<Integer>>. I do not understand how an <Integer,Integer> parameterization would work--assigning a distinct integer to each edge? Like, iterators would return successors? But then how containsEdge(Integer e) would work?

> Maybe it makes sense to see some of the code first, in order to give you a bit more involved feedback.

I'm including below the standard adapter (the other one is equivalent).

> On a different note, if you feel that we can jointly work on some research oriented problems regarding graph algorithms or graph representations, do not
> hesitate to ask. I am actively looking to expand my research collaborations.

That would be great.

Ciao,

                                        seba



----------------------------------
I would also like to propose a backward-compatible extension of the Graph interface that would make JGraphT applicable to much larger graphs.

The problem we face presently when trying to use the adapter is that JGraphT is built about the idea of materializing all edge sets (entire graph, incoming, outgoing, etc.). While this is not a big issue with dense representations, it becomes an issue with compressed or implicit representations as those materialized edge sets might be orders of magnitude larger than the graph representation. This problem should affect also your sparse representation.

As an example, I tried to compute the strongly connected components of a small Wikipedia graph (≈5M nodes, ≈140M arcs). I needed to eliminate the part of the code creating the components, as it was materializing the whole arc set, and then I needed 8G of RAM because during the recursion a large number of sets of outgoing arcs have to represented in main memory at the same time.

What I suggest is adding for each set method a corresponding Iterable method. Like, edgeSetIterable(), outgoingEdgesOfIterable() (or any other name). At that point compressed or implicit representations can materialize the edges as they're used: using this approach I was able to run strongly connected components using about a GB of RAM, that is, almost ten times less memory: the result was also computed 20% faster (better cache usage and less garbage collection), and in the same time of our WebGraph implementation.

These methods can have default implementation returning the set, which is of course an Iterable, so the change would be entirely backward-compatible. But compressed or implicit representation could provide specialized lazy implementations.

Then, algorithmic classes enumerating incoming or outgoing edges to do something would just have to replace, say, in for loops outgoingArcsOf() with outgoingArcsOfIterable() to enormously reduce their memory footprint at zero cost, and probably performing faster due to the lessened burden on the garbage collector. My guess is that the vast majority of your algorithmic classes, at their core, are based on such enumerations, rather than on accessing randomly sets of arcs.




---------- Forwarded message ---------
From: Dimitrios Michail <dimitrio...@gmail.com>
Date: Fri, Oct 23, 2020 at 6:45 PM
Subject: Re: Contributing to JGrapthT
To: Sebastiano Vigna <sebastia...@unimi.it>
Cc: John Sichi <jsi...@gmail.com>


Dear Seba, 

thanks a lot for your kind words. Your libraries and research efforts are also very impressive!  I remember listening to a talk by Ian Munro about succinct data structures and wondering if there are any actual implementations of similar results. 

Regarding the adapter classes for WebGraph, I would personally be very happy to include them in the library. Since this is not a solo decision, I am also cc-ing my colleague John Sichi whose opinion I value a lot. 

Some questions for both of you: 
  - Would this fit into the jgrapht-opt package? 
  - For the "big" version is the degree only the limit or do you also have problems with the Sets we are returning?  I was thinking of adding in the future one more 
    `BigGraph` interface where degrees would be longs and all sets would be returned as streams or iterators, thus avoiding such issues.
  - We recently released the python bindings for JGraphT which compiles natively the JGraphT library (as a shared library using GraalVM). Any chance 
    this would also work with the webgraph? The only restriction we have is that the graph implements Graph<Integer, Integer>. 

Maybe it makes sense to see some of the code first, in order to give you a bit more involved feedback. 

On a different note, if you feel that we can jointly work on some research oriented problems regarding graph algorithms or graph representations, do not 
hesitate to ask. I am actively looking to expand my research collaborations. 

Kind Regards, 
Dimitrios
    




 

On Thu, Oct 22, 2020 at 9:42 AM Sebastiano Vigna <sebastia...@unimi.it> wrote:
Dear Dimitrios,
first of all let me congratulate with you for JGrapthT--a truly formidable library. I'm also happy you found an application for fastutil's data structures.

I'm writing as I would like to contribute adapter classes for JGraphT that would make it possible to use WebGraph's compressed graph format (http://webgraph.di.unimi.it/). This would make available to JGraphT's user advanced compressed and succinct representation techniques (e.g., at http://law.di.unimi.it/datasets.php you can find large web graph represented using less than a bit per link, and Common Crawl https://commoncrawl.org/ distributes graphs in WebGraph format). The user would be able to compress its data source using WebGraph, and then use the representation in place of the sparse internal representation to analyze graphs that would be too large otherwise.

The present design contains two adapters: one based on Integer nodes for the standard WebGraph version, and a "big" one for the big version of WebGraph (albeit we cannot support graphs with in/outdegrees >= Integer.MAX_INT as JGraphT's methods return an int). We reused Guava's EndpointPair class to represent arcs and edges, and we have 99% covering by unit tests.

Let me know if you're interested--I can send you the two classes, or create a pull request if you prefer so. Aside from, of course, releasing them under the EPL, we will be happy to release the copyright of the classes to you or another JGraphT author if that's your preference.

Ciao,

                                        seba

Sebastiano Vigna

unread,
Oct 28, 2020, 11:57:11 AM10/28/20
to jgrapht-dev
Il giorno mercoledì 28 ottobre 2020 alle 04:34:51 UTC+1 John Sichi ha scritto:

1) Add the new adapter in JGraphT (similar to our Guava adapter).  It can either live directly in jgrapht-opt, or in a new jgrapht-webgraph module which depends on jgrapht-opt.  We will either explicitly exclude the spurious Jung dependencies in our pom.xml (at least until there's a webgraph release which discards the Jung adapter).  We do need to wait for the dual-licensed release of WebGraph, since it is currently GPL-only.

Would this be OK? We need to triple-licence because of the FASTEN EU project, where we will be using WebGrap:

/*

 * Copyright (C) 2003-2020 Paolo Boldi and Sebastiano Vigna

 *

 * This program and the accompanying materials are made available under the

 * terms of the Eclipse Public License 2.0 which is available at

 * http://www.eclipse.org/legal/epl-2.0, or the

 * GNU Lesser General Public License v2.1 or later

 * which is available at

 * http://www.gnu.org/licenses/old-licenses/lgpl-2.1-standalone.html,

 * or the Apache Software License 2.0 which is available at

 * https://www.apache.org/licenses/LICENSE-2.0.

 *

 * This program is distributed in the hope that it will be useful, but

 * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY

 * or FITNESS FOR A PARTICULAR PURPOSE.

 *

 * SPDX-License-Identifier: EPL-2.0 OR LGPL-2.1-or-later OR Apache-2.0

 */


Ciao,

                       seba

Sebastiano Vigna

unread,
Oct 28, 2020, 12:15:59 PM10/28/20
to jgrapht-dev
Actually (sorry for the noise) we would prefer to release WebGraph under a dual license ASF-2.0/LGPL2.1+ , which should make it linkable to JGraphT with both licences. The adapters would be released exactly under JGraphT's licensing scheme instead. Would that work?

seba

John Sichi

unread,
Oct 29, 2020, 2:14:32 AM10/29/20
to jgrapht-dev
Yes, Apache+LGPL should be fine, since Apache and Eclipse are compatible.

Sebastiano Vigna

unread,
Oct 29, 2020, 2:02:56 PM10/29/20
to jgrapht-dev
So, just for the record I built a sparse JGraphT representation for our dewiki-2013 graph (playing dirty with the constructor, as materializing the edge list is not possible on my laptop), and serialized is about 573M. In WebGraph BVGraph format is about 110M. A toy web graph (the eu domain in 2005) is 300M in the sparse representation, 17M with WebGraph.

On the dewiki-2013 graph, strongly connected components (Gabow) took 10s with the sparse representation and 15s with WebGraph. Recompressing the graph using EFGraph brought down the time to 10s, with the graph occupying about 160M. So I think there are interesting tradeoffs: EFGraph is probably a smaller-footprint easy substitute for sparse (with the added bonus of constant-time adjacency), whereas BVGraph can shrink the space significantly

BTW, with the new methods it is trivial to write a space-efficient constructor for the sparse representation.

Ciao,

seba
Reply all
Reply to author
Forward
0 new messages