how long twitterlarge.txt will run for?

23 views
Skip to first unread message

Keerthikanth Kodali

unread,
May 2, 2011, 4:17:02 PM5/2/11
to cornell-c...@googlegroups.com
how long twitterlarge.txt will run for?

Keerthikanth Kodali

unread,
May 2, 2011, 4:19:41 PM5/2/11
to cornell-c...@googlegroups.com
how long twitterlare.txt will run for using adjacency list?
Message has been deleted

Joseph Manganelli

unread,
May 2, 2011, 5:26:21 PM5/2/11
to cs2110-sp11
On my computer, TwitterLarge takes:
1) 18 seconds to load the text file into a graph
2) 2 seconds for connectedComponents()
3) 3 seconds for maxst()

These times are for AdjListGraph. Note: I uphold a strict definition
for the adjacency list. I do not include a list of incoming edges for
each
vertex. Instead, when an algorithm requires the reverse list, it
computes it in O(|V|+|E|) time, then it gets tossed out.

This becomes somewhat problematic for computing the in-degree
distribution, because in-degree takes O(|V|+|E|) for each vertex
(because I do not keep "reverse" information in the adjacency list).
So I had to write a separate algorithm that reuses the same list for
in-degree iteration.

Keerthikanth Kodali

unread,
May 2, 2011, 7:18:44 PM5/2/11
to cornell-c...@googlegroups.com
Thanks!! my files run for ever for large sets and works very well for smaller data sets.
This is why i don't believe in principle of mathematical induction...hahaha

On Mon, May 2, 2011 at 5:21 PM, Joseph Manganelli <jtm...@gmail.com> wrote:
On my computer, TwitterLarge takes:

1) 18 seconds to load the text file into a graph
2) 2 seconds for connectedComponents()
3) 3 seconds for maxst()

These times are for AdjListGraph. Note: I uphold a strict definition
for the adjacency list. I do not include a list of incoming for each

vertex. Instead, when an algorithm requires the reverse list, it
computes it in O(|V|+|E|) time, then it gets tossed out.
On May 2, 4:19 pm, Keerthikanth Kodali <kk...@cornell.edu> wrote:

Gregory Hill

unread,
May 2, 2011, 11:57:16 PM5/2/11
to cs2110-sp11
Wait how did you get it to run in 18 seconds? Did you use
scn.nextInt() and addVertex/addEdge or did you do something with
inputting a whole array at a time?

Gregory Hill

unread,
May 2, 2011, 11:57:47 PM5/2/11
to cs2110-sp11
Mine gets a heap overflow after 3 minutes with twitterLarge

Nishant George

unread,
May 3, 2011, 12:02:12 AM5/3/11
to cornell-c...@googlegroups.com
@Joe, I don't know how you got it to run that fast!

Did you use hash maps to store the vertex list and/or the neighbor lists for each vertex? I'm not getting anything with the large datasets, even just loading and waiting for minutes.

Any speed enhancing recommendations anybody can make? 
--
Nishant George
Cornell University '11
Applied & Engineering Physics

Andrew Casey

unread,
May 3, 2011, 12:07:36 AM5/3/11
to cornell-c...@googlegroups.com
Mine runs for 22 seconds including keeping a list of outgoing and incoming edges for each Vertex. Try looking at the data structures you're using, as not all data structures have constant insert times, thus as you add more and more items, each insert takes more time. Hope this helps.
--
Andrew Casey
Class of 2014

Joseph Manganelli

unread,
May 3, 2011, 12:36:19 AM5/3/11
to cs2110-sp11
@Nishant

It could be your file reading procedure. I can't really disclose my
implementation of AdjListGraph, but I will tell you I'm not doing
anything tricky. However, since I'm a nice person, I will post some
pseudo-code that may help you load your graph more quickly from the
file. (It should be noted that my posting of the following code does
not violate academic integrity because we are in no way required to
implement a file loading method)

BufferedReader b = new BufferedReader(new FileReader(fileName));
char[] inBuf = new char[8192];
StringBuffer fileData = new StringBuffer(1000);
int numRead;
while ((numRead=b.read(inBuf))!=-1){
fileData.append(inBuf, 0, numRead);
}
Scanner s = new Scanner(fileData.toString());
while (s.hasNext()){
int from=Integer.parseInt(s.next());
int to=Integer.parseInt(s.next());
int weight=Integer.parseInt(s.next());
addEdge(from, to, weight);
}


This method maximizes I/O efficiency by reading 8KB at a time from the
hard disk into a string builder. Then uses a scanner to read the
really big string.

Keerthikanth Kodali

unread,
May 3, 2011, 12:51:12 AM5/3/11
to cornell-c...@googlegroups.com
Hi,

when i ran it with below code I can certainly see that it was faster but I still ran out of heap space.
how about you greg?

Thanks!

Joseph Manganelli

unread,
May 3, 2011, 1:07:15 AM5/3/11
to cs2110-sp11
It depends on your computer AND adjacency list implementation. My
AdjListGraph requires about 1.25 GB of memory to load the large files.
If you to not have the available RAM or your max heap space in Java
Runtime Environment is set below the required amount, you see that
heap space exception. I wouldn't worry too much about running the
large files. I heard they will be using files much smaller to test our
code. Moreover, a minimal bit representation of the large files in an
adjacency matrix would require 8+ GB or RAM (assuming |V|^2 matrix
representation).

Robert Escriva

unread,
May 3, 2011, 5:45:23 AM5/3/11
to cornell-c...@googlegroups.com
We will still be using large files, so do not assume that the files we
grade with will be significantly smaller.

If you are having problems with the large files, look at the overheads
of your adjacency list structure. 1.25 GB sounds about right, but might
be open to improvdement, depending upon implementation. Anyone who can
do less than 960MB for Twitter is probably pulling some tricks out of
Google's hat.

-Robert

Robert Escriva

unread,
May 3, 2011, 6:18:35 AM5/3/11
to cornell-c...@googlegroups.com
On Mon, May 02, 2011 at 04:17:02PM -0400, Keerthikanth Kodali wrote:
> how long twitterlarge.txt will run for?

Other people have posted numbers regarding their algorithms'
performance. Please note that "TwitterLarge.txt" is just a text file,
and therefore implies no runtime.

-Robert

Joseph Manganelli

unread,
May 3, 2011, 12:13:45 PM5/3/11
to cs2110-sp11
@Robert

We are not expected to fit the large graphs in the adjacency matrix
are we?

(unless you test on computers with > 8GB of memory)...

Gregory Hill

unread,
May 3, 2011, 12:59:11 PM5/3/11
to cs2110-sp11
How does one see how large their overhead is while the program is
running?

On May 3, 5:45 am, Robert Escriva <escr...@cs.cornell.edu> wrote:

Christina Brandt

unread,
May 3, 2011, 1:11:04 PM5/3/11
to cornell-c...@googlegroups.com
I suggest jconsole.
Look in your java sdk bin and run jconsole (you need to start it while your program is actually running).
general path:
C:\Program Files\Java\jdk1.6.0_22\bin\jconsole.exe
cheers
Christie

Joshua Hagins

unread,
May 3, 2011, 3:44:30 PM5/3/11
to cornell-c...@googlegroups.com
Joseph, I used your exact file reading procedure, and almost immediately ran out of Java heap space. How are you handling all of this memory? My machine has 8GB of RAM, so that shouldn't be an issue...

Nishant George

unread,
May 3, 2011, 3:59:09 PM5/3/11
to cornell-c...@googlegroups.com
I think what Joshua is asking is (and I would also like to know): how can we allocate Eclipse more heap space to work with? I'm also running out of heap space.

Joseph Manganelli

unread,
May 3, 2011, 5:51:04 PM5/3/11
to cs2110-sp11
@Joshua and Nishant

I did not modify my "max heap space" in my JRE in Eclipse. It may be
that mine is set high by default (because I use Java EE), or it could
be that your object structure that holds the graph data is
"inefficient". Also note that if you're trying to load the "Large"
files into an adjacency matrix it will run out of heap space for any
"good" computer (they require about 8GB of free memory to hold the
entire matrix). So you should use adjacency list implementation to
load the "Large" graphs.

I can't really disclose my data structure for AdjListGraph, but I will
say that its not too complicated, and it doesn't waste space.

If you want to increase your max heap space in the JRE for Eclipse,
Google is a good place to start, but I don't know how to do it off the
top of my head :P

Christina Brandt

unread,
May 3, 2011, 6:01:53 PM5/3/11
to cornell-c...@googlegroups.com
Robert or Nikos can give the final word on this, but it is probably reasonable if running your implementation on TwitterLarge requires heap space of 1GB or so.

HOW TO CHANGE YOUR HEAP SPACE IN ECLIPSE:

Go to Run->Run Configurations->Arguments tab-> Then in the VM arguments, enter a memory requirement; for example,

-Xmx512m -Xms512m

sets a max heapsize of 512MB.

You CAN implement the assignment so that all of the datasets, including TwitterLarge, work with heap space of 512MB.  It is probably also fine to require a heap space of 1GB; Robert or Nikos can give the final word on maximum space requirements.

Christie

Joseph Manganelli

unread,
May 3, 2011, 6:30:58 PM5/3/11
to cs2110-sp11
...but not in adjMatrixGraph because:

LiveJournalLarge has 264552 vertices
A non-condensed matrix for this graph is 264552*264552 = 70 Billion
values
If each value is represented as a bit of data then 70 Billion bits =
8.15 GB

It was specifically stated in several posts that we need to keep
references of "empty slots" to reflect run-times of a true adjacency
matrix. If you "condensed" the adjacency matrix, it would essentially
perform as well as the adjacency list implementation. Please correct
me if I'm wrong.

Christina Brandt

unread,
May 3, 2011, 6:50:21 PM5/3/11
to cornell-c...@googlegroups.com
The assignment specifically says that for you are not required to be able to load the large graphs in memory as an adjacency matrix.

Nikos Karampatziakis

unread,
May 3, 2011, 8:38:26 PM5/3/11
to cornell-c...@googlegroups.com
Right. Running everything in 512Mb is doable, running everything in
2*512Mb is acceptable, and running anything in > 3*512Mb is considered
bad style.

-Nikos

Nikos Karampatziakis

unread,
May 3, 2011, 9:19:37 PM5/3/11
to cornell-c...@googlegroups.com
On Tue, May 3, 2011 at 8:38 PM, Nikos Karampatziakis <n...@cs.cornell.edu> wrote:
> Right. Running everything in 512Mb is doable, running everything in
> 2*512Mb is acceptable, and running anything in > 3*512Mb is considered
> bad style.
>

Let me add that the measurements above come from 32bit JVMs. If you
are using a 64bit JVM my back of the envelope calculations suggest
that 512Mb should be replaced with 716Mb.

-Nikos

Peter

unread,
May 4, 2011, 12:45:03 PM5/4/11
to cs2110-sp11
So if we are running on a 64bit JVM, being able to run everything in
1432Mb is okay? As in we won't be docked for this?
What exactly will be the allocated memory when the graders run tests
on large data sets?
> ...
>
> read more »

Peter

unread,
May 4, 2011, 12:50:49 PM5/4/11
to cs2110-sp11
To clarify, I can run connectedComponents() using the large data sets
in 1432Mb in about 30sec, but I run out of memory in 1024Mb.
> ...
>
> read more »

Nikos Karampatziakis

unread,
May 4, 2011, 3:13:51 PM5/4/11
to cornell-c...@googlegroups.com
The graders will by default use 1.5Gb if they have a 32bit JVM and 2Gb
if they have a 64bit JVM.

-Nikos

Steve Spagnola

unread,
May 6, 2011, 10:24:09 AM5/6/11
to cs2110-sp11
For 64-bit, then this is the scaling requirement and that would be OK.

On May 4, 12:45 pm, Peter <peterj...@gmail.com> wrote:
> ...
>
> read more »
Reply all
Reply to author
Forward
0 new messages