Correct Answers for text files given

Mark Li

unread,

May 1, 2011, 9:19:26 PM5/1/11

to cs2110-sp11

What is everyone getting for connectedComponents and maxst for the
text files given? I want data to compare to.

Thanks.

Chong Wang

unread,

May 2, 2011, 12:44:46 AM5/2/11

to cornell-c...@googlegroups.com

for LiveJournalSmall.txt we got 843. And our algorithm takes a lot of time to run the two large ones...

Nishant George

unread,

May 2, 2011, 2:32:13 PM5/2/11

to cornell-c...@googlegroups.com

We are getting the following, for the small files:

LiveJournalSmall:

843

93464

TwitterSmall:

756

7134

Smaller number is connectedComponents; larger is maxst.

Anybody else getting these?

--
Nishant George
Cornell University '11
Applied & Engineering Physics

Chaoran Song

unread,

May 2, 2011, 3:27:15 PM5/2/11

to cs2110-sp11

Would people also mind posting some results for clustering? just the
ones with more connections, no need for the redundant ones that cycle
on themselves only.

James

unread,

May 2, 2011, 6:39:56 PM5/2/11

to cs2110-sp11

I got 844 and 757 for LiveJournalSmall and TwitterSmall respectively.
But for all of my really small (20 vertices or less) handmade cases,
it gives the correct answer. My algorithm takes into account self
loops (A-->A) and vertices connected trough two edges (A-->B && B--
>A). Any ideas as to why I'm getting one additional connected
component? Thanks.

On May 2, 2:32 pm, Nishant George <skarlath...@gmail.com> wrote:
> We are getting the following, for the small files:
>

> *LiveJournalSmall*:
> 843
> 93464
>
> *TwitterSmall*:

> 756
> 7134
>
> Smaller number is connectedComponents; larger is maxst.
>
> Anybody else getting these?
>
> On Mon, May 2, 2011 at 12:44 AM, Chong Wang <cw...@cornell.edu> wrote:
> > for LiveJournalSmall.txt we got 843. And our algorithm takes a lot of time
> > to run the two large ones...
>

Jeff Heidel

unread,

May 2, 2011, 6:45:35 PM5/2/11

to cornell-c...@googlegroups.com

I am also getting 843 and 756 for connectedComponents, respectively.
Still need to implement maxst. Will check back later.

Jeff Heidel

unread,

May 2, 2011, 9:46:04 PM5/2/11

to cornell-c...@googlegroups.com

Hey Nishant,
I just finished implementing maxst...

I got:
LiveJournalSmall.txt
93464

TwitterSmall.txt
6571

Very weird that we're getting identical numbers for livejournal but not for twitter...
Any ideas?

Jacob Bucci

unread,

May 3, 2011, 3:06:30 PM5/3/11

to cs2110-sp11

I'm getting 843 and 756 respectively. Any numbers from the large
files?

On May 2, 9:46 pm, Jeff Heidel <jhei...@gmail.com> wrote:
> Hey Nishant,

> I just finished implementing *maxst*...
>
> I got:
> *LiveJournalSmall.txt*
> 93464
>
> *TwitterSmall.txt*

Jeff Heidel

unread,

May 3, 2011, 3:08:03 PM5/3/11

to cornell-c...@googlegroups.com

I got:

TwitterLarge.txt
Connected Components: 512076
Max ST: 1883627

LiveJournalLarge.txt
Connected Components: 1961
Max ST: 2354926

Nishant George

unread,

May 3, 2011, 3:39:27 PM5/3/11

to cornell-c...@googlegroups.com

@ Jeff:

Not sure why we got different maxst values for twitter small, esp if we got the same numbers for livejournal...

I disallowed self edges (as I'm sure you did) by implementing a disjoint set forest and running Kruskal's. Pretty much by the book. Did you do anything fancy?

Jeff Heidel

unread,

May 3, 2011, 7:49:03 PM5/3/11

to cornell-c...@googlegroups.com

Ok, I now am also getting a MaxST of 7134 for TwitterSmall.

My change: I disabled all exception handling when adding a duplicate edge.

Some graphs, like twitter, must contain many duplicate edges that need to be summed up to produce the correct answer.

I'm curious, what are you doing for handling DuplicateEdgeException? What is the correct procedure here?

Charlie Pan

unread,

May 3, 2011, 7:54:08 PM5/3/11

to cornell-c...@googlegroups.com

There are no duplicate edges in any of the graphs. You might be not
adding the weights together for two-direction edges.

On a different note, how fast do your addEdge() methods run? I'm
having some problems loading the larger graphs and the problem for me
seems to be actually constructing the graph. Are you iterating across
your list every time?

Best

Gary Peng

unread,

May 3, 2011, 8:39:41 PM5/3/11

to cs2110-sp11

I had trouble with that too - but increasing heapsize helped alot.

Check out this tread

https://groups.google.com/group/cornell-cs2110-sp11/browse_thread/thread/ec549ad3d621f334?hl=en

> > the correct procedure here?- Hide quoted text -
>
> - Show quoted text -

Jeff Heidel

unread,

May 3, 2011, 9:08:24 PM5/3/11

to cornell-c...@googlegroups.com

I am now getting the following results:

TwitterSmall.txt
CC: 756
ST: 7134

LiveJournalSmall.txt
CC: 843
ST: 93464

TwitterLarge.txt
CC: 512076
ST: 1954534

LiveJournalLarge.txt
CC: 1961
ST: 2571038

It turns out my reverse edge summing wasn't implemented correctly and was sometimes throwing exceptions when it shouldn't have.

Mark Li

unread,

May 3, 2011, 11:25:19 PM5/3/11

to cs2110-sp11

I have all the same answers as Jeff for TwitterSmall, and
LiveJournalSmall+Large. However my TwitterLarge takes forever to parse
(LiveJournalLarge takes like 20 seconds). I haven't the slightest idea
why this would be.

On May 3, 9:08 pm, Jeff Heidel <jhei...@gmail.com> wrote:
> I am now getting the following results:*
>
> TwitterSmall.txt*
> CC: 756
> ST: 7134
>
> *LiveJournalSmall.txt*
> CC: 843
> ST: 93464
>
> *TwitterLarge.txt*
> CC: 512076
> ST: 1954534
>
> *LiveJournalLarge.txt*

Jeff Heidel

unread,

May 4, 2011, 2:31:48 AM5/4/11

to cornell-c...@googlegroups.com

Mine takes about the same amount of time (around 18 seconds).
This is similar to times that people are reporting in this thread.

I believe I remember something from 1110 about how reading and parsing a file one line at a time actually takes more time than reading the entire file into a String variable and then parsing the string in Java, as a result of some implementation nuance... perhaps you might try that if you would like to see better parse times. Personally, I think the ~18 second parse time is fine for the large files.

Mark Li

unread,

May 4, 2011, 3:39:18 PM5/4/11

to cs2110-sp11

I got the same results now for TwitterLarge as well, but it took 31
seconds to parse the file, and we went over the allowed memory usage..
Will have to fix that.

Any results for clustering or standardRecommendation?

On May 4, 2:31 am, Jeff Heidel <jhei...@gmail.com> wrote:
> Mine takes about the same amount of time (around 18 seconds).

> This is similar to times that people are reporting in this thread<https://groups.google.com/forum/#%21searchin/cornell-cs2110-sp11/buff...>
> .

Alex

unread,

May 5, 2011, 2:46:05 PM5/5/11

to cs2110-sp11

If my LiveJournalLarge ST was off by 5 does that mean I somehow missed
an edge? I am running Kruskals and this is the only instance in which
this happens. I used the section code with only minor tweaking. Did
anyone else have this problem or have to do major tweaking?

Message has been deleted

Jeff Heidel

unread,

May 5, 2011, 3:28:43 PM5/5/11

to cornell-c...@googlegroups.com

Oh, critical note: my standardRecommendation implementation is directed with edges pointing away from the recommendee and towards the input node. If you implemented it differently, you will certainly get different results.

Mark Li

unread,

May 5, 2011, 4:02:51 PM5/5/11

to cs2110-sp11

By recommendee do you mean the two neighbor you are considering?

In other words, are you only considering the edges pointing from the
Two Neighbor --> middle node --> node for which we are making a
recommendation, (and the sum of all of those paths of course), or the
other way around?

On May 5, 3:28 pm, Jeff Heidel <jhei...@gmail.com> wrote:
> Oh, critical note: my standardRecommendation implementation is *directed *with
> edges pointing *away *from the recommendee and *towards *the input node. If

Mark Li

unread,

May 5, 2011, 4:10:21 PM5/5/11

to cs2110-sp11

We got the same clustering results. I tested the first 4
standardRecommendations. 2+3 were the same but 1: 199190716 and 4:
221357672.

What max weights did you get for the first 4?

Message has been deleted

Jeff Heidel

unread,

May 5, 2011, 5:35:40 PM5/5/11

to cornell-c...@googlegroups.com

Disregard my previous post (which I deleted from the google group); I misunderstood.

You are correct, Mark, I am implementing standardRecommendation as Two Neighbor --> middle node --> node (as indicated by my example here).

As for the discrepancy, I found a blatant error in my standardRecommendation code and I have new values:
Are these the values you are getting?

-------------------------------------------------------------------------

100000803 0.00909090909090909 162681855
132086139 0.027777777777777776 170807787
100008234 0.16666666666666666 100008234
133268998 0.06944444444444445 183226865
29056784 0.06060606060606061 183226865
229792648 0.16666666666666666 149783988
40519997 2.5125628140703518E-5 199750010
20120877 5.7260650480989464E-5 174673178
25970331 6.172146784872629E-4 181288376
100017602 0.05555555555555555 114240102
24776235 4.663754663754664E-4 159621650
26041084 0.08928571428571429 148118085
65142177 0.10714285714285714 148118085
27260086 5.628518733990867E-6 57687354
164998160 3.397893306150187E-4 133183531
14994303 2.229654403567447E-5 204914410
100019889 0.013888888888888888 85819676
57733010 0.5 184758174
100022034 0.007575757575757576 34919005
147305691 0.001829268292682927 228842907

Format: [Node x] [Clustering about x] [standardRecommendation on x]
These are the first 20 nodes in TwitterLarge.txt that have clustering > 0 (eliminating 0s and NaN's).
It might be easier to read here (first 500, same format): https://spreadsheets.google.com/ccc?key=0AmP6pIkAJXcjdEluMGFuOU50OGJCYmM5N0FiTmpJMGc&hl=en&authkey=CL3Dm5wF (updated)

Alex Leung

unread,

May 5, 2011, 6:13:59 PM5/5/11

to cornell-c...@googlegroups.com

132086139 0.027777777777777776 170807787

100008234 0.16666666666666666 100008234

133268998 0.06944444444444445 183226865

29056784 0.06818181818181818 43910805

229792648 0.16666666666666666 149783988

40519997 2.5125628140703518E-5 199750010

20120877 5.7260650480989464E-5 64497685

25970331 6.172146784872629E-4 181288376

100017602 0.05555555555555555 114240102

24776235 4.777504777504777E-4 159621650

26041084 0.10714285714285714 65142177

65142177 0.10714285714285714 148118085

27260086 5.628518733990867E-6 57687354

164998160 3.397893306150187E-4 133183531

14994303 2.229654403567447E-5 204914410

100019889 0.013888888888888888 85819676

57733010 1.0 184758174

100022034 0.015151515151515152 34919005

147305691 0.001829268292682927 228842907

It seems that some values are the same and that some are the same for both clustering and standard recommendation. I, too, am doing my standard recommendation with the two node -> middle node -> node.

Jeff Heidel

unread,

May 5, 2011, 6:28:10 PM5/5/11

to cornell-c...@googlegroups.com

Hey Alex, what do you get as a max weight for the standardRecommendation of node 29056784 (according to the results above, we differ on this node)
I'm getting connection to 183226865 with a weight of 3.

Andrew Casey

unread,

May 5, 2011, 6:48:54 PM5/5/11

to cornell-c...@googlegroups.com

Just a thought, but depending on how you guys break ties may be the reason for your slight differences. Most weights in twitterlarge are not that large and therefore you may have a whole slew of possible recommendations for any one vertex.

On Thu, May 5, 2011 at 6:28 PM, Jeff Heidel <jhe...@gmail.com> wrote:

Hey Alex, what do you get as a max weight for the standardRecommendation of node 29056784 (according to the results above, we differ on this node)
I'm getting connection to 183226865 with a weight of 3.

--
Andrew Casey
Class of 2014

Alex Leung

unread,

May 5, 2011, 7:24:26 PM5/5/11

to cornell-c...@googlegroups.com

I am also getting a max weight of 3 with a connection to 43910805. I'm guessing that Andrew's explanation is the reason for our discrepancy.

Jeff Heidel

unread,

May 5, 2011, 7:28:38 PM5/5/11

to cornell-c...@googlegroups.com

Our clustering values should be the same, though. There are a couple discrepancies (for example, node 57733010).
I think Mark got the same results as I did for clustering.

Alex Leung

unread,

May 5, 2011, 7:37:30 PM5/5/11

to cornell-c...@googlegroups.com

Are you checking for self-edges? Because that could throw off the clustering.

A node should not be considered as a neighbor of itself and a neighbor who talks to himself should not be considered as a connection.

Jeff Heidel

unread,

May 5, 2011, 8:21:32 PM5/5/11

to cornell-c...@googlegroups.com

I am checking and handling both of those self-edge cases.

Joshua Lui Wen Hao

unread,

May 5, 2011, 11:52:53 PM5/5/11

to cornell-c...@googlegroups.com

Hi Jeff and Alex,

I've had results that are somewhat different from yours for standardRecommendation, so I went to check TwitterLarge.txt itself for node 29056784 (since your path weights are available as well). This is what I found:

A direct connection from 43910805 to 29056784 exists, so it cannot be a valid recommendation:

43910805 29056784 2

183226865 indeed has a combined path weight of 3. But using my algorithm, I found instead node 87971425, which has a combined path weight of 5 via:

87971425 133268998 1, then 133268998 29056784 1

and 87971425 43910805 1, then 43910805 29056784 2

So 87971425 should be the correct recommendation.

Let me know if there's been a mistake.

Thanks,

Joshua

Jeff Heidel

unread,

May 6, 2011, 12:03:27 AM5/6/11

to cornell-c...@googlegroups.com

Now that I think about it, I think we're all doing it incorrectly.
In another thread, Robert states that we are to "sum the weight of all 2-edge paths".

My understanding of this is that the longer, 4-edge path you have specified is not valid.
In addition, my current standardRecommendation only looks for the largest 2-edge path to the recommended node; it does not consider the sum of all 2-edge paths that could connect the recommended node to the source node (the case of many mutual 1st order neighbors between the two nodes).

Joshua Lui Wen Hao

unread,

May 6, 2011, 12:28:28 AM5/6/11

to cornell-c...@googlegroups.com

I did not specify a 4-edge path, but rather two distinct 2-edge paths. Why would these two paths be invalid?

1st path:

87971425 133268998 1, then 133268998 29056784 1

2nd path:

87971425 43910805 1, then 43910805 29056784 2

My understanding of the sum of all 2-edge paths is the same.

Jeff Heidel

unread,

May 6, 2011, 12:32:04 AM5/6/11

to cornell-c...@googlegroups.com

Ok, that makes sense and looks correct. I misunderstood.
I will fix my algorithm and then post my corrected results on here.

Do you have clustering implemented by any chance? Are your results for clustering matching up with any of the results posted here?

Joshua Lui Wen Hao

unread,

May 6, 2011, 12:40:50 AM5/6/11

to cornell-c...@googlegroups.com

I've used the first 15 or so nodes you listed earlier for reference. Here's my results (format is [Node]: [clustering] [standardRecommendation])

From a cursory glance, i think the clustering results match up well.

Node 100000803: 0.00909090909090909 106747762

Node 132086139: 0.027777777777777776 170807787

Node 100008234: 0.16666666666666666 100008234

Node 133268998: 0.06944444444444445 221357672

Node 29056784: 0.045454545454545456 87971425

Node 229792648: 0.16666666666666666 149783988

Node 40519997: 2.5125628140703518E-5 199750010

Node 20120877: 5.7260650480989464E-5 158262046

Node 25970331: 6.172146784872629E-4 186213617

Node 100017602: 0.05555555555555555 114240102

Node 24776235: 4.663754663754664E-4 143494683

Node 26041084: 0.08928571428571429 25002265

Node 65142177: 0.10714285714285714 44111710

Node 27260086: 5.628518733990867E-6 46248772

Node 164998160: 3.397893306150187E-4 133183531

Node 14994303: 2.229654403567447E-5 63572500

Node 100019889: 0.013888888888888888 85819676

Node 57733010: 0.5 184758174

Jeff Heidel

unread,

May 6, 2011, 1:02:42 AM5/6/11

to cornell-c...@googlegroups.com

I think I have fixed my code for standardRecommendation.
I am now also getting a maxWeight of 5 for node 29056784, although it recommends a different node (appears to be a tie).

I have updated my list of values, including all the maxWeights if you wish to compare:
https://spreadsheets.google.com/ccc?key=0AmP6pIkAJXcjdEluMGFuOU50OGJCYmM5N0FiTmpJMGc&hl=en&authkey=CL3Dm5wF

Alex Leung

unread,

May 6, 2011, 1:29:53 AM5/6/11

to cornell-c...@googlegroups.com

I had bugs in both my standard recommendation and clustering. Hopefully I was able to fix all the bugs in my program.

My values now read:

132086139 0.027777777777777776 170807787

100008234 0.16666666666666666 100008234

133268998 0.06944444444444445 183226865

29056784 0.06060606060606061 183226865

229792648 0.16666666666666666 149783988

40519997 2.5125628140703518E-5 199750010

20120877 5.7260650480989464E-5 174673178

25970331 6.172146784872629E-4 181288376

100017602 0.05555555555555555 114240102

24776235 4.663754663754664E-4 159621650

26041084 0.08928571428571429 65142177

65142177 0.10714285714285714 148118085

27260086 5.628518733990867E-6 57687354

164998160 3.397893306150187E-4 133183531

14994303 2.229654403567447E-5 204914410

100019889 0.013888888888888888 85819676

57733010 0.5 184758174

100022034 0.007575757575757576 34919005

147305691 0.001829268292682927 228842907

These are consistent with your results Jeff.

Alex Leung

unread,

May 6, 2011, 1:31:30 AM5/6/11

to cornell-c...@googlegroups.com

And in terms of Standard Recommendation, I think we are allowed to implement one of three different ways: Directed from the two node to the middle node to the node, Directed from the node to the middle node to the two node, or Undirected

Mark Li

unread,

May 6, 2011, 3:25:53 AM5/6/11

to cs2110-sp11

I'm getting the same max weight values, but I'm picking different
nodes to recommend (probably because we are breaking ties differently.
I'm taking the first max value I get, but we may be traversing through
the nodes differently as well. I think since our weights match up we
should be fine.

Clustering of course still matches.

Peter

unread,

May 7, 2011, 5:10:15 PM5/7/11

to cs2110-sp11

Hey guys,

Robert Escriva mentioned three strategies for implementing
standardRec: (u-->v, u<--v, u--v), which refers to directionality and
what vertices we consider as 2-neighbors.

Thus the answers for standardRec can be very different depending on
which strategy used right?

On May 5, 6:48 pm, Andrew Casey <acc...@cornell.edu> wrote:
> Just a thought, but depending on how you guys break ties may be the reason
> for your slight differences. Most weights in twitterlarge are not that large
> and therefore you may have a whole slew of possible recommendations for any
> one vertex.
>

Jeff Heidel

unread,

May 7, 2011, 5:49:10 PM5/7/11

to cornell-c...@googlegroups.com

Yes, you will only get similar answers as others if you implement standardRecommendation using the same strategy.
Even within similar implementations, ties between possible recommendations are likely (this is the basis for comparing maxWeight of the paths).

Joshua Hagins

unread,

May 7, 2011, 8:28:45 PM5/7/11

to cornell-c...@googlegroups.com

Don't paths for standardRecommendation have to be directed and u-->v? As per the footnote from page 3 of the PDF: "Notice that we specify paths to v. The input graph is directed."

James Feher

unread,

May 7, 2011, 8:47:13 PM5/7/11

to cornell-c...@googlegroups.com

https://mail.google.com/mail/?shva=1#search/clarifications/12fb558c2cb8c483

but i believe on another post he says that your method is also correct

Reply all

Reply to author

Forward