Correct Answers for text files given

38 views
Skip to first unread message

Mark Li

unread,
May 1, 2011, 9:19:26 PM5/1/11
to cs2110-sp11
What is everyone getting for connectedComponents and maxst for the
text files given? I want data to compare to.

Thanks.

Chong Wang

unread,
May 2, 2011, 12:44:46 AM5/2/11
to cornell-c...@googlegroups.com
for LiveJournalSmall.txt we got 843. And our algorithm takes a lot of time to run the two large ones...

Nishant George

unread,
May 2, 2011, 2:32:13 PM5/2/11
to cornell-c...@googlegroups.com
We are getting the following, for the small files:

LiveJournalSmall:
843
93464

TwitterSmall:
756
7134

Smaller number is connectedComponents; larger is maxst.

Anybody else getting these?
--
Nishant George
Cornell University '11
Applied & Engineering Physics

Chaoran Song

unread,
May 2, 2011, 3:27:15 PM5/2/11
to cs2110-sp11
Would people also mind posting some results for clustering? just the
ones with more connections, no need for the redundant ones that cycle
on themselves only.

James

unread,
May 2, 2011, 6:39:56 PM5/2/11
to cs2110-sp11
I got 844 and 757 for LiveJournalSmall and TwitterSmall respectively.
But for all of my really small (20 vertices or less) handmade cases,
it gives the correct answer. My algorithm takes into account self
loops (A-->A) and vertices connected trough two edges (A-->B && B--
>A). Any ideas as to why I'm getting one additional connected
component? Thanks.

On May 2, 2:32 pm, Nishant George <skarlath...@gmail.com> wrote:
> We are getting the following, for the small files:
>
> *LiveJournalSmall*:
> 843
> 93464
>
> *TwitterSmall*:
> 756
> 7134
>
> Smaller number is connectedComponents; larger is maxst.
>
> Anybody else getting these?
>
> On Mon, May 2, 2011 at 12:44 AM, Chong Wang <cw...@cornell.edu> wrote:
> > for LiveJournalSmall.txt we got 843. And our algorithm takes a lot of time
> > to run the two large ones...
>

Jeff Heidel

unread,
May 2, 2011, 6:45:35 PM5/2/11
to cornell-c...@googlegroups.com
I am also getting 843 and 756 for connectedComponents, respectively.
Still need to implement maxst. Will check back later.

Jeff Heidel

unread,
May 2, 2011, 9:46:04 PM5/2/11
to cornell-c...@googlegroups.com
Hey Nishant,
I just finished implementing maxst...

I got:
LiveJournalSmall.txt
93464

TwitterSmall.txt
6571

Very weird that we're getting identical numbers for livejournal but not for twitter...
Any ideas?

Jacob Bucci

unread,
May 3, 2011, 3:06:30 PM5/3/11
to cs2110-sp11
I'm getting 843 and 756 respectively. Any numbers from the large
files?

On May 2, 9:46 pm, Jeff Heidel <jhei...@gmail.com> wrote:
> Hey Nishant,
> I just finished implementing *maxst*...
>
> I got:
> *LiveJournalSmall.txt*
> 93464
>
> *TwitterSmall.txt*

Jeff Heidel

unread,
May 3, 2011, 3:08:03 PM5/3/11
to cornell-c...@googlegroups.com
I got:

TwitterLarge.txt
Connected Components: 512076
Max ST: 1883627

LiveJournalLarge.txt
Connected Components: 1961
Max ST: 2354926

Nishant George

unread,
May 3, 2011, 3:39:27 PM5/3/11
to cornell-c...@googlegroups.com
@ Jeff:

Not sure why we got different maxst values for twitter small, esp if we got the same numbers for livejournal...

I disallowed self edges (as I'm sure you did) by implementing a disjoint set forest and running Kruskal's. Pretty much by the book. Did you do anything fancy?

Jeff Heidel

unread,
May 3, 2011, 7:49:03 PM5/3/11
to cornell-c...@googlegroups.com
Ok, I now am also getting a MaxST of 7134 for TwitterSmall.

My change: I disabled all exception handling when adding a duplicate edge.

Some graphs, like twitter, must contain many duplicate edges that need to be summed up to produce the correct answer.

I'm curious, what are you doing for handling DuplicateEdgeException? What is the correct procedure here?

Charlie Pan

unread,
May 3, 2011, 7:54:08 PM5/3/11
to cornell-c...@googlegroups.com
There are no duplicate edges in any of the graphs. You might be not
adding the weights together for two-direction edges.

On a different note, how fast do your addEdge() methods run? I'm
having some problems loading the larger graphs and the problem for me
seems to be actually constructing the graph. Are you iterating across
your list every time?

Best

Gary Peng

unread,
May 3, 2011, 8:39:41 PM5/3/11
to cs2110-sp11
I had trouble with that too - but increasing heapsize helped alot.

Check out this tread

https://groups.google.com/group/cornell-cs2110-sp11/browse_thread/thread/ec549ad3d621f334?hl=en
> > the correct procedure here?- Hide quoted text -
>
> - Show quoted text -

Jeff Heidel

unread,
May 3, 2011, 9:08:24 PM5/3/11
to cornell-c...@googlegroups.com
I am now getting the following results:

TwitterSmall.txt

CC: 756
ST: 7134

LiveJournalSmall.txt
CC: 843
ST: 93464


TwitterLarge.txt
CC: 512076
ST: 1954534

LiveJournalLarge.txt
CC: 1961
ST: 2571038


It turns out my reverse edge summing wasn't implemented correctly and was sometimes throwing exceptions when it shouldn't have.

Mark Li

unread,
May 3, 2011, 11:25:19 PM5/3/11
to cs2110-sp11
I have all the same answers as Jeff for TwitterSmall, and
LiveJournalSmall+Large. However my TwitterLarge takes forever to parse
(LiveJournalLarge takes like 20 seconds). I haven't the slightest idea
why this would be.

On May 3, 9:08 pm, Jeff Heidel <jhei...@gmail.com> wrote:
> I am now getting the following results:*
>
> TwitterSmall.txt*
> CC: 756
> ST: 7134
>
> *LiveJournalSmall.txt*
> CC: 843
> ST: 93464
>
> *TwitterLarge.txt*
> CC: 512076
> ST: 1954534
>
> *LiveJournalLarge.txt*

Jeff Heidel

unread,
May 4, 2011, 2:31:48 AM5/4/11
to cornell-c...@googlegroups.com
Mine takes about the same amount of time (around 18 seconds).
This is similar to times that people are reporting in this thread.

I believe I remember something from 1110 about how reading and parsing a file one line at a time actually takes more time than reading the entire file into a String variable and then parsing the string in Java, as a result of some implementation nuance... perhaps you might try that if you would like to see better parse times. Personally, I think the ~18 second parse time is fine for the large files.

Mark Li

unread,
May 4, 2011, 3:39:18 PM5/4/11
to cs2110-sp11
I got the same results now for TwitterLarge as well, but it took 31
seconds to parse the file, and we went over the allowed memory usage..
Will have to fix that.

Any results for clustering or standardRecommendation?

On May 4, 2:31 am, Jeff Heidel <jhei...@gmail.com> wrote:
> Mine takes about the same amount of time (around 18 seconds).
> This is similar to times that people are reporting in this thread<https://groups.google.com/forum/#%21searchin/cornell-cs2110-sp11/buff...>
> .

Alex

unread,
May 5, 2011, 2:46:05 PM5/5/11
to cs2110-sp11
If my LiveJournalLarge ST was off by 5 does that mean I somehow missed
an edge? I am running Kruskals and this is the only instance in which
this happens. I used the section code with only minor tweaking. Did
anyone else have this problem or have to do major tweaking?
Message has been deleted

Jeff Heidel

unread,
May 5, 2011, 3:28:43 PM5/5/11
to cornell-c...@googlegroups.com
Oh, critical note: my standardRecommendation implementation is directed with edges pointing away from the recommendee and towards the input node. If you implemented it differently, you will certainly get different results.

Mark Li

unread,
May 5, 2011, 4:02:51 PM5/5/11
to cs2110-sp11
By recommendee do you mean the two neighbor you are considering?

In other words, are you only considering the edges pointing from the
Two Neighbor --> middle node --> node for which we are making a
recommendation, (and the sum of all of those paths of course), or the
other way around?

On May 5, 3:28 pm, Jeff Heidel <jhei...@gmail.com> wrote:
> Oh, critical note: my standardRecommendation implementation is *directed *with
> edges pointing *away *from the recommendee and *towards *the input node. If

Mark Li

unread,
May 5, 2011, 4:10:21 PM5/5/11
to cs2110-sp11
We got the same clustering results. I tested the first 4
standardRecommendations. 2+3 were the same but 1: 199190716 and 4:
221357672.

What max weights did you get for the first 4?
Message has been deleted

Jeff Heidel

unread,
May 5, 2011, 5:35:40 PM5/5/11
to cornell-c...@googlegroups.com
Disregard my previous post (which I deleted from the google group); I misunderstood.

You are correct, Mark, I am implementing standardRecommendation as Two Neighbor --> middle node --> node (as indicated by my example here).

As for the discrepancy, I found a blatant error in my standardRecommendation code and I have new values:
Are these the values you are getting?

-------------------------------------------------------------------------

100000803 0.00909090909090909 162681855
132086139 0.027777777777777776 170807787
100008234 0.16666666666666666 100008234
133268998 0.06944444444444445 183226865
29056784 0.06060606060606061 183226865
229792648 0.16666666666666666 149783988
40519997 2.5125628140703518E-5 199750010
20120877 5.7260650480989464E-5 174673178
25970331 6.172146784872629E-4 181288376
100017602 0.05555555555555555 114240102
24776235 4.663754663754664E-4 159621650
26041084 0.08928571428571429 148118085
65142177 0.10714285714285714 148118085
27260086 5.628518733990867E-6 57687354
164998160 3.397893306150187E-4 133183531
14994303 2.229654403567447E-5 204914410
100019889 0.013888888888888888 85819676
57733010 0.5 184758174
100022034 0.007575757575757576 34919005
147305691 0.001829268292682927 228842907

Format: [Node x] [Clustering about x] [standardRecommendation on x]
These are the first 20 nodes in TwitterLarge.txt that have clustering > 0 (eliminating 0s and NaN's).
It might be easier to read here (first 500, same format): https://spreadsheets.google.com/ccc?key=0AmP6pIkAJXcjdEluMGFuOU50OGJCYmM5N0FiTmpJMGc&hl=en&authkey=CL3Dm5wF (updated)


Alex Leung

unread,
May 5, 2011, 6:13:59 PM5/5/11
to cornell-c...@googlegroups.com
132086139  0.027777777777777776  170807787
100008234  0.16666666666666666    100008234
133268998  0.06944444444444445    183226865
29056784    0.06818181818181818    43910805
229792648  0.16666666666666666    149783988
40519997    2.5125628140703518E-5 199750010
20120877    5.7260650480989464E-5 64497685
25970331    6.172146784872629E-4  181288376
100017602  0.05555555555555555    114240102
24776235    4.777504777504777E-4  159621650
26041084    0.10714285714285714    65142177
65142177    0.10714285714285714    148118085
27260086    5.628518733990867E-6   57687354
164998160  3.397893306150187E-4   133183531
14994303    2.229654403567447E-5   204914410
100019889  0.013888888888888888   85819676
57733010   1.0                                  184758174
100022034  0.015151515151515152   34919005
147305691  0.001829268292682927   228842907

It seems that some values are the same and that some are the same for both clustering and standard recommendation.  I, too, am doing my standard recommendation with the two node -> middle node -> node.

Jeff Heidel

unread,
May 5, 2011, 6:28:10 PM5/5/11
to cornell-c...@googlegroups.com
Hey Alex, what do you get as a max weight for the standardRecommendation of node 29056784 (according to the results above, we differ on this node)
I'm getting connection to 183226865 with a weight of 3.

Andrew Casey

unread,
May 5, 2011, 6:48:54 PM5/5/11
to cornell-c...@googlegroups.com
Just a thought, but depending on how you guys break ties may be the reason for your slight differences. Most weights in twitterlarge are not that large and therefore you may have a whole slew of possible recommendations for any one vertex.


On Thu, May 5, 2011 at 6:28 PM, Jeff Heidel <jhe...@gmail.com> wrote:
Hey Alex, what do you get as a max weight for the standardRecommendation of node 29056784 (according to the results above, we differ on this node)
I'm getting connection to 183226865 with a weight of 3.



--
Andrew Casey
Class of 2014

Alex Leung

unread,
May 5, 2011, 7:24:26 PM5/5/11
to cornell-c...@googlegroups.com
I am also getting a max weight of 3 with a connection to 43910805.  I'm guessing that Andrew's explanation is the reason for our discrepancy.

Jeff Heidel

unread,
May 5, 2011, 7:28:38 PM5/5/11
to cornell-c...@googlegroups.com
Our clustering values should be the same, though. There are a couple discrepancies (for example, node 57733010).
I think Mark got the same results as I did for clustering.

Alex Leung

unread,
May 5, 2011, 7:37:30 PM5/5/11
to cornell-c...@googlegroups.com
Are you checking for self-edges? Because that could throw off the clustering.  

A node should not be considered as a neighbor of itself and a neighbor who talks to himself should not be considered as a connection.

Jeff Heidel

unread,
May 5, 2011, 8:21:32 PM5/5/11
to cornell-c...@googlegroups.com
I am checking and handling both of those self-edge cases.

Joshua Lui Wen Hao

unread,
May 5, 2011, 11:52:53 PM5/5/11
to cornell-c...@googlegroups.com
Hi Jeff and Alex,

I've had results that are somewhat different from yours for standardRecommendation, so I went to check TwitterLarge.txt itself for node 29056784 (since your path weights are available as well). This is what I found:

A direct connection from 43910805 to 29056784 exists, so it cannot be a valid recommendation:
43910805 29056784 2

183226865 indeed has a combined path weight of 3. But using my algorithm, I found instead node 87971425, which has a combined path weight of 5 via:

87971425 133268998 1, then  133268998 29056784 1
and 87971425 43910805 1, then 43910805 29056784 2

So 87971425 should be the correct recommendation.

Let me know if there's been a mistake.

Thanks,
Joshua

Jeff Heidel

unread,
May 6, 2011, 12:03:27 AM5/6/11
to cornell-c...@googlegroups.com
Now that I think about it, I think we're all doing it incorrectly.
In another thread, Robert states that we are to "sum the weight of all 2-edge paths".

My understanding of this is that the longer, 4-edge path you have specified is not valid.
In addition, my current standardRecommendation only looks for the largest 2-edge path to the recommended node; it does not consider the sum of all 2-edge paths that could connect the recommended node to the source node (the case of many mutual 1st order neighbors between the two nodes).

Joshua Lui Wen Hao

unread,
May 6, 2011, 12:28:28 AM5/6/11
to cornell-c...@googlegroups.com
I did not specify a 4-edge path, but rather two distinct 2-edge paths. Why would these two paths be invalid?

1st path:
87971425 133268998 1, then  133268998 29056784 1

2nd path:
87971425 43910805 1, then 43910805 29056784 2


My understanding of the sum of all 2-edge paths is the same. 


Jeff Heidel

unread,
May 6, 2011, 12:32:04 AM5/6/11
to cornell-c...@googlegroups.com
Ok, that makes sense and looks correct. I misunderstood.
I will fix my algorithm and then post my corrected results on here.

Do you have clustering implemented by any chance? Are your results for clustering matching up with any of the results posted here?

Joshua Lui Wen Hao

unread,
May 6, 2011, 12:40:50 AM5/6/11
to cornell-c...@googlegroups.com
I've used the first 15 or so nodes you listed earlier for reference. Here's my results (format is [Node]: [clustering] [standardRecommendation])

From a cursory glance, i think the clustering results match up well.

Node 100000803: 0.00909090909090909   106747762
Node 132086139: 0.027777777777777776   170807787
Node 100008234: 0.16666666666666666   100008234
Node 133268998: 0.06944444444444445   221357672
Node 29056784: 0.045454545454545456   87971425
Node 229792648: 0.16666666666666666   149783988
Node 40519997: 2.5125628140703518E-5   199750010
Node 20120877: 5.7260650480989464E-5   158262046
Node 25970331: 6.172146784872629E-4   186213617
Node 100017602: 0.05555555555555555   114240102
Node 24776235: 4.663754663754664E-4   143494683
Node 26041084: 0.08928571428571429   25002265
Node 65142177: 0.10714285714285714   44111710
Node 27260086: 5.628518733990867E-6   46248772
Node 164998160: 3.397893306150187E-4   133183531
Node 14994303: 2.229654403567447E-5   63572500
Node 100019889: 0.013888888888888888   85819676
Node 57733010: 0.5   184758174

Jeff Heidel

unread,
May 6, 2011, 1:02:42 AM5/6/11
to cornell-c...@googlegroups.com
I think I have fixed my code for standardRecommendation.
I am now also getting a maxWeight of 5 for node 29056784, although it recommends a different node (appears to be a tie).

I have updated my list of values, including all the maxWeights if you wish to compare:

https://spreadsheets.google.com/ccc?key=0AmP6pIkAJXcjdEluMGFuOU50OGJCYmM5N0FiTmpJMGc&hl=en&authkey=CL3Dm5wF

Alex Leung

unread,
May 6, 2011, 1:29:53 AM5/6/11
to cornell-c...@googlegroups.com
I had bugs in both my standard recommendation and clustering.  Hopefully I was able to fix all the bugs in my program.

My values now read:

132086139  0.027777777777777776    170807787

100008234  0.16666666666666666      100008234

133268998  0.06944444444444445      183226865

29056784    0.06060606060606061      183226865

229792648  0.16666666666666666      149783988

40519997    2.5125628140703518E-5  199750010

20120877    5.7260650480989464E-5  174673178

25970331    6.172146784872629E-4    181288376

100017602  0.05555555555555555      114240102

24776235    4.663754663754664E-4    159621650

26041084    0.08928571428571429      65142177

65142177    0.10714285714285714      148118085

27260086    5.628518733990867E-6    57687354

164998160  3.397893306150187E-4    133183531

14994303    2.229654403567447E-5    204914410

100019889  0.013888888888888888    85819676

57733010    0.5                                      184758174

100022034  0.007575757575757576    34919005

147305691  0.001829268292682927    228842907

These are consistent with your results Jeff.

Alex Leung

unread,
May 6, 2011, 1:31:30 AM5/6/11
to cornell-c...@googlegroups.com
And in terms of Standard Recommendation, I think we are allowed to implement one of three different ways: Directed from the two node to the middle node to the node, Directed from the node to the middle node to the  two node, or Undirected

Mark Li

unread,
May 6, 2011, 3:25:53 AM5/6/11
to cs2110-sp11
I'm getting the same max weight values, but I'm picking different
nodes to recommend (probably because we are breaking ties differently.
I'm taking the first max value I get, but we may be traversing through
the nodes differently as well. I think since our weights match up we
should be fine.

Clustering of course still matches.

Peter

unread,
May 7, 2011, 5:10:15 PM5/7/11
to cs2110-sp11
Hey guys,

Robert Escriva mentioned three strategies for implementing
standardRec: (u-->v, u<--v, u--v), which refers to directionality and
what vertices we consider as 2-neighbors.

Thus the answers for standardRec can be very different depending on
which strategy used right?

On May 5, 6:48 pm, Andrew Casey <acc...@cornell.edu> wrote:
> Just a thought, but depending on how you guys break ties may be the reason
> for your slight differences. Most weights in twitterlarge are not that large
> and therefore you may have a whole slew of possible recommendations for any
> one vertex.
>

Jeff Heidel

unread,
May 7, 2011, 5:49:10 PM5/7/11
to cornell-c...@googlegroups.com
Yes, you will only get similar answers as others if you implement standardRecommendation using the same strategy.
Even within similar implementations, ties between possible recommendations are likely (this is the basis for comparing maxWeight of the paths).

Joshua Hagins

unread,
May 7, 2011, 8:28:45 PM5/7/11
to cornell-c...@googlegroups.com
Don't paths for standardRecommendation have to be directed and u-->v? As per the footnote from page 3 of the PDF: "Notice that we specify paths to v. The input graph is directed."

James Feher

unread,
May 7, 2011, 8:47:13 PM5/7/11
to cornell-c...@googlegroups.com
https://mail.google.com/mail/?shva=1#search/clarifications/12fb558c2cb8c483

but i believe on another post he says that your method is also correct

Reply all
Reply to author
Forward
0 new messages