Tag clouds and "similar pages"

176 views
Skip to first unread message

Velenux

unread,
Aug 10, 2009, 8:59:31 PM8/10/09
to nanoc
Hi there,

I'm quite new both to Ruby and to nanoc, so excuse me for my bad
coding style, I'm still learning :D

I'm working on a rewrite of my site (currently using oddmuse wiki
engine). My new site design has two major navigation methods: tag
clouds (home page and each internal page) and "similar pages" (each
internal page), so I'm heavily relying on the Tagging helper.

For the "similar pages" I'm using this code (I hope it could help
someone out there!):
http://pastie.textmate.org/579382

Still thinking about the tag cloud implementation, maybe I'll follow
this way (rake task and then render from layout):
http://www.remerson.plus.com/articles/nanoc-tags/

... but I'm open to suggestions :)

bye!

Gilberto "Velenux" Ficara

JLF

unread,
Aug 11, 2009, 7:22:37 AM8/11/09
to nanoc
Ciao Gilberto,

On Aug 11, 2:59 am, Velenux <g.fic...@gmail.com> wrote:
> For the "similar pages" I'm using this code (I hope it could help
> someone out there!):http://pastie.textmate.org/579382

The similarity measurement you proposed is not very effective, for two
main reasons:

1. Articles having a large number of tags will often be very similar;
2. Some tags are very common and thus their absence or presence carry
litte meaning.

You could first improve your calculation by dividing the intersection
of tags ary by the sum of their length.

similar_pages[p.path] = (@page.tags & p.tags).length /
( @page.tags.size + p.tags.size )

But I would rather use a Vector Space Model with Tf-Idf weighting
http://en.wikipedia.org/wiki/Vector_space_model

I am implementing such solution this week, if you can wait, I'll try
to post my code this week-end.

> Still thinking about the tag cloud implementation, maybe I'll follow
> this way (rake task and then render from layout):http://www.remerson.plus.com/articles/nanoc-tags/

With Nanoc3 (in beta but clean enough to be tried), you can generate
tag pages in memory using the 'preprocessing' section of the 'Rules'
file. Something like:

http://pastie.org/579803

I am currently writing a new tagging helper with more features like:
- finding all item with a certain tag.
- finding all the tags existing in the site or in an item array.
- Counting tag occurencies in the site or in an item array.
- Generating a tag cloud.
- etc.

I will post the result here but if people are interested to
collaborate, I can put the code in github.

Cheers,

jl

Gilberto Ficara

unread,
Aug 11, 2009, 8:17:36 AM8/11/09
to na...@googlegroups.com
Il giorno mar, 11/08/2009 alle 04.22 -0700, JLF ha scritto:
> Ciao Gilberto,
>
> On Aug 11, 2:59 am, Velenux <g.fic...@gmail.com> wrote:
> > For the "similar pages" I'm using this code (I hope it could help
> > someone out there!):http://pastie.textmate.org/579382
>
> The similarity measurement you proposed is not very effective, for two
> main reasons:
>
> 1. Articles having a large number of tags will often be very similar;
> 2. Some tags are very common and thus their absence or presence carry
> litte meaning.

thank you for pointing this out!


> But I would rather use a Vector Space Model with Tf-Idf weighting
> http://en.wikipedia.org/wiki/Vector_space_model
>
> I am implementing such solution this week, if you can wait, I'll try
> to post my code this week-end.

Wow! That looks a bit overkill for my little and humble site, but as a
general solution looks like The Right Thing To Do[tm], I'm in no hurry,
so I'll be glad to try your code when you'll post it!! :)


> > Still thinking about the tag cloud implementation, maybe I'll follow
> > this way (rake task and then render from layout):http://www.remerson.plus.com/articles/nanoc-tags/
>
> With Nanoc3 (in beta but clean enough to be tried), you can generate
> tag pages in memory using the 'preprocessing' section of the 'Rules'
> file. Something like:
>
> http://pastie.org/579803

interesting! Looks like I should give nanoc3 a try!


> I am currently writing a new tagging helper with more features like:
> - finding all item with a certain tag.
> - finding all the tags existing in the site or in an item array.
> - Counting tag occurencies in the site or in an item array.
> - Generating a tag cloud.
> - etc.

that would be just perfect!! I can't wait to see the code and try it :)


Velenux

JLF

unread,
Aug 11, 2009, 1:03:23 PM8/11/09
to nanoc
> that would be just perfect!! I can't wait to see the code and try it :)

Here is the first draft:

http://pastie.org/580199

I tested it with my prototype site and that seemed to work, but I
hadn't enough time to write proper tests.

The functions included allow you to retrieve a tag set from a bunch of
items or to select a bunch of items carrying a tag.

The last two functions allow you to compute tag occurencies and
classify tags according to their occurences in a given number of
classes. That should allow to easily build a tag cloud. I will post an
example soon.

Gilberto Ficara

unread,
Aug 12, 2009, 3:15:06 AM8/12/09
to na...@googlegroups.com
Il giorno mar, 11/08/2009 alle 10.03 -0700, JLF ha scritto:
> > that would be just perfect!! I can't wait to see the code and try it :)
>
> Here is the first draft:
>
> http://pastie.org/580199
>

@line 30:

def items_with_tag( tag, items=0)

should be items=nil instead?

didn't try the code yet, just spot that reading it...

velenux

JLF

unread,
Aug 12, 2009, 12:52:11 PM8/12/09
to nanoc
On Aug 12, 9:15 am, Gilberto Ficara <g.fic...@gmail.com> wrote:
> should be items=nil instead?

Yes, indeed !

I finish the first draft of the similarity helper:

http://pastie.org/581515

I just tested it with an easy case and with my website prototype. It
seems to work well but use it with caution.

In its current implementation, ruby is slow for this kind of
computation. Therefore the website compilation can be sluggish with
big sites (hundred of items and hundred of tags), but that should not
be a problem with small sites. If it turns to be a problem, I see room
for optimization, for example the whole similarity matrix can be
computed during preprocessing and the similar_items method can do a
look-up in the matrix.

Ciao,

jl

Gilberto Ficara

unread,
Aug 25, 2009, 5:51:32 PM8/25/09
to na...@googlegroups.com
Il giorno mer, 12/08/2009 alle 09.52 -0700, JLF ha scritto:
> On Aug 12, 9:15 am, Gilberto Ficara <g.fic...@gmail.com> wrote:
> I just tested it with an easy case and with my website prototype. It
> seems to work well but use it with caution.
>

hi!

I finally had some spare time to test the module, I made a small change
to the code since Similarity module had a problem with some tags ("can't
compare fixnum with string")... in TaggingExtra, row 14:

item[:tags].each { |tag| tags << tag.to_s }

I'm using it this way:
http://pastie.org/594691

it works like a charm :)


> In its current implementation, ruby is slow for this kind of
> computation. Therefore the website compilation can be sluggish with
> big sites (hundred of items and hundred of tags), but that should not
> be a problem with small sites. If it turns to be a problem, I see room
> for optimization, for example the whole similarity matrix can be
> computed during preprocessing and the similar_items method can do a
> look-up in the matrix.

yep, it's quite slow, ranging from 1.5 to 3.5s for page, but I think
it's ok since it's a static site generator ;)

thank you!

Gilberto


Denis Defreyne

unread,
Aug 29, 2009, 5:33:04 AM8/29/09
to na...@googlegroups.com
On 12 Aug 2009, at 18:52, JLF wrote:

> I finish the first draft of the similarity helper:
>
> http://pastie.org/581515

Hi,

Cool! There is one mistake around line 60 in make_vector, though:
there is a line that says

tags = tag_set().sort

but the implementation of #tag_set is missing. Should it be something
like this?

def tag_set
@items.map { |i| i[:tags] }.flatten.uniq
end

Otherwise, this is quite nice!

> I just tested it with an easy case and with my website prototype. It
> seems to work well but use it with caution.
>
> In its current implementation, ruby is slow for this kind of
> computation. Therefore the website compilation can be sluggish with
> big sites (hundred of items and hundred of tags), but that should not
> be a problem with small sites. If it turns to be a problem, I see room
> for optimization, for example the whole similarity matrix can be
> computed during preprocessing and the similar_items method can do a
> look-up in the matrix.

Yeah, there are probably quite a few redundant computations going on
here. Memoization should be able to speed up the process quite a bit I
think. Still, for usage in a simple site it's probably fast enough
already.

Regards,

Denis

--
Denis Defreyne
denis.d...@stoneship.org

Denis Defreyne

unread,
Aug 29, 2009, 5:39:24 AM8/29/09
to na...@googlegroups.com
On 12 Aug 2009, at 18:52, JLF wrote:

> I finish the first draft of the similarity helper:
>
> http://pastie.org/581515

Hi,

I've tried a few more examples of the similarity helper and I believe
that I've stumbled on a case where the result is incorrect.

I have three items, with the following sets of tags:

1. [ 'a', 'b', 'c', 'd', 'e' ]
2. [ 'a', 'b', 'c' ]
3. [ 'c', 'd', 'e' ]

Item 2 has three tags which are in Item 1's tags. Similarly, Item 3
has three tags which also are in Item 1's tags. However, when I check
the similarity between Item 1 and 2, and Item 1 and 3, I get the
following results respectively:

0.929038883389869
0.453133677544008

This would mean that Item 3 is only half as similar to Item 1 as Item
2 is similar to Item 1, which is incorrect. I haven't taken a good
look at the mathematics involved; I just wanted to let you know.

Denis Defreyne

unread,
Aug 29, 2009, 6:20:03 AM8/29/09
to na...@googlegroups.com
Hi,

Here's my take on the similarity helper--something completely
different from the helper mentioned before in this thread. My helper
is a small function that performs an array intersection on two arrays
(of tags or something else) and calculates the atan scaled between 0.0
and 1.0:

def similarity_between(vec1, vec2, exp=0.8)
x = (vec1 & vec2).size
(Math.atan(x) / (Math::PI/2.0))**exp
end

For the following tags:

[
[ 'a', 'b', 'c', 'd', 'e' ],


[ 'a', 'b', 'c', 'd' ],
[ 'a', 'b', 'c' ],

[ 'c', 'd', 'e' ],
[ 'd', 'e' ],
[ 'd' ],
[ 'x' ]
]

… it gives this result (with exp=0.8):

0.898135632446692
0.873154585249893
0.832466530062486
0.832466530062486
0.755907870987557
0.574349177498517
0.0

The "exp" parameter defines how quickly arrays are considered as
similar. For low values (e.g. 0.3), arrays become similar very
rapidly. For higher values (e.g. 0.9), arrays are really considered
similar when they share a large number of elements. Graph the "y =
(Math.atan(x) / (Math::PI/2.0))**exp" function and you'll see how it
works.

Note that two identical arrays don't have a similarity of 1.0. This
makes sense because a finite array is not complete in the sense that
there will always be tags missing; you can never fully describe an
article with a finite set of tags. Therefore, the more tags, the more
similar two arrays can be (or rather, the similarity stays the same
but the accuracy increases with the length of the intersection). The
similarity between [ 'a' ] and [ 'a' ] should be lower than the
similarity between [ 'a', 'b', 'c' ] and [ 'a', 'b', 'c' ].

At the moment the function only checks the number of shared array
elements ((vec1 & vec2).size). The function should probably also take
into account the total number of tags specified and therefore also the
number of elements in one array but not in the other.

Improvements are welcome. :)

Denis Defreyne

unread,
Aug 29, 2009, 7:22:25 AM8/29/09
to na...@googlegroups.com
Hi,

Slightly modified version that also takes into account the number of
non-shared elements between the array:

def similarity_between(vec1, vec2, exp=0.8, exp2=0.9)
shared_elems = vec1 & vec2
nonshared_elems = (vec1 | vec2) - shared_elems

a = (Math.atan(shared_elems.size) / (Math::PI/2.0))**exp
b = (Math.atan(nonshared_elems.size) / (Math::PI/2.0))**exp2

(a-b)/2.0 + 0.5
end

For the following tags:

[
[ 'a', 'b', 'c', 'd', 'e' ],

[ 'a', 'b', 'c', 'd', 'z' ],


[ 'a', 'b', 'c', 'd' ],
[ 'a', 'b', 'c' ],
[ 'c', 'd', 'e' ],
[ 'd', 'e' ],
[ 'd' ],
[ 'x' ]
]

it now gives this result (with exp=0.8 and exp2=0.9):

0.949067816223346
0.57161543346498
0.668633926990873
0.551271405871277
0.551271405871277
0.4711523532727
0.357937208397694
0.0475704750914407

Note how non-shared elements have an impact: replacing the 'e' with a
'z' in the first list of tags has a serious impact on the similarity.
The 3rd set of tags (without 'e') now has more accuracy than the 2nd
set of tags. Playing with the weights (exp and exp2) may also be
useful; they control the rate of how quickly items become similar
because of shared elements (exp) or dissimilar because of non-shared
elements (exp2).

Thanks to apeiros for his valuable feedback and contributions.

Denis Defreyne

unread,
Aug 29, 2009, 7:25:56 AM8/29/09
to na...@googlegroups.com
On 29 Aug 2009, at 13:22, Denis Defreyne wrote:

> For the following tags: [..] it now gives this result (with exp=0.8
> and exp2=0.9):

I should have mentioned that those values are calculated between each
item from the list and the array [ 'a', 'b', 'c', 'd', 'e' ], i.e. it
is calculated like this:

tag_sets.each do |tags|
puts similarity_between(tag_sets[0], tags)
end

with tag_sets being the array of arrays containing the tags.

Gilberto Ficara

unread,
Aug 30, 2009, 6:31:44 PM8/30/09
to na...@googlegroups.com
Il giorno mar, 25/08/2009 alle 23.51 +0200, Gilberto Ficara ha scritto:
> Il giorno mer, 12/08/2009 alle 09.52 -0700, JLF ha scritto:
> I finally had some spare time to test the module, I made a small change
> to the code since Similarity module had a problem with some tags ("can't
> compare fixnum with string")... in TaggingExtra, row 14:
>
> item[:tags].each { |tag| tags << tag.to_s }

I made another change in TaggingExtra, row 73

ranks[t.to_s] = rank.to_i

added the .to_s, because some of my tags keep getting parsed as integer
instead of string :)

The code with all the changes I applied is at http://pastie.org/599864

Since I didn't want to build the tag cloud for each page, I put the
generation code in one separate layout and rendered it one single time
in the Rules file preprocess directive, adding the resulting html to
each item: http://pastie.org/599853

I can't say why, but similarity calculations has slowed down a lot, I'll
check to see where I made changes... maybe I'll check your code too,
Denis! :)

Gilberto

Gilberto Ficara

unread,
Sep 5, 2009, 5:46:26 PM9/5/09
to na...@googlegroups.com
Wow,
I finally managed to try also your code, Denis! :)

I moved all the computations on Rules preprocess directive, it's quite
fast and accurate!

So, here's the code: http://pastie.org/607246

Gilberto

datagator

unread,
Oct 8, 2009, 12:32:36 PM10/8/09
to nanoc
Wow, this thread is very computational. I lifted some code from the
Jekyll project for related_posts and turned it into related_items. I'm
using it for a site to get related items to the current item.

Here's the code. I just stuck it into a file I called related_items.rb
in my site's lib directory
http://pastie.org/647081

Here's a tiny example showing how I use it.
<div id="related">
<% @item.related_items(@site.items.select{|p| p.attributes[:type] ==
'article'}).each do |a| %>
<div><%= a[:title]%></div>
<% end%>
</div>

You need to install the classifier gem to use this and classifier
recommends the GNU GSL and rb-gsl to improve performance by 10X. Even
without GNU GSL and rb-gsl, the first page compile is slower, but once
the LSI is indexed, it goes normal nanoc super speed to compile the
rest.

In the code, I used the raw_content rather than a content snapshot
because I couldn't seem to get around the circular dependencies
otherwise. The raw_content is good enough for what I'm doing now, so I
went with it.

Being able to display related posts was the one thing I really liked
about Jekyll, but after trying to use Jekyll for multiple categories,
I found nanoc to be more flexible to my needs. Bringing related posts
to nanoc is something I hope others will enjoy.

Harry

Denis Defreyne

unread,
Oct 11, 2009, 7:26:17 AM10/11/09
to na...@googlegroups.com
On 08 Oct 2009, at 18:32, datagator wrote:

> Wow, this thread is very computational. I lifted some code from the
> Jekyll project for related_posts and turned it into related_items. I'm
> using it for a site to get related items to the current item.
>
> Here's the code. I just stuck it into a file I called related_items.rb
> in my site's lib directory

> http://pastie.org/647081 [..]

Hi,

Ahh, this is another cool approach. I didn't know about the classifier
gem.

I would recommend against modifying Nanoc3::Item directly. If you can,
try to put the code into a helper module instead. The
Nanoc3::Item#related_items method could then become
Nanoc3::Helpers::RelatedItems#items_related_to(item, items_to_search).
It would functionally be the same, but it has a smaller chance of
breaking when Nanoc3::Item's internals change.

I'm not sure why you check whether self.site.config[:lsi] exists.
Would it not be better to let nanoc simply throw an error if
self.site.config[:lsi] is nil?

In any case, this looks like a very useful addition and this should
probably come on the nanoc wiki, along with the other two
implementations of a similarity/related-posts helper. If you want, you
can improve the new "Finding Similar Items" wiki page at <http://projects.stoneship.org/trac/nanoc/wiki/Tips/FindingSimilarItems
> (it's rather empty at the moment).

Reply all
Reply to author
Forward
0 new messages