thank you for pointing this out!
> But I would rather use a Vector Space Model with Tf-Idf weighting
> http://en.wikipedia.org/wiki/Vector_space_model
>
> I am implementing such solution this week, if you can wait, I'll try
> to post my code this week-end.
Wow! That looks a bit overkill for my little and humble site, but as a
general solution looks like The Right Thing To Do[tm], I'm in no hurry,
so I'll be glad to try your code when you'll post it!! :)
> > Still thinking about the tag cloud implementation, maybe I'll follow
> > this way (rake task and then render from layout):http://www.remerson.plus.com/articles/nanoc-tags/
>
> With Nanoc3 (in beta but clean enough to be tried), you can generate
> tag pages in memory using the 'preprocessing' section of the 'Rules'
> file. Something like:
>
> http://pastie.org/579803
interesting! Looks like I should give nanoc3 a try!
> I am currently writing a new tagging helper with more features like:
> - finding all item with a certain tag.
> - finding all the tags existing in the site or in an item array.
> - Counting tag occurencies in the site or in an item array.
> - Generating a tag cloud.
> - etc.
that would be just perfect!! I can't wait to see the code and try it :)
Velenux
@line 30:
def items_with_tag( tag, items=0)
should be items=nil instead?
didn't try the code yet, just spot that reading it...
velenux
hi!
I finally had some spare time to test the module, I made a small change
to the code since Similarity module had a problem with some tags ("can't
compare fixnum with string")... in TaggingExtra, row 14:
item[:tags].each { |tag| tags << tag.to_s }
I'm using it this way:
http://pastie.org/594691
it works like a charm :)
> In its current implementation, ruby is slow for this kind of
> computation. Therefore the website compilation can be sluggish with
> big sites (hundred of items and hundred of tags), but that should not
> be a problem with small sites. If it turns to be a problem, I see room
> for optimization, for example the whole similarity matrix can be
> computed during preprocessing and the similar_items method can do a
> look-up in the matrix.
yep, it's quite slow, ranging from 1.5 to 3.5s for page, but I think
it's ok since it's a static site generator ;)
thank you!
Gilberto
> I finish the first draft of the similarity helper:
>
> http://pastie.org/581515
Hi,
Cool! There is one mistake around line 60 in make_vector, though:
there is a line that says
tags = tag_set().sort
but the implementation of #tag_set is missing. Should it be something
like this?
def tag_set
@items.map { |i| i[:tags] }.flatten.uniq
end
Otherwise, this is quite nice!
> I just tested it with an easy case and with my website prototype. It
> seems to work well but use it with caution.
>
> In its current implementation, ruby is slow for this kind of
> computation. Therefore the website compilation can be sluggish with
> big sites (hundred of items and hundred of tags), but that should not
> be a problem with small sites. If it turns to be a problem, I see room
> for optimization, for example the whole similarity matrix can be
> computed during preprocessing and the similar_items method can do a
> look-up in the matrix.
Yeah, there are probably quite a few redundant computations going on
here. Memoization should be able to speed up the process quite a bit I
think. Still, for usage in a simple site it's probably fast enough
already.
Regards,
Denis
--
Denis Defreyne
denis.d...@stoneship.org
> I finish the first draft of the similarity helper:
>
> http://pastie.org/581515
Hi,
I've tried a few more examples of the similarity helper and I believe
that I've stumbled on a case where the result is incorrect.
I have three items, with the following sets of tags:
1. [ 'a', 'b', 'c', 'd', 'e' ]
2. [ 'a', 'b', 'c' ]
3. [ 'c', 'd', 'e' ]
Item 2 has three tags which are in Item 1's tags. Similarly, Item 3
has three tags which also are in Item 1's tags. However, when I check
the similarity between Item 1 and 2, and Item 1 and 3, I get the
following results respectively:
0.929038883389869
0.453133677544008
This would mean that Item 3 is only half as similar to Item 1 as Item
2 is similar to Item 1, which is incorrect. I haven't taken a good
look at the mathematics involved; I just wanted to let you know.
Here's my take on the similarity helper--something completely
different from the helper mentioned before in this thread. My helper
is a small function that performs an array intersection on two arrays
(of tags or something else) and calculates the atan scaled between 0.0
and 1.0:
def similarity_between(vec1, vec2, exp=0.8)
x = (vec1 & vec2).size
(Math.atan(x) / (Math::PI/2.0))**exp
end
For the following tags:
[
[ 'a', 'b', 'c', 'd', 'e' ],
[ 'a', 'b', 'c', 'd' ],
[ 'a', 'b', 'c' ],
[ 'c', 'd', 'e' ],
[ 'd', 'e' ],
[ 'd' ],
[ 'x' ]
]
… it gives this result (with exp=0.8):
0.898135632446692
0.873154585249893
0.832466530062486
0.832466530062486
0.755907870987557
0.574349177498517
0.0
The "exp" parameter defines how quickly arrays are considered as
similar. For low values (e.g. 0.3), arrays become similar very
rapidly. For higher values (e.g. 0.9), arrays are really considered
similar when they share a large number of elements. Graph the "y =
(Math.atan(x) / (Math::PI/2.0))**exp" function and you'll see how it
works.
Note that two identical arrays don't have a similarity of 1.0. This
makes sense because a finite array is not complete in the sense that
there will always be tags missing; you can never fully describe an
article with a finite set of tags. Therefore, the more tags, the more
similar two arrays can be (or rather, the similarity stays the same
but the accuracy increases with the length of the intersection). The
similarity between [ 'a' ] and [ 'a' ] should be lower than the
similarity between [ 'a', 'b', 'c' ] and [ 'a', 'b', 'c' ].
At the moment the function only checks the number of shared array
elements ((vec1 & vec2).size). The function should probably also take
into account the total number of tags specified and therefore also the
number of elements in one array but not in the other.
Improvements are welcome. :)
Slightly modified version that also takes into account the number of
non-shared elements between the array:
def similarity_between(vec1, vec2, exp=0.8, exp2=0.9)
shared_elems = vec1 & vec2
nonshared_elems = (vec1 | vec2) - shared_elems
a = (Math.atan(shared_elems.size) / (Math::PI/2.0))**exp
b = (Math.atan(nonshared_elems.size) / (Math::PI/2.0))**exp2
(a-b)/2.0 + 0.5
end
For the following tags:
[
[ 'a', 'b', 'c', 'd', 'e' ],
[ 'a', 'b', 'c', 'd', 'z' ],
[ 'a', 'b', 'c', 'd' ],
[ 'a', 'b', 'c' ],
[ 'c', 'd', 'e' ],
[ 'd', 'e' ],
[ 'd' ],
[ 'x' ]
]
it now gives this result (with exp=0.8 and exp2=0.9):
0.949067816223346
0.57161543346498
0.668633926990873
0.551271405871277
0.551271405871277
0.4711523532727
0.357937208397694
0.0475704750914407
Note how non-shared elements have an impact: replacing the 'e' with a
'z' in the first list of tags has a serious impact on the similarity.
The 3rd set of tags (without 'e') now has more accuracy than the 2nd
set of tags. Playing with the weights (exp and exp2) may also be
useful; they control the rate of how quickly items become similar
because of shared elements (exp) or dissimilar because of non-shared
elements (exp2).
Thanks to apeiros for his valuable feedback and contributions.
> For the following tags: [..] it now gives this result (with exp=0.8
> and exp2=0.9):
I should have mentioned that those values are calculated between each
item from the list and the array [ 'a', 'b', 'c', 'd', 'e' ], i.e. it
is calculated like this:
tag_sets.each do |tags|
puts similarity_between(tag_sets[0], tags)
end
with tag_sets being the array of arrays containing the tags.
I made another change in TaggingExtra, row 73
ranks[t.to_s] = rank.to_i
added the .to_s, because some of my tags keep getting parsed as integer
instead of string :)
The code with all the changes I applied is at http://pastie.org/599864
Since I didn't want to build the tag cloud for each page, I put the
generation code in one separate layout and rendered it one single time
in the Rules file preprocess directive, adding the resulting html to
each item: http://pastie.org/599853
I can't say why, but similarity calculations has slowed down a lot, I'll
check to see where I made changes... maybe I'll check your code too,
Denis! :)
Gilberto
I moved all the computations on Rules preprocess directive, it's quite
fast and accurate!
So, here's the code: http://pastie.org/607246
Gilberto
> Wow, this thread is very computational. I lifted some code from the
> Jekyll project for related_posts and turned it into related_items. I'm
> using it for a site to get related items to the current item.
>
> Here's the code. I just stuck it into a file I called related_items.rb
> in my site's lib directory
> http://pastie.org/647081 [..]
Hi,
Ahh, this is another cool approach. I didn't know about the classifier
gem.
I would recommend against modifying Nanoc3::Item directly. If you can,
try to put the code into a helper module instead. The
Nanoc3::Item#related_items method could then become
Nanoc3::Helpers::RelatedItems#items_related_to(item, items_to_search).
It would functionally be the same, but it has a smaller chance of
breaking when Nanoc3::Item's internals change.
I'm not sure why you check whether self.site.config[:lsi] exists.
Would it not be better to let nanoc simply throw an error if
self.site.config[:lsi] is nil?
In any case, this looks like a very useful addition and this should
probably come on the nanoc wiki, along with the other two
implementations of a similarity/related-posts helper. If you want, you
can improve the new "Finding Similar Items" wiki page at <http://projects.stoneship.org/trac/nanoc/wiki/Tips/FindingSimilarItems
> (it's rather empty at the moment).