CiteULike dataset

谷文栋

unread,

Nov 25, 2009, 5:14:25 AM11/25/09

to re...@googlegroups.com

CiteULike 最近上了推荐系统，接连发了两篇blog说这个事情，http://blog.citeulike.org/?p=11，http://blog.citeulike.org/?p=136。
他们之前还开放了数据集，http://www.citeulike.org/faq/data.adp。
group里面有没有在这个上面作research的朋友，给分享一下这个数据集的基本情况吧。

王立才

unread,

Nov 25, 2009, 8:40:17 PM11/25/09

to resys

CiteULike 根据什么来计算你的最近邻呢？？？相似性度量又是采用哪些方法呢？

看了看Res2009中的tag-RS，好像大司对这一领域研究比较深入，可否扩展开来讲讲。

王立才

unread,

Nov 26, 2009, 1:38:58 AM11/26/09

to resys

如果仅仅通过两个用户之间所关注相同文章的数量，来区分最最近邻，是不是粗糙了些？

huizi liang

unread,

Nov 26, 2009, 2:37:27 AM11/26/09

to re...@googlegroups.com

Thanks for this information! That's very helpful:)

I'm doing very similar research with the author, which is using tags to do item recommendation.

I'm looking forward to have a look at the thesis:)

Basically, I think tags contain a lot of noise, so it's better to deal with the noise first. Then, the relationships among users, items an tags are 3 dimensional. So how to make use of the relationships to do recommendation is very important. It's very natural to do content filtering based on tags.

I published one paper in WI09, titled "Personalized recommender system integrating social tags and item taxonomy". But didn't use the citeulike dataset with that approach. The new approach also used the Citeulike dataset. Basically, it's very sparce.

Are there someone doing research about tags? no matter tag recommendation or item recommendation based on tags. Would you please follow this topic? (I'd like to have a discussion) Thanks.

xlvector

unread,

Nov 26, 2009, 3:25:38 AM11/26/09

to Resys

我们目前在用这个数据集，这个数据集的活跃用户大概5000左右。baseline 大概在 17% - 19%之间，我是说recall

如果利用tag，基本可以做到21%左右。如果有人能做到25%以上，基本可以发文章了，哈哈

谷文栋

unread,

Nov 26, 2009, 3:28:58 AM11/26/09

to re...@googlegroups.com

@huizi liang，你在北京吗？明年1月份，Resys会搞一个ibm crl的专场topic，我听 Yuan Quan 说里面应该有一部分是 tag 相关的。Yuan Quan 他们team是这方面的大拿，到时候欢迎你来和他们PK，哈哈。

王立才

unread,

Nov 26, 2009, 3:37:32 AM11/26/09

to resys

topic纷至沓来，呵呵~~先报个名。

王立才

2009-11-26

发件人： 谷文栋

发送时间： 2009-11-26 16:29:29

收件人： resys

抄送：

主题： [resys] Re: CiteULike dataset

Quan Yuan

unread,

Nov 26, 2009, 3:58:05 AM11/26/09

to re...@googlegroups.com

我们组只是在tag这方面做得比较早些, 07年开始做的，投了个IUI 08的 paper (Improved Recommendation based on Collaborative Tagging Behaviors), 因该是用tag来做推荐的第一批人了。因为user-item(page)-tag 是个经典的三元组结构，后来我们还试了很多方法，比如在图上，tensor, 包括现在在用矩阵分解的方法在做。

从对tag的使用上来看，tag即可以作为recommender 的input；也可以作为output，向用户推荐tag。在下下次的聚会中，我和我的同事打算做个social recommender 的talk，其中tag-based recommender 是其中的一部分。

王立才

unread,

Nov 26, 2009, 7:24:42 AM11/26/09

to resys

刚才huizi liang说到标签中包含很多噪音；我不太了解标签推荐系统怎么计算和分析user-item-tag的，不知道下面的感触是否和噪音有关？？

在用citeUlike时，需要自己添加tags, 我填的时候基本就直接按照关键词、标题名字、以及自己的理解填写，但往往一个领域的相关概念被分成好几个标签：

比如，我比较关注context-aware RS和contextual user preference, 关于context的tags就包括：context contextual context-aware context-modelling context-sensitive context-computing等，关于RS的tags包括recommender_systems recommender recommendation。

tag-RS要计算他们的相似度或者关联度吗？要把他们聚为同一类tag吗（我的tags列表里面有好多tag值为1）？

我们组只是在tag这方面做得比较早些, 07年开始做的，投了个IUI 08的 paper (Improved Recommendation based on Collaborative Tagging Behaviors), 因该是用tag来做推荐的第一批人了。因为user-item(page)-tag 是个经典的三元组结构，后来我们还试了很多方法，比如在图上，tensor, 包括现在在用矩阵分解的方法在做。

从对tag的使用上来看，tag即可以作为recommender 的input；也可以作为output，向用户推荐tag。在下下次的聚会中，我和我的同事打算做个social recommender 的talk，其中tag-based recommender 是其中的一部分。

xlvector

unread,

Nov 26, 2009, 7:34:50 AM11/26/09

to Resys

噪音很多是属于自然语言理解方面的，比如词根啊，同义词啊，什么的。

On Nov 26, 8:24 pm, "王立才" <wiiz...@gmail.com> wrote:
> 刚才huizi liang说到标签中包含很多噪音；我不太了解标签推荐系统怎么计算和分析user-item-tag的，不知道下面的感触是否和噪音有关？？
>
> 在用citeUlike时，需要自己添加tags, 我填的时候基本就直接按照关键词、标题名字、以及自己的理解填写，但往往一个领域的相关概念被分成好几个标签：
> 比如，我比较关注context-aware RS和contextual user preference, 关于context的tags就包括：context contextual context-aware context-modelling context-sensitive context-computing等，关于RS的tags包括recommender_systems recommender recommendation。
>
> tag-RS要计算他们的相似度或者关联度吗？要把他们聚为同一类tag吗（我的tags列表里面有好多tag值为1）？
>
> 我们组只是在tag这方面做得比较早些, 07年开始做的，投了个IUI 08的 paper (Improved Recommendation based on Collaborative Tagging Behaviors), 因该是用tag来做推荐的第一批人了。因为user-item(page)-tag 是个经典的三元组结构，后来我们还试了很多方法，比如在图上，tensor, 包括现在在用矩阵分解的方法在做。
>
> 从对tag的使用上来看，tag即可以作为recommender 的input；也可以作为output，向用户推荐tag。在下下次的聚会中，我和我的同事打算做个social recommender 的talk，其中tag-based recommender 是其中的一部分。
>

> 2009/11/26 谷文栋 <wendell...@gmail.com>

>
> @huizi liang，你在北京吗？明年1月份，Resys会搞一个ibm crl的专场topic，我听 Yuan Quan 说里面应该有一部分是 tag 相关的。Yuan Quan 他们team是这方面的大拿，到时候欢迎你来和他们PK，哈哈。
>

王立才

unread,

Nov 26, 2009, 7:40:30 AM11/26/09

to resys

能把我那些属于同一类范畴的标签归结为一个tag就好了~~~

王立才

unread,

Nov 26, 2009, 7:43:02 AM11/26/09

to resys

对了，还有多义词，呵呵，算不算噪音？python可能是蛇也可能是编程语言

xlvector

unread,

Nov 26, 2009, 8:03:28 AM11/26/09

to Resys

这个是自然语言理解的问题了，不可能完全解决，但可以部分解决。毕竟例外不多。

bing wang

unread,

Nov 26, 2009, 8:29:51 AM11/26/09

to re...@googlegroups.com

我们组刚开始做tag recommendation 的研究，主要是想在social booking system中，利用社会化关系给用户推荐Tag。以后还得多向各位前辈学习。

2009/11/26 huizi liang <oklian...@gmail.com>

--
Gmail: wangbi...@gmail.com
Gtalk: wangbi...@gmail.com
Gwave: wangbi...@googlewave.com

wiizane

unread,

Nov 26, 2009, 8:54:59 AM11/26/09

to Resys

citeUlike里面，我的t好多ag的数量为1。真是挺像“常常的尾巴”~~

我又随机点了几个人或群组的tags，发现也都是这种类似的分布。。

如果citeUlike公开了数据，是否可以分析下每个人的tags是否符合“长尾理论”？？

Summer

unread,

Nov 26, 2009, 9:36:26 AM11/26/09

to Resys

我们在用tag之前一般是先要做stem的，做完stem以后，还可以再对tag做文本分析，比如用LDA算出每个tag在latent topics
上的分布，用这些分布就可以计算两个tag在语义之间的相似程度。这样可以很大程度上消除变体词，同义词带来的噪音。有了tag之间的相似度，要做归类
也简单了。

Xiance SI(司宪策)

unread,

Nov 26, 2009, 9:23:49 PM11/26/09

to re...@googlegroups.com

我也在做相关的研究，不过重点放在基于内容的tag推荐上面，即从文字内容推荐tag。例如在blog中为待发表的文章推荐tag。得多向Quan Yuan等前辈学习啊 :)

宋辉

unread,

Nov 26, 2009, 11:31:23 PM11/26/09

to re...@googlegroups.com

设想利用tag的分布计算彼此间的距离，那么可以按照tag的意义构造节点，相同意义tags归为一个node，一个tag每个意思分在一个node. 在寻找tag位置时，可以利用内容抽取该tags的具体意义，然后在节点图中检索彼此距离。这样部分解决了噪音。

这样的话，node的索引岂不是要变成 tag+意义？是不是这么考虑的？

2009/11/27 Xiance SI(司宪策) <adam.si@gmail.com>

huizi liang

unread,

Nov 26, 2009, 11:52:19 PM11/26/09

to re...@googlegroups.com

Hi, all,

Thanks. It's good to see so many people are interested in tag:)

To 文栋. Sorry, I'm not in Beijing. Otherwise, I think i will attend the meetings of this group. I probably can't attend the PK meeting either:)

To Quan Yuan. I acturally read that paper. It's good to know that you are also in this group:) I started this research on the early of 08. Does your group get some new results? Or have more recent publications? Thanks.

To 立才. Since tags are words given by users freely, they contains a lot of personal tags, synonyms, or different words means the same thing. So, it will cause incorrect neighbourhood forming and content filtering. What you mentioned also explain that tags contains a lot of noise. There are varioius ways to deal with it such as what mentioned by Summer. The tags, users and items all follow the power law distribution for the real life datasets. (According to the publications that using delicious, citeulike, amazon). So, the distribution of tags has a long tail. As for the similarity calculation, the purpose is to form proper neighbourhood. So, that's a key issue of the proposed approaches.

To Summer: What kind of dataset you are using based on LDA. I plan to compare with this approach but havent' finish this part yet.

Thanks. Have a good weekend!

Cheers,

Huizi

Quan Yuan

unread,

Nov 27, 2009, 1:15:48 AM11/27/09

to re...@googlegroups.com

Huizi,

yes, we've made some new progress in tag-based recommendation, and now target for a top conference in next year.

2009/11/27 huizi liang <oklian...@gmail.com>

tensor zhang

unread,

Nov 27, 2009, 7:40:00 AM11/27/09

to re...@googlegroups.com

citeulike那篇博文的作者不是在写他的博士论文么，recommener systems for social bookmarking

12月8日之后即可下载。

到时候有人如果读了，可以分享一下瓦，hoho

2009/11/27 Quan Yuan <quany...@gmail.com>

--
张亮
Tensor Zhang
Sent from Hangzhou, 33, China

huizi liang

unread,

Nov 28, 2009, 1:58:07 AM11/28/09

to re...@googlegroups.com

As for the dataset, I'm wondering why there is no dataset publicly released for the research purpose by some Chinese websites such as 豆瓣当当 etc.

Acturally there are quite a number of excellent Chinese researchers (or students) all over the world, it's a bit of shame that the processing information mostly based on only English textural information, though the processing Chinese is different from Chinese information. (This issue is different from writing in English for the purpose of communication.)

I realized this issue was because of the experience of one Tiland student who is doing the research of Thai processing. He got one paper accepted by PAKDD 09, which held in Tiland. He told me that his research was not so qualified acturally. But since the organizers wanted to encourage the research of their own language, he got a better chance than other people.

Then, I realized that one limitation in my reseach proposal is the limitation of language:) But personally, I hope my time and effort can also give contributions to Chinese information processing.(The implication will be more significant:) ) Basically, if there are some publicly released datasets of Chinese tags, I'd like to have a try and do some modification to make it applicable for Chinese textual information. I don't think it's a good thing that more advanced approaches are dealing with just English information. I hope the researchers or the industry people will be aware of this issue.

Besides, I think the website itself will benefit from not only the new approaches but also good reputation and free advertisement. (i.e. netflix, citeulike)

wiizane

unread,

Nov 30, 2009, 2:27:10 AM11/30/09

to Resys

CiteULike的数据集有谁已经用过了吗？？

我发邮件想申请一个access,两天过去了，还没有回音啊？

是不是有什么要求的？

huizi liang

unread,

Nov 30, 2009, 2:46:29 AM11/30/09

to re...@googlegroups.com

You can download it directly.

http://www.citeulike.org/faq/data.adp

王立才

unread,

Nov 30, 2009, 7:16:09 AM11/30/09

to resys

The latest data snapshot can always be downloaded at http://static.citeulike.org/data/current.bz2

Older datasets are available on a daily basis and can be found at URLs of the form http://static.citeulike.org/data/2007-05-30.bz2

老数据不需要许可，可以直接下载，新数据集得要access。

huizi liang

unread,

Nov 30, 2009, 9:33:07 PM11/30/09

to re...@googlegroups.com

Sorry about that. I think they did the restriction of the access very recently. I download the latest one directly about 2 or 3 month ago.

It seems that free lunch in the Internet area is always only available in the very begining:)

谷文栋

unread,

Dec 17, 2009, 2:00:16 AM12/17/09

to re...@googlegroups.com

给CiteULike作推荐系统的那人把他的完整论文放出来了，感兴趣的同学可以看看，http://ilk.uvt.nl/~toine/publications/phd-thesis.pdf

Reply all

Reply to author

Forward