用Topic组织你的兴趣

文栋兄的cutt真是千呼万唤始出来啊。 文栋专门写了一篇文章介绍他们的系统,大家可以围观一下:
http://www.guwendong.com/post/2010/cutt.html

然后再给大家展示一下Logo

OK,客套话说完了,来说说我的看法吧。这套系统的核心是利用主题(topic)来联系user和item的关系。推荐领域的一个牛人曾经说过,推荐的本质就是将user和item联系起来,user-cf是用相同兴趣的用户联系user/item, item-cf是用相似的item联系user/item。而cutt是利用topic联系user和item。

其实用topic联系user和item的思想在数据挖掘的领域已经有很悠久的历史了,其代表的模型就是LSI,pLSA, LDA,这些统统被称为latent factor model或者topic model。在理想的情况下,我们可以把item分成一个个的topic,然后更加用户的历史记录计算用户对不同的topic的喜好程度,从而对用户做出推荐。

topic model的好处是可以解决新的item的cold start的问题。如果一个新的item出现了,在collaborative filtering的方法中,如果还没有一个用户对这个item产生过行为,我们是没有办法判断谁会喜欢这个item的。但如果item是一篇文章,我们就可以根据topic model知道这个item属于什么样的topic (可以通过分析文章的内容),这样就可以把这个item推荐到喜欢这个item的topic的用户那儿去。

这个过程听起来很简单,其实还是很复杂的,首先是如何定义topic。举电影的例子,比如大话西游,可以属于古装片,周星驰电影,西游记,香港电影,向华强/向华胜兄弟的电影,搞笑电影,90年代香港电影,等等很多的topic。我们根据不同的数据可以得到不同的topic,而不同的用户对这个电影的看法可能是倾向于不同的topic。比如有的用户就是冲着周星驰去的,而有的用户只是冲着搞笑电影去的。

建立topic model需要的核心数据是拿到不同entity之间的关系。在collaborative filtering里面,用户看过item就是一种关系。语义网络也是一种关系,比如周星驰是演员,也当过导演,这就是一种语义网络定义的关系。cutt在语义网络方面下了很大的功夫。他们更加不同的数据,对item进行了不同种类的聚类,同时找出了用户最关心的聚类展示给用户。

另外cutt在反馈系统的设计上,直接让用户对他们找出来的topic进行反馈,这样就可以更好的给用户做出推荐。当然cutt也有很多的问题,不过这可能是和中国的互联网现状相关的,中国互联网上原创文章比较少,有意思的文章也比较少。这样造成大量的topic相关的文章其实也就是一篇,其他的文章都是相互转载。 另外用户的冷启动问题也比较难解决,因为现在还没有用户的喜好数据,所以cutt目前其实还没有做推荐系统,他们只是提供了一个相关topic的浏览体系,让用户比较容易的找到自己喜欢的topic。

就技术细节而言,cutt比我上面说的要复杂的多,我这儿就算抛砖引玉了,嘿嘿。

LDA的实现

首先要感谢Wang Yi的帮助,我之前对统计模型总是没有深入的了解,最近仔细研究了LDA之后,发现这个模型具有很多矩阵模型没有的优点。而且这个算法的实现很简单。下面的图片就是主要代码

同时利用openmp实现了一个简单的并行版本,可以充分一下利用我机器的4个核。

轻松一下

如何修改chromium的user-agent

在chrome工程下webkit/support/glue 工程里面,有个webkit_glue.cc的文件,找到里面这么一行

StringAppendF(
result,
“Mozilla/5.0 (%s; %c; %s; %s) AppleWebKit/%d.%d”
” (KHTML, like Gecko) %s Safari/%d.%d”,
mimic_windows ? “Windows” : kUserAgentPlatform,
kUserAgentSecurity,
((mimic_windows ? “Windows ” : “”) + BuildOSCpuInfo()).c_str(),
kUserAgentLocale,
WEBKIT_VERSION_MAJOR,
WEBKIT_VERSION_MINOR,
product.c_str(),
WEBKIT_VERSION_MAJOR,
WEBKIT_VERSION_MINOR
);
这就是chromium生成user-agent的语句,可以随便修改了

多个用户共用一个id?

之前在Netflix Prize的时候,我们Ensemble组内部讨论过,如果很多人共用一个账号,而他们的兴趣不同,这可能会造成推荐结果的不准确。那么如何能够分开共用一个账号的用户呢? 当时我们的组员Lester Mackey提出过一些idea,但是当时因为时间紧张,没有能实现这个model。

最近Lester Mackey在ICML上发表了一篇文章”Mixed Membership Matrix Factorization“,详细讨论这个idea。他将LDA和SVD结合了起来,认为每个user id可能对应了好几个人,所以一开始用多项式分布采出一个id对应的人,而每个人对应一个latent factor。他的方法在RMSE上的提高是比较明显的 0.9 => 0.896

这个model很有意思,大家可以看看,也许我理解的不是那么准确。

Surprise Recommendations need Reasonable Explanation

In recent years, many researches focus on increasing recommendation serendiptiy – how to make surprise recommendation. In my research, I find surprise recommendation is not enough, we must give reasonable explanations. This is because users need explanation to make sure your recommendation is relavant to them.

In real life, we always give detail explaination when we give special recommendation. For example, when a student working on data mining ask you to recommend some books for him. If I recommenda a datamining book to him, I only need to give short explanation, such as “this is the best book in datamining”. However, if I recommend a math book to him, he may surprise. This time, I need to give more detail explanation, such as “if you want to do good job in datamining, you must have good math foundation”. If I do not give explanation, this student may not read the math book I recommend. So, reasonable explanation is very important for surprise recommendation.

We are working hard to increase recommendation serendipity to help users find many items they do not know but may prefer potentially. Increasing serendipity always make CTR down. I think, this is because we do not make reasonable explanation. If a user only see the name of the item we think he may like, she may be surprise but she may not click the item because she do not know what is it.

So, if we want to make surprise recommendation for users, we must give the explanation at the same time. There are many reasons, such as:

  1. this is the best book for your age
  2. this movie is directed by XXX who is the director of XXX you watch before
  3. Females living in zhongguancun always go to this shop
  4. ………………………….

In real life, we give explanation when we recommend our friends to do something, in Internet, we also need to give explanation, especially for recommendation which may make users surprise.

成功编译Chromium

最近一直在编译Chromium,一直搞不定,后来发现是VS2005版本的问题,一定要VS2005 SP1才能编译成功。当然现在还有几十个错误,不过chrome.exe已经编译出来了。

编译Chromium的目的是更好的研究浏览器,比较浏览器是互联网的基础。同时锻炼工程方面的能力也是一个目的。接下来就是研究chrome的代码了,任重道远啊。

YouTube Offers Personalized TV-Like Experience

http://www.readwriteweb.com/archives/youtube_offers_personalized_tv-like_experience.php


Youtube发布了一个叫做leanback的服务,可以得到类似看电视的体验。
到达LeanBack的界面,youtube就会按照个性化推荐的结果开始给用户不断的放视频。利用键盘,用户可以浏览那些视频。

Youtube的高级项目经理Kuan Yong解释了电视节目产生的原理:这些视频是根据用户在youtube的设置和喜欢,包括了用户的subscription和用户的好友在facebook上share的那些视频。

Facebook Unveils One of History’s Most Powerful Recommendation Engines

http://www.readwriteweb.com/archives/facebook_unveils_one_of_the_historys_most_powerful.php


Important parts:

Facebook will now recommend that new users sign up for updates from (“Like”) publishers with high reader engagement and subscribed-to by people demographically similar to themselves.
给新注册的用户推荐热门的用户和具有相同社会属性的朋友,这一点twitter似乎也做过了,注册的时候可以选择几个好友。

The Facebook vs. Google battle could become a fight between Recommendation and Search.
这话太激动人心了,推荐不至于和搜索走到对立面上去吧。

Recommendation-geeks have argued that recommendation may someday become bigger, more important and more lucrative than search. Recommendation is like a smarter, pre-emptive search before you even thought to search for anything.
我一直觉得推荐和搜索满足了用户两种不同的需求,缺一不可,这两个市场并不矛盾,可以携手发展。

It’s too bad this had to happen under a proprietary platform with privacy problems.
隐私!!个性化永远绕不去的坎。

Comments on Google’s Personal News

Updates: ReadWriteWeb Did Google Blow It with the Google News Redesign?

google昨天对news做了改版,加入了更多个性化的模块。很多文章对google的个性化新闻做了讨论:

ReadWriteWeb New Google News is More Personal and Spontaneous

TechCrunch Google News Gets Biggest Overhaul Since 2002, Adds Trending Topics And Personal News Stream

Google的具体改动可以见下图:

下面是一些群众的评论

The personalized news idea is a total disrespect to the human intelligence.
Instead of news/analytics which you have to browse,read,analyze, understand you get a personalized stream which “fits your interests”
It is well known result from recommendation engines and collaborative filtering: such recommendation/personalization engine significantly decrease the diversity of the information processed into the system, at the end the interests of majority of participants converge to the very narrow set of topics.
I love traditional newspapers and magazines more than Google’s Newsburger.
这位担心推荐系统会降低推荐结果的多样性,从而让最终的推荐结果局限在很少的几个话题上。

看网上的评论,真是恶评如潮啊,没看到几个说好话的,大多数都是表示不如以前的设计。

I do not like the new layout – the older layout gave a good sample of the news in different categories – this layout is just plain confusing – very poor layout and organization. Just because something has been around for awhile does NOT mean that it needs to be messed with. For example, many people really prefer the UI of XP to Windows 7. New does NOT equate to better!

These changes, improvements, customization features, personalization are awesome, innovative, inspired, clever BUT unnecessary. The clustering of news the way it was and the option to add sections was just fine. I do google news a dozen times a day. Today I struggled with this new layout to read the titles, figure out the organization … and frankly just got so #^$%’ed off that I closed it and deleted my bookmark.

I love the new features, but hate the new UI. Google has generally been amazing at minimalist UI’s that do not distract you from what you are looking for (just look at their homepage). I have used google news for a while as my main news source. Google, please fix!

Google doesn’t give a crap what you think. They will tell you what you will like. Your only job is to login like a good sheep and proceed to provide them with endless amounts of preference and selection data, for free. Anonymous customization? Long gone. Multiple points of view… god they know how much you hate that, and now you can get rid of those pesky people who think differently. And of course, everything must be “social”. Because of course if thousands of other retards think something is interesting, you’d better think so too! How DARE you not conform?! 100 years from now, historians will look back on this age as the time when the word “social” became totally corrupted and an oxymoron.

我个人对google个性化新闻的看法是,不知道他为什么要做定制,让用户自己指定topic,这个不能算做推荐,其实和iGoogle差不多,就是让用户定制。作为新闻来说,推荐是要给用户带来惊喜。因为正常的用户看新闻就是每天知道一下发生了什么事情,这个google的旧首页可以很好的满足。所以如果要做个性化,应该是给用户带来更多的惊喜,而不是给用户提供一个个性化的浏览方式。

ICML 2010 and Yahoo! Learning to Rank Workshop

今年上半年比较有影响的比赛就是yahoo!的Learning to Rank,我参加了2周之后,感觉个人不擅长这个,就没有继续了。5月底这个比赛结束了,比赛的胜者可以把自己的方法发表在ICML 2010learning to rank workshop上。

第一名是来自微软research的Chris J.C. Burges,他们的方法是:

From RankNet to LambdaRank to LambdaMART: An Overview

剩下的两位获奖者的文章如下:

* BagBoo: Bagging the Gradient Boosting by Dmitry Pavlov and Cliff Brunk
* YetiRank: Everybody Lies by Andrey Gulin and Igor Kuralenok