Explanation:Explanation is one of the most important components of every recommendation system. The explanation module generates some reasoning for every recommendation result using the user’s historical behaviors. For example, we will recommend “American Dad” to a user who had previously watched “Family Guy.” The explanation will say, “We recommend ‘American Dad’ to you because you have watched ‘Family Guy’”.

Figure 1 : Architecture for Hulu
Off-line Architecture
In the above on-line architecture, some components rely on offline resources, such as the topic model, related model, feedback model, etc. The off-line system is also an important part of our recommendation system. Our off-line system has these main components:
- Data Center: The data center contains all user behavior data in Hulu. Some of them are stored in Hadoop clusters and some of them are stored in a relational database.
- Related Table Generator: The related table is an important resource for on-line recommendation. We use two main types of related table: one that’s based on collaborative filtering (which we’ll call CF), and another based on content. In CF, show A and show B will have high similarity if users who like show A also like show B. With content filtering, we use content information including title, description, channel, company, actor/actress, and tags.
- Topic Model: A topic is represented by a group of shows that have similar content. Topics are thus larger in scope than shows, but they’re still smaller than channels. Our topics are learned by LDA, which is a popular topic model in machine learning.
- Feedback Analyzer: Feedback specifically means users’ reactions to recommendation results. Using user feedback can improve recommendation quality. For example, say a show is recommended to many users, but most of them do not click this show. In that case, we’ll decrease the rank of this show. Users will also have different types of behavior, so we’ll use all these behaviors in developing the recommendations. However, some users may prefer recommendations to come from their prior watch history, and some users may prefer their recommendations to come from their voting behavior. All these effects can be modeled offline by analyzing users’ feedback on their recommendations.
- Report Generator: Evaluation is most important part of the recommendation system. The report generator will generate a report including multiple metrics every day to show the quality of recommendations. At Hulu we monitor metrics including CTR, conversion ratio, etc.

Figure 2 : Architecture for Hulu
Algorithms
So far, we’ve given a brief overview of our recommendation architecture. From previous discussion, we can see that Hulu’s recommendation system is primarily based on ItemCF. We’ve added many improvements on top of the ItemCF algorithm, too, in order to make it generate better recommendations. To test these improvements, we’ve performed many A/B tests on different algorithms. In following sections, we’ll introduce some of these algorithms and the experiment results.
Item-based Collaborative Filtering
Item-based Collaborative Filtering (ItemCF) is the basis of all our algorithms. In ItemCF, let N(u) be a set of items user u has preferred previously. User u’s preference on item j (j is not in N(u)) can then be measured by:
p(u,i) = \sum_{j \in N(u)} r(u,j) s(i,j)
Here, r(u,i) is the preference weight of user u on show i, and s(i,j) is the similarity between show i and show j. In CF, the similarity between two shows is calculated by user behavior data on these two shows. Let N(i) be a set of users who watched show i and N(j) be a set of users who watched show j. Then, the similarity s(i,j) between show i and show j is calculated by following formula:
In this definition, show i will be highly relevant to show j if most users who watch show i will also watch show j. However, this definition will have the “Harry Potter problem,” which means that every show will have high relevance with popular shows.
Recent Behavior
The first lesson we learned from A/B testing is that recommendations should fit users’ recent preference and that users’ recent behavior is more important than their older, historical behaviors. So, in our engine, we will put more weight on users’ recent behaviors. In our system, CTR of recommendations that originate from users’ recent watch behavior is 1.8 times higher than CTR of recommendations originating from users’ old watch behavior.

Novelty
Just because a recommendation system can accurately predict user behavior does not mean it produces a show that you want to recommend to an active user. For example, “Family Guy” is a very popular show on Hulu, and thus most users have watched at least some episodes from this show. These users do not need us to recommend this show to them — the show is popular enough that users will decide whether or not to watch it by themselves.
Thus, novelty is also an important metric to evaluate recommendations. The first way we think can increase novelty is by revising ItemCF algorithm:
- First, we will decrease weight of popular shows that users have watched before.
- Then, we’ll put more weight on shows that are not only similar to shows the active user watched before, but also less popular than shows the active user watched before.
Explanation-based Diversity
Most users have diverse preferences, so the recommendation should also meet their diverse interests. In our system, we use explanations to diversify our recommendations. We think a diverse recommendation means most of the recommendation shows have different explanations.
We have performed an A/B test to show the usefulness of diversification (shown in the above figure). The results of the experiment show that, for active users who had previously watched 10 or more shows, diversification can increase recommendation CTR significantly.
Temporal Diversity
A good recommendation system should not generate static recommendations. Users want to see new suggestions every time they visit the recommendation system. If a user has new behaviors, she will find her recommendations have changed because we have put more weight on the user’s recent behaviors. But if a user has no new behaviors, we also need to change our recommendations. We use three methods to keep temporal diversity of our system:
- First, we’ll recommend recently-added shows to users. Many new shows are added to Hulu every day, and we will suggest these shows to users who will like them. Thus, users will see fresh ideas for shows to watch when new ones are added.
- Second, we will randomize our recommendations. Randomization is the simplest way to keep recommendations fresh.
- Finally, we’ll decrease rank of recommendations which users have seen many times. This is called implicit feedback, and data show that CTR is increased by 10% after using this method.
Performance of Hulu’s Recommendation Hub
The recommendation hub is a personal recommendation page for every user. On this page users will see 6 carousels. The top carousel is “top recommendations”, which includes shows that we think users will prefer very much. After top recommendations, there are three carousels for three genres. These three genres are selected by analyzing users’ historical preferences. The next carousel is bookmarks, which include shows that users have indicated they’d like to watch later. The last carousel is filled with shows that the user has already rated. This carousel is designed to collect more explicit feedback from users.

We have performed an A/B test to compare our recommendation algorithms with two simple recommendation algorithms: Most Popular (which recommends the most popular shows to every user) and Highest Rated (which recommends highly-rated shows to every user). As shown in the above figure, experiment results show that the CTR of our algorithm is much higher than both simple methods.
Lessons
Every user behavior can reflect user preferences.
In our system, we use a slew of user behaviors to come up with our recommendations. We’ve calculated the CTR of recommendations originating from different types of behaviors. As shown in Figure 3, we can see that recommendations from every type of behavior can generate recommendations that will be clicked by users.

Figure 3 : CTR of recommendations come from different types of behaviors
Explicit Feedback data is more important than implicit feedback data
As shown in Figure 3, CTR of recommendations that originate from users’ historically loved (vote 5 stars on shows) and liked (vote 4 stars on shows) behaviors is higher than CTR of recommendations that come from users’ historical subscribe/watch/search behavior. So although the size our explicit feedback data is much smaller than implicit feedback data, they’re much more important.
Recent behaviors are much more important than old behaviors
Novelty, Diversity, and offline Accuracy are all important factors
Most researchers focus on improving offline accuracy, such as RMSE, precision/recall. However, recommendation systems that can accurately predict user behavior alone may not be a good enough for practical use. A good recommendation system should consider multiple factors together. In our system, after considering novelty and diversity, the CTR has improved by more than 10%.
Based on the paper “Recommendation System at Hulu” by Liang Xiang, Hua Zheng and Hang Li.
Hua Zheng is the senior lead developer in charge of the Hulu content recommendation and behavior targeting systems.
Dr. Xiang and Dr. Li, associate researchers, are working together on the recommendation system, helping users discover and enjoy relevant premium videos.