A trick of doing SVD on binary data

In github contest, some people used SVD on github data. Github data is a binary data which only contain “who watch what” data.

Most of previous reseaches about SVD are done on rating data. In netflix, many people used Funk-SVD which are trained on observed data. However, in binary data, the label of observed data is 1 while user-item pairs with label 0 are all missing values. In this way, we can not train SVD only on observed user-item pairs in binary data.

In this way, some one use classical SVD directly on binary user-item matrix. However, such method can not produce more accurate recommendations than UserKNN or ItemKNN.

I do research on binarySVD for a long time, and today I find by changing value 0 in binary user-item matrix, we can get more accurate recommendation.

In binary data, user-item matrix R are defined as:
R(u, i) = 1 if user u like item i
R(u, i) = 0 missing value

if we let R(u, i) = e if R(u, i) = 0, where e is a positive number less than 1, and then used classical SVD to factorize R, we can produce very accurate recommendation.

Another result we find is that, for sparse dataset, we shoud choose small e, for example e = 0.2, and for dense dataset, we should choose large e, for example e = 0.8. So, the best chose of e depend on the sparsity of dataset.

Comments 1

  1. CQ wrote:

    interesting, similar with the initialization of a logistical neural network.

    Posted 13 三 2010 at 8:36 下午

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Click to hear an audio file of the anti-spam word