The last decade led to an unbelievable growth in the importance of social media. One of the most important parts of social media are blogs, where users can express their views and opinions on a variety of topics and also react to the opinions of other users. Blog users and their comments form a complex social network, the evolution of which is a hard to predict process.
The primary aim of my thesis is predicting which user is expected to comment on which blog in the future. This problem can be formulated as a link prediction problem in bipartite graphs. Recently, sparse matrix factorization became popular for link prediction (it is one of the most frequently used methods for recommendation systems). I implemented an algorithm that can be applied to real-life data that has been transformed into a correct format using data-mining techniques.
The first results showed that although the factorization of matrices has great potential to predict new links between blogs and user, it fails to capture the true distribution of comments and therefore a straight-forward application of the factorization is not able to predict future comments effectively. Therefore various experiments have been tried to define a way of using the result of factorization that best takes advantage of its hidden information about future commenting trends. The results of the experiments were evaluated using the measurements precision and recall.
The rows of the matrix - on which the factorization is carried out - can be perceived an n-dimensional vectors that describe the posting habits of each user (where n is the number of blogs processed). One of the by-products of matrix factorization is a matrix that assigns a low-dimensional vector to each user instead of their original n-dimensional representation, which in theory still contain the same information about his/her commenting habits. The second part of the study examines whether these vectors may be capable of identifying unusual, abnormal behavior among users. Furthermore, I also examine, if the low-dimensional vectors that describe users based on the result of the factorization identify the same users as anomalous as the original vectors did with the same techniques.