Searching overlapping clusters with similarity matrices

OData support
Dr. Katona Gyula
Department of Computer Science and Information Theory

Cluster analisys is basic datamining task. The essence of the procedure is partitioning a

set of objects according to the their attributes. A slightly altered method is capable of

putting one object into many subset. Also the membership of an element in a set can be a

real number between zero and one, such as in fuzzy logic. In many cases the core of cluster

analisys is the distance function, which assigns a number to a pair of elements according

to their diversity. These distance values can form a distance matrix.

Applications of cluster analisys can be various. The first field I examined was the

databases of supermarkets. The task was to identify customer behaviour patterns based

on baskets. Another application is clustering organic molecules. Pharmaceutical companies

and chemical research laboratories need to order molecules into clusters according to their

chemical attributes. The third field was the world of recommending systems. Costumers

can rate products and we can create recommending softwares using these ratings, which

can do recommendations and make optimized personal advertisements.

There are two important aspects of my method. The first is using asymmetric "distance"

values, so the similarity matrix of the objects can be asymmetric. The other contribution

is a modified matrix factoring method, which approximates the similarity matrix with a

smaller matrix using a gradient based search method. This contrasts the original matrix

factoring methods which use two matrices to decompose to. The result of my algorithm is a

matrix with two dimensions, one for the objects and one for the clusters. The i, j element of

the matrix is the membership of the ith element in the jth cluster. This overlapping, fuzzy

clustering is capable of identifying customer behaviour patterns. It can also be transformed

easily to discrete clusters thus solving more conventional cluster analisys tasks.


Please sign in to download the files of this thesis.