Clustering for high dimensional categorical data based on text similarity

Link to full Paper

It is a well-known fact that a variety of cluster analysis techniques exist to group objects which have characteristics related to one another. But the fact of the matter is the implementation of many of these techniques poses a great challenge because of the fact that much of the data contained in today’s database is categorical in nature. Despite the fact that there have been recent advances in algorithms for clustering categorical data, some are unable to handle uncertainty in the clustering process while others have stability issues. In this paper, it is intended to propose an effective method for text similarity based clustering technique. At first the relevant features are selected from the input dataset. Thus the relevant features are clustered based on the A Possibilistic Fuzzy C-Means Clustering Algorithm (PFCM). Here the features used for clustering will be the similarity between the categorical data. The similarity measure is presented namely SMTP (similarity measure for text processing) for the two categorical data. Clustering based proposed method has high probability of producing a useful subset and independent features. To improve the efficiency of the proposed method, construct the minimum spanning tree by an optimization algorithm. Here adaptive artificial bee colony algorithm (AABC) is used for the purpose of selecting the optimal features. The performance of the proposed technique is evaluated by clustering accuracy, Jaccard coefficient and Dice’s coefficient. The proposed method will be implemented in MATLAB platform using machine learning repository.

Nifty tech tag lists fromĀ Wouter Beeftink