As an important research field of machine learning, clustering aims to partition the samples of unsupervised data into clusters. Traditional clustering methods, like K-means, are concise and simple to be used. But they have very limited performance on various real-world data. Recently, clustering methods based on neural networks have been proposed to improve the processing of complicated and large-scale data. However, these methods are time-consuming and still difficult to extract effective features, which limits this application. Besides, most of the clustering methods cannot be applied to online clustering. To solve these problems, we propose Representation Contrast Clustering (RCC) method in this paper. To enhance the ability to extract features, we propose to introduce contrast learning into clustering, which makes it possible to extract clustering-friendly and effective features from complex data. Without labels, contrast learning is comparable to supervised learning in extracting features. Moreover, it designed a “pre-training & fine-tuning” structure, which can save clustering time and be used for online clustering. In the pre-training phase, a contrast learning framework combining data augmentation and neural networks is used to extract a cluster-friendly representation. In the fine-tuning phase, the representations are clustered by the strategy of “label-as-representation”. Experimental results show our proposed RCC method achieves state-of-the-art performance on most datasets. For example, RCC’s NMI of 0.764 on the cifar10 dataset exceeds the known best methods by 6 percentage points, while the ACC of 0.855 is at least 6 percentage points higher than others. In summary, the RCC method does not require data integrity in either pre-training or fine-tuning stages, so RCC can run on large data and be used for online clustering. This method is very efficient and requires very little training time for a downstream task, i.e. clustering after pre-training is completed, which also achieves state-of-the-art performance on most datasets in the clustering experiments.
|