Indexed by:
Abstract:
Currently, with the widespread use of distributed information, traditional clustering algorithms can no longer meet the processing needs of massive information both in terms of accuracy and computational efficiency, so clustering algorithms based on the Spark distributed platform have become today's research hotspots. The K-Means algorithm can be widely used in both academic research and business by virtue of its ease of implementation and high scalability. However, the traditional K-Means is based on Euclidean distance, which is not applicable in some scenarios. And it is inefficient when dealing with large-scale data. This study implements and tests K-means, K-means++, Canopy+ K-means algorithms based on Euclidean distance and Manhattan distance in Spark distributed platform and analyses the performance changes. The experimental results show that the introduction of Manhattan distance makes the clustering time longer and the optimisation effect of different algorithms changes differently. © 2024 IEEE.
Keyword:
Reprint 's Address:
Email:
Version:
Source :
Year: 2024
Page: 15-20
Language: English
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 1
Affiliated Colleges: