Parallel Hierarchical Clustering on Market Basket Data

Document Type

Conference Proceeding

Publication Date



Computer Sciences


Data clustering has been proven to be a promising data mining technique. Recently, there have been many attempts for clustering market-basket data. In this paper, we propose a parallelized hierarchical clustering approach on market-basket data (PH-Clustering), which is implemented using MPI. Based on the analysis of the major clustering steps, we adopt a partial local and partial global approach to decrease the computation time meanwhile keeping communication time at minimum. Load balance issue is always considered especially at data partitioning stage. Our experimental results demonstrate that PH-Clustering speeds up the sequential clustering with a great magnitude. The larger the data size, the more significant the speedup when the number of processors is large. Our results also show that the number of items has more impact on the performance of PH-Clustering than the number of transactions.