Till KTH:s startsida Till KTH:s startsida

Master thesis "Building Evolutionary Clustering Algorithms on Spark"

Tid: Onsdag 6 september 2017 kl 09:00 - 10:30 2017-09-06T09:00:00 2017-09-06T10:30:00

Kungliga Tekniska högskolan
KTH Kista Degree projects, Master-level (Examensarbete, Master)

Plats: Ada room

Info:

Student:  Xinye Fu
Date and Time:  9:00 am, Wednesday, 6th September 2017
Examiner:  Šarūnas Girdzijauskas
Supervisor:  Vladimir Vlassov
Title: "Building Evolutionary Clustering Algorithms on Spark"
Opponent: Jing Li, Pietro Cannalire


Abstract
Evolutionary clustering (EC) is a kind of clustering algorithm to handle the noise of time-evolved data. It can track the truth drift of clustering across time by considering history. EC tries to make clustering result fit both current data and historical data/model well, so each EC algorithm defines snapshot cost (SC) and temporal cost (TC) to reflect both requests. EC algorithms minimize both SC and TC by different methods, and they have different ability to deal with a different number of cluster, adding/deleting nodes, etc.

Until now, there are more than 10 EC algorithms, but no survey about that. Therefore, a survey of EC is written in the thesis. The survey first introduces the application scenario of EC, the definition of EC, and the history of EC algorithms. Then two categories of EC algorithms - model-level algorithms and data-level algorithms are introduced one-by-one. What's more, each algorithm is compared with each other. Finally, performance prediction of algorithms is given. Algorithms which optimize the whole problem (i.e., optimize change parameter or don't use change parameter to control), accept a change of cluster number perform best in theory.

EC algorithm always processes large datasets and includes many iterative data-intensive computations, so they are suitable for implementing on Spark. Until now, there is no implementation of EC algorithm on Spark. Hence, four EC algorithms are implemented on Spark in the project. In the thesis, three aspects of the implementation are introduced. Firstly,  algorithms which can parallelize well and have a wide application are selected to be implemented. Secondly, program design details for each algorithm have been described. Finally, implementations are verified by correctness and efficiency experiments.

Hela världen får läsa.

Senast ändrad 2017-08-31 10:16

Taggar: Saknas än så länge.