The data stream models, with its ability to cater to numerous types of data, have been providing empirical evidence of streaming algorithms based on the performance of synthetic and real data streams. It is crucial to analyze such data provided its ability to process the data in a single pass, or a small number of passes with minimal usage of memory. Clustering is a key data mining task. This is the problem of partitioning a set of observations into clusters such that the intra-cluster observations are similar and the inter-cluster observations are dissimilar.
The traditional set-up where a static dataset is available in its entirety for random access is not applicable as we do not have the entire dataset at the launch of the learning, the data continue to arrive at a rapid rate, we cannot access the data randomly, and we can make only one or at most a small number of passes on the data in order to generate the clustering results. These types of data are referred to as data streams which are also applicable to in-depth big data analysis.
An effective combination of data stream sequences will be accessed in order and can be read only once or a minimum number of times. Each reading of the sequence is called a linear scan or a pass. The stream model is motivated by emerging applications involving massive data sets. While taking into account restrictions of memory and time the data stream clustering problem requires a process capable of partitioning observations continuously.
In the literature of data stream clustering methods, a large number of algorithms use a two-phase scheme which consists of an online component that processes data stream points and produces summary statistics, and an offline component that uses the summary data to generate the clusters. An alternative class is capable of generating the final clusters without the need of an offline phase. Along with a comprehensive survey of data stream clustering, an overview of the most well-known streaming platforms which implement clustering will be presented by the paper.
The data stream and online or incremental models are similar in that they both require decisions to be made before all the data are available. Having a useful and ubiquitous tool in data analysis, clustering of data is set in specific partitions so that, under some definition of “similarity,” similar items are in the same part of the partition and different items are in different parts. There is a significant difference between the streaming techniques that are found have clustering quality and the clustering quality of running the algorithm on all the data at once. The results will be compared to different streaming algorithms and the similar tradeoff between quality time and cluster time will be found. For most natural clustering objective functions, the optimization problems turn out to be NP-hard.
Therefore, most theoretical work is in the design of approximation algorithms: algorithms that guarantee a solution whose objective function value is a fixed factor of the value of the optimal solution. With the efficacies of different levels of data analytics tools including big data analytics, novel clustering approaches could be derived. Considering that the data stream distribution may vary over time, the data stream clustering algorithms should clearly infuse outlier detection mechanisms that are able to differentiate between actual outliers and cluster evolution. Outlier detection mechanisms can be divided into statistical-based approaches and density-based approaches. In spite of being with an object clustering algorithms set, there are works that perform attribute clustering which is also known as variable clustering. Attribute clustering is usually considered a batch offline procedure, in which the common strategy is to employ a traditional clustering algorithm over the transposed data matrix.
Data Science and Big Data Analytics
Join in for a session on Data Science and Big Data Analytics on July 22, 2017
Time : 10.30 AM to 11.30 AM EST
Limited Seats. Book now at 331-999-0059 or [email protected]