A REVIEW ON HADOOP BASED DATA CLUSTERING

Authors

  • Dhirendra Pandey, Sandhya Satyarthi, Vandana Pandey, Virendra Singh

Keywords:

Big-data, Hadoop, Data Clustering, Document clustering, Text mining, MapReduce.

Abstract

Clustering problems are becoming more challenging as the quantity of data and unstructured data handling complexities is increasing. Such big amount and complex data that cannot handle by todays DBMS tools is known as Big Data. Processing time for data is directly proportional to increase in data. To access (Read/write) big data efficiently data clustering is a good solution. Because of it differentiates between dissimilar data. Routine data clustering algorithms are considered as NP-Hard problem. To deal with such situation huge research is going on with respect to parallelization of resources and algorithms.

However, parallelization gives the worst results when data is dependent on each other and dedicated setup is required. Alternative to Parallel architecture is a distributed environment and that can utilize remote system processing capacity on the fly. Hadoop as a distributed environment and framework runs the same code on partitioned data, and finally gathers the result at one place. Hence, to solve the data clustering problems distributed environment is very helpful. Data classification or data distribution in distinct classes has an obligation to reduce the dependency. The availability of similar data at a single point from a big data is known as Data Clustering. To solve the data-clustering problem of big data distributing environment has proposed. This dissertation report gives application survey and research scope to inspire researcher to solve Big data-clustering problems using distributed computing architecture.

References

Technology, Vol.5, pp.688-691, 2015.

Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: a flexible data processing tool."

Communications of the ACM, Vol.53.1, pp.72-77, 2010.

Condie, Tyson, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, KhaledElmeleegy, and Russell Sears. "MapReduce Online." In NSDI, Vol. 10, pp. 20-34, 2010.

Suganya, R., and R. Shanthi. "Fuzzy C-Means Algorithm-A Review."International Journal of

Scientific and Research Publications,Vol. 2, pp-1-3, 2012.

Hartigan, John A., and Manchek A. Wong. "Algorithm AS 136: A k-means clustering

algorithm." Applied statistics, Vol.28, pp.100-108, 2018.

http://www.cs.princeton.edu/courses/archive/fall08/cos43 6/Duda/C/fk_means.htm

Cannon, Robert L., Jitendra V. Dave, and James C. Bezdek. "Efficient implementation of the

fuzzy c-means clustering algorithms." Pattern Analysis and Machine Intelligence, IEEE

Transactions on, Vol.2, pp. 248-255, 1986.DOI:10.1109/TPAMI.1986.4767778

Jain, Anil K. "Data clustering: 50 years beyond K-means." Pattern recognition letters, Vol.

, pp. 651-666, 2010.

Anchalia, Prajesh, AnjanKoundinya, and N. Srinath. "MapReduce Design of K-Means

Clustering Algorithm." Information Science and Applications (ICISA), pp.1-5, 2013.

DOI:10.1109/ICISA.2013.6579448

Zhao, Weizhong, Huifang Ma, and Qing He. "Parallel k-means clustering based on

mapreduce." In Cloud Computing Springer Berlin Heidelberg, Vol. 5931, pp. 674-679, 2009.

DOI:10.1007/978-3-642-10665-1_71

Ene, Alina, SungjinIm, and Benjamin Moseley. "Fast clustering using MapReduce." In

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery

and data mining, pp. 681-689. ACM, 2011. DOI: 10.1145/2020408.2020515

articles: a comparison between K-means and fuzzy C-means in the cloud." In Cloud

Computing Technology and Science (CloudCom), 2020 IEEE Third International Conference

on, pp. 565 -569. IEEE, 2020. DOI:10.1109/CloudCom.2011.86

LOU, Xiaojun, Junying LI, and Haitao LIU. "Improved fuzzy C-means clustering algorithm

based on cluster density." Journal of Computational Information Systems, Vol. 8, pp. 727-

, 2012.

Xie, Jiong, Shu Yin, XiaojunRuan, Zhiyang Ding, et al. "Improving mapreduce performance

through data placement in heterogeneous hadoop clusters." In Parallel & Distributed

Processing, Workshops and Phd Forum (IPDPSW), 2018 IEEE International Symposium on,

pp. 1-9. IEEE, 2010.DOI:10.1109/IPDPSW.2010.5470880

Ferreira Cordeiro, Robson Leonardo, Caetano Traina Junior, et al.. "Clustering very large

multi-dimensional datasets with mapreduce." In Proceedings of the 17th ACM SIGKDD

international conference on Knowledge discovery and data mining, pp. 690-698. ACM, 2011.

DOI:10.1145/2020408.2020516

Ekanayake, Jaliya, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, et.

al,"Twister: a runtime for iterative mapreduce." In Proceedings of the 19th ACM

International Symposium on High Performance Distributed Computing, pp. 810-818, 2017.

Published

2022-01-30