A REVIEW ON HADOOP BASED DATA CLUSTERING
Keywords:
Big-data, Hadoop, Data Clustering, Document clustering, Text mining, MapReduce.Abstract
Clustering problems are becoming more challenging as the quantity of data and unstructured data handling complexities is increasing. Such big amount and complex data that cannot handle by todays DBMS tools is known as Big Data. Processing time for data is directly proportional to increase in data. To access (Read/write) big data efficiently data clustering is a good solution. Because of it differentiates between dissimilar data. Routine data clustering algorithms are considered as NP-Hard problem. To deal with such situation huge research is going on with respect to parallelization of resources and algorithms.
However, parallelization gives the worst results when data is dependent on each other and dedicated setup is required. Alternative to Parallel architecture is a distributed environment and that can utilize remote system processing capacity on the fly. Hadoop as a distributed environment and framework runs the same code on partitioned data, and finally gathers the result at one place. Hence, to solve the data clustering problems distributed environment is very helpful. Data classification or data distribution in distinct classes has an obligation to reduce the dependency. The availability of similar data at a single point from a big data is known as Data Clustering. To solve the data-clustering problem of big data distributing environment has proposed. This dissertation report gives application survey and research scope to inspire researcher to solve Big data-clustering problems using distributed computing architecture.
References
Technology, Vol.5, pp.688-691, 2015.
Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: a flexible data processing tool."
Communications of the ACM, Vol.53.1, pp.72-77, 2010.
Condie, Tyson, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, KhaledElmeleegy, and Russell Sears. "MapReduce Online." In NSDI, Vol. 10, pp. 20-34, 2010.
Suganya, R., and R. Shanthi. "Fuzzy C-Means Algorithm-A Review."International Journal of
Scientific and Research Publications,Vol. 2, pp-1-3, 2012.
Hartigan, John A., and Manchek A. Wong. "Algorithm AS 136: A k-means clustering
algorithm." Applied statistics, Vol.28, pp.100-108, 2018.
http://www.cs.princeton.edu/courses/archive/fall08/cos43 6/Duda/C/fk_means.htm
Cannon, Robert L., Jitendra V. Dave, and James C. Bezdek. "Efficient implementation of the
fuzzy c-means clustering algorithms." Pattern Analysis and Machine Intelligence, IEEE
Transactions on, Vol.2, pp. 248-255, 1986.DOI:10.1109/TPAMI.1986.4767778
Jain, Anil K. "Data clustering: 50 years beyond K-means." Pattern recognition letters, Vol.
, pp. 651-666, 2010.
Anchalia, Prajesh, AnjanKoundinya, and N. Srinath. "MapReduce Design of K-Means
Clustering Algorithm." Information Science and Applications (ICISA), pp.1-5, 2013.
DOI:10.1109/ICISA.2013.6579448
Zhao, Weizhong, Huifang Ma, and Qing He. "Parallel k-means clustering based on
mapreduce." In Cloud Computing Springer Berlin Heidelberg, Vol. 5931, pp. 674-679, 2009.
DOI:10.1007/978-3-642-10665-1_71
Ene, Alina, SungjinIm, and Benjamin Moseley. "Fast clustering using MapReduce." In
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery
and data mining, pp. 681-689. ACM, 2011. DOI: 10.1145/2020408.2020515
articles: a comparison between K-means and fuzzy C-means in the cloud." In Cloud
Computing Technology and Science (CloudCom), 2020 IEEE Third International Conference
on, pp. 565 -569. IEEE, 2020. DOI:10.1109/CloudCom.2011.86
LOU, Xiaojun, Junying LI, and Haitao LIU. "Improved fuzzy C-means clustering algorithm
based on cluster density." Journal of Computational Information Systems, Vol. 8, pp. 727-
, 2012.
Xie, Jiong, Shu Yin, XiaojunRuan, Zhiyang Ding, et al. "Improving mapreduce performance
through data placement in heterogeneous hadoop clusters." In Parallel & Distributed
Processing, Workshops and Phd Forum (IPDPSW), 2018 IEEE International Symposium on,
pp. 1-9. IEEE, 2010.DOI:10.1109/IPDPSW.2010.5470880
Ferreira Cordeiro, Robson Leonardo, Caetano Traina Junior, et al.. "Clustering very large
multi-dimensional datasets with mapreduce." In Proceedings of the 17th ACM SIGKDD
international conference on Knowledge discovery and data mining, pp. 690-698. ACM, 2011.
DOI:10.1145/2020408.2020516
Ekanayake, Jaliya, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, et.
al,"Twister: a runtime for iterative mapreduce." In Proceedings of the 19th ACM
International Symposium on High Performance Distributed Computing, pp. 810-818, 2017.
Downloads
Published
Issue
Section
License
Submission Preparation ChecklistSubmission Preparation Checklist
Before proceeding with your submission, please ensure that you have completed the following checklist. All items on the list must have a checkmark before you can submit your manuscript: