Volume 3, Issue 6, December 2015, Page: 88-94
Methods for Evaluating Agglomerative Hierarchical Clustering for Gene Expression Data: A Comparative Study
Md. Bipul Hossen, Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
Md. Siraj-Ud-Doulah, Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
Aminul Hoque, Department of Statistics, Rajshahi University, Rajshahi, Bangladesh
Received: Dec. 5, 2015;       Accepted: Dec. 14, 2015;       Published: Dec. 30, 2015
DOI: 10.11648/j.cbb.20150306.12      View  3497      Downloads  123
Abstract
Microarray is already well established techniques to understand various cellular functions by profiling transcriptomics data. To capture the overall feature of high dimensional variable datasets in microarray data, various analytical and statistical approaches are already developed. One of the most widely used Agglomerative Hierarchical Clustering (AHC) methods is the cluster analysis of gene expression data; however, little work has been done to compare the performance of clustering methods on gene expression data, where some authors used three or four AHC methods and some others used at most five AHC methods. All of the authors concretely suggested complete linkage method to further researchers to determine the best method for clustering their gene expression data. This paper compared the performance of seven AHC methods for clustering gene expression data with respect to five major proximity measures. We used corrected Rand (cR) Index to compare the performance of each clustering method. To illustrate the results, we found that the clustering method Ward exhibited the best performance among all of the AHC methods as well as the proximity measure Cosine performed better in comparison to all the other measures in both type of Affymetrix and cDNA datasets.
Keywords
Agglomerative Hierarchical Clustering, Proximity Measures, Corrected Rand Index, Gene Expressions Data
To cite this article
Md. Bipul Hossen, Md. Siraj-Ud-Doulah, Aminul Hoque, Methods for Evaluating Agglomerative Hierarchical Clustering for Gene Expression Data: A Comparative Study, Computational Biology and Bioinformatics. Vol. 3, No. 6, 2015, pp. 88-94. doi: 10.11648/j.cbb.20150306.12
Copyright
Copyright © 2015 Authors retain the copyright of this article.
This article is an open access article distributed under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Reference
[1]
Brown M P and Bostein D (1999); Exploring the new world of genome with DNA microarrays. Nature Genetics, 21: 33-37.
[2]
Quackenbush J (2001); Computational analysis of cDNA microarray data. Nature Reviews. 6(2):418-428.
[3]
Slonim D (2002); From patterns to pathways: gene expression data analysis comes of age. Nature Genetics. 32:502-508.
[4]
Monti S, Tamayo P, Mesirov J, Golub T (2003); Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data; Machine Learning. 52:91-118.
[5]
Cunningham K M and Ogilvie J C (1972); Evaluation of hierarchical grouping techniques: A preliminary study. The Computer Journal, 15: 209–213.
[6]
Hubert L (1974); Approximate evaluation techniques for the single-link and complete ¬¬link hierarchical clustering procedures. Journal of the American Statistical Association, 69, 698–704.
[7]
Baker F B (1974); Stability of two hierarchical grouping techniques – Case I: Sensitivity to data errors. Journal of the American Statistical Association, 69: 440–445.
[8]
Kuiper F K and Fisher L (1975); A Monte Carlo comparison of six clustering procedures. Biometrics, 31: 777–783.
[9]
Blashfield R K (1976); Mixture model tests of cluster analysis: Accuracy of four agglomerative hierarchical methods. The Psychological Bulletin, 83: 377–388.
[10]
Hands S and Everitt B (1987); A Monte Carlo study of the recovery of cluster structure in binary data by hierarchical clustering techniques. Multivariate Behavioral Research, 22: 235–243.
[11]
Johnson R A and Wichern D W (2002); Applied Multivariate Statistical Analysis. Upper Saddle River, NJ: Prentice Hall.
[12]
Jaskowiak P A, Campello R J G B and Costa I G (2013); Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis, Computational Biology and Bioinformatics. 10 (4):845-857.
[13]
Costa I G, Carvalho F A D and Souto M C P D (2004); Comparative Analysis of Clustering Methods for Gene Expression Time Course Data. Genetics and Molecular Biology, 27: 4623-4631.
[14]
Datta S and Datta S (2006); Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics, 7: 397.
[15]
Kerr G, Ruskin H J, Crane M and Doolan P (2008); Techniques for clustering gene expression data. Comput Biol Med, 38(3): 283-293.
[16]
Geetha T and Michael A (2010); Enhanced Hierarchical Clustering for Gene Expression data. International Journal of Computer Applications 1(20):92–98.
[17]
Milligan G W (1980); An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45: 325–342.
[18]
Sasirekha K and Baby P (2013); Agglomerative Hierarchical Clustering Algorithm-A Review. International Journal of Scientific and Research Publications, 3(3):01-03.
[19]
Frakes W B and Baeza-Yates R (1992); Information Retrieval: Data Structures and Algorithms, Upper Saddle River, NJ: Prentice Hall.
[20]
Guojun G, Chaoqun M and Jianhong W (2007); Data Clustering: Theory, Algorithms, and Applications, ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA.
[21]
Gentleman R, Ding B, Dudoit S and Ibrahim J (2005); Bioinformatics and Computational Biology Solutions Using R and Bioconductor Statistics for Biology and Health, 189-208.
[22]
Pablo A Jaskowiak, Ricardo J G B Campello and Ivan G Costa (2013); Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(4):845-857.
[23]
Jain A K and Dubes R C (1988); Algorithms for clustering data Prentice Hall.
[24]
Milligan G W and Cooper M C (1988); A study of standardization of variables in cluster analysis. Journal of Classification, 5:181-204.
[25]
Marcilio C P de Souto, Ivan G Costa, Daniel S A de Araujo, Teresa B Ludermir and Alexander Schliep (2008); Clustering cancer gene expression data: a comparative study. BMC Bioinformatics, 01-14.
[26]
Anderberg M (1973); Cluster analysis for applications. New York: Academic Press.
[27]
Daxin J, Chun T, and Aidong Z (2004); Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions on Knowledge and Data Engineering, 16 (11):1370-1386.
[28]
Eldesoky, A.E, M. Saleh, N.A. Sakr (2009); Novel Similarity Measure for Document Clustering Based on Topic Phrase, Interna-tional Conference on Networking and Media Convergence24: 92-96.
[29]
Myatt, Glenn J, “Making Sense of Data II”, 2009, Wiley, Canada.
Browse journals by subject