With continued advances in communication
network technology and sensing technology, there is an astounding growth in
the amount of data produced and made available through the cyberspace.
Efficient and high quality clustering of large datasets continues to be one
of the most important problems in large-scale data analysis. A commonly used
methodology for cluster analysis on large datasets is the three-phase
framework of ``sampling/summarization - iterative cluster analysis -
disk-labeling''. There are three known problems with this framework, which
demand effective solutions. The first problem is how to effectively define
and validate irregularly shaped clusters, especially in large datasets.
Automated algorithms and statistical methods are typically not effective in
handling such particular clusters. The second problem is how to effectively
label the entire data on disk (disk-labeling) without introducing additional
errors, including the solutions for dealing with outliers, irregular
clusters, and cluster boundary extension. The third problem is the lack of
research about the issues for effectively integrating the three phases.
The iVIBRATE project studies an interactive-visualization based three-phase
framework for clustering large datasets. The two main components of iVIBRATE
are its VISTA visual cluster rendering
subsystem, which invites human into the large-scale iterative
clustering process through interactive visualization, and its Adaptive
ClusterMap Labeling subsystem, which offers visualization-guided
disk-labeling solutions that are effective in dealing with outliers,
irregular clusters, and cluster boundary extension. Another important
contribution of iVIBRATE development is the identification of special issues
presented in integrating the two components and the sampling approach into a
coherent framework, and the solutions to improve the reliability of the
framework and to minimize the amount of errors generated throughout the
cluster analysis process. We study the effectiveness of the iVIBRATE
framework through a walkthrough example dataset of a million records and
experimentally evaluate the iVIBRATE approach using both real-life datasets
and synthetic datasets. Our results show that iVIBRATE can efficiently
involve the user into the clustering process and generate high-quality
clustering results for large datasets.
|
Representative papers:
- Keke Chen and Ling
Liu: " iVIBRATE: Interactive Visualization Based Framework for
Clustering Large Datasets " , ACM Transactions on Information Systems (TOIS), 2006.
- Keke Chen and Ling Liu: "VISTA: Validating
and Refining Clusters via Visualization." Journal of
Information Visualization. Sept. 2004.
- Keke Chen and Ling Liu:"ClusterMap:
Labeling Large Datasets via Visualization." ACM Conf. of
Information and Knowledge Management (CIKM04), Washington DC, Nov,
2004
- Keke Chen and Ling Liu: "Validating and
Refining Clusters via Visual Rendering." Proc. of Intl. Conf.
on Data Mining(ICDM03). Melbourne, FL, November 2003.
- Keke Chen and Ling Liu: "A Visual
Framework Invites Human into the Clustering Process." Proc of
Scientific and Statistical Database Management (SSDBM03).
Cambridge, Boston, July 2003.
- Keke Chen and Ling Liu: "Cluster Rendering
of Skewed Datasets via Visualization." Proc. of ACM Symposium
on Applied Computing(ACM SAC03). Melbourne, FL, March 2003.
- Keke Chen and Ling Liu: "A Comparative
Study on Star-Coordinates Cluster Visualizations", under review
|