CloudVista: interactive visual analysis of large data in the cloud
Data Intensive Analysis and
Computing (DIAC) Lab, the Ohio Center of Excellence on Knowledge Enabled
Computing (Kno.e.sis Center)
Wright State University
Motivation:
With the
deployment of more and more cloud applications, large datasets are now
generated, stored and analyzed in the cloud. A critical problem is how to
efficiently analyze such large datasets using the parallel processing power
provided by the cloud. A few relational data analysis techniques, data mining
algorithms, and machine learning algorithms have been developed for handling
the data in the cloud. However, visual exploratory data analysis, such as
visual cluster analysis, which is very important for human involved knowledge
discovery, presents unique challenges for the data hosted in the cloud. We list
some of the challenges:
First, new
cloud-based processing framework and algorithms are needed for efficiently
generating visualization from the large datasets. In particular, visualization
models and data transformation algorithms that encode non-scalable components
need to be redesigned to fit the scalable processing framework.
Second,
processing large data inevitably brings significant delays, which conflicts
with the unique requirement of interactivity in visual analysis. How to
elegantly handle the delays and guarantee sufficient interactivity is the key
to interactive visual data exploration.
Third, economics of processing is a unique feature of cloud applications. We should understand the time-resource complexity of each component and design a workflow to minimize both the financial cost and the latency.
The CloudVista approach:
We design the CloudVista approach for interactively visualizing data clusters in the large datasets hosted in the cloud. CloudVista utilizes our previously developed VISTA visualization model and develops a set of methods to address the interactivity problem. We represent one static visualization as a visual frame. A series of interactive operations will generate a number of continuously changing visual frames, which help the user understand the clustering structure in a dynamic way. The technical problems include (1) how to generate a visual frame and a number of visual frames in a parallel program? (2) how to handle the delays between interactive operations?
The first problem can be easily addressed by using the VISTA visualization model. We address the second problem with two approaches. First, we change the traditional visual interactive model to a hybrid batch-interactive model. Second, we use the randomized continuous frame generation algorithm to generate a series of continuously changing visual frames.
Batch-interactive
model: In the
traditional interactive model, the system waits for the user’s interactive
operation. The interactive operation just changes the visualization parameters.
With the new visualization parameters, the system updates the visualization.
This
real-time feedback model is not realistic for data in the cloud because of the
long delay caused by cloud-side processing. It is also difficult to predict the
user’s operation based on historical operations. Therefore, we propose a
batch-interactive model as the following figure shows.
The main idea is to generate a batch of related visual frames in the server for each round of exploration, while the user can explore the series of visual frames locally until a request for a new batch is issued. To implement the batch generation algorithm, the key is to automatically generate a series of correlated visual frames, which is addressed by the RandGen algorithm.
RandGen
frame generation algorithm: the RandGen frame generation algorithm is used to automatically
generate a series of correlated visual frames. The design of this algorithm is
based on the special properties of the VISTA visualization model.
The VISTA
visualization model is used to visualize the clustering structure of a
multidimensional dataset. It maps the data from the high-dimensional space to
the two dimensional visual space, partially preserving the relative Euclidean
distance relationship. As a result, the clustering structure (based on
Euclidean distance) is partially preserved – the well-separated clusters in the
original space may be visualized as overlapped clusters. The VISTA
visualization is determined by dimensional parameters. Continuously changing
these parameters will create an “animation” of the visualized clustering
structure.
Compared to
the normally used dimensional interactive tuning, RandGen generates the
dimensional parameter settings for many frames in a batch. It starts with the
user’s initial dimensional setting and then randomly changes the dimensional
setting by –s or +s with 50% to 50% probability without user’s intervention.
We have
shown that the visualized distances follow certain pattern in the randomized
process – overall, the close points will move closely on average, while distant
points will be visually separated in high probability.
The following figure shows a comparison between visualizing a sample set (one thousand records, 68 dimensions) with the VISTA system and visualizing the extended large set (25 million records, 68 dimensions). We can see that, by exploring the entire dataset, more details of the clustering structure can be observed (the right subfigure). The visual frame is encoded with the density information and visualized with the heatmap technique. The more dense the cells have records mapped on, the warmer color the cells are visualized – clusters are located at the dense areas.
Related Resources:
1. A demo system will be available soon, check this page for details.
2. The original VISTA interactive visual cluster exploration system.
3. Some video clips: exploring extended Census data (25 million records, 68 dimensions, 5.3GB data) and extended KDDCUP99 data (40 million records, 41 dimensions, 13.5GB data)
4. Keke Chen, Huiqi Xu, Fengguang Tian, Shumin Guo: “CloudVista: visual cluster exploration for extreme scale data in the cloud”, Scientific and Statistical Database Management Conference, Portland OR, 2011