The Cloud Vista Demo System

1 Introduction

Analysis of big data has become an important problem for many business and scientific applications, among which clustering and visualizing clusters in big data raise some unique challenges. This demonstration presents the CloudVista prototype system to address the problems with big data caused by using existing data reduction approaches. It promotes a whole-big-data visualization approach that preserves the details of clustering structure. The prototype system has several merits. (1) Its visualization model is naturally parallel, which guarantees the scalability. (2) The visual frame structure minimizes the data transferred between the cloud and the client. (3) The RandGen algorithm is used to achieve a good balance between interactivity and batch processing. (4) This approach is also designed to minimize the Financial cost of interactive exploration in the cloud. The demonstration will highlight the problems with existing approaches and show the advantages of the CloudVista approach. The viewers will have the chance to play with the CloudVista prototype system and compare the visualization results generated with different approaches.

2 The demo system

2.1 Sample exploration video

2.2 Client-cloud demo system

Download demo system with sample exploration cloudvista-1.1.zip.

2.3 About the dataset on the cloud

This demo system will be able handle high dimensional numeric data. Each row should contain a space separated data record.

Data normalization (either max-min or guassian standard) are required to make sure the visualized data frame to be on the center of the visual window.

2.4 Related publications

CloudVista: Interactive and Economical Visual Cluster Analysis for Big Data in the Cloud Demo, VLDB, Istanbul Turkey, 2012.

CloudVista: visual cluster exploration for extreme scale data in the cloud Full paper, SSDBM, Portland, Oregon, 2011.

3 Installation Instructions

3.1 Prerequisites

Java version >= 1.6

Working Hadoop cluster

3.2 Hadoop Cluster Configuration.

The demo system does not have special requirement on the hadoop cluster. You can use default settings for all the hadoop configuration parameters, for more configuration details, please refer to the official hadoop document.

3.3 Private Key Generation for Passwordless login to hadoop server.

The demo system uses the ssh protocol for the communication between client and the hadoop server. So, a private key file is required for the login purposes. The private key generation for windows and linux are different, please reference the respective documentation in the following sections.

Generate private key in windows system. We recommend to use the open source putty suite software PuTTYgen, which can be downloaed from here.
Another nice reference to generate private key with putty is the aws document in this link.

After generting the private key, you need to upload/copy the public key into the hadoop server and configure the demo system the path to the private key under local file system.
Generate private key in linux system. We recommend openSSH to generate key files under linux, and it is installed by default in most linux distributions.
You can use the following command to generate key-pair:

ssh-keygen

Next, use the following command to copy your public key to the cloud/hadoop server.

ssh-copy-id <username>

Make sure the passwordless login is correctly configured by logging out from the server and login again. If login is sucessful without prompting for password, it means the configuration is successful.

Next step is to tell the demo system where to find the private key file from local file system.

3.4 Demo System Configuration.

Users can modify configurations through user interface.

cloud vista

CloudVista Client main window.

The following configuration items are required to run the demo system:

Local working directory, default value is the directory where the program is run.
Remote/Server working directory, this one is required (non-optional), it is used to store a mirror of the local working directory, and will be used to sync from the server to local machine. This directory is usually under the user's (whose private key is generated) home directory.
Map/Reduce jar file(RR.jar) on the server, put it into the root working directory of the server.
Remote/Server hadoop HDFS directory , where dataset data is hosted for mapreduce jobs.
Path to the private key file in local file system.
The user name to login to the cloud server.
The server host name (DNS name) or ip address.
The server SSHD service port number (default is 22).

4 Running the demo

4.1 Start program.

To start up the demo program, you need to unzip the who software package with un-packing tools, such as gunzip.

From the command line run the startup script to run the program.

For windows, start the command prompt window and cd to the directory where you un-packed the software then type in the following command:

> run.bat

For linux, start terminal and change to the un-pack directory and run the following command.

$ ./run.sh

5 The user interface.

The demo client window contains two major area, visualization on the left and exploration manipulation on the right.

5.1 The Visualization Region

On the left is the visual frame visualization window, in which data points will be visualized with different colors based on the density, in general, the lower the density of a point, the deeper in black the point will be in the visualization and the higher the density the darker in red the point will be. This makes it easy to identify interesting clusters buried in high dimension datasets.

Users can operate on the visual frames such as zoom in/out and move the frames to different directions in the visual window. Visual explorations will be created in batch, in which a number of visual frames are generated in one exploration. In order to inspect these frames one by one, user can use the auto-play mode. The visual frames of the exploration will be displayed one by one in sequence. The visual frame play control tool bar makes it easy to control the auto-play of visual frames.

5.2 The Management Area

On the right side of the client window is the exploration manipulation and and management window, from this area, you can create new explorations, edit, delete and visualize existing explorations etc. These operations can be done by a handy popup menu by right clicking a node in the tree.

The tree view shows the structure of the explorations currently exist on the client machine. The root of the tree is "Datasets" the next level contains all datasets currently available. And under each dataset, explorations of this dataset is kept. The deepest level of the tree corresponds to sub-set exploration generated from existing exploration by selecting a sub-set region.

Under the tree structure is an area for editing existing explorations or modifying parameters when creating new explorations. The "Save" button is used to apply the changes.

On the bottom of the management area lies the system status area, which shows operations on the client etc.

6 Configure from Main Window

After the program is started you were able to configure the aforementioned settings by clicking the configure button on the right-upper corner of the window.

After fill in all the fields, click the "OK" button.

configuration window

System configuration.

7 Manipulation Instructions

7.1 Data set operations.

If the HDFS directory is correctely set, all available datasets (directories under the HDFS dataset root) will be automatically synced to the local client. These datasets will be read only during all the operations.

You can add a new dataset to the HDFS root directory, it will be synced to local client automatically.

– NOTE: The auto-sync feature currently is not available, so you need to manually specify the name of the dataset in local working directory to make the exploration work. For example, you have a dataset named sampleDs, you will need to create a directory $EXPLORE_DIR/sampleDs. If the dataset is not created, you will not be able to create explorations.

7.2 Exploration Operations.

Explorations are organized in tree format, you can create, edit, visualize, delete and visually explore explorations.

To create a new exploration, right click a dataset item, then selected "New" in the popup menu, you are asked to enter the exploration name, and set up other exploration parameters under the exploration tree window.
To edit an exploration, right click the desired exploration item on the tree view, then selected the edit button, then you can edit the parameters for the exploration, after the parameters are changed, click the "Save" button, or right click the item then select "Save" from the popup menu.
To visualize an exploration, right click the desired exploration item in the exploration tree view window, and then select "Visualize" from the popup menu. If the selected exploration has been built previously, they will be loaded from local cache, which is fast, otherwise, a hadoop Map/Reduce job will be started to built this exploration on the server, after the hadoop job is done, the visual frame files will be downloaded from the server to the client and then loaded into the visualizer.
To delete an exploration, right clicked the desired exploration, then select "Delete" from the popup menu. Then this item will disappear from the client window as well as the local exploration directory. But the server will have a back up of this exploration (current option) and can be deleted if desired.
To visually explore an exploration, first visualize the exploration in the left windows of the client program (window). You can move visual frame left, right, bottom and up. You can zoom in and out the visual frame and play the visual frames frame by frame automatically. And more importantly, select interesting area for more detailed exploration (sub-set exploration, see next section).

7.3 Subset Exploration Operations.

A sub-set exploration is created from an existing exploration by selecting an interesting area. More details about the data cluster can be viewed through sub-set explorations.

When a sub-area of a visual frame is selected, the life cycle of a subset exploration get started, you first need to specify the name of this sub-exploration, and you can also set the parameters similar to a regular exploration and all the regular exploration operations are supported by a sub-set exploration except that you can not create it on the fly but through an existing exploration.

8 Related Publications

Please reference our paper throw the following BibTex:

@article{Xu:2012:CIE:2367502.2367529,
 author = {Xu, Huiqi and Li, Zhen and Guo, Shumin and Chen, Keke},
 title = {CloudVista: interactive and economical visual cluster
 analysis for big data in the cloud},
 journal = {Proc. VLDB Endow.},
 issue_date = {August 2012},
 volume = {5},
 number = {12},
 month = aug,
 year = {2012},
 issn = {2150-8097},
 pages = {1886--1889},
 numpages = {4},
 url = {http://dl.acm.org/citation.cfm?id=2367502.2367529},
 acmid = {2367529},
 publisher = {VLDB Endowment},
}

9 Project members

Advisor: Keke Chen,
Shumin Guo, Ph.D. Candidate
Huiqi Xu,
Zhen Li

10 Copyright Notice

This software is for research purpose only, several third party software packages are used during the development of this demo. We would like to thank the authors of these software.