Cluster-Driven Navigation of the Query Space
Additional Material

Thibault Sellam - Martin Kersten

This page describes the experiments presented in the article Cluster-Driven Navigation of the Query Space, under review for IEEE TKDE.

Screencast


Blaeu is currently being developped as a commercial product. This screencast showcases an early prototype. The video may also be downloaded here (hosted by Google Drive).
The visualization used in the video is different from that used in the paper - the video uses trees, the paper uses radial pie charts - but the underlying principles are exactly the same.

Use cases

The first use case is based on a database provided by US Bureau Transportation Statistics, available here. We used the month of January 2010. The Hollywood dataset was curated by the Website Information is Beautiful Awards. You may find a copy here.

Experiments

Synthetic data

Data generator

Our generator operates in two steps. First, it creates subspaces by assigning each column to a group. Second, it generates the clusters for each groups. To do so, it draws samples from several multinomial Gaussian distributions.
We used the following parameters:
Parameter Distribution Value
Tuples Constant 20,000
Columns Constant 40
Columns per subspace Uniform [2,10]
Clusters per subspace Constant 5
Centroid position Uniform [0,100]
Cluster radius Uniform [2,10]

Algorithms

We inferred several parameters from the characteristics of the data. The variable NCLU represents the number of clusters in each subspace, NSUB the number of subspaces, NROWS and NCOLS the number of rows and columns. We ran the experiment CMI + SimpleMap with two different set of parameters, and reported the best results.
SimpleMap Max. number of clusters NCLU
MultiMap Max. number of clusters NCLU
LightMap Max. number of clusters NCLU
CMI + SimpleMap Beam size 400
Alpha 1.0
Seeds 10
Num subspaces NSUB
CMI + Simple Map (2) Num subspaces 100
FIRES Base DBSCAN eps. 0.5
Base DBSCAN min pts. NROWS/100
Pre. min percent 25.0
Graph split 0.66
Graph k ceil(NCOLS / NSUB)
Graph mu ceil(NCOLS / NSUB)
Graph min clu 1
Post DBSCAN eps. 0.5
Post DBSCAN min pts. 25
PROCLUS k NCLU * NSUB
Dim NCOLS / NSUB
We ran each experiment five times and reported the average results.

Real data

Data

All the datasets we used were made available by the the UCI repository. For convenience, we gathered them in this archive (mutant1) and this archive (all the other sets).

Settings

With real-life datasets, we cannot calculate the best parameters a priori. Therefore, we tried several combinations, and reported the best results. The table below describes the settings we tried. For instance, we tested 10*10=100 combinations for PROCLUS. We did not change the other parameters.
SimpleMap Max. number of clusters 2,3,5,8,13,21,34
MultiMap Max. number of clusters 2,3,5,8,13,21,34
LightMap Max. number of clusters 2,3,5,8,13,21,34
CMI + MultiMap. Num subspaces 2^i, i in [1,8]
Max. number of clusters 2,3,5,8,13,21,34
CMI + SimpleMap. Num subspaces 2^i, i in [1,8]
Max. number of clusters 2,3,5,8,13,21,34
FIRES Graph K 2*i, i in [1,5]
Graph Mu 2*i, i in [1,5]
Graph minclu 2^i, i in [0,2]
PROCLUS k 2^i, i in [1,10]
Dim 2^i, i in [1,10]

Sampling

The datasets used for the sampling experiments may be found here and here.

Software

We used the implementations of FIRES and PROCLUS provided with the Open Subspace suite.
You may find our implementation of SimpleMap, MultiMap and LightMap here. These scripts require R to be installed, with the packages cluster, dynamicTreeCut and rpart. On a Linux machine, they may be called as follows:
# First, compile the Information Theory functions
(cd lib && R CMD SHLIB info_theory.c )

# SimpleMap
./SimpleMap.R data_file out_file number_clusters [sample_size]

# MultiMap
./MultiMap.R data_file out_file number_clusters [sample_size]

# LightMap
./LightMap.R data_file out_file number_clusters
The folder lib contains a few primitives written in R and C (for the functions related to entropy, mutual information, variation of information). It should be stored in the same folder as the R scripts. The input file should be a headless CSV file, with the separator ";". We included an example in the archive. The output file contains the subspace-clusters. Each entry has the following format:
[binary column vector] [n items] [item indices]
The column vector describes the subspace in which the cluster is relevant. The next field describes the number of items in the cluster. The third field describes the index of the tuples, starting at 0. For instance, the following entry indicates a 6 tuples cluster, based on the second and third columns:
0 1 1 6 0 1 12 13 14 51
This format was initiated by the Open Subspace project. It is compatible with E4SC.
CWI DISCLAIMER