Cluster-Driven Navigation of the Query Space Additional
Material
Thibault Sellam - Martin Kersten
This page describes the experiments presented in the article
Cluster-Driven Navigation of the Query Space, under review for IEEE TKDE.
Screencast
Blaeu is currently being developped as a commercial product. This
screencast showcases an early prototype. The video may also be downloaded
here
(hosted by Google Drive). The visualization used in the video is
different from that used in the paper - the video uses trees, the paper uses
radial pie charts - but the underlying principles are exactly the same.
Use cases
The first use case is based on a database provided by US Bureau
Transportation Statistics, available here. We
used the month of January 2010. The Hollywood dataset was curated by the
Website Information is
Beautiful Awards. You may find a copy here.
Experiments
Synthetic data
Data generator
Our generator operates in two steps. First, it creates subspaces by assigning each column to a group. Second, it generates the clusters for each groups. To do so, it draws samples from several multinomial Gaussian distributions.
We used the following parameters:
Parameter
Distribution
Value
Tuples
Constant
20,000
Columns
Constant
40
Columns per subspace
Uniform
[2,10]
Clusters per subspace
Constant
5
Centroid position
Uniform
[0,100]
Cluster radius
Uniform
[2,10]
Algorithms
We inferred several parameters from the characteristics of the data.
The variable NCLU represents the number of clusters in each subspace,
NSUB the number of subspaces, NROWS and NCOLS the number of rows and
columns. We ran the experiment CMI + SimpleMap with two
different set of parameters, and reported the best results.
SimpleMap
Max. number of clusters
NCLU
MultiMap
Max. number of clusters
NCLU
LightMap
Max. number of clusters
NCLU
CMI + SimpleMap
Beam size
400
Alpha
1.0
Seeds
10
Num subspaces
NSUB
CMI + Simple Map (2)
Num subspaces
100
FIRES
Base DBSCAN eps.
0.5
Base DBSCAN min pts.
NROWS/100
Pre. min percent
25.0
Graph split
0.66
Graph k
ceil(NCOLS / NSUB)
Graph mu
ceil(NCOLS / NSUB)
Graph min clu
1
Post DBSCAN eps.
0.5
Post DBSCAN min pts.
25
PROCLUS
k
NCLU * NSUB
Dim
NCOLS / NSUB
We ran each experiment five times and reported the average results.
Real data
Data
All the datasets we used were made available by the the UCI repository. For
convenience, we gathered them in this
archive (mutant1) and this
archive (all the other sets).
Settings
With real-life datasets, we cannot calculate the best parameters a priori.
Therefore, we tried several combinations, and reported the best results.
The table below describes the settings we tried. For instance, we tested
10*10=100 combinations for PROCLUS. We did not change the other parameters.
SimpleMap
Max. number of clusters
2,3,5,8,13,21,34
MultiMap
Max. number of clusters
2,3,5,8,13,21,34
LightMap
Max. number of clusters
2,3,5,8,13,21,34
CMI + MultiMap.
Num subspaces
2^i, i in [1,8]
Max. number of clusters
2,3,5,8,13,21,34
CMI + SimpleMap.
Num subspaces
2^i, i in [1,8]
Max. number of clusters
2,3,5,8,13,21,34
FIRES
Graph K
2*i, i in [1,5]
Graph Mu
2*i, i in [1,5]
Graph minclu
2^i, i in [0,2]
PROCLUS
k
2^i, i in [1,10]
Dim
2^i, i in [1,10]
Sampling
The datasets used for the sampling experiments may be found here
and here.
Software
We used the implementations of FIRES and PROCLUS provided with the Open Subspace
suite. You may find our implementation of SimpleMap,
MultiMap and LightMap here. These scripts require
R to be installed,
with the packages cluster, dynamicTreeCut and rpart. On a Linux
machine, they may be called as follows:
# First, compile the Information Theory functions
(cd lib && R CMD SHLIB info_theory.c )
The folder lib contains a few primitives written in R and C (for the
functions related to entropy, mutual information, variation of
information). It should be stored in the same folder as the R scripts.
The input file should be a headless CSV file, with the separator ";".
We included an example in the archive. The output file contains the
subspace-clusters. Each entry has the following format:
[binary column vector] [n items] [item indices]
The column vector describes the subspace in which the cluster is
relevant. The next field describes the number of items in the cluster. The
third field describes the index of the tuples, starting at 0. For instance, the
following entry indicates a 6 tuples cluster, based on the second and third columns: