Cluster-Driven Navigation of the Query Space Additional
Material

Thibault Sellam - Martin Kersten

This page describes the experiments presented in the article
Cluster-Driven Navigation of the Query Space, under review for IEEE TKDE.

Screencast

Blaeu is currently being developped as a commercial product. This
screencast showcases an early prototype. The video may also be downloaded
here
(hosted by Google Drive). The visualization used in the video is
different from that used in the paper - the video uses trees, the paper uses
radial pie charts - but the underlying principles are exactly the same.

Use cases

The first use case is based on a database provided by US Bureau
Transportation Statistics, available here. We
used the month of January 2010. The Hollywood dataset was curated by the
Website Information is
Beautiful Awards. You may find a copy here.

Experiments

Synthetic data

Data generator

Our generator operates in two steps. First, it creates subspaces by assigning each column to a group. Second, it generates the clusters for each groups. To do so, it draws samples from several multinomial Gaussian distributions.
We used the following parameters:

Parameter

Distribution

Value

Tuples

Constant

20,000

Columns

Constant

40

Columns per subspace

Uniform

[2,10]

Clusters per subspace

Constant

5

Centroid position

Uniform

[0,100]

Cluster radius

Uniform

[2,10]

Algorithms

We inferred several parameters from the characteristics of the data.
The variable NCLU represents the number of clusters in each subspace,
NSUB the number of subspaces, NROWS and NCOLS the number of rows and
columns. We ran the experiment CMI + SimpleMap with two
different set of parameters, and reported the best results.

SimpleMap

Max. number of clusters

NCLU

MultiMap

Max. number of clusters

NCLU

LightMap

Max. number of clusters

NCLU

CMI + SimpleMap

Beam size

400

Alpha

1.0

Seeds

10

Num subspaces

NSUB

CMI + Simple Map (2)

Num subspaces

100

FIRES

Base DBSCAN eps.

0.5

Base DBSCAN min pts.

NROWS/100

Pre. min percent

25.0

Graph split

0.66

Graph k

ceil(NCOLS / NSUB)

Graph mu

ceil(NCOLS / NSUB)

Graph min clu

1

Post DBSCAN eps.

0.5

Post DBSCAN min pts.

25

PROCLUS

k

NCLU * NSUB

Dim

NCOLS / NSUB

We ran each experiment five times and reported the average results.

Real data

Data

All the datasets we used were made available by the the UCI repository. For
convenience, we gathered them in this
archive (mutant1) and this
archive (all the other sets).

Settings

With real-life datasets, we cannot calculate the best parameters a priori.
Therefore, we tried several combinations, and reported the best results.
The table below describes the settings we tried. For instance, we tested
10*10=100 combinations for PROCLUS. We did not change the other parameters.

SimpleMap

Max. number of clusters

2,3,5,8,13,21,34

MultiMap

Max. number of clusters

2,3,5,8,13,21,34

LightMap

Max. number of clusters

2,3,5,8,13,21,34

CMI + MultiMap.

Num subspaces

2^i, i in [1,8]

Max. number of clusters

2,3,5,8,13,21,34

CMI + SimpleMap.

Num subspaces

2^i, i in [1,8]

Max. number of clusters

2,3,5,8,13,21,34

FIRES

Graph K

2*i, i in [1,5]

Graph Mu

2*i, i in [1,5]

Graph minclu

2^i, i in [0,2]

PROCLUS

k

2^i, i in [1,10]

Dim

2^i, i in [1,10]

Sampling

The datasets used for the sampling experiments may be found here
and here.

Software

We used the implementations of FIRES and PROCLUS provided with the Open Subspace
suite. You may find our implementation of SimpleMap,
MultiMap and LightMap here. These scripts require
R to be installed,
with the packages cluster, dynamicTreeCut and rpart. On a Linux
machine, they may be called as follows:

# First, compile the Information Theory functions
(cd lib && R CMD SHLIB info_theory.c )

The folder lib contains a few primitives written in R and C (for the
functions related to entropy, mutual information, variation of
information). It should be stored in the same folder as the R scripts.
The input file should be a headless CSV file, with the separator ";".
We included an example in the archive. The output file contains the
subspace-clusters. Each entry has the following format:

[binary column vector] [n items] [item indices]

The column vector describes the subspace in which the cluster is
relevant. The next field describes the number of items in the cluster. The
third field describes the index of the tuples, starting at 0. For instance, the
following entry indicates a 6 tuples cluster, based on the second and third columns: