Cluster-Driven Navigation of the Query Space
Additional Material

Thibault Sellam - Martin Kersten

This page describes the experiments presented in the article Cluster-Driven Navigation of the Query Space, under review for IEEE TKDE.

Screencast

Blaeu is currently being developped as a commercial product. This screencast showcases an early prototype. The video may also be downloaded here (hosted by Google Drive).
The visualization used in the video is different from that used in the paper - the video uses trees, the paper uses radial pie charts - but the underlying principles are exactly the same.

Use cases

The first use case is based on a database provided by US Bureau Transportation Statistics, available here. We used the month of January 2010. The Hollywood dataset was curated by the Website Information is Beautiful Awards. You may find a copy here.

Experiments

Synthetic data

Data generator

Our generator operates in two steps. First, it creates subspaces by assigning each column to a group. Second, it generates the clusters for each groups. To do so, it draws samples from several multinomial Gaussian distributions.
We used the following parameters:

Parameter	Distribution	Value
Tuples	Constant	20,000
Columns	Constant	40
Columns per subspace	Uniform	[2,10]
Clusters per subspace	Constant	5
Centroid position	Uniform	[0,100]
Cluster radius	Uniform	[2,10]

Algorithms

We inferred several parameters from the characteristics of the data. The variable NCLU represents the number of clusters in each subspace, NSUB the number of subspaces, NROWS and NCOLS the number of rows and columns. We ran the experiment CMI + SimpleMap with two different set of parameters, and reported the best results.

SimpleMap	Max. number of clusters	NCLU
MultiMap	Max. number of clusters	NCLU
LightMap	Max. number of clusters	NCLU
CMI + SimpleMap	Beam size	400
	Alpha	1.0
	Seeds	10
	Num subspaces	NSUB
CMI + Simple Map (2)	Num subspaces	100
FIRES	Base DBSCAN eps.	0.5
	Base DBSCAN min pts.	NROWS/100
	Pre. min percent	25.0
	Graph split	0.66
	Graph k	ceil(NCOLS / NSUB)
	Graph mu	ceil(NCOLS / NSUB)
	Graph min clu	1
	Post DBSCAN eps.	0.5
	Post DBSCAN min pts.	25
PROCLUS	k	NCLU * NSUB
PROCLUS	Dim	NCOLS / NSUB

We ran each experiment five times and reported the average results.

Real data

Data

All the datasets we used were made available by the the UCI repository. For convenience, we gathered them in this archive (mutant1) and this archive (all the other sets).

Settings

With real-life datasets, we cannot calculate the best parameters a priori. Therefore, we tried several combinations, and reported the best results. The table below describes the settings we tried. For instance, we tested 10*10=100 combinations for PROCLUS. We did not change the other parameters.

SimpleMap	Max. number of clusters	2,3,5,8,13,21,34
MultiMap	Max. number of clusters	2,3,5,8,13,21,34
LightMap	Max. number of clusters	2,3,5,8,13,21,34
CMI + MultiMap.	Num subspaces	2^i, i in [1,8]
CMI + MultiMap.	Max. number of clusters	2,3,5,8,13,21,34
CMI + SimpleMap.	Num subspaces	2^i, i in [1,8]
CMI + SimpleMap.	Max. number of clusters	2,3,5,8,13,21,34
FIRES	Graph K	2*i, i in [1,5]
	Graph Mu	2*i, i in [1,5]
	Graph minclu	2^i, i in [0,2]
PROCLUS	k	2^i, i in [1,10]
PROCLUS	Dim	2^i, i in [1,10]

Sampling

The datasets used for the sampling experiments may be found here and here.

Software

We used the implementations of FIRES and PROCLUS provided with the Open Subspace suite.
You may find our implementation of SimpleMap, MultiMap and LightMap here. These scripts require R to be installed, with the packages cluster, dynamicTreeCut and rpart. On a Linux machine, they may be called as follows:


		# First, compile the Information Theory functions 

		(cd lib && R CMD SHLIB info_theory.c )



		# SimpleMap 

		./SimpleMap.R data_file out_file number_clusters [sample_size]


		# MultiMap 

		./MultiMap.R  data_file out_file number_clusters [sample_size] 


		# LightMap 

		./LightMap.R data_file out_file number_clusters

The folder lib contains a few primitives written in R and C (for the functions related to entropy, mutual information, variation of information). It should be stored in the same folder as the R scripts. The input file should be a headless CSV file, with the separator ";". We included an example in the archive. The output file contains the subspace-clusters. Each entry has the following format:


        [binary column vector] [n items] [item indices]

The column vector describes the subspace in which the cluster is relevant. The next field describes the number of items in the cluster. The third field describes the index of the tuples, starting at 0. For instance, the following entry indicates a 6 tuples cluster, based on the second and third columns:


        0 1 1 6 0 1 12 13 14 51

This format was initiated by the Open Subspace project. It is compatible with E4SC.

CWI DISCLAIMER

Cluster-Driven Navigation of the Query SpaceAdditional Material