HOWTO: Use the Clusterworkspace to conduct clusteringexperiments

WARNING! The use of all scripts provided here is at you own risk! We give NO WARRANTY that this will work for you.

There are 2 bash-scripts available for experiments:

/clusterworkspace/scripts/test.sh
/clusterworkspace/scripts/testfix.sh

which differs just in two points: First, testfix uses a fixed value to scale the the coreset/chunksize with the parameter k for the number of centers. By default, this value is set to 200 but may be edited in testfix.sh if needed. Second, testfix uses other startingparameters than test.sh since it is not longer necessary to specify the coresetsize. Instead of the parameters
"cstart cincrement cend" the parameters "performstream performlocalsearch performbirch" are used to toggle on/off the desired algorithms.

In the following, we will give a slight overview about the usage of test.sh (the use of testfix.sh should become clear anyway).

1. Inputfiles
------------------------------
Although StreamKM++ and StreamLS use a whitespace-separated fileformat internally, all input-data must be in a comma separated file (csv), which is usually the case. Invalid entries such as headers of tables or non-numerical values should be deleted before starting the script!

2. Executing test.sh
-------------------------------
the Syntax for test.sh is:

$ ./test.sh path kstart kincrement kend cstart cincrement cend repetitions

parameters:

path: absolute path to csv-inputfile
kstart, kincrement, kend: specifies number of centers k
cstart, cincrement, cend: specifies the coresetsize / chunksize for algorithms streamKM++ and streamLS
repetitions: this param specifies how often each experiment for each values of k and c will be performed

example:

$ ./test.sh /privat/mmarcus/datasets/cov.data 10 5 30 1000 1000 1000 10

This call will cluster cov.data for centersizes k = 10, 15, 20, 25, 30 each time with coresetsize c = 1000. Every algorithm will be called 10 times for each value of k and c (so 50 runs will be performed for streamKM++).

The coresetsize can changed in a similar way as the number of centers is changed, for example:

$ ./test.sh /privat/mmarcus/datasets/cov.data 10 5 30 1000 1000 3000 10

will perform the same experiments as above, but for c = 1000, 2000, 3000.

testfix.sh has other parameters. An example call for testfix.sh would be

$ ./testfix.sh /privat/mmarcus/datasets/cov.data 10 5 30 1 1 0 10

which will cluster cov.data for centersizes k = 10, 15, 20, 25, 30 using for each k a coresetsize of c = 200*k, performing streamKM++ and streamLS but not BIRCH, 10 repetitions.

3. Some information about test.sh
---------------------------------------------------
test.sh is doing following jobs:

a) very simple check for correct parameters

b) validate csv-file and transform it into a whitespace-separated format, adding also a leading 1 for weight 1.
Warning: at this point the script may fail if the inputfile is corrupted (for example missing values).
Two files will be created in this step from data.txt: data.txt.wsv and data.txt.wsv.b. The first one is the inputfile for streamKM++ and streamLS, the second one is for BIRCH.

c) creating the reportfile
A reportfile is created in the same directory of the inputfile and is named with the actual date and the filename of the inputfile. This directory is also the workingdirectory of almost everything test.sh does.

d) starting the algorithms
The algorithms are started on their specific inputfiles. Running time is determined using /usr/bin/time. A file for the calculated centers will also be created. This file is validated afterwards to determine the exact number of centers calculated by the specific algorithm. Afterwards, the sum of squared distances (costs) is calculated and saved in the reportfile.

e) finishing
Some temporary files are deleted, all centerfiles are tar-zipped in an archive which is called centers.tar.gz everytime the script is started. If specified, an emailnotification about the succesful experiment is sent and - if specified as well - all results are checked in a svn-repository.

4. configuration
---------------------------------------
There are some possibilities to change the behavior of the script. If you want to set some variables take care not to use whitespaces before or after the equation mark! All variables can be found at the beginning of test.sh (commented). It is recommended to make a backup of test.sh before changing anything!

a) paths
toolpath: absolute path to the binaries, which are not used for clustering, but to create seeds, transform fileformats or calculate costs.
algpath: absolutere path to the binaries for the clusteringalgorithms (namely streamKM++, streamLS and BIRCH).
repopath: absolute path to a repository (just in case you want...)

b) settings
repair: missing values in a csv-file will not cause test.sh to fail. Instead of this, for each missing value a trailing zero is added at the end of the line. The dimension of the wsv-file will fit, though the inner structure of the data may be disturbed. Default: 0
performstream, performlocalsearch, performbirch: specifies, which algorithms will be executed. stream = streamKM++, localsearch = streamLS and birch = BIRCH.
addtosvn: If true, all results will be moved to svn and checked in. If false, the data will remain in the working directory. Default: 0
sendmail: If true, an email will be sent after finishing the experiments. The email contains the complete reportfile, generated by test.sh. Default: 1 (this is a useful feature!)
cleanup: If true, temporary files will be removed if test.sh is not aborted. Default: 1 (recommended)
birchattributetype: Valid values for this are only "int" or "double". What you need to use depends on the data you want to cluster!
birchmemory: specifies a parameter of BIRCH. This one tells BIRCH how many memory should be set aside for the internal data structures. The authors recommend a value of 5 for this, but a value of 10 has been more useful for our experiments. Default: 10
adressaten: Used to save all the addresses (if sendmail="1") for the email-notification. All addresses need to be given without any whitespace and comma separated, i. e.: "a@b.edu,c@d.de,e@f.net"
factor: can be found just in testfix.sh. This factor determines the coresetsize / chunksize in dependency of the number of centers k. Default: 200 (so, c will be 200*k).

5. files bigger than 2GB
----------------------------------------------
For really big files, the normal 32-bit filepointer of our c-programs will not work. If you want to cluster this kind of data, you need to use the 64-bit version of the clusterworkspace. Change the "toolpath" and "algpath" parameter in test.sh to bin64 and everything should work.