Author: | Jesús Arias Fisteus |
---|
Contents
This guide explains how you can reproduce the experiments of the paper A middleware for publishing semantic event streams on the Web.
In addition to this guide, we provide the dataset we used in our experiments, the scripts that run the experiments and the scripts that process and plot the data gathered from the experiments. The following download includes those files as well as scripts that will help you to setup your computers for running the experiments:
http://www.it.uc3m.es/jaf/ztreamy-dl/experiments/ztreamy-experiments.tar.gz
For instructions about how to install this package in the computers in which you will run the experiments, see section Setting up the computers, below in this document.
In order to reproduce the experiments in similar conditions to those of the paper, you'll need five computers with Linux installed. We'll call them computer0 to computer4. In our setup, those computers were:
computer1 to computer4 are in the same 1 Gbps Ethernet LAN. They share the same switch, and no other computer is connected to it. computer0 is in different 100 Mbps LAN, which connects to the other LAN through just one intermediate router.
In order to measure event delivery delays with accuracy, the clock of all the computers must be synchronized. Install the NTP (Network Time Protocol) in all of them, using if possible the same reference servers. If possible, all the computers should be synchronized with an offset of no more than 5ms.
When using the command ntpq -p, the offset column displays the offset of the local clock with respect to the time server. The offset is displayed in ms.
The experiments can be run from a single script executed at computer0. This script needs to connect to the other computers through SSH in order to run there servers and clients. It is necessary for the script to be able to access those computers without the user needing to type a password. Therefore, you need to setup an SSH public/private key pair that allows password-less access from computer0 to the other four computers.
In addition to that, the scripts expect in all the computers an account with the same user name.
Modern processors allow the operating system to dynamically chose the clock frequency at which they operate. In Linux this is usually done by cpufreq. Before running the experiments, remember to load the performance governor in order to set the processor always to its maximum frequency:
sudo cpufreq-set -r -g performance
After finishing the experiments, you can set the default governor again:
sudo cpufreq-set -r -g ondemand
At any moment, you can get information about the current governor with this command:
cpufreq-info
Note: the cpufreq commands above are in Debian and Ubuntu included in the package cpufrequtils. Install it if your system does not recognize those commands installed.
In all the experiments, a server provides a stream of events. The events come from an event source that runs in the same computer as the server and sends the events via HTTP. A number of clients that listen to the stream is run from the rest of computers.
In the experiments we control the following configuration parameters, which we vary in order to measure their impact in performance:
Each individual experiment consists of three stages:
The following performance indicators are measured:
Given the limitations in the number of computers available for the experiments, we have to run many clients per computer. The script ztreamy.tools.many_clients that we include in the Ztreamy distribution runs an arbitrary number of clients from the same process. Each client has a separate connection to the server and receives its data through it. In one of the configurable modes of the script, only one of the clients actually parses the events. The other clients just read the data without processing it. This is done to save CPU resources in the computer that runs the script when the number of clients is large.
The source of events transmits the events that it loads from a file. For each round of a experiment, it begins with the first event of the file, then the second, etc.
In order to set up the computers for running the experiments, you'll need to run the same commands in several computers. You may find an SSH multiplexer, such as Cluster SSH, very useful for that:
sudo apt-get install clusterssh
You'll need to have installed in your computers the following required packages:
The basic GNU tools for C development (such as libc6-dev, gcc, etc.) In Debian and Ubuntu they can be installed through the package build-essentials:
sudo apt-get install build-essential
Java JDK 7 (you'll see in some places as JDK 1.7).
Python 2.7, its header files and the virtualenv software. In Debian/Ubuntu:
sudo apt-get install python python-dev python-virtualenv
Ztreamy has not been ported yet to Python 3.
BLAS and Lapack, including the development header files (libblas-dev, liblapack-dev), and the Fortran compiler gfortran. In Debian/Ubuntu:
sudo apt-get install libblas3 libblas-dev liblapack3 liblapack-dev gfortran
(in some systems those packages are called libblas3gf and liblapack3gf.)
curl and libcurl, including the development header files. In Debian/Ubuntu:
sudo apt-get install curl libcurl3 libcurl4-openssl-dev
ZeroMQ, including the development header files. In Debian/Ubuntu the packages are libzmq1 and libzmq-dev:
sudo apt-get install libzmq1 libzmq-dev
Probably, NTP is already installed and running in your computers. If not, you have to install the NTP service in all the computers. In Debian and Ubuntu it can be done with:
sudo apt-get install ntp
It may take some time before all the clocks are synchronized. Use ntpq to check whether it is the case before starting to run the experiments.
Linux imposes a limit to the maximum number of open file descriptors a user can have. Since it is probably too small for running our experiment, you should increase it in all the computers for the user that will run the experiments. Assuming the user is called foo, edit the file /etc/security/limits.conf and append a couple of lines with the new limit:
foo soft nofile 65536 foo hard nofile 65536
Replace foo above with the name of the user that will run the experiments.
Once you have downloaded our http://www.it.uc3m.es/jaf/ztreamy-dl/experiments/ztreamy-experiments.tar.gz package, decompress it somewhere inside your user account in the five computers you are using for the experiments. Install it in the same path in all the computers. You are expected to run the scripts from the main directory of the package you have just uncompressed.
Before beginning to run the experiments, you need to do some further installations (create a virtual environment for the Python programming language, with the additional packages required by Ztreamy) and a virtual environment for the Ruby programming language (with Faye and its required packages). In order to do that, just run from the main directory of our package:
./setup.sh
The script takes about 15 minutes to run, because it has to compile some of the libraries we use.
The installation of Ruby may prompt you to do some actions (install additional dependencies in your system). If asked, you only need the dependencies for Ruby, but not those for JRuby, IronRuby and Opal. Follow the instructions provided by the script.
Once the script is finished, you should configure the environment (see the next section). After that, you should be able to begin to run the experiments as explained in section Running the script.
The configuration needed by the scripts that run the experiments, such as the hostnames of the computers and the location of the Java JDK, is placed in config.sh, at the main directory of the package we provide. Once edited, remember to copy the file to the other computers.
You have to configure the following variables in this file:
You have to copy the file config.sh in the five computers, inside the main directory of the package we provide.
If you want to uninstall this things later, the only things that get installed outside the directory of our package are:
The rest of the things are installed inside our package. Just remove our package if you want to uninstall them.
Experiments can be run by invoking the run.py script from computer0. The script receives the parameters of the experiment (number of clients, event rate, number of events, etc.), starts the server, event source and clients, many of them in the remote computers, collects the logs generated by them and stores them inside a directory.
In order to run the script, open a command line terminal and go to the directory from which you want to run the experiments (the one you have specified as main_dir in variables.sh). Run the script from there. Make sure that variables.sh is in that directory.
The script can run more than one round of the experiment with the same parameters. Since some factors such as the instants at which the events are generated, the state of the CPU of the computers, the state of the network, etc. are random, repeating several times the experiment with the same parameters is useful for reporting the confidence intervals for the performance indicators that we measure. The number of repetitions to run is specified as a command line argument. Optionally, since rounds are numbered, the number for the first round to be run can be specified with the -i optional option. If not specified, the script begins with round 1. The number of round is important because the script creates a separate directory for each round. For example, assume you have already run rounds 1 to 7, and that you want to start a new experiment labeled as round 8:
python scripts/run.py -i 8 ...rest of arguments...
The tool on which the experiment will run is selected with a command line argument. In no argument is specified, the script runs Ztreamy. For example, in order to run the experiment on the Dataturbine system, use the -t option:
python scripts/run.py -t ...rest of arguments...
The full list of arguments that select the system to run are:
In our experiments, we usually vary one parameter while the other ones are fixed. Specifically, we perform some experiments in which the number of clients varies, other experiments in which the event rate varies and other experiments in which the server buffer window varies. The following sections describe how to run each kind of experiment.
In order to perform an experiment in which the number of events, the event rate and the server buffer window are fixed, while the number of clients varies, run the script like in the following example:
python scripts/run.py 0.25 0.5 400 5 experiment-logs/ztreamy/clients 300 clients
The meaning of its command line arguments is:
(param. 1) 0.25: average separation between two consecutive events. In this case, it is 0.25s (i.e. 4 events/s). It characterizes the Poisson process that the source of events follows.
(param. 2) 0.5: the size of the server buffer time in seconds.
(param. 3) 400: the number of events to send. Note that 400 events at a rate of 0.25 mean an average duration of 100s for the experiment.
(param. 4) 5: number of rounds of the experiment to run. In this case, 5 rounds are run. By default, they are numbered 1 to 5. You can change the initial number with the -i option, as explained above.
(param. 5) experiment-logs/ztreamy/clients: directory in which the data gathered from the experiment will be stored. Inside this directory, the script automatically creates a subdirectory named after the number of clients of this experiment. It stores the data associated to each round of the experiment in a separate subdirectory inside it. The name of the inner directory is the number of the round. In the example, the results of the first round are stored in the subdirectory:
experiment-logs/ztreamy/clients/300/1
(param. 6) 300: number of clients. Clients are automatically divided by the script between the local computer and the three remote computers that do not run the server.
(param. 7) clients: the parameter that varies in this experiment. In this case, it is always "clients".
The command above runs the experiment 5 times for just one fixed number of clients. Probably you'll want to run the experiment for different numbers of clients:
for clients in 1000 2000 3000 4000; do python scripts/run.py 0.25 0.5 400 5 experiment-logs/ztreamy/clients $clients clients; done
This command runs 5 rounds of the experiment with 1000 clients, then another 5 rounds with 2000 clients, etc. Notice the dollar sign ($) that we have inserted before clients, at the end of the script. This command requires the use of bash, dash, sh or other Unix shell that is compatible with them.
In order to vary the event rate while keeping the number of clients and server window size constant, run the script like in the following example:
python scripts/run.py 0.25 0.5 100 5 experiment-logs/ztreamy/rate 300 rate
There are two main changes in the invocation with respect to the previous cases. The first one is that the last parameter is rate. The second one is that the meaning of the third parameter changes:
In this case, the results are stored in a directory named from the event rate. For example, the first round of the previous command stores its results at:
experiment-logs/ztreamy/rate/0.25/1
because the event rate is 0.25s.
You can use a shell loop to automatically repeat many experiments. For example, if you use the BASH shell, you can run:
for rate in 0.1 0.12 0.15 0.2 0.25 0.5 0.75 1.0 1.5 2.0; \ do python scripts/run.py -f $rate 0 100 10 experiment-logs/faye/rate \ 500 rate; done
which runs experiments on the Faye server, with 500 clients and an average of 100s of events per run. The event rate varies, taking the values 0.1, 0.12, 0.15, 0.2, 0.25, 0.5, 0.75, 1.0, 1.5 and 2.0. For each of the 10 event rates the experiment is repeated 10 times. Therefore, the experiment is run 100 times.
We provide the scripts that process the files obtained from the experiments and generate the final plots that are shown in the paper. This section explains how to use them.
Let's suppose you run an experiment in which the parameter that varies is the number of clients (1000, 2000, 3000,...). The data from the experiment will be organized in directories with a structure like the following one:
experiment-logs/ |-- clients |-- 1000 | |-- 1 | |-- 2 | |-- 3 | |-- 4 | |-- 5 | |-- 2000 | |-- 1 | |-- 2 | |-- 3 | |-- 4 | |-- 5 | |-- 3000 | |-- 1 | |-- 2 | |-- 3 | |-- 4 | |-- 5 (...)
Each low-level directory contains the data gathered from one round of the experiment, organized in several text files:
This is an example of a metadata.txt file:
Server: ztreamy Rate: 0.05 Buffer: 0.4 Num events: 2000 Clients: 9000 Dir: experiment-logs/ztreamy-solo/clients_effect/9000/1 Server news.gast.it.uc3m.es NTP: remote refid st t when poll reach delay offset jitter ============================================================================== *roleX.uc3m.es 130.206.3.166 2 u 653 1024 377 0.346 -1.189 0.645 Client ariadna NTP: remote refid st t when poll reach delay offset jitter ============================================================================== *roleX.uc3m.es 130.206.3.166 2 u 185 1024 377 0.314 -2.011 0.489 Client itaca NTP: remote refid st t when poll reach delay offset jitter ============================================================================== *roleX.uc3m.es 130.206.3.166 2 u 1013 1024 375 0.333 -1.339 2.759 Client infoflex NTP: remote refid st t when poll reach delay offset jitter ============================================================================== *roleX.uc3m.es 130.206.3.166 2 u 489 1024 377 0.246 -1.228 1.682 Client semnet NTP: remote refid st t when poll reach delay offset jitter ============================================================================== *roleX.uc3m.es 130.206.3.166 2 u 503 1024 377 0.289 -0.315 1.779
The file stores the parameters used in the experiment as well as the state of the NTP service in every computer.
This is an example of a server-*.log file:
# Node: d8f59642-3c54-4be5-8659-1350f1d8e7c5 # Host: news # # Buffer time (ms): 400.0 server_traffic_sent 1354034628.97 0 server_traffic_sent 1354034638.97 34362 server_traffic_sent 1354034648.99 63647 server_traffic_sent 1354034658.97 50948 server_traffic_sent 1354034668.97 50450 server_traffic_sent 1354034678.97 88201 server_traffic_sent 1354034688.97 87205 server_traffic_sent 1354034698.97 126201 server_traffic_sent 1354034708.97 252056 server_traffic_sent 1354034718.97 16146479 server_traffic_sent 1354034728.97 28986000 server_traffic_sent 1354034738.97 29805000 server_traffic_sent 1354034748.97 28507000 server_traffic_sent 1354034758.97 26624000 server_traffic_sent 1354034768.97 27106000 server_traffic_sent 1354034778.97 26322000 server_traffic_sent 1354034788.97 29545000 server_traffic_sent 1354034798.97 31126000 server_traffic_sent 1354034808.97 27182000 server_timing 10.22 102.800895929 1354034715.98 server_traffic_sent 1354034818.97 26481000 server_closed 1354034827.37 1000 server_traffic_sent 1354034827.37 1047000 server_traffic_sent 1354034827.44 72796
The file displays the following information gathered by the server:
For the other systems we do not have the same degree of control. Therefore, the log contains just the server_timing information for them.
Some of the clients launched by the script generate a log file. Therefore, there will be several of these files for each experiment. This is a fragment of an example file:
# Node: 8b1cad78-0a43-4c57-acc1-a4936d0b39aa # Host: ariadna # data_receive 188 188 data_receive 0 0 data_receive 7 0 data_receive 140 152 data_receive 54 152 data_receive 50 152 data_receive 48 152 data_receive 50 152 data_receive 50 152 data_receive 49 152 data_receive 48 152 data_receive 47 152 data_receive 52 152 data_receive 139 246 data_receive 491 957 manyc_event_finish 1 0.630144119263 data_receive 176 778 manyc_event_finish 2 0.487888097763 data_receive 174 899 manyc_event_finish 3 0.417834997177 data_receive 258 1151 manyc_event_finish 4 0.766107797623 data_receive 352 2180 manyc_event_finish 5 0.656394004822 manyc_event_finish 6 0.55241894722 data_receive 142 1039 manyc_event_finish 7 0.39984703064 data_receive 211 975 manyc_event_finish 8 0.374735832214 data_receive 263 2033 manyc_event_finish 10 0.406519889832 manyc_event_finish 9 0.426545143127
The file shows at its beginning the hostname of the computer in which the client runs. Then, it shows:
We provide several scripts that extract the different indicators that we need to measure in the experiments. The scripts aggregate the data captured from the different rounds of the same combination of parameters:
All the scripts are invoked with similar command line parameters. For example:
python scripts/dump_delays.py experiment-logs/ztreamy/clients clients
dumps to its standard output the aggregated delays taken from all the experiments under the directory experiment-logs/ztreamy/clients (at any depth). The second parameters tells the script that the parameter that varies in those experiments is the number of clients. Therefore, the results for each individual number of clients will be shown separately:
1000 0.239 0.236 0.241 0.300 0.296 0.305 0.255 0.254 0.257 2000 0.239 0.237 0.241 0.313 0.311 0.316 0.270 0.268 0.271 3000 0.242 0.240 0.245 0.355 0.353 0.358 0.283 0.282 0.285 4000 0.241 0.239 0.244 0.401 0.398 0.404 0.295 0.294 0.297 5000 0.254 0.250 0.257 0.470 0.465 0.475 0.311 0.310 0.313 6000 0.253 0.249 0.257 0.523 0.518 0.529 0.329 0.327 0.330 7000 0.331 0.325 0.337 0.676 0.668 0.684 0.387 0.385 0.389 8000 0.394 0.386 0.401 0.813 0.802 0.823 0.454 0.451 0.457 9000 0.478 0.470 0.487 1.031 1.018 1.044 0.591 0.588 0.594 10000 0.607 0.594 0.621 1.283 1.265 1.302 0.706 0.703 0.710 11000 0.570 0.559 0.580 1.380 1.362 1.399 0.778 0.774 0.782 12000 0.986 0.949 1.023 2.587 2.519 2.656 1.018 1.007 1.029
The file contains one line for each value of the parameter that varies (in this case, the number of clients). This value is shown in the first column. The meaning of the rest of the columns is:
This is an example of a file that shows CPU consumption:
1000 9.982000 9.692210 10.271790 5 2000 19.294000 18.795429 19.792571 5 3000 29.570000 28.395987 30.744013 5 4000 41.090000 40.297715 41.882285 5 5000 51.158000 49.850846 52.465154 5 6000 64.550000 63.988705 65.111295 5 7000 69.344000 66.626435 72.061565 5 8000 71.364000 69.214511 73.513489 5
Again, the first column shows the value of the parameter that varies. The rest of the columns represent:
The format of the file generated by dump_traffic.py is similar: the second column is the average traffic rate in Mbit/s. The third and fourth columns are the 95% confidence interval for that average.
In the case of the file generated by dump_compression_ratio.py, the second column is the average compression ratio, and the third and fourth columns are its 95% confidence interval.
The data files produced with the scripts of Aggregating the results of the experiments is in a format that allows easy plotting from many graphical packages. We provide gnuplot scripts that create automatically all the plots that are shown in the paper.
In addition, we provide the script create_plots.sh that generates the data files by using the scripts and runs all the gnuplot scripts. Note that, in order to work, the script assumes some standard locations and directory names for the logs generated from the experiments. Those standard locations are, relative to the main directory of the package we provide:
experiment-logs ├── dataturbine │ ├── clients │ └── rate ├── faye │ ├── clients │ └── rate ├── lsm │ ├── clients │ └── rate ├── zmq │ ├── clients │ └── rate ├── ztreamy │ ├── clients │ ├── clients-buffer │ ├── clients-uncompressed │ ├── clients-uncompressed-unbuffered │ ├── rate │ ├── rate-buffer │ ├── rate-uncompressed │ └── rate-uncompressed-unbuffered └── ztreamy-solo ├── buffering_effect ├── clients_effect ├── clients_effect-large_window ├── clients_effect-low_rate ├── clients_effect-many_clients └── rate_effect
In the link below you can obtain all the log files and processed data that we gathered in our experiments. The results presented in the article have been obtained by processing these files with the create_plots.sh script.