This repo contains the implementation of a distributed key-value store using RDMA. Various types of communication primitives are tested in different configurations. By configuration, I mean the manner in which requests and responses are handled, currently the following have been implemented (or rather, allowed to be run):
| Request (C) | Response (C) | Request (S) | Response (S) |
|---|---|---|---|
| SEND | RECV | RECV | SEND |
| WRITE | RECV | - | SEND |
| WRITE IMM | RECV | RECV | SEND |
| WRITE | - | - | WRITE |
| WRITE | READ | - | - |
Here, the (X) indicates whether the action required for request/response is performed by the Client or Server.
- Great RDMA example to get started with understanding RDMA implementations: https://github.com/animeshtrivedi/rdma-example
- How to setup softiwarp if you do not have hardware support for RDMA:
- How to RDMA? https://www.rdmamojo.com/
- Memcached: https://memcached.org/downloads
- Libmemcached: https://launchpad.net/libmemcached/+download
For installing locally (without sudo), choose installation location:
./configure --prefix=/home/$USER/local && make && make install
Cmake expects either a globally installed installation (found through regular system variables), or if you have trouble, you can manually set an environment variable: LIBMEMCACHED_PATH=/home/$USER/libmemcached-1.0.18, e.g. if you have the source code in your home dir.
Then for actually generating build files and compiling:
cmake . for config
[ENV] cmake . for advanced config, see below
make for compilation and linking
For extra functionality you can pass in temporary environment variables, currently, there are multiple for benchmarking purposes:
PERK_DEBUG=1, turns on debugging printsPERK_BM_LATENCY=1, turns on latency macros (prints latency client side)PERK_BM_OPS_PER_SEC=1, turns on ops/sec macros (prints ops per sec client and server side)PERK_BM_SERVER_EXIT=1, server exits after all clients disconnectPERK_PRINT_REQUESTS=1, turns on request printing client-sidePERK_OVERRIDE_VALSIZE=<VAL>, overrides the maximum value size of PERK (utility for benchmark setups)
How to pass these variables?
PERK_BM_OPS_PER_SEC=1 PERK_BM_SERVER_EXIT=1 cmake .
make
For generating a sample workload for 1 or multiple clients:
python3 gen_small_workload.py <REQUESTS> <GET_DISTRIBUTION> <MAX_CLIENTS> <OUTPUT_DIR>
e.g.:
python3 gen_small_workload.py 1000000 0.95 16 /var/scratch/$USER/input, this generates 16 1,000,000-request files with a GET/SET ratio of 95:5 and stores them in the folder /var/scratch/$USER/input.
Running the server:
./bin/perk_server [ARGS]
Running the client:
./bin/perk_client [ARGS]
Some examples:
./bin/perk_server -a [RDMA-enabled-ip] -p 20838
./bin/perk_client -a [RDMA-enabled-ip] -p 20838 -c 3000000
Or you can use the run scripts, which supply the executables with a number of default arguments.
./run_server.sh, try./run_server.sh -hfor options./run_client.sh, try./run_client.sh -hfor options
This section only applies if you have access to one of the DAS5 clusters:
module load cmake/3.15.4; module load prun, to make sure the right modules are loaded:preserve -np 2 -t 900, to reserve some nodes
ssh node0.., to connect to nodecd bsc-project-rdma, to cd to folder with runscript./run_server.sh -n -r <comp>, to run server (-n binds to local node, where it expects aib0infiniband interface), check./run_server.sh -hfor example comps.
ssh node0.., to connect to nodecd bsc-project-rdma, to cd to folder with runscript./run_client.sh -n <N> -r <comp>, where N is the node number (substitutes in a default ib address), check./run_client.sh -hfor example comps and other options.
Benchmarking everything easily is only going to work on DAS5: https://www.cs.vu.nl/pub/das5/users.shtml.
To benchmark the application, you will probably need some sample workloads, unless you like benchmarking on GET requests that will always return an empty result. Otherwise generating input files is a good idea. Personally, I like 3,000,000 requests, with a 95:5 ratio for GET/SETs. The utility script can automatically generate files for different clients and different payloads [32, 64, 128, 256, 512, 1024, 2048] are included by default, if you want to modify this you'll have to do it manually in the script.
python3 gen_small_workload.py 3000000 0.95 16 /var/scratch/$USER/input\
Afterwards, benchmarking is easy:
./run_benchmark.sh -i /var/scratch/$USER/input/ -f /var/scratch/$USER/output/ -a -r 5, runs the CPU, latency, and scalability benchmarks in one go, 5 times in a row. For convenience, you might want to use nohup and background the process, because it takes quite long
nohup ./run_benchmark.sh -i /var/scratch/$USER/input/ -f /var/scratch/$USER/output/ -a -r 5 &