Current baseline: ~75K req/s (SET with random keys, TCP) Target: 1M req/s Date: 2026-01-22
Ordered by expected impact, with cumulative performance targets.
Expected: 300K-750K req/s
Redis gets most of its speed from pipelining. The benchmark currently does:
send request → wait for response → send next request
With pipelining:
send 100 requests → read 100 responses (amortize network RTT)
What to change:
- Modify
fkvs-benchmark.cto batch requests (lines 168-180) - Server-side: read multiple commands from socket buffer before responding
- Write responses in a batch with
writev()instead of individualsend()calls
Quick win: Add -P flag to benchmark that sends N requests before reading responses.
Implementation notes:
- Start with depth of 10-100 pipelined requests
- Server needs to parse buffer until empty before writing responses
- Use
writev()for batched response writes
Expected: 150K-225K req/s → combine with pipelining for 600K-900K
The hashtable does 4 mallocs per SET and 2 mallocs + deep copy per GET. This is killing performance.
Current problems (src/core/hashtable.c):
- Line 61, 65, 81, 95: Separate malloc for entry, key, value_entry_t, value
- Lines 136-156: Deep copy on GET with 2 mallocs
- Linked list chaining (cache misses on collision)
Implement Swiss Tables (or similar):
- Open addressing with SIMD probing
- Inline small keys/values (< 64 bytes) directly in entry
- No separate allocations for small data
- 2x better cache locality
Or simpler optimization:
- Arena allocator for hashtable entries (pre-allocate 10MB slab)
- Inline keys ≤ 15 bytes, inline values ≤ 31 bytes in entry struct
- Only return pointer on GET (no deep copy)
- This is the
perf-use-swiss-data-structure-for-increased-performancebranch work
References:
- Swiss Tables: https://abseil.io/about/design/swisstables
- Google's implementation: https://github.com/google/swiss-tables
Expected: Combined with above = 900K+ req/s
Problems:
src/commands/server/server_command_handlers.c:104:malloc(value_len + 1)then immediatelyfree(data)line 121src/commands/server/server_command_handlers.c:136: malloc for response buffer- Every command does malloc/free on hot path
Solutions:
- Per-connection buffer pool: Preallocate 4KB buffer per client, reuse for all commands
- Object pools: Pre-allocate arrays of entry structs, never free them
- Arena/region allocator: Bump pointer allocation, reset after response sent
- Remove pointless malloc at lines 104-121 in
handle_set_command(malloc, memcpy, then immediate free!)
Implementation approach:
typedef struct client_t {
// ... existing fields
unsigned char *cmd_buffer; // 4KB reusable buffer
size_t cmd_buffer_size;
unsigned char *resp_buffer; // 4KB reusable buffer
size_t resp_buffer_size;
} client_t;Expected: Push to 1M+ req/s
Important: See io_uring findings for a detailed analysis of why naive io_uring integration underperforms epoll and what architectural changes are needed to realize gains. Key takeaway: fix the hashtable first (Phase 2) since fkvs is CPU-bound, then redesign io_uring integration with batched SQE submission, registered buffers, DeferTR, and zero-copy paths.
Current:
- Single
recv()call per event - Single
send()per response
Optimize:
- Read until
EAGAIN(fill 64KB buffer) - Parse multiple commands from buffer
- Batch responses with
writev()or io_uringIOSQE_IO_LINK - Avoid memcpy where possible (parse in place)
Implementation:
// Read loop
while (1) {
ssize_t n = recv(fd, buffer + offset, buffer_size - offset, 0);
if (n <= 0) {
if (errno == EAGAIN) break; // Drained socket
// handle error
}
offset += n;
}
// Parse all commands in buffer
while (offset >= MIN_COMMAND_SIZE) {
// parse and execute command
// add response to iovec array
}
// Write all responses at once
writev(fd, iovecs, iovec_count);The binary protocol is decent, but can be improved:
Current overhead:
- 2 bytes length + 1 byte cmd + 2 bytes key_len = 5 bytes minimum
Optimizations:
- Pre-encoded integer responses (PING → single byte 0x01)
- Varint encoding for lengths (saves bytes for small keys/values)
- Inline small responses (e.g., OK, PONG) without length prefix
Example:
PING response: currently [0x00, 0x01, 'O', 'K']
Optimized: [0x01] // 4 bytes → 1 byte
Compiler flags:
CFLAGS += -O3 -march=native -flto -fno-omit-frame-pointerHash function:
- Replace DJB2 (src/core/hashtable.c:8-16) with xxHash or CityHash (3-5x faster)
- Or use hardware CRC32C (
_mm_crc32_u64on x86,__crc32cdon ARM)
Hot path optimization:
- Mark command handlers with
__attribute__((hot)) - Use
__builtin_expectfor error paths - Use
__builtin_prefetchfor hashtable lookups - Profile with
perf recordandperf reportto find hotspots
Example:
__attribute__((hot))
void handle_set_command(int client_fd, unsigned char *buffer, size_t bytes_read) {
if (__builtin_expect(bytes_read < 5, 0)) {
// error handling
return;
}
// hot path
}Effort: 1-2 days
- ✅ Remove pointless malloc/free in command handlers (lines 104-121 in server_command_handlers.c)
- ✅ Add pipelining support to benchmark (
-Pflag) - ✅ Server: batch read until EAGAIN
- ✅ Server: batch write with writev()
- ✅ Add per-connection buffer pools (4KB)
Expected result: ~250-300K req/s
Effort: 1-2 weeks
- ✅ Implement arena allocator for hashtable entries
- ✅ Inline small keys/values in entry struct
- ✅ Remove deep copy on GET (return pointer)
- ✅ Switch to Swiss Tables or open addressing
- ✅ Replace DJB2 with xxHash
Expected result: ~500-700K req/s
Effort: 1 week
- ✅ Full zero-copy parsing
- ✅ io_uring optimization (batch operations with SQ/CQ) — see io_uring findings for detailed analysis of when and how to optimize io_uring based on VLDB 2026 research
- ✅ Protocol efficiency improvements
- ✅ Profile-guided optimization (PGO)
- ✅ Compiler optimizations (-march=native, LTO)
Expected result: ~900K-1.2M req/s
These four changes could get you to 250-300K req/s:
-
Remove wasteful malloc in
src/commands/server/server_command_handlers.c:104-121- Currently: allocate, copy, immediately free
- Just remove it entirely
-
Add pipelining to benchmark
- Add
-P Nflag to send N requests before reading - Start with N=100
- Add
-
Pre-allocate buffers per client
- Add 4KB cmd_buffer and resp_buffer to client_t
- Reuse for all operations
-
Don't deep copy on GET
- Lines 136-156 in
src/core/hashtable.c - Return pointer instead of allocating + copying
- Lines 136-156 in
After each optimization phase:
-
Run baseline benchmark:
./fkvs-benchmark -n 1000000 -c 25 -t set -r -
Profile with perf:
perf record -g ./fkvs-server -c perf report
-
Check for regressions:
- Memory usage (should stay flat or decrease)
- Latency percentiles (p50, p95, p99)
- CPU utilization
-
Document results:
- Update this file with actual numbers
- Note any unexpected bottlenecks discovered
Based on code review, expected hot spots:
- malloc/free - Every command allocates/frees memory (30-40% of CPU time)
- hashtable lookups - Linked list chaining causes cache misses (20-30%)
- memcpy - Deep copies on GET, unnecessary copies in SET (10-15%)
- Network I/O - Single recv/send per command (10-20%)
- Hash function - DJB2 is slow (5-10%)
Run perf record to validate these assumptions.
- Redis performance: https://redis.io/docs/management/optimization/benchmarks/
- TigerBeetle architecture: https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/DESIGN.md
- Swiss Tables: https://abseil.io/about/design/swisstables
- xxHash: https://github.com/Cyan4973/xxHash
- Linux perf: https://perf.wiki.kernel.org/index.php/Tutorial
- io_uring for High-Performance DBMSs (VLDB 2026): https://github.com/mjasny/vldb26-iouring
- io_uring findings for fkvs -- detailed analysis with benchmark data
File: src/commands/server/server_command_handlers.c:112-118
if (!is_integer(&buffer[pos_value], value_len)) {
set_value(table, &buffer[pos_key], key_len, &buffer[pos_value],
value_len, VALUE_ENTRY_TYPE_RAW);
}
set_value(table, &buffer[pos_key], key_len, &buffer[pos_value], value_len,
VALUE_ENTRY_TYPE_INT);The second set_value call runs unconditionally, so every key ends up with VALUE_ENTRY_TYPE_INT encoding regardless of actual type. For integer values, set_value is called twice on the same key — the first call (RAW) is immediately overwritten by the second (INT). This is both a correctness issue (non-integer values get INT encoding) and a performance issue (8 mallocs per SET for integer values instead of 4).
Fix: The logic should be inverted — store as INT only when is_integer() is true, otherwise store as RAW. A single set_value call per SET command.
File: src/commands/server/server_command_handlers.c:135-149
value_entry_t *value;
size_t value_len;
if (get_value(table, &buffer[5], key_len, &value, &value_len)) {
unsigned char *resp_buffer = malloc(value->value_len + 1);
// ...
send_reply(client_fd, resp_buffer, value_len);
}Neither value (allocated by get_value via deep copy) nor resp_buffer are ever freed after send_reply. Every GET leaks both allocations. This compounds with the deep-copy issue in get_value() (hashtable.c:136-155) — once the deep copy is removed in Phase 2, the value leak disappears naturally, but resp_buffer still needs to be freed (or eliminated via the per-connection buffer pool).
- Redis (single-threaded) achieves ~100K-200K req/s over TCP without pipelining
- With pipelining, Redis can hit 1M+ req/s on Unix sockets
- TigerBeetle shows that single-threaded design is production-ready when optimized
- The current ~75K req/s is respectable but has clear optimization opportunities
- Swiss table branch (
perf-use-swiss-data-structure-for-increased-performance) should be prioritized