by Emery Berger
The Hoard memory allocator is a fast, scalable, and memory-efficient memory allocator that works on a range of platforms, including Linux, Mac OS X, and Windows.
Hoard is a drop-in replacement for malloc that can dramatically improve application performance, especially for multithreaded programs running on multiprocessors and multicore CPUs. No source code changes necessary: just link it in or set one environment variable (see Building Hoard, below).
-
"If you'll be running on multiprocessor machines, ... use Emery Berger's excellent Hoard multiprocessor memory management code. It's a drop-in replacement for the C and C++ memory routines and is very fast on multiprocessor machines."
-
"(To improve scalability), consider an open source alternative such as the Hoard Memory Manager..."
-
"Hoard dramatically improves program performance through its more efficient use of memory. Moreover, Hoard has provably bounded memory blowup and low synchronization costs."
Companies using Hoard in their products and servers include AOL, British Telecom, Blue Vector, Business Objects (formerly Crystal Decisions), Cisco, Credit Suisse, Entrust, InfoVista, Kamakura, Novell, Oktal SE, OpenText, OpenWave Systems (for their Typhoon and Twister servers), Pervasive Software, Plath GmbH, Quest Software, Reuters, Royal Bank of Canada, SAP, Sonus Networks, Tata Communications, and Verite Group.
Open source projects using Hoard include the Asterisk Open Source Telephony Project, Bayonne GNU telephony server, the Cilk parallel programming language, the GNU Common C++ system, the OpenFOAM computational fluid dynamics toolkit, and the SafeSquid web proxy.
Hoard is now a standard compiler option for the Standard Performance Evaluation Corporation's CPU2006 benchmark suite for the Intel and Open64 compilers.
Hoard has now been released under the widely-used and permissive Apache license, version 2.0.
There are a number of problems with existing memory allocators that make Hoard a better choice.
Multithreaded programs often do not scale because the heap is a bottleneck. When multiple threads simultaneously allocate or deallocate memory from the allocator, the allocator will serialize them. Programs making intensive use of the allocator actually slow down as the number of processors increases. Your program may be allocation-intensive without you realizing it, for instance, if your program makes many calls to the C++ Standard Template Library (STL). Hoard eliminates this bottleneck.
System-provided memory allocators can cause insidious problems for multithreaded code. They can lead to a phenomenon known as "false sharing": threads on different CPUs can end up with memory in the same cache line, or chunk of memory. Accessing these falsely-shared cache lines is hundreds of times slower than accessing unshared cache lines. Hoard is designed to prevent false sharing.
Multithreaded programs can also lead the allocator to blowup memory consumption. This effect can multiply the amount of memory needed to run your application by the number of CPUs on your machine: four CPUs could mean that you need four times as much memory. Hoard is guaranteed (provably!) to bound memory consumption.
You can use Homebrew to install the current version of Hoard as follows:
brew tap emeryberger/hoard
brew install --HEAD emeryberger/hoard/libhoard
This not only installs the Hoard library, but also creates a hoard command you can use to run Hoard with anything at the command-line.
hoard myprogram-goes-here
On Linux, you may need to first install the appropriate version of libstdc++-dev (e.g., libstdc++-12-dev):
sudo apt install libstdc++-devNow, to build Hoard from source, do the following:
git clone https://github.com/emeryberger/Hoard
mkdir build && cd build
cmake ..
makeYou can then use Hoard by linking it with your executable, or
by setting the LD_PRELOAD environment variable, as in
export LD_PRELOAD=/path/to/libhoard.soor, in Mac OS X:
export DYLD_INSERT_LIBRARIES=/path/to/libhoard.dylibHoard uses Microsoft Detours for function interposition on Windows. Detours is automatically downloaded and built by CMake.
git clone https://github.com/emeryberger/Hoard
cd Hoard
mkdir build && cd build
cmake ..
cmake --build . --config ReleaseThis produces build\Release\hoard.dll along with withdll.exe and setdll.exe tools. Supports x86, x64, ARM, and ARM64 architectures.
Important: Programs must be compiled with /MD (dynamic C runtime) for Hoard to intercept allocations. Programs compiled with /MT (static C runtime) have allocation functions embedded directly in the executable, which Hoard cannot intercept.
With unmodified executables (recommended):
Use withdll.exe (built automatically) to inject Hoard into any program at runtime, similar to LD_PRELOAD on Linux:
build\Release\withdll.exe /d:build\Release\hoard.dll yourapp.exe [args...]Permanent modification:
Use setdll.exe (built automatically) to modify an executable's import table:
# Add Hoard to executable (creates backup as .exe~)
build\Release\setdll.exe /d:build\Release\hoard.dll yourapp.exe
# Remove Hoard from executable
build\Release\setdll.exe /r:hoard.dll yourapp.exeLinking at build time:
You can also link Hoard directly into your application:
cl /Ox /MD yourapp.cpp /link hoard.libThe directory benchmarks/ contains a number of benchmarks used to evaluate and tune Hoard.
All benchmarks were run on a 192-core, 2-node NUMA system (AMD EPYC). Graphs are normalized to Hoard (1.0 = Hoard, shown as green line). Values above the line mean worse than Hoard.
Key findings:
- Hoard achieves 1.3-1.5x higher throughput than mimalloc, jemalloc, and glibc on server workloads (Larson)
- Hoard is 2-5x faster on realloc-heavy workloads (Phong)
- Hoard uses less memory than mimalloc and jemalloc at high thread counts
- On NUMA systems, Hoard is up to 1.6x faster due to NUMA-aware memory management
Simulates a multithreaded server handling many short-lived allocations with object passing between threads.
Take-home: Hoard achieves 1.3-1.5x higher throughput than all other allocators across all thread counts. This benchmark is representative of real server workloads.
Measures raw allocation/deallocation throughput with minimal work between operations.
Take-home: Hoard is fastest at low-medium thread counts (8-32 threads) and matches mimalloc at 256 threads. Hoard uses significantly less memory than jemalloc at high thread counts.
Tests realloc performance with repeated grow/shrink patterns.
Take-home: Hoard is 2-5x faster than all other allocators at low-medium thread counts (4-64) due to its optimized in-place realloc implementation.
Pure malloc/free pairs with no work between operations. Tests raw allocator scalability.
Take-home: jemalloc excels here; this workload is adversarial for Hoard's superblock design. However, jemalloc uses significantly more memory.
On NUMA systems, memory locality matters. Hoard's NUMA-aware sharding keeps allocations on the same NUMA node as the allocating thread, reducing cross-node memory traffic.
Take-home: At 128 threads on a 2-node NUMA system, Hoard is 1.4x faster than mimalloc, 1.4x faster than jemalloc, and 1.6x faster than glibc. The advantage grows with thread count.
Hoard has changed quite a bit over the years, but for technical details of the first version of Hoard, read Hoard: A Scalable Memory Allocator for Multithreaded Applications, by Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. The Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX). Cambridge, MA, November 2000.











