03 May 14:42

8ce76cb

Latest

This new release of StencilStream broadens the previously FPGA-only framework into a portable, SYCL-based 2D stencil framework that targets FPGAs, NVIDIA GPUs, and CPUs from the same transition-function code. It also brings a significant API cleanup, makes the FPGA backends faster through spatial parallelism, and adds an experimental multi-FPGA backend.

New backends and performance features

GPU backend. A new CUDA backend, built on Codeplay's oneAPI for NVIDIA GPUs plugin, brings StencilStream to NVIDIA GPUs. A transparent Array-of-Structs ↔ Struct-of-Arrays transformation lets the very same transition function reach high throughput on both GPUs and FPGAs. See the IWOCL '26 publication (doi:10.1145/3811257.3811259) for a detailed evaluation.
Spatial parallelism on the FPGA backends. The Monotile and Tiling backends now introduce vectorized processing elements, raising the achievable throughput substantially over 3.0.0. The highest single-device throughput measured for the new release is 176.08 GCells/s (1.58 TFLOPS) for Jacobi on the Tiling backend, and 122.67 GCells/s (1.84 TFLOPS) arithmetic throughput for HotSpot on the Monotile backend (BittWare 520N w/ Intel Stratix 10 GX 2800).
Experimental multi-FPGA Monotile backend. Uses the custom FPGA networking infrastructure at PC2's Noctua 2 supercomputer to scale a Monotile design beyond a single device.

Breaking API changes

These changes affect every user upgrading from 3.0.0:

Index types. The configurable stencil::uindex_t and stencil::index_t, along with the STENCIL_INDEX_WIDTH macro, have been replaced by std::size_t and std::ptrdiff_t to align with the SYCL standard. Index width narrowing is now done automatically within the FPGA backends. The StencilStream/Index.hpp header has been removed.
2D coordinates. The custom ID / UID / GenericID types have been replaced by sycl::id<2> and sycl::range<2>, again to align with the SYCL standard. The StencilStream/GenericID.hpp header has been removed.
Index ordering. The first index of a 2D coordinate is now the row and the second is the column, again matching the SYCL standard. Transition functions, grid construction, and accessor calls written against 3.0.0 must be updated accordingly.
No more Boost dependency. StencilStream no longer pulls in Boost; builds and downstream projects can drop the corresponding find/link lines.
Internal headers reorganized. Implementation-detail headers (helpers, I/O / memory / switch kernels, the per-backend kernel and design classes) now live under StencilStream/internal/ and per-backend internal/ subdirectories. Public concepts in Concepts.hpp and Stencil.hpp have been updated to use the new index types.

New example and documentation

Jacobi example. A new example under examples/jacobi/ provides multiple Jacobi-kernel variants with adjustable computational complexity. It also serves as and additional benchmark in the new performance figures.
Documentation overhaul. The README has been rewritten with up-to-date build, run, and benchmarking instructions covering all backends, and the Doxygen documentation now uses the Doxygen Awesome theme with a dark-mode toggle.

Build and tooling

Toolchain. Validated on Intel oneAPI 24.2.1. Although Intel oneAPI version 25.1.0 is the last one to support FPGAs, StencilStream only supports version 24.2.1.
Environment setup. Separate scripts/env_fpga.sh and scripts/env_cuda.sh scripts replace the previous combined setup, so the FPGA and CUDA toolchains can be loaded independently on Noctua 2.
Per-backend benchmark scripts. Each example now ships benchmark_mono.sh, benchmark_tiling.sh, and benchmark_cuda.sh driver scripts on top of the shared Julia benchmark harness.
Standalone Conway build. The Conway example provides a CMakeLists.standalone.txt that can be used to build it outside of the StencilStream source tree.

Assets 2

31 Aug 09:14

JOOpdenhoevel

v3.0.0

b031c33

StencilStream v3.0.0

Highlights

This release introduces a major API and build-system evolution for StencilStream, centered around a more modern stencil update model and improved project integration.

New update-style stencil API (C++20 concepts):
Stencil updates now use a concept-based design instead of class-inheritance-based interfaces. This makes update definitions more explicit, composable, and easier to validate at compile time.
New Time-Dependent Value (TDV) system:
Added TDV support for precomputing values that depend only on the current iteration/time. The system includes multiple implementation strategies so you can choose the best trade-off for your backend and workload.
Expanded CMake-first support:
Build support has been consolidated around CMake, replacing the previous dual CMake/Make approach and making integration in modern C++ projects cleaner.
New convection example:
Added a new example demonstrating how to use the update-style API to embed stencil updates as subroutines inside a larger algorithmic workflow.

Breaking Changes

The stencil update API has changed to the new concept-based update style.
Existing code will require adaptation.
Build workflows based on the old dual CMake/Make setup should be migrated to CMake.

Migration Notes

Stencil updates: Port custom transition/update logic to the new concept-based update interface.
Build system: Use CMake targets/configuration as the primary integration path.
TDV: TDV is new in this release; no migration from prior TDV APIs is required.

Assets 2

01 Dec 13:51

JOOpdenhoevel

v2.1.1

9a9c8cd

StencilStream Version 2.1.1

This release contains hotfixes for regression errors introduced by newer oneAPI versions and brings it up-to-date with oneAPI version 2022.3.0.

Assets 4

15 Sep 07:50

JOOpdenhoevel

v2.1.0

7a006a7

StencilStream Version 2.1.0

We are verify exited to bring you the new version 2.1.0 of StencilStream, the Generic Stencil Simulation Library for FPGAs!

This major update brings two new features, along with some smaller tweaks:

Arbitrary pipeline lengths
A new stencil executor with a monotile architecture

Arbitrary Pipeline Lengths

Due to a small internal quirk, pipeline lengths always had to be a power of two. This was because the pipeline length is used to calculate the number of required banks, which DPC++ demands to be a power of two. This is circumvented by using the next biggest power of two as a number of banks, but only using the banks that are required to implement the demanded pipeline length. DPC++ automatically optimizes the superfluous banks away, which leads to the desired memory and area footprint.

A new stencil executor

With the introduction of StencilStream v2.0.0 came a new memory management architecture that divides the dynamic grid into tiles of static size, which allows unrestricted grid sizes and may provide finer runtime scaling. However, there may be applications where the single-tile approach from the v1.x.x version may be favorable due to it's reduced complexity. Therefore, the AbstractExecutor has been introduced to separate the execution mechanic from the execution strategy, as well as the MonotileExecutor which implements the monotile architecture.

We are looking forward to see the how the different strategies compare in different scenarios, so we would appreciate your feedback!

Assets 4

13 Aug 12:26

JOOpdenhoevel

v2.0.1

0e2bfa5

StencilStream Version 2.0.1

We are very exited to bring you the new version 2.0.1 of StencilStream, the Generic Stencil Simulation Library for FPGAs!

This version brings some minor improvements and bug fixes that were requested by users:

Bugfix: Evaluating the correct number of generations when the requested number of generations is not a multiple of the pipeline length.
Bugfix: Allowing cell sizes that are not a divisor of the flush size.
New API: The method RuntimeSample::get_mean_speed returns the number of pipeline passes per second.
New API: The attribute Stencil::grid_range contains the range of the grid.

In addition to that, the FDTD example has been rewritten to produce more useful and correct data.

Assets 4

06 Jul 13:37

JOOpdenhoevel

v2.0.0

e646467

StencilStream Version 2.0.0

We are very exited to bring you the new version 2.0.0 of StencilStream, the Generic Stencil Simulation Library for FPGAs!

For this release, we have fundamentally changed the way StencilStream works internally which allows simulation grids of arbitrary size and better scaling for smaller grids. Let's go into the details:

What's new?

Architecture

StencilStream now uses a spatial tiling approach introduced by Hamid Reza Zohouri, Artur Podobas and Satoshi Matsuoka that partitions a dynamically sized grid into statically sized tiles which can be better handled by the processing pipeline.

Defined Grid Halo

This also allows for a new way to handle the grid halo; The cells outside the grid that are required to calculate the cells on the grid's edge. In the previous version, these cells were undefined and transition functions had use the indices to check for edge cases. Now, the user can provide a constant value to StencilStream and the pipeline guarantees that all cells in the grid halo will have this value. Old transition function might still work, but their complexity can be vastly reduced using this precondition.

For example, you would have needed to write a transition function like this in v1.1.1 in order to sum up the neighbors of a cell:

auto trans_func = [grid_width, grid_height](Stencil2D<float, 1> const &stencil, Stencil2DInfo const &info) {
    float sum = 0;
    if (info.center_cell_id.c > 0) {
        sum += stencil[ID(-1, 0)];
    }
    if (info.center_cell_id.c < grid_width - 1) {
        sum += stencil[ID(1, 0)];
    }
    if (info.center_cell_id.r > 0) {
        sum += stencil[ID(0, -1)];
    }
    if (info.center_cell_id.r < grid_height - 1) {
        sum += stencil[ID(0, 1)];
    }
    return sum;
};

Now, you can set the halo value to 0.0 and simply write:

auto trans_func = [](Stencil<float, 1> const &stencil) {
    return stencil[ID(-1, 0)] + stencil[ID(1, 0)] + stencil[ID(0, -1)] + stencil[ID(0, 1)];
};

Edge cases are automatically handled by StencilStream.

Pipeline Length as a Template Parameter

The previous version of StencilStream used preprocessor macros to duplicate the execution stages of a pipeline. This came with the limitation that the pipeline length was capped at 1024 stages and that the length had to be set via a macro definition too. In version 2.0.0, we have overcome this limitation and the pipeline length of a design is now set as a template parameter of the StencilExecutor class too.

Breaking Changes

This release also brings some breaking changes to the user-facing interface to reduce verbosity and increase clearness:

StencilStream now uses StencilStream has the default directory name, and a one-file-per-class policy has been adopted where suitable. For example the include line for the StencilExecutor class is #include <StencilStream/StencilExecutor.hpp> instead of #include <stencil/stencil.hpp>.
The Stencil2D has been renamed to Stencil
The Stencil2DInfo class has been merged into Stencil, transition functions only accept a Stencil instance as a parameter.
The StencilExecutor class has been completely rewritten.

More information

More information on how StencilStream is structured and how the interface is designed can be found in the documentation. It is both hosted online and attached as a tarball.

What's next?

This version marks the introduction of the spatial tiling architecture. Until now, we have focused on correctness and clearness with only some second thoughts on performance. In subsequent releases, we will profile and improve the performance of StencilStream and also provide optimization guides for users to achieve the full potential of their applications.

Your feedback is always welcome! Please submit an issue if you find a bug or have a feature request.

Assets 4

09 Dec 16:55

JOOpdenhoevel

v1.1.1

8711322

StencilStream Version 1.1.1

Changes

This release adds a benchmark mode to the hotspot and fdtd examples. In fdtd, this can be enabled with the -b flag. In hotspot, it can be enabled by appending true to the list of arguments.

Performance

This release contains the isolated StencilStream library as well as synthesized application binaries. All of these binaries have been synthesized using oneAPI version beta-10, and the targeted boards are the Nallatech/Bittware 520N Board as well as the Intel PAC Stratix 10. Below are performance metrics of some sample applications. The conway application is optimized for readability, not for performance, and is therefore not listed.

Bittware/Nallatech 520N (Stratix 10 GX 2800)

Application	Cycles per Loop	Pipeline Depth	Cycle Frequency	Generations per Second	Overall Performance	Logic Usage	Register Usage	RAM Usage	DSP Usage
`hotspot`	1.07 cycles	200 cores	206.25 MHz	36933 G/s	580.91 GFLOPS	79.38%	49.02%	35.35%	52.13%
`fdtd`	16.54 cycles	35 cores	272.50 MHz	243.56 G/s	136.10 GFLOPS	79.93%	49.61%	47.64%	52.66%

Intel PAC D5005 (Stratix 10 SX 2800)

Application	Cycles per Loop	Pipeline Depth	Cycle Frequency	Generations per Second	Overall Performance	Logic Usage	Register Usage	RAM Usage	DSP Usage
`hotspot`	0.98 cycles	200 cores	163.00 MHz	31644.4 G/s	497.724 GFLOPS	83.67%	50.16%	35.60%	52.13%
`fdtd`	6.69 cycles	20 cores	221.00 MHz	157.61 G/s	78.01 GFLOPS	63.93%	35.65%	33.45%	30.30%

Assets 9

03 Dec 11:35

JOOpdenhoevel

v1.0.0

38c457d

StencilStream Version 1.0.0

Nallatech/Bittware 520N Board (Stratix 10 GX 2800)

Application	Main Loop II	Pipeline Depth	Cycle Frequency	Generations per Second	Overall Performance	Logic Usage	Register Usage	RAM Usage	DSP Usage
`hotspot`	1.05 cycles	225 cores	79.63 MHz	16,328 G/s	256.84 GFLOPS	85.34%	51.23%	38.31%	58.64%
`fdtd`	1.73 cycles	30 cores	225 MHz	233.10 G/s	29.02 KFLOPS	83.19%	50.37%	43.91%	45.42%

Intel PAC (Stratix 10 SX)

Application	Main Loop II	Pipeline Depth	Cycle Frequency	Generations per Second	Overall Performance	Logic Usage	Register Usage	RAM Usage	DSP Usage
`hotspot`	1.06 cycles	100 cores	225.00 MHz	20,161.29 G/s	317.17 GFLOPS	64.26%	35.75%	25.09%	26.11%
`fdtd`	1.45 cycles	20 cores	218.00 MHz	178.95 G/s	24.43 KFLOPS	69.41%	37.87%	34.66%	30.29%

Assets 9

23 Nov 10:46

JOOpdenhoevel

v1.0.0-rc1

d619f19

StencilStream Version 1.0.0 RC1 Pre-release

Pre-release

This release contains the isolated StencilStream library as well as synthesized application binaries. All of these binaries have been synthesized using oneAPI version beta-10 and target the Nallatech/Bittware 520N Board, powered by a Stratix 10 GX 2800. Below are some measured performance metrics. Note that the conway application does not provide runtime information and therefore has some blank fields.

Performance Metrics

Application	Main Loop II	Pipeline Depth	Cycle Frequency	Generations per Second	Overall Performance	Logic Usage	Register Usage	RAM Usage	DSP Usage
`hotspot`	1.05 cycles	225 cores	79.63 MHz	16,328 G/s	256.84 GFLOPS	85.34%	51.23%	38.31%	58.64%
`fdtd`	1.73 cycles	30 cores	225 MHz	233.10 G/s	29.02 KFLOPS	83.19%	50.37%	43.91%	45.42%
`conway`	-	10 cores	353.33 Mhz	-	-	25.41%	12.78%	8.45%	0.05%

Assets 6

23 Nov 11:23

JOOpdenhoevel

copy_bug_0

8ffcbff

Copy Bug, build0 Pre-release

Pre-release

copy_bug_0

Changing the scripts for the PAC

Assets 4

Releases: pc2/StencilStream

StencilStream v4.0.0

New backends and performance features

Breaking API changes

New example and documentation

Build and tooling

Uh oh!

StencilStream v3.0.0

Highlights

Breaking Changes

Migration Notes

Uh oh!

StencilStream Version 2.1.1

Uh oh!

StencilStream Version 2.1.0

Arbitrary Pipeline Lengths

A new stencil executor

Uh oh!

StencilStream Version 2.0.1

Uh oh!

StencilStream Version 2.0.0

What's new?

Architecture

Defined Grid Halo

Pipeline Length as a Template Parameter

Breaking Changes

More information

What's next?

Uh oh!

StencilStream Version 1.1.1

Changes

Performance

Bittware/Nallatech 520N (Stratix 10 GX 2800)

Intel PAC D5005 (Stratix 10 SX 2800)

Uh oh!

StencilStream Version 1.0.0

Nallatech/Bittware 520N Board (Stratix 10 GX 2800)

Intel PAC (Stratix 10 SX)

Uh oh!

StencilStream Version 1.0.0 RC1

Performance Metrics

Uh oh!

Copy Bug, build0

Uh oh!