CUDA (GPU) implementation of the sieve of Eratosthenes.
Key components used to make it really fast:
- segmented sieve of Eratosthenes
- sieve in shared memory
- bucket concept
- wheel concept
- blackkey concept - compact layout of primes to be sieved, one odd number per bit
- reuse all data structures again and again - no malloc, no free hence no memory leaks and init time can be neglected
- no bulk data transfer between CPU and GPU, only the sieve result will be transfered from GPU to CPU
- different sieving method for very small, small, medium, large and very large numbers
- internal granularity is 10^9, subsets of it will be dealt with by the counting module
- Sieve range 0 ... 2^64 (1,8446744x10^19)
The program LingSieve counts primes and is designed for long duration tests (LDT), to handle large quantities of data, for the generation of large prime tables. That is what GPU computing is all about.
As of V3.2 small sieve jobs can be launched as well, e.g. LingSieve 100 -d=48 - to sieve from 100 ... 148 and find 9 primes.
The GPU HW will scale the program accordingly, there is no need to configure anything.
Measurements V2.1 ... V2.3 taken with a NVIDIA GeForce GTX 1080 (Palit Gamerock)
Measurements V3.x taken with a NVIDIA GeForce GTX 5060 Ti (Palit Infinity 3)
| Range | Count | V2.1 | V2.2 | V2.3 | V3.0 | V3.1 | V3.2 |
|---|---|---|---|---|---|---|---|
| 0 … 10^11 | 4.118.054.813 | 0,60s | 0,577s | 0,564s | 0,225s | 0,195s | 0,160s |
| 0 … 10^12 | 37.607.912.018 | 8,31s | 7,32s | 5,91s | 2,993s | 2,643s | 2,368s |
| 0 … 10^13 | 346.065.536.839 | 103s | 87,2s | 75,9s | 36,2s | 33,1s | 30,3s |
| 0 … 10^14 | 3.204.941.750.802 | 1231s | 1019s | 952s | 439s | 408s | 368s |
| 0 … 10^11 | 4.118.054.813 | 0,60s | 0,577s | 0,564s | 225ms | 195ms | 160ms |
| 10^12 + 10^11 | 3.612.791.400 | 0,90s | 0,795s | 0,707s | 303ms | 293ms | 273ms |
| 10^13 + 10^11 | 3.340.141.707 | 1,08s | 0,919s | 0,815s | 376ms | 357ms | 326ms |
| 10^14 + 10^11 | 3.102.063.927 | 1,26s | 1,057s | 1,002s | 469ms | 439ms | 397ms |
| 10^15 + 10^11 | 2.895.317.534 | 1,44s | 1,235s | 1,186s | 553ms | 522ms | 472ms |
| 10^16 + 10^11 | 2.714.336.584 | 1,62s | 1,494s | 1,365s | 642ms | 610ms | 554ms |
| 10^17 + 10^11 | 2.554.712.095 | 1,81s | 1,664s | 1,548s | 732ms | 702ms | 644ms |
| 10^18 + 10^11 | 2.412.731.214 | 2,08s | 1,912s | 1,790s | 827ms | 794ms | 742ms |
| 4x10^18 + 10^11 | 2.334.654.194 | 2,34s | 2,166s | 2,056s | 895ms | 865ms | 814ms |
| 10^19 + 10^11 | 2.285.693.139 | 2,64s | 2,518s | 2,326s | 989ms | 942ms | 895ms |
| 1,4x10^19 + 10^11 | 2.268.304.926 | 2,81s | 2,646s | 2,461s | 1032ms | 984ms | 939ms |
| 1,8x10^19 + 10^11 | 2.255.482.326 | 2,91s | 2,747s | 2,586s | 1073ms | 1027ms | 986ms |
This is the fastest implementation of the sieve of Eratosthenes on a GPU to the best of my knowledge.
- NVIDIA graphics card with driver
- GPU memory: 1346MB .. 1926MB depending on the range to be sieved
- a 64-bit host application and non-embedded operating system (Linux, Windows)
The available binaries have been compiled with Microsoft Visual Studio Community for 64bit Windows and CUDA ToolKit for NVIDIA GPUs with compute capability (CC) as indicated in the table.
| Year | V | MSVS | ToolKit | CC | Driver | Target | MD5 |
|---|---|---|---|---|---|---|---|
| 2018 | 2.1 | 2017 | 9.0 | 6.1 | >=384.81 | W10 | 294001B4092856F5C09A7DCBB907B340 |
| 2022 | 2.2 | 2017 | 9.2 | 6.1 | >=398.26 | W10 | 4EFCB3C000C2C6ACCE0A8C27C68E274F |
| 2026 | 2.3 | 2022 | 12.4 | 6.1 | >=525.xx | W1x | A62DB3C002BD4CF1BFFAB52A65A344EE |
| 2026 | 3.0 | 2022 | 12.9 | 7.5&12.0 | >=575.xx | W1x | E5F9A9A364BD51C4D1FA7761FBA3A259 |
| 2026 | 3.1 | 2022 | 12.9 | 7.5&12.0 | >=575.xx | W1x | 74494FB3BA629F488A926A7F798928C3 |
| 2026 | 3.2 | 2022 | 12.9 | 7.5&12.0 | >=575.xx | W1x | 4EAC17C270A5DDAFD6B1B018E5173E43 |
Sieve source code for V2.2 ... V3.2 is identical. Only program and CUDA parameters and tool sets were modified.
Open a Command Prompt and run the program LingSieve:
- start without parameter - it will use initial values - StartValue (0), segment size (10^11) and start a LDT
- if there is a valid Result file it will use the last entry as StartValue, segment size (10^11) and start a LDT
- if there is a valid user input it will use it instead of 1) or 2)
| Examples ≥ V3.2 | Comment | Count |
|---|---|---|
| LingSieve -GPU | get your graphics cards parameter | - |
| LingSieve -bench | get your graphics cards performance | - |
| LingSieve /? | Display help text | - |
| LingSieve 0 | Count from 0 ... 10^11 | 4.118.054.813 |
| LingSieve 0 -d=1e9 | Count from 0 ... 10^9 | 50.847.534 |
| LingSieve 0 -d=1e10 | Count from 0 ... 10^10 | 455.052.511 |
| LingSieve 1e15 -d=1e11 | Count from 10^15 ... 10^15+10^11 | 2.895.317.534 |
| LingSieve 1e17 -d=1e12 | Count from 10^17 ... 10^17+10^12 | 25.546.659.722 |
| LingSieve 100 -d=48 | Count from 100 ... 148 | 9 |
| LingSieve 1e17 -d=2048 | Count from 1e17 ... 1e17+2048 | 55 |
| LingSieve 1e17+123 -d=1e4 | Count from 1e17+123 ... 1e17+1e4+123 | 265 |
I've sifted 6,8x10^16 numbers in various ranges without any errors, that is 68.000.000 cycles. Once I run it for more than 60 hours continuously without any problems, it just works as it is supposed to. The results were compared with the extensive prime sieve tables of Tomás Oliveira e Silva and Jan Büthe, randomly with the program primesieve 64bit of Kim Walisch.
Basic development is finished.