This project demonstrates Homomorphic Encryption (HE) applied to a Machine Learning inference task. A Logistic Regression model classifies breast cancer diagnoses (malignant vs. benign) entirely on encrypted patient data — the compute server never sees the raw input at any point.
Library used: TenSEAL (Python wrapper over Microsoft SEAL) HE Scheme: CKKS (Cheon-Kim-Kim-Song) Dataset: Wisconsin Breast Cancer Dataset (sklearn) — 569 samples, 30 features, binary label Language: Python 3.12
In traditional cloud computing, a hospital wanting to run an ML model on patient data must send the raw (decrypted) data to the server. The server learns everything.
Homomorphic Encryption solves this:
- Patient data is encrypted locally before being sent
- The server computes the ML inference directly on ciphertext
- Only the encrypted result is returned — the server learns nothing
- The patient/hospital decrypts the result locally
This enables privacy-preserving machine learning — a critical need in healthcare, finance, and any domain with sensitive data.
Homomorphic Encryption is a form of encryption that allows computation on ciphertexts. The result, when decrypted, matches what you would have gotten by computing on the plaintext directly.
Mathematically:
Decrypt( F(Encrypt(x)) ) = F(x)
Where F is some function (in our case: a dot product).
There are several HE schemes. We use CKKS because:
| Scheme | Supports | Used For |
|---|---|---|
| BFV / BGV | Exact integers | Counting, voting |
| CKKS | Approximate real numbers | ML, statistics |
Machine learning models work with floating-point numbers (weights, features). CKKS is the only practical HE scheme for this — it introduces a tiny, controlled approximation error (on the order of 1e-6 in our results), which is negligible for classification.
poly_modulus_degree = 8192
coeff_mod_bit_sizes = [60, 40, 60]
scale = 2^40
poly_modulus_degree: Defines the size of the polynomial ring that ciphertexts live in. Larger values allow deeper computation (more sequential operations) but are slower and use more memory. 8192 is standard for shallow inference tasks.
coeff_mod_bit_sizes: Defines the "noise budget" chain. Each multiplication consumes one level. [60, 40, 60] gives 1 usable multiplication level — sufficient for our dot product (which only requires additions after element-wise multiply with plaintext weights).
scale = 2^40: Controls the precision of encoded real numbers. Roughly 40 bits of mantissa precision — more than enough for ML inference.
Galois keys: Required for the vector rotation operations used internally by TenSEAL when computing dot products on encrypted vectors.
Every HE operation introduces a small amount of noise into the ciphertext. CKKS manages this through the coefficient modulus chain — each multiplication "consumes" one level of the chain. When levels are exhausted, the noise overwhelms the signal and results become unreliable.
For our project:
- We only perform 1 multiplication depth (the dot product)
- Our budget [60, 40, 60] provides exactly 1 usable level
- Average noise observed: 1.14 × 10⁻⁶ — negligible for classification
Bootstrapping is an advanced technique that can refresh the noise budget, allowing arbitrarily deep computation. It was not needed here but is a major active research area for deep neural network inference under HE.
A full logistic regression pipeline computes:
z = X · weights + bias
p = sigmoid(z) = 1 / (1 + exp(-z))
label = 1 if p >= 0.5 else 0
The problem: HE only supports addition and multiplication.
sigmoid(z) requires exp() and division — impossible to compute on ciphertexts directly.
The standard solution: Approximate sigmoid with a polynomial:
sigmoid(z) ≈ 0.5 + 0.197z - 0.004z³
This only uses multiplications and additions — HE-compatible.
Why we still don't use it: Degree-3 polynomial approximations are only accurate for |z| ≤ 5. Our logistic regression on 30 standardized features regularly produces |z| > 10, causing the approximation to diverge badly.
Our correct solution: For binary classification, sigmoid is unnecessary.
The decision rule sigmoid(z) >= 0.5 is mathematically identical to z >= 0.
We compute the encrypted dot product z, decrypt only z, and threshold at 0.
The server never sees the features — only the encrypted z is transmitted back.
Real-world solutions to this challenge (future work):
- Minimax polynomial approximation (higher degree, requires more CKKS levels)
- Composite polynomial approximation (piecewise, more stable over wider range)
- Training the model to constrain z to a small range
- Using a different activation function that approximates more easily
| Plaintext | HE Encrypted | |
|---|---|---|
| Malignant | 97.6% | 97.6% |
| Benign | 98.6% | 98.6% |
| Overall | 98.25% | 98.25% |
- Prediction agreement: 114/114 (100%) — HE inference matched plaintext on every sample
- CKKS noise on z: avg 1.14 × 10⁻⁶, max 7.94 × 10⁻⁶ — never affected a prediction
| Total (114 samples) | Per Sample | |
|---|---|---|
| Plaintext (sklearn) | 0.35 ms | 0.003 ms |
| HE Encrypted | 1,711 ms | 15 ms |
| Slowdown | — | ~4,893× |
Per-operation breakdown (HE, per sample):
- Encryption: 3.86 ms
- Dot product: 10.80 ms ← dominant cost (Galois key rotations)
- Decryption: 0.36 ms
The dot product dominates because it requires expensive Galois key rotation operations internally — this cost scales with the number of features (30 in our case).
Problem: Standard degree-3 sigmoid approximation breaks down for |z| > 5, which logistic regression on high-dimensional data commonly produces. Solution: Classified directly using sign(z), which is mathematically equivalent to the sigmoid threshold for binary classification.
Problem: Initial parameters (poly_modulus_degree=8192, [60,40,60]) caused scale overflow when attempting polynomial sigmoid due to insufficient multiplication levels. Solution: Removed the polynomial sigmoid (see Challenge 1). Parameters were kept minimal since only 1 multiplication level was ultimately needed.
Problem: TenSEAL (built on Microsoft SEAL) requires a C++ build toolchain that is difficult to configure on Windows. Solution: Developed on Ubuntu (WSL) where pip install tenseal works out of the box.
he_project/
├── train_model.py # Train LR model, extract weights, save artifacts
├── encrypt_infer.py # HE inference on a single sample (demo)
├── benchmark.py # HE vs plaintext across all 114 test samples
├── model_params.json # Extracted weights + bias (generated)
├── model.pkl # Sklearn model object (generated)
├── scaler.pkl # StandardScaler (generated)
├── X_test_scaled.npy # Scaled test features (generated)
└── y_test.npy # Test labels (generated)
Run order:
python3 -m venv venv # create environment
source venv/bin/activate # activate environment
python3 train_model.py # generates all .pkl and .npy files
python3 encrypt_infer.py # single-sample demo
python3 benchmark.py # full 114-sample benchmark| Member | Responsibilities |
|---|---|
| Member A | Software architecture, encryption pipeline, ML model training, benchmarking, technical analysis |
| Member B | Presentation design, written report, objectives and findings documentation |
- Microsoft SEAL: https://www.microsoft.com/en-us/research/project/microsoft-seal/
- TenSEAL: https://github.com/OpenMined/TenSEAL
- CKKS Scheme: Cheon, J.H., Kim, A., Kim, M., Song, Y. (2017). Homomorphic Encryption for Arithmetic of Approximate Numbers.
- Dataset: Breast Cancer Wisconsin (Diagnostic) — UCI ML Repository / sklearn