Skip to content

VTMT0905/Cyber-Security-HE-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Homomorphic Encryption for Private ML Inference

Project Overview

This project demonstrates Homomorphic Encryption (HE) applied to a Machine Learning inference task. A Logistic Regression model classifies breast cancer diagnoses (malignant vs. benign) entirely on encrypted patient data — the compute server never sees the raw input at any point.

Library used: TenSEAL (Python wrapper over Microsoft SEAL) HE Scheme: CKKS (Cheon-Kim-Kim-Song) Dataset: Wisconsin Breast Cancer Dataset (sklearn) — 569 samples, 30 features, binary label Language: Python 3.12


Motivation — Why Homomorphic Encryption?

In traditional cloud computing, a hospital wanting to run an ML model on patient data must send the raw (decrypted) data to the server. The server learns everything.

Homomorphic Encryption solves this:

  • Patient data is encrypted locally before being sent
  • The server computes the ML inference directly on ciphertext
  • Only the encrypted result is returned — the server learns nothing
  • The patient/hospital decrypts the result locally

This enables privacy-preserving machine learning — a critical need in healthcare, finance, and any domain with sensitive data.


What is Homomorphic Encryption?

Homomorphic Encryption is a form of encryption that allows computation on ciphertexts. The result, when decrypted, matches what you would have gotten by computing on the plaintext directly.

Mathematically:

Decrypt( F(Encrypt(x)) ) = F(x)

Where F is some function (in our case: a dot product).

Why CKKS?

There are several HE schemes. We use CKKS because:

Scheme Supports Used For
BFV / BGV Exact integers Counting, voting
CKKS Approximate real numbers ML, statistics

Machine learning models work with floating-point numbers (weights, features). CKKS is the only practical HE scheme for this — it introduces a tiny, controlled approximation error (on the order of 1e-6 in our results), which is negligible for classification.


CKKS Parameters — Technical Explanation

poly_modulus_degree = 8192
coeff_mod_bit_sizes = [60, 40, 60]
scale               = 2^40

poly_modulus_degree: Defines the size of the polynomial ring that ciphertexts live in. Larger values allow deeper computation (more sequential operations) but are slower and use more memory. 8192 is standard for shallow inference tasks.

coeff_mod_bit_sizes: Defines the "noise budget" chain. Each multiplication consumes one level. [60, 40, 60] gives 1 usable multiplication level — sufficient for our dot product (which only requires additions after element-wise multiply with plaintext weights).

scale = 2^40: Controls the precision of encoded real numbers. Roughly 40 bits of mantissa precision — more than enough for ML inference.

Galois keys: Required for the vector rotation operations used internally by TenSEAL when computing dot products on encrypted vectors.


Noise Budget — A Key HE Concept

Every HE operation introduces a small amount of noise into the ciphertext. CKKS manages this through the coefficient modulus chain — each multiplication "consumes" one level of the chain. When levels are exhausted, the noise overwhelms the signal and results become unreliable.

For our project:

  • We only perform 1 multiplication depth (the dot product)
  • Our budget [60, 40, 60] provides exactly 1 usable level
  • Average noise observed: 1.14 × 10⁻⁶ — negligible for classification

Bootstrapping is an advanced technique that can refresh the noise budget, allowing arbitrarily deep computation. It was not needed here but is a major active research area for deep neural network inference under HE.


Why We Don't Use Encrypted Sigmoid

A full logistic regression pipeline computes:

z = X · weights + bias
p = sigmoid(z) = 1 / (1 + exp(-z))
label = 1 if p >= 0.5 else 0

The problem: HE only supports addition and multiplication. sigmoid(z) requires exp() and division — impossible to compute on ciphertexts directly.

The standard solution: Approximate sigmoid with a polynomial:

sigmoid(z) ≈ 0.5 + 0.197z - 0.004z³

This only uses multiplications and additions — HE-compatible.

Why we still don't use it: Degree-3 polynomial approximations are only accurate for |z| ≤ 5. Our logistic regression on 30 standardized features regularly produces |z| > 10, causing the approximation to diverge badly.

Our correct solution: For binary classification, sigmoid is unnecessary. The decision rule sigmoid(z) >= 0.5 is mathematically identical to z >= 0. We compute the encrypted dot product z, decrypt only z, and threshold at 0. The server never sees the features — only the encrypted z is transmitted back.

Real-world solutions to this challenge (future work):

  • Minimax polynomial approximation (higher degree, requires more CKKS levels)
  • Composite polynomial approximation (piecewise, more stable over wider range)
  • Training the model to constrain z to a small range
  • Using a different activation function that approximates more easily

Results

Accuracy

Plaintext HE Encrypted
Malignant 97.6% 97.6%
Benign 98.6% 98.6%
Overall 98.25% 98.25%
  • Prediction agreement: 114/114 (100%) — HE inference matched plaintext on every sample
  • CKKS noise on z: avg 1.14 × 10⁻⁶, max 7.94 × 10⁻⁶ — never affected a prediction

Performance

Total (114 samples) Per Sample
Plaintext (sklearn) 0.35 ms 0.003 ms
HE Encrypted 1,711 ms 15 ms
Slowdown ~4,893×

Per-operation breakdown (HE, per sample):

  • Encryption: 3.86 ms
  • Dot product: 10.80 ms ← dominant cost (Galois key rotations)
  • Decryption: 0.36 ms

The dot product dominates because it requires expensive Galois key rotation operations internally — this cost scales with the number of features (30 in our case).


Technical Challenges & Solutions

Challenge 1: Polynomial sigmoid divergence

Problem: Standard degree-3 sigmoid approximation breaks down for |z| > 5, which logistic regression on high-dimensional data commonly produces. Solution: Classified directly using sign(z), which is mathematically equivalent to the sigmoid threshold for binary classification.

Challenge 2: CKKS scale overflow

Problem: Initial parameters (poly_modulus_degree=8192, [60,40,60]) caused scale overflow when attempting polynomial sigmoid due to insufficient multiplication levels. Solution: Removed the polynomial sigmoid (see Challenge 1). Parameters were kept minimal since only 1 multiplication level was ultimately needed.

Challenge 3: TenSEAL on Windows

Problem: TenSEAL (built on Microsoft SEAL) requires a C++ build toolchain that is difficult to configure on Windows. Solution: Developed on Ubuntu (WSL) where pip install tenseal works out of the box.


File Structure

he_project/
├── train_model.py       # Train LR model, extract weights, save artifacts
├── encrypt_infer.py     # HE inference on a single sample (demo)
├── benchmark.py         # HE vs plaintext across all 114 test samples
├── model_params.json    # Extracted weights + bias (generated)
├── model.pkl            # Sklearn model object (generated)
├── scaler.pkl           # StandardScaler (generated)
├── X_test_scaled.npy    # Scaled test features (generated)
└── y_test.npy           # Test labels (generated)

Run order:

python3 -m venv venv        # create environment
source venv/bin/activate    # activate environment
python3 train_model.py      # generates all .pkl and .npy files
python3 encrypt_infer.py    # single-sample demo
python3 benchmark.py        # full 114-sample benchmark

Workload Distribution

Member Responsibilities
Member A Software architecture, encryption pipeline, ML model training, benchmarking, technical analysis
Member B Presentation design, written report, objectives and findings documentation

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages