This repository contains the code for our paper Proxy Compression for Language Modeling.
This codebase supports pretraining language models with various input representations. It is built as an extension of Lingua and implements Proxy Compression Training: training language models on proxy-compressed data (gzip, neural compression, tokenizer-based compression, where tokens can be represented as-is or encoded as bytes, e.g., a 65K-vocab BPE token encoded as 2 bytes). For neural compression, see our neural compression library for offline data preprocessing.
We currently support training OpenCoder-style Transformer and EvaByte architectures. EvaByte supports bytes, tokens, sub-byte representations, and compressed formats; Transformers currently support token-level training only.
We also include a minimal evaluation library for downstream code generation tasks (HumanEval+, MBPP+) and BPB (Bits Per Byte) metrics.
# Clone the repository
git clone https://github.com/LZhengisme/proxy-compression.git
cd proxy-compression
# Create virtual environment
bash setup/create_env.sh
source envs/lingua_<date>/bin/activate
# Optional: Install Flash Attention for efficient training of OpenCoder-like models
pip install psutil
pip install flash-attn --no-build-isolationSet up the following environment variables for logging and model access:
# Weights & Biases logging (or use `wandb login`)
export WANDB_PROJECT=your_project_name
export WANDB_ENTITY=your_entity
export WANDB_API_KEY=your_api_key
# HuggingFace access
export HF_TOKEN=your_huggingface_token
export HF_HOME=/path/to/hf_cacheInstall additional dependencies for code generation evaluation:
cd evals
# Alternatively, install eval dependencies in a separate environment
pip install -r requirements.txt
pip install evalplus==0.3.1 --no-deps # Code generation evaluationWe follow the same data format as Lingua, and assume training data is stored as JSONL files with a text field, where the files are named as <your_data_dir>/<dataset_name>/<some_prefix>.<i>.jsonl. For instance, the running example below downloads and shuffles an example dataset (subsampled stack-edu) at path data/stackedu/stackedu.chunk.<i>.jsonl with 4 chunks.
Example: Download and prepare Stack-Edu data
# Download sample data from HuggingFaceTB/stack-edu (adjust --num_samples as needed)
python setup/download_stackedu.py --output data/stackedu_raw/data.jsonl --num_samples 1000000
# Shuffle and split into chunks
bash setup/in_mem_shuffle_split.sh data/stackedu_raw/data.jsonl '*.jsonl' 4 data/stackedu stackeduYour data directory structure should look like:
data/
├── stackedu/
│ ├── stackedu.chunk.01.jsonl
│ ├── stackedu.chunk.02.jsonl
│ ├── stackedu.chunk.03.jsonl
│ └── stackedu.chunk.04.jsonl
We provide example training scripts in scripts/. You can edit these scripts to customize hyperparameters for your setup.
| Script | Description |
|---|---|
scripts/train_opencoder.sh |
Standard OpenCoder-style Transformer on tokens |
scripts/train_evabyte_bytes.sh |
EvaByte on raw bytes with multi-byte prediction |
scripts/train_evabyte_tokenized.sh |
EvaByte on BPE tokens with multi-token prediction |
scripts/train_evabyte_gzip_proxy.sh |
EvaByte with gzip proxy compression |
scripts/train_evabyte_neural_proxy.sh |
EvaByte with neural proxy compression |
scripts/train_evabyte_token_proxy.sh |
EvaByte with token proxy compression |
scripts/train_evabyte_token_byte_proxy.sh |
EvaByte with token-byte proxy compression, where BPE tokens are encoded as bytes |
scripts/train_evabyte_bits_and_bytes.sh |
EvaByte with various bit/byte-level representations |
By default, we use OpenCoder's tokenizer for most experiments. For token-byte proxy compression, we also use custom tokenizers such as artifacts/superbpe_vocab65k.json.
# Example: Train EvaByte with token-based proxy compression
bash scripts/train_evabyte_token_proxy.shDistributed Training:
distributed.dp_shard: FSDP sharding degree. Set to the GPU count or GPUs per node depending on the hardware setup (default: 8).
Multi-byte/Token Prediction:
model.num_pred_heads: Number of prediction heads for future positions.model.apply_raw_multibyte_lm_head: Whentrue, adds separate raw byte prediction heads when using proxy compression training. Setmodel.num_raw_pred_headsfor the number of raw byte prediction heads. Conceptually, this means when training with proxy compression, each position will be predicted bymodel.num_pred_headsprediction heads for compressed tokens andmodel.num_pred_heads + model.num_raw_pred_headsprediction heads for raw bytes.data.n_views: The number of different views during data processing (for instance, with 2 multi-token prediction heads, thenn_views = 3, consisting of the input and targets for each head). Equal tonum_pred_heads + 1(ornum_pred_heads + num_raw_pred_heads + 1with raw heads).apply_fused_linear_chunked_ce_loss: Memory-efficient chunked cross-entropy for large vocabularies. Use withapply_multibyte_loss_mask=true(disable when using raw prediction heads, as the masking is already handled inside the loss function).
Prediction Head Tensor Shapes
Consider mixed input representations, where raw bytes take 256 possible values and compressed tokens take 65536 possible values.The vocabulary would be 64 for special tokens, 256 for raw bytes, 65536 for compressed bytes (tokens): vocab_size = 65536 + 256 + 64 = 65856
For a single prediction head, the tensor shapes would be
# B = batch size, N = sequence length, V = vocabulary size
# inputs: [B, N]
# logits: [B, N, V]
# labels: [B, N]For vanilla multi-token/byte prediction, tensor shapes would be
# B = batch size, N = sequence length, V = vocabulary size, H = prediction heads
# inputs: [B, N]
# logits: [B, N, V * H]
# labels: [B, N, H]For multi-byte prediction with additional raw prediction heads, tensor shapes would be
# B = batch size, N = sequence length, V = vocabulary size, H_1 = main prediction heads, H_2 = additional raw prediction heads
# we assume there are N_r raw bytes out of N elements in total
# inputs: [B, N]
# main_logits: [B, N, V * H_1]
# additional_raw_logits: [B, N_r, 320 * H_2]
# labels: [B, N, H_1 + H_2]Attention Masking:
apply_doc_boundary_mask: Prevents cross-document attention within packed contexts.
Proxy Compression:
data.compression_sampling_rate: Fraction of compressed data (e.g.,0.9= 90% compressed, 10% raw).data.tokenizer.separate_embedding: Separate embeddings for raw vs compressed tokens.
Neural Compression:
- Requires offline preprocessing with our neural compression tool to build the compressed JSONL files (e.g.,
data.sources="{'stackedu_neural_compressed_ow16':1.0}"). - Config:
data.compression_alg_config=ac_m1_key-pack_<N>-<key>whereN= bits per compressed symbol (default: 16, meaning there will be 2^16 = 65536 possible compressed symbols),key= JSONL field with compressed data.
To run BPB evaluation on your checkpoints:
# Set evaluation parameters
CKPT_DIR=checkpoints/your_experiment/step_50000 # Path to checkpoint directory
VAL_DATA_PATH=data/validation # Path to validation data (JSONL format)
OUTPUT_DIR=eval_results/bpb_eval # Directory to save evaluation results
NUM_GPUS=1 # Number of GPUs to use
# Optional: Pre-consolidate distributed checkpoints
python3 -m apps.utils.pre_consolidate ckpt_dir=$CKPT_DIR
# Run BPB evaluation
# 1. For EvaByte models
torchrun --nproc_per_node=$NUM_GPUS \
-m apps.evabyte.bpb_eval config=apps/evabyte/configs/bpb_eval.yaml \
name=bpb_eval dump_dir=$OUTPUT_DIR metric_log_dir=$OUTPUT_DIR \
ckpt_dir=$CKPT_DIR val_data_path=$VAL_DATA_PATH \
use_train_doc_attn_mask_config=false use_train_data_config=false \
apply_doc_boundary_mask=true \
evaluate_on_raw_only=true \
context_seqlen=4096 context_stride=4096
# 2. For OpenCoder (Transformer) models
torchrun --nproc_per_node=$NUM_GPUS \
-m apps.main.bpb_eval config=apps/main/configs/bpb_eval.yaml \
name=bpb_eval dump_dir=$OUTPUT_DIR metric_log_dir=$OUTPUT_DIR \
ckpt_dir=$CKPT_DIR val_data_path=$VAL_DATA_PATH \
use_train_doc_attn_mask_config=false use_train_data_config=true \
apply_doc_boundary_mask=true \
context_seqlen=4096 context_stride=4096Key parameters:
use_train_data_config: Whentrue, processes validation data the same way as training data. Settruefor byte/token models;falsefor proxy-trained models to evaluate on raw bytes.evaluate_on_raw_only: Whentrue, computes BPB only on raw bytes. Use for proxy compression models.context_seqlen: Context window size for evaluation.context_stride: Stride between evaluation windows.
Results will be saved to $OUTPUT_DIR/val_results.json.
Step 1: Convert checkpoints to HuggingFace format
# For EvaByte models
python -m apps.evabyte.scripts.hf_lingua_conversion mode=dcp_to_hf \
dcp_to_hf_args.dcp_checkpoint_dir=checkpoints/your_experiment/step_50000 \
dcp_to_hf_args.hf_output_dir=ckpts/your_experiment_hf \
dcp_to_hf_args.ref_hf_output_dir=evals/ref_hf_evabyte_impl
# For Transformer (OpenCoder) models
python -m apps.main.scripts.opencoder_hf_lingua_conversion mode=dcp_to_hf \
dcp_to_hf_args.dcp_checkpoint_dir=checkpoints/your_experiment/step_50000 \
dcp_to_hf_args.hf_output_dir=ckpts/your_experiment_hf \
dcp_to_hf_args.ref_hf_output_dir=evals/ref_hf_opencoder_1B5_implStep 2: Run evaluation
cd evals/gen_evals
bash run_gen_eval.sh <MODEL_PATH> <DATASET> <DUMP_DIR> <TOKENIZER_MODE> [SPM_PATH] [PROMPT_HEALING] [DECODING_MODE]| Argument | Description |
|---|---|
MODEL_PATH |
HuggingFace checkpoint path |
DATASET |
humaneval_plus, mbpp_plus, or append :sample for sampling |
DUMP_DIR |
Output directory for results |
TOKENIZER_MODE |
Tokenizer mode for evaluation (see below) |
SPM_PATH |
Path to tokenizer on HF Hub (required for hf_tokenizer mode) or locally trained SPM models |
PROMPT_HEALING |
Prompt boundary handling strategy (see below) |
DECODING_MODE |
vanilla (default) or multibyte (not fully tested for proxy compression training) |
Argument Values:
TOKENIZER_MODE - Controls how input is tokenized during evaluation
| Value | Description |
|---|---|
default |
Use the HuggingFace tokenizer bundled with the model checkpoint. Suitable for standard token-level models (e.g., OpenCoder). |
raw_sentinel |
Evaluate on raw UTF-8 bytes with a <raw> sentinel token prepended to the input. Use for byte-level models (e.g., EvaByte trained on bytes or proxy-compressed data). |
hf_tokenizer |
Use a different HuggingFace tokenizer specified via SPM_PATH. Useful when the model was trained with a tokenizer different from its HF model specification (e.g., EvaByte trained on BPE tokens). |
doublebyte |
Two bytes are grouped into 16-bit tokens. Use for models trained with double-byte representations. |
hf_spm |
Use a locally trained HuggingFace SPM tokenizer specified via SPM_PATH. |
PROMPT_HEALING - Handles tokenization boundary artifacts at the end of prompts
| Value | Description |
|---|---|
strip |
(Default) Strip whitespace at prompt boundary. |
heal |
Apply token healing to merge partial tokens across prompt/completion boundary. Can improve generation quality but may affect reproducibility. |
pad_to_even |
Pad prompt to even byte length. Required for doublebyte tokenizer mode to ensure proper 16-bit alignment. |
Examples:
cd evals/gen_evals
# OpenCoder trained on tokens
bash run_gen_eval.sh ckpts/opencoder_1b5_hf humaneval_plus logs default
# EvaByte trained on bytes
bash run_gen_eval.sh ckpts/your_evabyte_hf humaneval_plus logs raw_sentinel
# EvaByte trained on tokens
bash run_gen_eval.sh ckpts/evabyte_tokens_multibytepred2_hf humaneval_plus logs hf_tokenizer infly/OpenCoder-1.5B-Base
# EvaByte trained on tokens with sampling (20 samples, temperature=0.2)
bash run_gen_eval.sh ckpts/evabyte_tokens_multibytepred2_hf humaneval_plus:sample logs hf_tokenizer infly/OpenCoder-1.5B-Base
# EvaByte trained on proxy compressed data, evaluated on bytes
bash run_gen_eval.sh ckpts/evabyte_neural_proxy humaneval_plus logs raw_sentinelResults are saved to <DUMP_DIR>/<DATASET>/<MODEL_NAME>/ with generated samples and evaluation scores.
Pre-defined model configs are available in:
apps/main/configs/— OpenCoder-style Transformer (1.5B, 8B)apps/evabyte/configs/— EvaByte architecture (500M, 1.5B, 4B, 7B, 14B)
| Use Case | Type | Description |
|---|---|---|
| OpenCoder training on tokens | data.tokenizer.name=vanilla_hf |
Standard HuggingFace tokenizer. Specify tokenizer via data.tokenizer.path. |
| EvaByte on bytes or most proxy compression | data.tokenizer.name=hf |
A customized HuggingFace tokenizer with sentinel tokens (<raw>, <compressed>) added to vocabulary. |
| Token-based proxy compression | data.tokenizer.name=token_plus_byte |
Hybrid tokenizer supporting both BPE tokens and raw bytes in a unified vocabulary. Specify BPE tokenizer via data.tokenizer.spm_byte_path. |
| Double-byte experiments | data.tokenizer.name=doublebyte |
Packs consecutive byte pairs into 16-bit tokens (vocab_size=65536 + sentinels). |
Related tokenizer options:
data.tokenizer.spm_byte_path: Path to BPE tokenizer for byte-level encoding of tokens, or for token-byte proxy compression.data.tokenizer.separate_embedding: Whentrue, uses separate embedding vectors for raw bytes vs compressed tokens. This is a must when tokens take a large vocabulary (e.g., 65k); while for tokens encoded as bytes, they can share the same embedding space with raw bytes (we did not observe significant performance differences in our experiments).data.tokenizer.byte_converter_config.byte_converter_type: Byte encoding scheme for token-byte proxy compression (e.g.,grayfor Gray coding to promote locality)
| Type | Description | vocab_size |
|---|---|---|
vanilla |
Use standard tokenization. | tokenizer vocab |
gzip / gzip_no_mtime |
Gzip compression on raw bytes. gzip_no_mtime omits timestamps for deterministic compression. |
256 + sentinels |
spm_byte |
BPE tokens as the compression. | either 256 or vocab size + sentinels |
doublebyte |
Packs pairs of UTF-8 bytes into 16-bit tokens. | 65536 + sentinels |
halfbyte |
Splits each byte into two 4-bit tokens. | 16 + sentinels |
doublebit |
Splits each byte into four 2-bit tokens. | 4 + sentinels |
bit |
Splits each byte into eight 1-bit tokens. | 2 + sentinels |
ac_m1_key-pack_N-KEY |
Neural compression with arithmetic coding. N = bits per symbol (typically 16), KEY = JSONL field storing the compressed data. Requires offline preprocessing with our neural compression library. |
Controls how raw and compressed representations are mixed during training.
| Mode | Description |
|---|---|
vanilla |
Single representation only, no mixing. Used in standard byte-level or token-level training |
sentinel |
Wraps data with sentinel tokens: <raw>...</raw> for raw bytes, <compressed>...</compressed> for compressed data. Model learns to distinguish representations. Recommended for proxy compression training |
translation_raw_compressed |
Raw data paired with compressed translation appended: <raw>...</raw><compressed>...</compressed> |
translation_compressed_raw |
Raw data paired with compressed translation prepended: <compressed>...</compressed><raw>...</raw> |
translation_random |
Randomly alternates between the two translation orders above. Recommended for initial warmup in proxy compression |
parallel_raw_compressed |
Similar to translation modes, but treats raw and compressed as separate samples, and one cannot attend to the other. |
parallel_compressed_raw |
Parallel mode with compressed prepended to raw. |
parallel_random |
Randomly alternates between the two parallel orders. |
For proxy compression training, you can schedule both the compression rate (fraction of compressed vs raw data) and mixing mode across training phases:
Rate scheduling: The compression rate ramps up during warmup, stays constant during steady phase, then optionally decays.
compression_initial_ratelinearly increased tocompression_peak_rateduringcompression_warmup_steps- constant
compression_peak_ratewithcompression_steady_steps compression_peak_ratedecayed tocompression_final_rateduringcompression_decay_steps(if any)
Mode scheduling: The mixing mode can also change across phases via compression_initial_mode, compression_steady_mode, and compression_final_mode. A common pattern: start with translation_random during warmup (helps model learn raw↔compressed mapping), then transition to sentinel for the main training.
enable_compression_rate_schedule=true \
compression_warmup_steps=10000 \
compression_steady_steps=40000 \
compression_decay_steps=0 \
compression_initial_rate=0.4 \
compression_peak_rate=0.9 \
compression_final_rate=0.9 \
compression_initial_mode=translation_random \
compression_steady_mode=sentinel \
compression_final_mode=sentinelproxy-compression/
├── apps/
│ ├── main/ # Standard Transformer (OpenCoder) implementation
│ │ ├── train.py
│ │ ├── bpb_eval.py
│ │ ├── configs/ # Model configs (1.5B, 8B)
│ │ └── scripts/ # HF conversion scripts
│ ├── evabyte/ # EvaByte architecture
│ │ ├── train.py
│ │ ├── bpb_eval.py
│ │ ├── configs/ # Model configs (500M to 14B)
│ │ ├── component/ # Triton kernels for efficient attention
│ │ └── scripts/ # HF conversion scripts
│ └── utils/ # Shared utilities (checkpoint consolidation)
├── lingua/ # Core library (data loading, tokenizers, training utils)
├── evals/
│ ├── gen_evals/ # Code generation evaluation (HumanEval+, MBPP+)
│ ├── ref_hf_evabyte_impl/ # Reference HF implementation for EvaByte
│ └── ref_hf_opencoder_1B5_impl/ # Reference HF implementation for OpenCoder
├── scripts/ # Example training scripts
├── setup/ # Environment and data preparation scripts
└── data/ # Training data directory (user-provided)
If you find this codebase useful, please cite our paper:
@article{zheng2026proxy,
title={Proxy Compression for Language Modeling},
author={Zheng, Lin and Li, Xinyu and Liu, Qian and Feng, Xiachong and Kong, Lingpeng},
journal={arXiv preprint arXiv:2602.04289},
year={2026}
}This codebase is built upon Lingua. We thank the authors for their excellent work.