Skip to content

Latest commit

 

History

History
232 lines (194 loc) · 7.58 KB

File metadata and controls

232 lines (194 loc) · 7.58 KB

PLSemanticsBench

Hugging Face

Table of Contents

About

PLSemanticsBench is the first benchmark for evaluating LLMs as programming language interpreters. We introduce three tasks to evaluate this:

Task Description
PredState Predicts the final program state
PredRule Predicts the ordered sequence of semantic rules needed to evaluate a program
PredTrace Predicts the step-by-step execution of a program

PLSemanticsBench is hosted on HuggingFace: PLSemanticsBench.

You must implement BaseRunner(_query method) to evaluate your models. We provide two example implementations for OpenAI models (GPTRunner) and Ollama models (OllamaRunner).

Installation

System Requirements

  • Conda package management system
  • Python 3.11 or higher
  • OpenAI API key (for running experiments with OpenAI models)

Step-by-Step Installation

  1. Create and activate the conda environment:
conda env create -f env.yaml
conda activate plsemanticsbench
  1. Set up your OpenAI API key (only for OpenAI models):
export OPENAI_API_KEY='your-api-key-here'

Quick Start

We provide a bash script quick that:

  1. Sets up the plsemanticsbench conda environment.
  2. Pulls the DeepSeek-R1 1.5B model.
  3. Evaluates the DeepSeek-R1 1.5B model on the PredState task with no-semantics and chain-of-thought prompting on the Human-Written dataset.
  4. Prints the accuracy and malformed-count to screen.
  5. Creates metrics-predstate-deepseek-r1:1.5b.json that contains the evaluation result.
bash quick

Detailed Usage

Basic Example

Here's a minimal example to get started:

from plsemanticsbench import GPTRunner
from plsemanticsbench import ExperimentArgs, LLMEvaluator
from plsemanticsbench import (
    PROMPT_STRATEGY,
    Task,
    Formalization,
    Semantics_Type,
    Language,
    PLDataset
)

# Model name
model_name = "o3-mini"

# Experiment args: Run the PredState task on the IMP language with
# standard semantics formalized using SOS and with direct prompting
exp_args = ExperimentArgs(
    dataset=PLDataset.Human_Written,
    task=Task.PredState,
    language=Language.IMP,
    formalization=Formalization.SOS,
    semantics_type=Semantics_Type.Standard,
    model_name=model_name,
    prompt_strategy=PROMPT_STRATEGY.DA,
    num_datapoints_to_run=2, # Run just 2 datapoints (omit to run entire dataset)
)
                        
# Run inference using the OpenAI API
gpt_runner = GPTRunner(args=exp_args)

# Generation (generate LLM prediction on the predstate task)
predictions = gpt_runner.do_experiment() # path to dump results can be provided

# Evaluation (evaluate LLM prediction against ground-truth)
llm_eval = LLMEvaluator(task=exp_args.task, semantics_type=exp_args.semantics_type)
evaluation_result = llm_eval.evaluate_from_list(results=predictions, model_name=model_name)
print(evaluation_result)

Expected Output

{
    'accuracy': 1,
    'malformed-count': 0,
}

Benchmark

Our benchmark is hosted on HuggingFace: PLSemanticsBench.

Benchmark Access

You can load the dataset using the datasets library. Here is an example:

from datasets import load_dataset

# Load PredState task with standard semantics (uk) and K-semantics formalization (K) and with the Human Written (human-written) dataset
predstate_IMP_K_uk_human_written = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predstate-IMP-K-uk-human-written")

# Load PredRule task with nonstandard semantics (mk) ans SOS formalization (SOS) and with the LLM Translated (llm-translated) dataset
predrule_IMP_SOS_mk_llm_translated = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predrule-IMP-SOS-mk-llm-translated")

# Load PredState task with no-semantics (nk) and with the Fuzzer Generated (fuzzer-generated) dataset
predstate_IMP_nk_fuzzer_generated = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predstate-IMP-nk-fuzzer-generated")

Dataset Split

Task Split Description
PredState
(Final State Prediction)
predstate-IMP-nk-{dataset-name} No semantics
predstate-IMP-K-uk-{dataset-name} Standard semantics with K-semantics formalization
predstate-IMP-K-mk-{dataset-name} Nonstandard semantics with K-semantics formalization
predstate-IMP-SOS-uk-{dataset-name} Standard semantics with SOS formalization
predstate-IMP-SOS-mk-{dataset-name} Nonstandard semantics with SOS formalization
PredRule
(Semantic Rule Prediction)
predrule-IMP-K-uk-human-written Standard semantics with K-semantics formalization
predrule-IMP-K-mk-human-written Nonstandard semantics with K-semantics formalization
predrule-IMP-SOS-uk-human-written Standard semantics with SOS formalization
predrule-IMP-SOS-mk-human-written Nonstandard semantics with SOS formalization
PredTrace
(Execution Trace Prediction)
predtrace-IMP-K-uk-human-written Standard semantics with K-semantics formalization
predtrace-IMP-K-mk-human-written Nonstandard semantics with K-semantics formalization
predtrace-IMP-SOS-uk-human-written Standard semantics with SOS formalization
predtrace-IMP-SOS-mk-human-written Nonstandard semantics with SOS formalization

Data Example

One example of the dataset is as follows:

{
  "program": "int ans; ans = 1; ...",
  "syntax": "<program> :: ...",
  "semantics": "ℤ := Set of integers ...",
  "mutated-program": "int ans; ans = 1; ...",
  "mutation-pattern": "KeyWordSwap",
  "exec-trace": [
    {
      "linenumber": 1,
      "rule": ["Rule 38", "Rule 39"],
      "state": {"ans": 1}
    }
  ],
  "ground-truth": "<answer>...</answer>"
}

Citation

@article{ThimmaiahETAL25PLSemanticsBench,
  title={PLSemanticsBench: Large Language Models As Programming Language Interpreters},
  author={Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Junyi Jessy Li, Milos Gligoric},
  year={2025},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2510.03415}, 
}

License

This project is licensed under the CC BY 4.0 License.