Connect AI agents and engineers to Apache Spark History Server for intelligent job analysis, performance monitoring, and investigation
Important
A standalone Go binary that queries Spark History Server directly from your terminal — no MCP, no AI framework, no daemon process. Inspect jobs, compare runs, investigate failures, and script against the Spark REST API.
This project provides two interfaces to your Spark History Server data:
🛠️ SHS CLI (shs) |
⚡ MCP Server | |
|---|---|---|
| For | Engineers, shell scripts, CI/CD, coding agents | AI agents and MCP-compatible clients |
| Mental model | "I know the command I want to run" | "Agent, investigate this Spark app" |
| Install | Single static binary — no dependencies | Python 3.12+, uv |
| Get started | CLI docs → | MCP docs → |
graph TB
subgraph Clients
A[🤖 AI Agent / LLM]
B[👩💻 Engineer / Script / CI]
C[🔧 Coding Agent - Claude Code / Kiro]
end
subgraph "Kubeflow Spark AI Toolkit"
D[⚡ MCP Server]
E[🛠️ CLI - shs]
end
subgraph "Spark History Servers"
F[🔥 Production]
G[🔥 Staging / Dev]
end
A -->|MCP Protocol| D
B -->|Terminal commands| E
C -->|shs skill file| E
D -->|REST API| F
D -->|REST API| G
E -->|REST API| F
E -->|REST API| G
A standalone Go binary — no MCP, no dependencies, no running daemon. Query your Spark History Server directly from the terminal, shell scripts, or CI/CD pipelines. Also works as a skill for coding agents like Claude Code and Kiro.
# Auto-detect latest version, OS, and architecture
VERSION=$(curl -s https://api.github.com/repos/kubeflow/mcp-apache-spark-history-server/releases | grep -m1 '"tag_name": "cli/' | cut -d'"' -f4 | sed 's|cli/||')
OS=$(uname -s | tr '[:upper:]' '[:lower:]')
ARCH=$(uname -m)
[ "$ARCH" = "x86_64" ] && ARCH="amd64"
[ "$ARCH" = "aarch64" ] && ARCH="arm64"
curl -sSL "https://github.com/kubeflow/mcp-apache-spark-history-server/releases/download/cli%2F${VERSION}/shs-${VERSION}-${OS}-${ARCH}.tar.gz" | tar xz
sudo mv shs /usr/local/bin/# Generate a config file
shs setup config > config.yaml # then set your Spark History Server URL
# Explore applications
shs apps
shs jobs -a APP_ID --status failed
shs stages -a APP_ID --sort duration
shs compare apps --app-a APP1 --app-b APP2
# Use as a skill with Claude Code or Kiro
shs setup skill > ~/.claude/skills/spark-history.mdCLI documentation for full usage, or check out a real-world example of Claude Code comparing two TPC-DS 3TB benchmark runs.
An MCP (Model Context Protocol) server that exposes Spark History Server data as tools for AI agents. Agents query your Spark infrastructure using natural language — the server handles tool selection, multi-server routing, and structured data retrieval.
Use the MCP server when you want an AI agent to conduct multi-step investigations, synthesize findings across tools, or answer natural-language questions about your Spark applications.
# Run directly with uvx (no install needed)
uvx --from mcp-apache-spark-history-server spark-mcp
# Or install with pip
pip install mcp-apache-spark-history-server
spark-mcpThe package is published to PyPI.
Edit config.yaml:
servers:
local:
default: true
url: "http://your-spark-history-server:18080"
auth: # optional
username: "user"
password: "pass"
include_plan_description: false # include SQL plans by default (default: false)
mcp:
transports:
- streamable-http # or: stdio
port: "18888"
debug: falseEnvironment variable overrides:
SHS_MCP_PORT Port for MCP server (default: 18888)
SHS_MCP_TRANSPORT Transport mode: streamable-http or stdio
SHS_MCP_DEBUG Enable debug mode (default: false)
SHS_MCP_ADDRESS Bind address (default: localhost)
SHS_SERVERS_*_URL URL for a specific server
SHS_SERVERS_*_AUTH_USERNAME
SHS_SERVERS_*_AUTH_PASSWORD
SHS_SERVERS_*_AUTH_TOKEN
SHS_SERVERS_*_VERIFY_SSL
SHS_SERVERS_*_TIMEOUT
SHS_SERVERS_*_EMR_CLUSTER_ARN
SHS_SERVERS_*_INCLUDE_PLAN_DESCRIPTION
Configure multiple Spark History Servers and route queries to specific ones:
servers:
production:
default: true
url: "http://prod-spark-history:18080"
auth:
username: "user"
password: "pass"
staging:
url: "http://staging-spark-history:18080"Agents can target a specific server per query:
"Get application
<app_id>from the production server"
| Agent | Transport | Guide |
|---|---|---|
| Claude Desktop | stdio | Setup → |
| Amazon Q CLI | stdio | Setup → |
| Kiro | streamable-http | Setup → |
| LangGraph | streamable-http | Setup → |
| Strands Agents | streamable-http | Setup → |
| Local / Inspector | streamable-http | Setup → |
| Tool | Description |
|---|---|
list_applications |
List applications with optional status, date, and limit filters |
get_application |
Get application detail: status, resources, duration, attempts |
| Tool | Description |
|---|---|
list_jobs |
List jobs with status filtering |
list_slowest_jobs |
Top N slowest jobs |
| Tool | Description |
|---|---|
list_stages |
List stages with status filtering |
list_slowest_stages |
Top N slowest stages |
get_stage |
Stage detail with attempt and summary metrics |
get_stage_task_summary |
Task metric distributions (execution time, memory, I/O, spill) |
| Tool | Description |
|---|---|
list_executors |
List executors (active and optionally inactive) |
get_executor |
Executor detail: resources, task stats, performance |
get_executor_summary |
Aggregate metrics across all executors |
get_resource_usage_timeline |
Chronological executor add/remove with resource totals |
| Tool | Description |
|---|---|
get_environment |
Spark config, JVM info, system properties, classpath |
| Tool | Description |
|---|---|
list_slowest_sql_queries |
Top N slowest SQL executions with metrics |
get_sql_execution |
SQL execution detail with optional plan and node metrics |
compare_sql_execution_plans |
Compare SQL plans and metrics between two jobs |
| Tool | Description |
|---|---|
get_job_bottlenecks |
Identify bottlenecks across stages, tasks, and executors |
| Tool | Description |
|---|---|
compare_job_environments |
Diff Spark configs between two applications |
compare_job_performance |
Diff performance metrics between two applications |
- "Why is my ETL job running slower than yesterday?" →
get_job_bottlenecks+list_slowest_stages+compare_job_performance - "What caused job 42 to fail?" →
list_jobs+get_stage+get_stage_task_summary - "Compare today's batch with yesterday's run" →
compare_job_performance+compare_job_environments - "Find my slowest SQL queries and explain why" →
list_slowest_sql_queries+get_sql_execution+compare_sql_execution_plans
Deploy the MCP server using Helm:
helm install spark-history-mcp ./deploy/kubernetes/helm/mcp-apache-spark-history-server/
# Production configuration
helm install spark-history-mcp ./deploy/kubernetes/helm/mcp-apache-spark-history-server/ \
--set replicaCount=3 \
--set autoscaling.enabled=trueSee deploy/kubernetes/helm/ for full configuration options.
When deployed in Kubernetes, connect Claude Desktop via mcp-remote:
kubectl port-forward svc/mcp-apache-spark-history-server 18888:18888- AWS Glue — Connect to Glue Spark History Server
- Amazon EMR — Use EMR Persistent UI for Spark analysis
git clone https://github.com/kubeflow/mcp-apache-spark-history-server.git
cd mcp-apache-spark-history-server
# Install Task runner
brew install go-task # macOS; see https://taskfile.dev/installation/ for others
# MCP Server
task install # install Python dependencies
task start-spark-bg # start Spark History Server with sample data
task start-mcp-bg # start MCP server
task start-inspector-bg # open MCP Inspector at http://localhost:6274
task stop-all
# CLI
cd skills/cli
task build # build ./bin/shs
task test # unit tests
task test-e2e # e2e tests (starts/stops Docker SHS automatically)
task start-shs # start SHS with CLI e2e sample dataUsing this project? Add your organization to ADOPTERS.md and help grow the community.
See CONTRIBUTING.md for guidelines.
Apache License 2.0 — see LICENSE.
Built for use with Apache Spark™ History Server. Not affiliated with or endorsed by the Apache Software Foundation.
Connect your Spark infrastructure to AI agents and engineers
🛠️ SHS CLI · ⚡ MCP Server · 🧪 Test · 🤝 Contribute
Built by the community, for the community 💙

