OncallM Roadmap

Purpose

Make OncallM a reliable on-call AI copilot that triages alerts, inspects Kubernetes signals (resources, events, logs, metrics), summarizes likely root cause, and proposes safe, auditable remediations. Optimized for a link-first workflow: Alertmanager (and other notifiers) include links to the OncallM web UI for each alert; optional ChatOps integrations can come later. Supports authenticated deep-links via SSO/SAML/OIDC. Emphasizes low-cost local LLM support for development and testing.

Versioning and planning

Pre-GA milestones use 0.x versions (0.1 … 0.9). GA is 1.0 with semantic versioning.
Each milestone lists goals and a concise definition of done (DoD) to keep PRs focused and reviewable.

Milestones

0.1 – MVP stabilization and clarity Goals

Solidify API endpoints and predictable error responses.
Improve typing, docstrings, and error handling across services.
Unify configuration via environment variables with sane defaults.
Optional: streaming responses for LLM output.
LLM provider abstraction with Local LLM dev mode (OpenAI-compatible endpoint) to enable low-cost development and testing. DoD
Endpoints return structured errors and consistent JSON.
Type checks and lints pass in CI; essential unit tests added.
A single configuration surface documented.

0.2 – Production readiness Goals

0.3 – Link-first workflow expansion Goals

Cloud LLM provider driver (e.g., OpenAI/Azure/Anthropic) added on top of the provider abstraction (local dev mode remains default for development).
Link-first alert deep-links: canonical URL scheme /alerts/{alert_id} that renders triage status and results.
Alertmanager template examples and a small helper to include the alert link in notifications.
Optional signed links or per-alert tokens to protect access; config toggles for public vs. protected links.
Basic SSO for the UI (SAML or OIDC), including minimal role mapping (viewer/admin) and per-alert access control options.
Metrics providers: interface for Prometheus first. DoD
Cloud LLM provider works end-to-end via config toggle alongside local dev provider.
Alert deep-link renders triage page end-to-end from a notification link.
Alertmanager template snippet documented and covered by an example test.
Signed link/token option implemented and tested.
SSO login flow works with a sample IdP (metadata/issuer), with configuration docs and a smoke test.
Metrics snapshot retrieval used in analysis flow.

0.4 – Intelligence and accuracy Goals

0.5 – UX and workflow Goals

Web UI polish: show status, inspected resources, recommendations + confidence; quick access to copy/share the alert link.
Link-first flow: from Alertmanager message link to triage page with clear recommended actions.
Incident timeline summaries and resolution notes. DoD
UI reflects analysis steps and outputs clearly; alert links are copyable from the UI.
End-to-end link-first flow documented with setup steps.
Slack interactions create/update an incident thread.

0.6 – Safety and controlled remediation Goals

0.7 – Performance and cost Goals

0.8 – Extensibility and plugin system Goals

0.9 – Security and compliance Goals

1.0 – GA readiness Goals

Quality gates

CI: lint, type-check, tests, security scans on PRs; build/publish images on tags.
Test coverage report and status comment on PRs.
Pre-commit hooks for formatting and static analysis.
Nightly e2e against a kind cluster using example alerts.

Governance

How to propose roadmap changes

Open an issue labeled roadmap with a concise problem statement, motivation, and acceptance criteria.
For cross-cutting or security-impacting changes, start with an RFC and ping maintainers for async review.

Provide feedback