Skip to content

MeteoraAg/os-crawler

Repository files navigation

os-crawler

Microservices that feed structured knowledge into GBrain (or any Obsidian-compatible vault).

GBrain handles storage, search, enrichment, and back-linking. These feeders watch external sources and push structured markdown into the Obsidian vault that GBrain syncs from.

Originally built to track an on-chain protocol's programs, SDKs, GitHub activity, Twitter/X handles, and internal Notion docs. The same pattern works for any team that wants a single searchable corpus over scattered sources.

Architecture

External Sources          Feeders              Obsidian Vault        GBrain
─────────────────    ─────────────────    ─────────────────    ─────────────
Solana Programs  ──→ idl-watcher      ──→ ants/programs/   ──→ Search
GitHub Repos     ──→ github-feed      ──→ ants/changelog/  ──→ Enrichment
SDKs             ──→ sdk-feed         ──→ ants/sdks/       ──→ Back-links
On-chain Data    ──→ analytics-feed   ──→ ants/analytics/  ──→ Skills
Twitter / X      ──→ twitter-feed     ──→ ants/twitter/    ──→ Cron Jobs
Notion Docs      ──→ notion-feed      ──→ ants/notion/     ──→ Search
Public Docs      ──→ docs-feed        ──→ ants/docs/       ──→ ...
Meetings         ──→ meeting-feed     ──→ ants/meetings/   ──→ ...

Each feeder writes Obsidian-compatible markdown with YAML frontmatter. GBrain's sync picks up changes, chunks content, embeds vectors, and makes everything searchable via MCP.

Feeders

Feeder Status Description
idl-watcher Watches on-chain Anchor programs, renders IDL to markdown, detects changes
github-feed Tracks merged PRs, releases, and issues across configured repos
sdk-feed Parses SDK public API surface, tracks breaking changes, maps methods → IDL instructions
docs-feed Syncs public documentation, flags stale docs
twitter-feed Tracks tweets from configured handles plus mentions
notion-feed Indexes internal Notion design docs, QA plans, FE/BE specs, comms, and milestones. Needs NOTION_TOKEN plus [notion] allowlists.
analytics-feed Pulls product metrics from Amplitude via declarative event-segmentation queries (DAU, volume, on-chain txs) per entity. Needs AMPLITUDE_API_KEY + AMPLITUDE_SECRET_KEY.
meeting-feed 🚧 Transcribes meetings via whisper.cpp, extracts decisions

Quick Start

cp .env.example .env                      # fill in keys + paths
cp ants.config.example.toml ants.config.toml   # define what to track
bun install

Configuration

Two files drive everything:

  • .env — secrets and per-machine paths (vault root, cache dir, API keys)
  • ants.config.toml — the registry of what to track. Both files are gitignored; commit only the .example versions.

The config layout is hybrid: real products group their program + SDKs + tagged repos under one block; standalone repos and handles stay flat.

[products.<slug>]                  # product metadata (title, description)
[products.<slug>.program]          # → idl-watcher  (on-chain program)
[products.<slug>.sdks.<role>]      # → sdk-feed     (one entry per SDK lang)
[products.<slug>.repos.<role>]     # → github-feed  (one entry per tagged repo)

[repos.<slug>]                     # → github-feed  (standalone repos)
[handles.<slug>]                   # → twitter-feed (handle snapshots)

[notion]                           # → notion-feed allowlists + default queries

Adding a new product is one block: paste [products.foo], then any combination of [products.foo.program], [products.foo.sdks.ts], and [products.foo.repos.<role>]. Repos that aren't tagged to a product (infra, services, third-party forks) live at the top level under [repos.<slug>].

Auto-derived entity slugs: <product> for the program, <product>-sdk for the TS SDK, <product>-<role> for everything else. Override per-subsection with entity = "..." to keep an existing vault page name. Operational config (metrics, TTLs, whisper glossary) and secrets (.env) stay where they are. Long-running daemons pick up edits on restart. Override the config path with ANTS_CONFIG.

notion-feed is deny-by-default. Set NOTION_TOKEN to a Notion integration token with read access to the workspace pages/databases you want indexed, then add those IDs to [notion].allowed_teamspaces or [notion].allowed_databases in ants.config.toml. Empty allowlists return a clean no-op. The orchestrator also runs a startup handshake; if the token is present but invalid or unreachable, Notion is disabled for that session instead of wedging later calls.

Commands

Command Description
bun run idl-watcher:sync Snapshot every configured program's IDL from on-chain and write to the vault
bun run idl-watcher:pr <url|owner/repo#N> Compare a PR's IDL against main and render the diff
bun run github-feed:sync [entity] Pull merged PRs, releases, and issues for all repos (or one)
bun run sdk-feed:sync Snapshot every SDK's public API surface from main
bun run sdk-feed:pr <url|owner/repo#N> Compare an SDK PR's surface against base and flag breaking changes
bun run docs-feed Sync public docs into the vault (incremental, uses cache)
bun run docs-feed:full Full docs sync — ignore cache, fetch everything
bun run docs-feed:dry-run Report docs changes without writing to the vault
bun run analytics-feed:sync Pull product metrics from Amplitude and snapshot into the vault
bun run analytics-feed:dry-run List registered metrics without hitting Amplitude or writing
bun run meeting-feed:file <path> Transcribe a Craig recording (.zip or directory of .flac) into a meeting page
bun run meeting-feed:watch [dir] Watch $ANTS_RECORDINGS/ (or arg) and auto-ingest every new .zip
bun run meeting-feed:eval Score transcription quality (WER) against hand-corrected goldens. See feeders/meeting-feed/eval-README.md
bun run twitter-feed:sync [entity] Snapshot tweets for all tracked handles (or one) plus configured mentions
bun run twitter-feed:mentions Refresh just the mentions aggregator
bun run notion-feed:sync [entity] Search configured Notion queries (or one entity term) and write scoped summaries under ants/notion/
bun run notion-feed:diag [query...] Dump raw Notion /search results with title + parent type/ID so you can see what UUIDs to add to [notion].allowed_teamspaces / allowed_databases. Does not write to the vault.
bun run notion-feed:clean [--yes|--dry-run] One-shot reset of <vault>/ants/notion/. Run when scope inference or the slugifier changes and stale files would otherwise pile up.

Tuning meeting transcription

When transcripts come back wrong, you have five knobs:

  1. Glossary — add domain terms (product names, partner names, team handles) to feeders/shared/whisper-prompt.toml. Whisper biases toward words it has seen in the prompt.
  2. Per-meeting context — pass --context "discussion of X" to meeting-feed:file. Layered on top of the glossary, just for that meeting.
  3. Team handles — uncomment and fill in real Discord/Slack display names in whisper-prompt.toml's glossary list.
  4. Model swap — point WHISPER_MODEL at a larger model (ggml-large-v3.bin instead of turbo) for higher accuracy at slower speed.
  5. Eval before/afterbun run meeting-feed:eval re-runs the pipeline against fixed samples and reports WER deltas. Without this you're tuning blind. See feeders/meeting-feed/eval-README.md.

Remote MCP access

The orchestrator (feeders/orchestrator/main.ts) is the MCP server that exposes research plus per-feeder tools to any MCP client (Claude Code, Claude Desktop, ChatGPT custom GPTs, scripts).

It runs in two modes:

Mode When to use Auth
stdio (default) Local clients on the host, or remote clients spawned via SSH SSH key
http Remote clients reaching the host over the internet via a tunnel — no inbound SSH, no shared keys Authorization: Bearer $ANTS_MCP_TOKEN

One-time setup

# 1. Generate a strong token and add it to .env
echo "ANTS_MCP_TOKEN=$(openssl rand -hex 32)" >> .env
echo "ANTS_MCP_TRANSPORT=http" >> .env

# 2. Install launchd agents (macOS) — feeders + orchestrator
./scripts/launchd/install.sh

# 3. Confirm the orchestrator is listening locally
curl -s http://127.0.0.1:7777/health
# → ok

# 4. Expose it publicly. Pick one:

# 4a. Tailscale Funnel — backed by your tailnet
tailscale funnel --bg --https=443 http://localhost:7777
tailscale funnel status
# → https://<machine>.<tailnet>.ts.net is now publicly reachable

# 4b. ngrok — stable reserved domain, useful when clients can't join the
#     tailnet (ChatGPT Enterprise, Claude.ai web, etc.)
#     Sign up at https://dashboard.ngrok.com, reserve a static domain, then:
echo "NGROK_AUTHTOKEN=<from-dashboard>" >> .env
echo "NGROK_DOMAIN=<your-subdomain>.ngrok.app" >> .env
./scripts/launchd/install.sh --label ngrok
# → https://<your-subdomain>.ngrok.app/mcp is now publicly reachable

Funnel runs as part of tailscaled, so it persists across reboots without its own launchd agent. Tear it down with tailscale funnel --https=443 off.

ngrok runs under its own launchd agent (KeepAlive=true). Tear it down with ./scripts/launchd/install.sh --uninstall --label ngrok. Logs at ~/.ants/logs/ngrok.log.

Wire it into a remote MCP client

Drop this block into your MCP client config:

{
  "mcpServers": {
    "ants": {
      "url": "https://<machine>.<tailnet>.ts.net/mcp",
      "headers": {
        "Authorization": "Bearer <your ANTS_MCP_TOKEN>"
      }
    }
  }
}

Smoke test from any laptop:

curl -s -X POST https://<machine>.<tailnet>.ts.net/mcp \
  -H "Authorization: Bearer $ANTS_MCP_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'

Rotating the token

# 1. Edit .env on the host, set a new ANTS_MCP_TOKEN
# 2. Restart the daemon so it picks up the new value
launchctl kickstart -k gui/$(id -u)/<your-launchd-label>
# 3. Update every client's config with the new token

Threat model

  • Anyone with the token can call every tool the orchestrator exposes — triggering feeder runs, ingesting recordings, writing into the vault. They cannot read arbitrary files outside the vault, and they cannot run shell commands. Treat the token like an API key with full vault write access.
  • The orchestrator binds 127.0.0.1 by default. Funnel terminates TLS at Tailscale's edge and proxies inside the tailnet to localhost:7777. Without Funnel, only processes on the host can reach /mcp.
  • /health is unauthenticated by design (Funnel and uptime probes need it). It returns no information about the vault or the token.
  • Rotate the token regularly, or immediately if a client laptop is lost. There is no per-client revocation — one token, all clients.

Knowledge Model

Every page uses GBrain's compiled truth + timeline pattern:

  • Compiled truth (above ---): Current synthesized state. Gets overwritten on each run.
  • Timeline (below ---): Append-only evidence trail. Never edited, only prepended to.

This means a program page always shows the latest IDL state at the top, with a history of all changes below.

Skills

Skills are reusable agent workflows that live alongside the feeders. The skill system (loader, registry, resolver, runner, MCP tools skill_list / skill_create / skill_validate / skill_reload / skill_resolve) is generic — see skills/AUTHORING.md for how to author one. No skills ship with this repo; bring your own.

v1

The original implementation (ClickHouse-based with custom search) is archived in v1/. v2 and later versions will be shipped to this repo in batches.

License

MIT — see LICENSE.

About

Microservices that feed structured knowledge into GBrain or any Obsidian-compatible vault

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors