os-crawler

Microservices that feed structured knowledge into GBrain (or any Obsidian-compatible vault).

GBrain handles storage, search, enrichment, and back-linking. These feeders watch external sources and push structured markdown into the Obsidian vault that GBrain syncs from.

Originally built to track an on-chain protocol's programs, SDKs, GitHub activity, Twitter/X handles, and internal Notion docs. The same pattern works for any team that wants a single searchable corpus over scattered sources.

Architecture

External Sources          Feeders              Obsidian Vault        GBrain
─────────────────    ─────────────────    ─────────────────    ─────────────
Solana Programs  ──→ idl-watcher      ──→ ants/programs/   ──→ Search
GitHub Repos     ──→ github-feed      ──→ ants/changelog/  ──→ Enrichment
SDKs             ──→ sdk-feed         ──→ ants/sdks/       ──→ Back-links
On-chain Data    ──→ analytics-feed   ──→ ants/analytics/  ──→ Skills
Twitter / X      ──→ twitter-feed     ──→ ants/twitter/    ──→ Cron Jobs
Notion Docs      ──→ notion-feed      ──→ ants/notion/     ──→ Search
Public Docs      ──→ docs-feed        ──→ ants/docs/       ──→ ...
Meetings         ──→ meeting-feed     ──→ ants/meetings/   ──→ ...

Each feeder writes Obsidian-compatible markdown with YAML frontmatter. GBrain's sync picks up changes, chunks content, embeds vectors, and makes everything searchable via MCP.

Feeders

Feeder	Status	Description
`idl-watcher`	✅	Watches on-chain Anchor programs, renders IDL to markdown, detects changes
`github-feed`	✅	Tracks merged PRs, releases, and issues across configured repos
`sdk-feed`	✅	Parses SDK public API surface, tracks breaking changes, maps methods → IDL instructions
`docs-feed`	✅	Syncs public documentation, flags stale docs
`twitter-feed`	✅	Tracks tweets from configured handles plus mentions
`notion-feed`	✅	Indexes internal Notion design docs, QA plans, FE/BE specs, comms, and milestones. Needs `NOTION_TOKEN` plus `[notion]` allowlists.
`analytics-feed`	✅	Pulls product metrics from Amplitude via declarative event-segmentation queries (DAU, volume, on-chain txs) per entity. Needs `AMPLITUDE_API_KEY` + `AMPLITUDE_SECRET_KEY`.
`meeting-feed`	🚧	Transcribes meetings via whisper.cpp, extracts decisions

Quick Start

cp .env.example .env                      # fill in keys + paths
cp ants.config.example.toml ants.config.toml   # define what to track
bun install

Configuration

Two files drive everything:

.env — secrets and per-machine paths (vault root, cache dir, API keys)
ants.config.toml — the registry of what to track. Both files are gitignored; commit only the .example versions.

The config layout is hybrid: real products group their program + SDKs + tagged repos under one block; standalone repos and handles stay flat.

[products.<slug>]                  # product metadata (title, description)
[products.<slug>.program]          # → idl-watcher  (on-chain program)
[products.<slug>.sdks.<role>]      # → sdk-feed     (one entry per SDK lang)
[products.<slug>.repos.<role>]     # → github-feed  (one entry per tagged repo)

[repos.<slug>]                     # → github-feed  (standalone repos)
[handles.<slug>]                   # → twitter-feed (handle snapshots)

[notion]                           # → notion-feed allowlists + default queries

Adding a new product is one block: paste [products.foo], then any combination of [products.foo.program], [products.foo.sdks.ts], and [products.foo.repos.<role>]. Repos that aren't tagged to a product (infra, services, third-party forks) live at the top level under [repos.<slug>].

Auto-derived entity slugs: <product> for the program, <product>-sdk for the TS SDK, <product>-<role> for everything else. Override per-subsection with entity = "..." to keep an existing vault page name. Operational config (metrics, TTLs, whisper glossary) and secrets (.env) stay where they are. Long-running daemons pick up edits on restart. Override the config path with ANTS_CONFIG.

notion-feed is deny-by-default. Set NOTION_TOKEN to a Notion integration token with read access to the workspace pages/databases you want indexed, then add those IDs to [notion].allowed_teamspaces or [notion].allowed_databases in ants.config.toml. Empty allowlists return a clean no-op. The orchestrator also runs a startup handshake; if the token is present but invalid or unreachable, Notion is disabled for that session instead of wedging later calls.

Commands

Command	Description
`bun run idl-watcher:sync`	Snapshot every configured program's IDL from on-chain and write to the vault
`bun run idl-watcher:pr <url\|owner/repo#N>`	Compare a PR's IDL against `main` and render the diff
`bun run github-feed:sync [entity]`	Pull merged PRs, releases, and issues for all repos (or one)
`bun run sdk-feed:sync`	Snapshot every SDK's public API surface from `main`
`bun run sdk-feed:pr <url\|owner/repo#N>`	Compare an SDK PR's surface against base and flag breaking changes
`bun run docs-feed`	Sync public docs into the vault (incremental, uses cache)
`bun run docs-feed:full`	Full docs sync — ignore cache, fetch everything
`bun run docs-feed:dry-run`	Report docs changes without writing to the vault
`bun run analytics-feed:sync`	Pull product metrics from Amplitude and snapshot into the vault
`bun run analytics-feed:dry-run`	List registered metrics without hitting Amplitude or writing
`bun run meeting-feed:file <path>`	Transcribe a Craig recording (`.zip` or directory of `.flac`) into a meeting page
`bun run meeting-feed:watch [dir]`	Watch `$ANTS_RECORDINGS/` (or arg) and auto-ingest every new `.zip`
`bun run meeting-feed:eval`	Score transcription quality (WER) against hand-corrected goldens. See feeders/meeting-feed/eval-README.md
`bun run twitter-feed:sync [entity]`	Snapshot tweets for all tracked handles (or one) plus configured mentions
`bun run twitter-feed:mentions`	Refresh just the mentions aggregator
`bun run notion-feed:sync [entity]`	Search configured Notion queries (or one entity term) and write scoped summaries under `ants/notion/`
`bun run notion-feed:diag [query...]`	Dump raw Notion `/search` results with title + parent type/ID so you can see what UUIDs to add to `[notion].allowed_teamspaces` / `allowed_databases`. Does not write to the vault.
`bun run notion-feed:clean [--yes\|--dry-run]`	One-shot reset of `<vault>/ants/notion/`. Run when scope inference or the slugifier changes and stale files would otherwise pile up.

Tuning meeting transcription

When transcripts come back wrong, you have five knobs:

Glossary — add domain terms (product names, partner names, team handles) to feeders/shared/whisper-prompt.toml. Whisper biases toward words it has seen in the prompt.
Per-meeting context — pass --context "discussion of X" to meeting-feed:file. Layered on top of the glossary, just for that meeting.
Team handles — uncomment and fill in real Discord/Slack display names in whisper-prompt.toml's glossary list.
Model swap — point WHISPER_MODEL at a larger model (ggml-large-v3.bin instead of turbo) for higher accuracy at slower speed.
Eval before/after — bun run meeting-feed:eval re-runs the pipeline against fixed samples and reports WER deltas. Without this you're tuning blind. See feeders/meeting-feed/eval-README.md.

Remote MCP access

The orchestrator (feeders/orchestrator/main.ts) is the MCP server that exposes research plus per-feeder tools to any MCP client (Claude Code, Claude Desktop, ChatGPT custom GPTs, scripts).

It runs in two modes:

Mode	When to use	Auth
`stdio` (default)	Local clients on the host, or remote clients spawned via SSH	SSH key
`http`	Remote clients reaching the host over the internet via a tunnel — no inbound SSH, no shared keys	`Authorization: Bearer $ANTS_MCP_TOKEN`

One-time setup

# 1. Generate a strong token and add it to .env
echo "ANTS_MCP_TOKEN=$(openssl rand -hex 32)" >> .env
echo "ANTS_MCP_TRANSPORT=http" >> .env

# 2. Install launchd agents (macOS) — feeders + orchestrator
./scripts/launchd/install.sh

# 3. Confirm the orchestrator is listening locally
curl -s http://127.0.0.1:7777/health
# → ok

# 4. Expose it publicly. Pick one:

# 4a. Tailscale Funnel — backed by your tailnet
tailscale funnel --bg --https=443 http://localhost:7777
tailscale funnel status
# → https://<machine>.<tailnet>.ts.net is now publicly reachable

# 4b. ngrok — stable reserved domain, useful when clients can't join the
#     tailnet (ChatGPT Enterprise, Claude.ai web, etc.)
#     Sign up at https://dashboard.ngrok.com, reserve a static domain, then:
echo "NGROK_AUTHTOKEN=<from-dashboard>" >> .env
echo "NGROK_DOMAIN=<your-subdomain>.ngrok.app" >> .env
./scripts/launchd/install.sh --label ngrok
# → https://<your-subdomain>.ngrok.app/mcp is now publicly reachable

Funnel runs as part of tailscaled, so it persists across reboots without its own launchd agent. Tear it down with tailscale funnel --https=443 off.

ngrok runs under its own launchd agent (KeepAlive=true). Tear it down with ./scripts/launchd/install.sh --uninstall --label ngrok. Logs at ~/.ants/logs/ngrok.log.

Wire it into a remote MCP client

Drop this block into your MCP client config:

{
  "mcpServers": {
    "ants": {
      "url": "https://<machine>.<tailnet>.ts.net/mcp",
      "headers": {
        "Authorization": "Bearer <your ANTS_MCP_TOKEN>"
      }
    }
  }
}

Smoke test from any laptop:

curl -s -X POST https://<machine>.<tailnet>.ts.net/mcp \
  -H "Authorization: Bearer $ANTS_MCP_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'

Rotating the token

# 1. Edit .env on the host, set a new ANTS_MCP_TOKEN
# 2. Restart the daemon so it picks up the new value
launchctl kickstart -k gui/$(id -u)/<your-launchd-label>
# 3. Update every client's config with the new token

Threat model

Anyone with the token can call every tool the orchestrator exposes — triggering feeder runs, ingesting recordings, writing into the vault. They cannot read arbitrary files outside the vault, and they cannot run shell commands. Treat the token like an API key with full vault write access.
The orchestrator binds 127.0.0.1 by default. Funnel terminates TLS at Tailscale's edge and proxies inside the tailnet to localhost:7777. Without Funnel, only processes on the host can reach /mcp.
/health is unauthenticated by design (Funnel and uptime probes need it). It returns no information about the vault or the token.
Rotate the token regularly, or immediately if a client laptop is lost. There is no per-client revocation — one token, all clients.

Knowledge Model

Every page uses GBrain's compiled truth + timeline pattern:

Compiled truth (above ---): Current synthesized state. Gets overwritten on each run.
Timeline (below ---): Append-only evidence trail. Never edited, only prepended to.

This means a program page always shows the latest IDL state at the top, with a history of all changes below.

Skills

Skills are reusable agent workflows that live alongside the feeders. The skill system (loader, registry, resolver, runner, MCP tools skill_list / skill_create / skill_validate / skill_reload / skill_resolve) is generic — see skills/AUTHORING.md for how to author one. No skills ship with this repo; bring your own.

v1

The original implementation (ClickHouse-based with custom search) is archived in v1/. v2 and later versions will be shipped to this repo in batches.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

os-crawler

Architecture

Feeders

Quick Start

Configuration

Commands

Tuning meeting transcription

Remote MCP access

One-time setup

Wire it into a remote MCP client

Rotating the token

Threat model

Knowledge Model

Skills

v1

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
docs		docs
feeders		feeders
scripts		scripts
skills		skills
v1		v1
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ants.config.example.toml		ants.config.example.toml
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

os-crawler

Architecture

Feeders

Quick Start

Configuration

Commands

Tuning meeting transcription

Remote MCP access

One-time setup

Wire it into a remote MCP client

Rotating the token

Threat model

Knowledge Model

Skills

v1

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages