Skip to content

feat: add .actorignore to all templates and populate dataset_schema.json fields#817

Draft
DaveHanns wants to merge 1 commit into
masterfrom
fix/actorignore-and-dataset-schemas
Draft

feat: add .actorignore to all templates and populate dataset_schema.json fields#817
DaveHanns wants to merge 1 commit into
masterfrom
fix/actorignore-and-dataset-schemas

Conversation

@DaveHanns

Copy link
Copy Markdown
Contributor

Summary

Library-wide hygiene + observability fix for templates created via apify create or fetched via @apify/actor-templates. Three coordinated changes:

  1. .actorignore added to all 44 templates so apify push excludes build artifacts, dependencies, local storage, and dev/IDE files from the uploaded archive. Currently the entire library ships .dockerignore but zero templates ship .actorignore — every CLI-deploying agent or human pays for the gap.
  2. .actor/dataset_schema.json added to 6 data-producing templates that materially push to a dataset but didn't ship a starter schema (ts-empty, js-empty, python-empty, ts-bootstrap-cheerio-crawler, js-bootstrap-cheerio-crawler, python-scrapy).
  3. .fields populated in the 29 existing dataset_schema.json files (they were shipping "fields": {} empty alongside a populated views.overview.transformation.fields — the schemas were structurally incomplete). Derived .fields JSON Schema from each schema's existing field names + display-format hints.

Scope summary

Change File count
.actorignore added 44
.actor/dataset_schema.json added 6
.actor/dataset_schema.json .fields populated 29
Total 79

Total diff: 79 files changed, 2311 insertions(+), 53 deletions(-).

.actorignore — language-appropriate contents

Three variants depending on the template's language:

  • Node (TS / JS — 25 templates): dist/, build/, *.tsbuildinfo, node_modules/, storage/, apify_storage/, crawlee_storage/, *.log, .env*, IDE/OS files, .git/.
  • Python (18 templates): __pycache__/, *.pyc, *.pyo, .venv//venv//env/, storage dirs, *.log, .env*, IDE/OS files, .git/.
  • Minimal (cli-start, shell-only): *.log, .env*, IDE/OS files, .git/.

Templates intentionally NOT given a dataset_schema.json

Audited each template's main code file for Actor.pushData / Actor.push_data / equivalent. The following 9 templates do not push to a dataset and intentionally do not ship one:

  • cli-start — shell-only template; no data path.
  • js-langchain — writes only to the key-value store via Actor.setValue('OUTPUT', res). No pushData calls anywhere. (Already ships key_value_store_schema.json and output_schema.json — which is correct for its pattern.)
  • ts-standby, js-standby, python-standby — HTTP servers; request/response, no dataset writes.
  • ts-mcp-empty, ts-mcp-proxy, python-mcp-empty, python-mcp-proxy — MCP servers; serve endpoints, no dataset writes.

Shipping a dataset_schema for a non-dataset-writing template would imply a contract the template doesn't use and would mislead users about output location.

.fields population approach

For the 29 existing dataset_schema.json files with empty .fields, derived fields.properties from the schema's existing views.overview.transformation.fields list, with types mapped from the corresponding views.overview.display.properties.<name>.format:

Display format Derived JSON Schema
link, image { "type": "string", "format": "uri" }
array { "type": "array", "items": { "type": "string" } }
number { "type": "number" }
boolean { "type": "boolean" }
date { "type": "string", "format": "date-time" }
object { "type": "object", "additionalProperties": true }
text / default { "type": "string" }

Each populated schema also gains a "required": [...] listing all derived field names. The structure remains internally consistent with what the existing views already declared.

What this fixes downstream

Closes the following items from the chocholous/apify-evals finding tracker:

  • F8 / F27 — agents and humans ship bloated archives (no .actorignore).
  • F29 — agents ship dist/ alongside src/ because the Dockerfile doesn't get a chance to rebuild from scratch.
  • F26ts-empty (and other empty starters) ship no dataset_schema.json, triggering T1 soft-warns on every template-using agent.
  • F3.fields empty soft-warn fires on every template-derived Actor.

Out of scope (separate PRs)

  • The 16 JS/TS templates where name is fundamentally different from id (e.g. js-empty / project_empty, js-start / getting_started_node) warrant a coordinated rename + apify-cli alias support to maintain backward compatibility. Tracked separately.
  • The @apify/json_schemas actor.schema.json .version regex being looser than the platform's MAJOR_MINOR_VERSION_REGEX is an apify-shared-js issue, not a templates one.

Test plan

  • CI green on existing tests
  • Spot-check apify create -t ts-empty produces a scaffold that includes both .actorignore and .actor/dataset_schema.json after dist/templates/ts-empty.zip is regenerated by CI
  • Spot-check apify push archive size shrinks for a template scaffold (no node_modules/ or storage/)

🤖 Generated with Claude Code

…son fields

Library-wide hygiene + observability fix for templates created via apify create
or fetched via @apify/actor-templates.

## .actorignore (all 44 templates)

Adds .actorignore to every template directory so `apify push` excludes:
- Build artifacts the platform regenerates inside the Docker image
  (dist/, build/, *.tsbuildinfo for Node; __pycache__/, *.pyc for Python)
- Dependencies (node_modules/, .venv/) — installed by the platform builder
- Local-development storage (storage/, apify_storage/, crawlee_storage/)
- Logs / env / IDE / OS files / .git/ history

Three .actorignore variants by language: Node (TS/JS, 25 templates),
Python (18 templates), minimal (cli-start, shell-only).

Closes the existing `apify push` upload-bloat soft-warn that fires on every
CLI-deploying agent or human.

## .actor/dataset_schema.json — added to 6 templates

Adds a starter dataset_schema.json to 6 templates that materially push
records to a dataset during normal execution but didn't ship one:
- ts-empty, js-empty, python-empty (starter scaffolds)
- ts-bootstrap-cheerio-crawler, js-bootstrap-cheerio-crawler (crawlers)
- python-scrapy (Scrapy-based crawler)

Deliberately NOT added (9 templates that don't push to the dataset by
design):
- cli-start (shell-only, no data path)
- js-langchain (writes only to key-value store via Actor.setValue('OUTPUT', res))
- ts-standby, js-standby, python-standby (HTTP servers — request/response)
- ts-mcp-empty, ts-mcp-proxy, python-mcp-empty, python-mcp-proxy (MCP servers)

Shipping a dataset schema for a non-dataset-writing template would imply a
contract the template doesn't use and mislead users about output location.

## .actor/dataset_schema.json — `.fields` populated in 29 existing schemas

The 29 templates that already shipped dataset_schema.json had
`"fields": {}` (empty) but a populated `views.overview.transformation.fields`
array. Populated `.fields` by deriving JSON Schema properties from each
schema's existing field names + display-format hints:
- format=link/image -> {"type":"string","format":"uri"}
- format=array      -> {"type":"array","items":{"type":"string"}}
- format=number     -> {"type":"number"}
- format=boolean    -> {"type":"boolean"}
- format=date       -> {"type":"string","format":"date-time"}
- format=object     -> {"type":"object","additionalProperties":true}
- format=text/other -> {"type":"string"}

Closes the existing non-empty-.fields recommendation soft-warn for every
template-derived Actor.

## Out of scope (separate PRs)

- The 16 JS/TS templates with `name` != `id` in manifest.json
  (e.g. js-empty / project_empty, js-start / getting_started_node)
  warrant a coordinated rename + apify-cli alias support; addressed in
  a follow-up PR pair.
- The leftover @apify/json_schemas actor.schema.json regex looser than
  platform admission is an apify-shared-js issue, not a templates one.

Refs:
- chocholous/apify-evals findings F26 (no dataset_schema in ts-empty)
- chocholous/apify-evals findings F27 (no .actorignore library-wide)
- chocholous/apify-evals findings F29 (dist/ shipped because no .actorignore)
- chocholous/apify-evals findings F32 (agents reject apify create + hand-roll)
- chocholous/apify-evals findings F33 (template usage rate ~80% under all stack)
@github-actions github-actions Bot added this to the 143rd sprint - DX team milestone Jun 21, 2026
@github-actions github-actions Bot added the t-dx Issues owned by the DX team. label Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-dx Issues owned by the DX team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants