feat: add .actorignore to all templates and populate dataset_schema.json fields#817
Draft
DaveHanns wants to merge 1 commit into
Draft
feat: add .actorignore to all templates and populate dataset_schema.json fields#817DaveHanns wants to merge 1 commit into
DaveHanns wants to merge 1 commit into
Conversation
…son fields
Library-wide hygiene + observability fix for templates created via apify create
or fetched via @apify/actor-templates.
## .actorignore (all 44 templates)
Adds .actorignore to every template directory so `apify push` excludes:
- Build artifacts the platform regenerates inside the Docker image
(dist/, build/, *.tsbuildinfo for Node; __pycache__/, *.pyc for Python)
- Dependencies (node_modules/, .venv/) — installed by the platform builder
- Local-development storage (storage/, apify_storage/, crawlee_storage/)
- Logs / env / IDE / OS files / .git/ history
Three .actorignore variants by language: Node (TS/JS, 25 templates),
Python (18 templates), minimal (cli-start, shell-only).
Closes the existing `apify push` upload-bloat soft-warn that fires on every
CLI-deploying agent or human.
## .actor/dataset_schema.json — added to 6 templates
Adds a starter dataset_schema.json to 6 templates that materially push
records to a dataset during normal execution but didn't ship one:
- ts-empty, js-empty, python-empty (starter scaffolds)
- ts-bootstrap-cheerio-crawler, js-bootstrap-cheerio-crawler (crawlers)
- python-scrapy (Scrapy-based crawler)
Deliberately NOT added (9 templates that don't push to the dataset by
design):
- cli-start (shell-only, no data path)
- js-langchain (writes only to key-value store via Actor.setValue('OUTPUT', res))
- ts-standby, js-standby, python-standby (HTTP servers — request/response)
- ts-mcp-empty, ts-mcp-proxy, python-mcp-empty, python-mcp-proxy (MCP servers)
Shipping a dataset schema for a non-dataset-writing template would imply a
contract the template doesn't use and mislead users about output location.
## .actor/dataset_schema.json — `.fields` populated in 29 existing schemas
The 29 templates that already shipped dataset_schema.json had
`"fields": {}` (empty) but a populated `views.overview.transformation.fields`
array. Populated `.fields` by deriving JSON Schema properties from each
schema's existing field names + display-format hints:
- format=link/image -> {"type":"string","format":"uri"}
- format=array -> {"type":"array","items":{"type":"string"}}
- format=number -> {"type":"number"}
- format=boolean -> {"type":"boolean"}
- format=date -> {"type":"string","format":"date-time"}
- format=object -> {"type":"object","additionalProperties":true}
- format=text/other -> {"type":"string"}
Closes the existing non-empty-.fields recommendation soft-warn for every
template-derived Actor.
## Out of scope (separate PRs)
- The 16 JS/TS templates with `name` != `id` in manifest.json
(e.g. js-empty / project_empty, js-start / getting_started_node)
warrant a coordinated rename + apify-cli alias support; addressed in
a follow-up PR pair.
- The leftover @apify/json_schemas actor.schema.json regex looser than
platform admission is an apify-shared-js issue, not a templates one.
Refs:
- chocholous/apify-evals findings F26 (no dataset_schema in ts-empty)
- chocholous/apify-evals findings F27 (no .actorignore library-wide)
- chocholous/apify-evals findings F29 (dist/ shipped because no .actorignore)
- chocholous/apify-evals findings F32 (agents reject apify create + hand-roll)
- chocholous/apify-evals findings F33 (template usage rate ~80% under all stack)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Library-wide hygiene + observability fix for templates created via
apify createor fetched via@apify/actor-templates. Three coordinated changes:.actorignoreadded to all 44 templates soapify pushexcludes build artifacts, dependencies, local storage, and dev/IDE files from the uploaded archive. Currently the entire library ships.dockerignorebut zero templates ship.actorignore— every CLI-deploying agent or human pays for the gap..actor/dataset_schema.jsonadded to 6 data-producing templates that materially push to a dataset but didn't ship a starter schema (ts-empty,js-empty,python-empty,ts-bootstrap-cheerio-crawler,js-bootstrap-cheerio-crawler,python-scrapy)..fieldspopulated in the 29 existingdataset_schema.jsonfiles (they were shipping"fields": {}empty alongside a populatedviews.overview.transformation.fields— the schemas were structurally incomplete). Derived.fieldsJSON Schema from each schema's existing field names + display-format hints.Scope summary
.actorignoreadded.actor/dataset_schema.jsonadded.actor/dataset_schema.json.fieldspopulatedTotal diff:
79 files changed, 2311 insertions(+), 53 deletions(-)..actorignore— language-appropriate contentsThree variants depending on the template's language:
dist/,build/,*.tsbuildinfo,node_modules/,storage/,apify_storage/,crawlee_storage/,*.log,.env*, IDE/OS files,.git/.__pycache__/,*.pyc,*.pyo,.venv//venv//env/, storage dirs,*.log,.env*, IDE/OS files,.git/.cli-start, shell-only):*.log,.env*, IDE/OS files,.git/.Templates intentionally NOT given a
dataset_schema.jsonAudited each template's main code file for
Actor.pushData/Actor.push_data/ equivalent. The following 9 templates do not push to a dataset and intentionally do not ship one:cli-start— shell-only template; no data path.js-langchain— writes only to the key-value store viaActor.setValue('OUTPUT', res). NopushDatacalls anywhere. (Already shipskey_value_store_schema.jsonandoutput_schema.json— which is correct for its pattern.)ts-standby,js-standby,python-standby— HTTP servers; request/response, no dataset writes.ts-mcp-empty,ts-mcp-proxy,python-mcp-empty,python-mcp-proxy— MCP servers; serve endpoints, no dataset writes.Shipping a dataset_schema for a non-dataset-writing template would imply a contract the template doesn't use and would mislead users about output location.
.fieldspopulation approachFor the 29 existing
dataset_schema.jsonfiles with empty.fields, derivedfields.propertiesfrom the schema's existingviews.overview.transformation.fieldslist, with types mapped from the correspondingviews.overview.display.properties.<name>.format:link,image{ "type": "string", "format": "uri" }array{ "type": "array", "items": { "type": "string" } }number{ "type": "number" }boolean{ "type": "boolean" }date{ "type": "string", "format": "date-time" }object{ "type": "object", "additionalProperties": true }text/ default{ "type": "string" }Each populated schema also gains a
"required": [...]listing all derived field names. The structure remains internally consistent with what the existingviewsalready declared.What this fixes downstream
Closes the following items from the
chocholous/apify-evalsfinding tracker:.actorignore).dist/alongsidesrc/because the Dockerfile doesn't get a chance to rebuild from scratch.ts-empty(and other empty starters) ship nodataset_schema.json, triggering T1 soft-warns on every template-using agent..fieldsempty soft-warn fires on every template-derived Actor.Out of scope (separate PRs)
nameis fundamentally different fromid(e.g.js-empty/project_empty,js-start/getting_started_node) warrant a coordinated rename +apify-clialias support to maintain backward compatibility. Tracked separately.@apify/json_schemasactor.schema.json.versionregex being looser than the platform'sMAJOR_MINOR_VERSION_REGEXis anapify-shared-jsissue, not a templates one.Test plan
apify create -t ts-emptyproduces a scaffold that includes both.actorignoreand.actor/dataset_schema.jsonafterdist/templates/ts-empty.zipis regenerated by CIapify pusharchive size shrinks for a template scaffold (nonode_modules/orstorage/)🤖 Generated with Claude Code