Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion docs/geneva/jobs/backfilling.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,35 @@ Triggering backfill creates a distributed job to run the UDF and populate the co

**Checkpoints**: Each batch of UDF execution is checkpointed so that partial results are not lost in case of job failures. Jobs can resume and avoid most of the expense of having to recalculate values.

## Adaptive checkpoint sizing

Geneva can automatically adjust checkpoint sizes during a backfill. It starts with small checkpoints (faster proof-of-life) and grows them as it observes stable throughput, while staying within safe bounds. Planning still uses your configured checkpoint size (`checkpoint_size`), but the actual checkpoint chunks can be smaller when adaptive sizing is enabled.

Adaptive sizing is always clamped to bounds:

- `max_checkpoint_size`: Upper bound. Defaults to the job's checkpoint size (`checkpoint_size`) and is capped at that value if you set a larger max.
- `min_checkpoint_size`: Lower bound. Defaults to 1.

When `min_checkpoint_size == max_checkpoint_size`, adaptive sizing is disabled and checkpoints are fixed-size.

You can set adaptive bounds in two places:

- On the UDF definition via `@udf(..., min_checkpoint_size=..., max_checkpoint_size=...)`
- On the backfill call via `table.backfill(..., min_checkpoint_size=..., max_checkpoint_size=...)`

Backfill-level values take precedence over UDF defaults.

<CodeGroup>
```python Python icon="python"
@udf(min_checkpoint_size=25, max_checkpoint_size=200)
def embed_udf(text):
...

# Override the UDF defaults for this run
tbl.backfill("embedding", min_checkpoint_size=10, max_checkpoint_size=100)
```
</CodeGroup>

## Managing concurrency

One way to speed up the execution of a job to give it more resources and to have it work in parallel. There are a few settings you can use on the backfill command to tune this.
Expand Down Expand Up @@ -95,4 +124,4 @@ tbl.backfill("embedding", where="content is not null and embeddding is not null"

Reference:
* [`backfill` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.backfill)
* [`backfill_async` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.backfill_async)
* [`backfill_async` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.backfill_async)
4 changes: 2 additions & 2 deletions docs/geneva/jobs/performance.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ The `Table.backfill(..) ` method has several optional arguments to tune performa

`commit_granularity` controls how frequently fragments are committed so that partical results can be come visible to table readers.

Setting `batch_size` smaller introduces finer-grained checkpoints and can help provide more frequent proof of life as a job is being executed. This is useful if the computation on your data is expensive.
Setting `checkpoint_size` smaller introduces finer-grained checkpoints and can help provide more frequent proof of life as a job is being executed. This is useful if the computation on your data is expensive.

Reference:
* [`backfill` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.backfill)
Expand All @@ -44,4 +44,4 @@ Certain jobs that take a small data set and expand it may appear as if the write

An example is table that contains a list of URLs pointing to large media files. This list is relatively small (&lt; 100MB) and can fit into a single fragment. A UDF that downloads will fetch all the data and then attempt to write all of it out through the single writer. This single writer then can be responsible for serially writing out 500+GB of data to a single file!

To mitigate this, you can load your initial table so that there will be multipe fragments. Each fragment with new outputs can be written in parallel with higher write throughput.
To mitigate this, you can load your initial table so that there will be multipe fragments. Each fragment with new outputs can be written in parallel with higher write throughput.
Loading