rhcc mapping failure can cause repeated layer downloads in disconnected environments

## Summary

In disconnected or egress-restricted environments, Clair can repeatedly re-index the same RHEL-based image manifests when it cannot reach Red Hat mapping files hosted at `security.access.redhat.com`.

Because the mapping fetch failure is returned as a fatal scanner error, the manifest scan aborts before a durable index result is persisted. The manifest remains eligible for indexing and is picked up again repeatedly. Each attempt fetches image layers from object storage before failing at the same CDN call, which can produce sustained and unbounded S3/object-storage read traffic.

I observed this generating approximately **20 TB/day** of traffic in production with **323 unscanned RHEL-based images**.

## Environment

- Red Hat Quay 3.15.4 with Quay Operator on OpenShift
- Clair managed by the Quay Operator
- Image layers stored on S3-compatible storage
- Cluster has no egress to `security.access.redhat.com`
- `ignore_unpatched: false` in Clair config, or unset

## What Happens

Two RHEL scanners fetch Red Hat mapping files during layer scans:

- `rhel/rhcc/scanner.go` (`rhel_containerscanner`, package scanner)
- `rhel/repositoryscanner.go` (`rhel-repository-scanner`, repository scanner)

The RHCC scanner fetches:

```text
https://security.access.redhat.com/data/metrics/container-name-repos-map.json
```

The RHEL repository scanner fetches:

```text
https://security.access.redhat.com/data/metrics/repository-to-cpe.json
```

When Clair cannot reach the CDN, the RHCC scanner can fail in `rhel/rhcc/scanner.go`:

```go
vi, err := s.upd.Get(tctx, s.client)
if err != nil && vi == nil {
    return nil, err
}

v, ok := vi.(*mappingFile)
if !ok || v == nil {
    return nil, fmt.Errorf("rhcc: unable to create a mappingFile object")
}
```

The repository scanner has the same fatal pattern in `rhel/repositoryscanner.go`:

```go
cmi, err := r.upd.Get(tctx, r.client)
if err != nil && cmi == nil {
    return []*claircore.Repository{}, err
}

cm, ok := cmi.(*mappingFile)
if !ok || cm == nil {
    return []*claircore.Repository{}, fmt.Errorf("rhel: unable to create a mappingFile object")
}
```

Both paths abort the index operation before a result is persisted. Since the manifest has no successful or degraded index result, it remains eligible for indexing and is retried repeatedly.

With multiple Clair replicas, the impact can be amplified because multiple pods may process failing manifests concurrently.

Typical log signature:

```json
{
  "level": "error",
  "component": "indexer/controller/Controller.Index",
  "manifest": "sha256:...",
  "state": "ScanLayers",
  "error": "failed to scan all layer contents: layer \"sha256:...\": rhcc: unable to create a mappingFile object",
  "message": "error during scan"
}
```

In our environment, the same manifest digests were retried every 8-30 seconds with no increasing delay.

## Measured Impact

I reproduced this on an isolated OpenShift test cluster using MinIO as the S3 backend and a NetworkPolicy blocking Clair egress.

| Condition                      | Object-storage traffic | Behaviour            |
| ------------------------------ | ---------------------: | -------------------- |
| 3 images, CDN blocked          |  ~8 GB/hr and climbing | Repeated re-indexing |
| 3 images, CDN blocked, patched |  ~0.001 GB/hr and flat | Results persisted    |
| Production, 323 images         |             ~20 TB/day | Same loop at scale   |

Scaling Clair to 0 replicas caused an immediate 93% traffic drop, confirming Clair as the primary source of traffic.

I also confirmed that affected manifests were not persisted in the `indexreport` table during the loop, and were persisted after applying the mitigation.

## Proposed Fix

Introduce a degraded terminal state, `IndexPartial`, for scans that produce usable index data but encounter scanner enrichment failures.

For the two RHEL mapping-file scanners, return `indexer.Partial(err)` instead of a fatal error when the mapping file cannot be fetched or the mapping object is invalid.

`rhel/rhcc/scanner.go`:

```go
vi, err := s.upd.Get(tctx, s.client)
if err != nil && vi == nil {
    slog.WarnContext(ctx,
        "rhcc: unable to fetch mapping file, skipping RHCC mapping enrichment",
        "error", err,
    )
    return pkgs, indexer.Partial(err)
}

v, ok := vi.(*mappingFile)
if !ok || v == nil {
    err := fmt.Errorf("rhcc: unable to create a mappingFile object")
    slog.WarnContext(ctx,
        "rhcc: unable to create mappingFile object, skipping RHCC mapping enrichment",
    )
    return pkgs, indexer.Partial(err)
}
```

`rhel/repositoryscanner.go`:

```go
cmi, err := r.upd.Get(tctx, r.client)
if err != nil && cmi == nil {
    slog.WarnContext(ctx,
        "rhel: unable to fetch mapping file, skipping repository enrichment",
        "error", err,
    )
    return []*claircore.Repository{}, indexer.Partial(err)
}

cm, ok := cmi.(*mappingFile)
if !ok || cm == nil {
    err := fmt.Errorf("rhel: unable to create a mappingFile object")
    slog.WarnContext(ctx,
        "rhel: unable to create mappingFile object, skipping repository enrichment",
    )
    return []*claircore.Repository{}, indexer.Partial(err)
}
```

The indexer can then persist the scan as `IndexPartial`, mark the manifest scanned for normal request paths, and avoid immediate repeated layer downloads. Operators can still distinguish degraded results from fully successful `IndexFinished` reports.

The implementation in the linked PR adds:

1. `indexer.ErrScanPartial`, `indexer.PartialError`, and `indexer.Partial(err)`.
2. `IndexPartial` controller state handling.
3. `SetIndexPartial` persistence that stores the degraded index report and marks the manifest scanned.
4. `RequeueIndexPartials` to make old partial reports eligible for later re-indexing.
5. A PostgreSQL migration adding `indexreport.updated_at` and an index for stale partial reports.
6. A `libindex` background retry scheduler with a default 24-hour retry interval.

## Validation

Tested locally:

```sh
go test ./indexer/... ./libindex/... ./rhel/... ./datastore/postgres/...
```

Tested on OpenShift with MinIO as the S3 backend and NetworkPolicy blocking CDN egress.

Expected patched behaviour was observed:

- RHCC and RHEL repository mapping failures were logged as degraded scanner results.
- Scan results were persisted as `IndexPartial`.
- Repeated immediate rescans stopped.
- Object-storage traffic returned to baseline after the initial scan.

Log evidence of correct behaviour:

```text
WARN  rhel: unable to create mappingFile object, skipping repository enrichment
WARN  scanner produced degraded results  reason="partial scan: ..."
WARN  layers scan completed with partial results
INFO  manifest partially scanned
```

## Trade-off

This approach prioritizes stopping the repeated layer download storm while preserving the usable index data Clair already discovered.

When the mapping files are unavailable, RHCC package mapping enrichment and RHEL repository enrichment can be incomplete for affected images. However, the report remains distinguishable from a fully successful scan through `IndexPartial`.

To avoid permanently suppressing retries after a temporary CDN outage, partial reports can be made eligible for re-indexing later through the background retry scheduler.

## Steps to Reproduce

1. Deploy Quay with Clair on OpenShift.
2. Block Clair egress to `security.access.redhat.com`.
3. Push or mirror a RHEL-based image, for example `registry.access.redhat.com/ubi9/ubi:9.3`.
4. Watch Clair logs for repeated mapping-file errors for the same manifest digest.
5. Watch object-storage traffic increase continuously.
6. Confirm the affected manifest is not persisted in the `indexreport` table.
7. Apply the patch and confirm traffic drops to baseline and the result is persisted as `IndexPartial`.

## Related

- PR: https://github.com/quay/claircore/pull/1892
- Red Hat Quay disconnected environments documentation: https://docs.redhat.com/en/documentation/red_hat_quay/3.15/pdf/vulnerability_reporting_with_clair_on_red_hat_quay/Red_Hat_Quay-3.15-Vulnerability_reporting_with_Clair_on_Red_Hat_Quay-en-US.pdf
- `ignore_unpatched` in `ScannerConfig`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rhcc mapping failure can cause repeated layer downloads in disconnected environments #1884

Summary

Environment

What Happens

Measured Impact

Proposed Fix

Validation

Trade-off

Steps to Reproduce

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Condition	Object-storage traffic	Behaviour
3 images, CDN blocked	~8 GB/hr and climbing	Repeated re-indexing
3 images, CDN blocked, patched	~0.001 GB/hr and flat	Results persisted
Production, 323 images	~20 TB/day	Same loop at scale

rhcc mapping failure can cause repeated layer downloads in disconnected environments #1884

Description

Summary

Environment

What Happens

Measured Impact

Proposed Fix

Validation

Trade-off

Steps to Reproduce

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions