Summary
In disconnected or egress-restricted environments, Clair can repeatedly re-index the same RHEL-based image manifests when it cannot reach Red Hat mapping files hosted at security.access.redhat.com.
Because the mapping fetch failure is returned as a fatal scanner error, the manifest scan aborts before a durable index result is persisted. The manifest remains eligible for indexing and is picked up again repeatedly. Each attempt fetches image layers from object storage before failing at the same CDN call, which can produce sustained and unbounded S3/object-storage read traffic.
I observed this generating approximately 20 TB/day of traffic in production with 323 unscanned RHEL-based images.
Environment
- Red Hat Quay 3.15.4 with Quay Operator on OpenShift
- Clair managed by the Quay Operator
- Image layers stored on S3-compatible storage
- Cluster has no egress to
security.access.redhat.com
ignore_unpatched: false in Clair config, or unset
What Happens
Two RHEL scanners fetch Red Hat mapping files during layer scans:
rhel/rhcc/scanner.go (rhel_containerscanner, package scanner)
rhel/repositoryscanner.go (rhel-repository-scanner, repository scanner)
The RHCC scanner fetches:
https://security.access.redhat.com/data/metrics/container-name-repos-map.json
The RHEL repository scanner fetches:
https://security.access.redhat.com/data/metrics/repository-to-cpe.json
When Clair cannot reach the CDN, the RHCC scanner can fail in rhel/rhcc/scanner.go:
vi, err := s.upd.Get(tctx, s.client)
if err != nil && vi == nil {
return nil, err
}
v, ok := vi.(*mappingFile)
if !ok || v == nil {
return nil, fmt.Errorf("rhcc: unable to create a mappingFile object")
}
The repository scanner has the same fatal pattern in rhel/repositoryscanner.go:
cmi, err := r.upd.Get(tctx, r.client)
if err != nil && cmi == nil {
return []*claircore.Repository{}, err
}
cm, ok := cmi.(*mappingFile)
if !ok || cm == nil {
return []*claircore.Repository{}, fmt.Errorf("rhel: unable to create a mappingFile object")
}
Both paths abort the index operation before a result is persisted. Since the manifest has no successful or degraded index result, it remains eligible for indexing and is retried repeatedly.
With multiple Clair replicas, the impact can be amplified because multiple pods may process failing manifests concurrently.
Typical log signature:
{
"level": "error",
"component": "indexer/controller/Controller.Index",
"manifest": "sha256:...",
"state": "ScanLayers",
"error": "failed to scan all layer contents: layer \"sha256:...\": rhcc: unable to create a mappingFile object",
"message": "error during scan"
}
In our environment, the same manifest digests were retried every 8-30 seconds with no increasing delay.
Measured Impact
I reproduced this on an isolated OpenShift test cluster using MinIO as the S3 backend and a NetworkPolicy blocking Clair egress.
| Condition |
Object-storage traffic |
Behaviour |
| 3 images, CDN blocked |
~8 GB/hr and climbing |
Repeated re-indexing |
| 3 images, CDN blocked, patched |
~0.001 GB/hr and flat |
Results persisted |
| Production, 323 images |
~20 TB/day |
Same loop at scale |
Scaling Clair to 0 replicas caused an immediate 93% traffic drop, confirming Clair as the primary source of traffic.
I also confirmed that affected manifests were not persisted in the indexreport table during the loop, and were persisted after applying the mitigation.
Proposed Fix
Introduce a degraded terminal state, IndexPartial, for scans that produce usable index data but encounter scanner enrichment failures.
For the two RHEL mapping-file scanners, return indexer.Partial(err) instead of a fatal error when the mapping file cannot be fetched or the mapping object is invalid.
rhel/rhcc/scanner.go:
vi, err := s.upd.Get(tctx, s.client)
if err != nil && vi == nil {
slog.WarnContext(ctx,
"rhcc: unable to fetch mapping file, skipping RHCC mapping enrichment",
"error", err,
)
return pkgs, indexer.Partial(err)
}
v, ok := vi.(*mappingFile)
if !ok || v == nil {
err := fmt.Errorf("rhcc: unable to create a mappingFile object")
slog.WarnContext(ctx,
"rhcc: unable to create mappingFile object, skipping RHCC mapping enrichment",
)
return pkgs, indexer.Partial(err)
}
rhel/repositoryscanner.go:
cmi, err := r.upd.Get(tctx, r.client)
if err != nil && cmi == nil {
slog.WarnContext(ctx,
"rhel: unable to fetch mapping file, skipping repository enrichment",
"error", err,
)
return []*claircore.Repository{}, indexer.Partial(err)
}
cm, ok := cmi.(*mappingFile)
if !ok || cm == nil {
err := fmt.Errorf("rhel: unable to create a mappingFile object")
slog.WarnContext(ctx,
"rhel: unable to create mappingFile object, skipping repository enrichment",
)
return []*claircore.Repository{}, indexer.Partial(err)
}
The indexer can then persist the scan as IndexPartial, mark the manifest scanned for normal request paths, and avoid immediate repeated layer downloads. Operators can still distinguish degraded results from fully successful IndexFinished reports.
The implementation in the linked PR adds:
indexer.ErrScanPartial, indexer.PartialError, and indexer.Partial(err).
IndexPartial controller state handling.
SetIndexPartial persistence that stores the degraded index report and marks the manifest scanned.
RequeueIndexPartials to make old partial reports eligible for later re-indexing.
- A PostgreSQL migration adding
indexreport.updated_at and an index for stale partial reports.
- A
libindex background retry scheduler with a default 24-hour retry interval.
Validation
Tested locally:
go test ./indexer/... ./libindex/... ./rhel/... ./datastore/postgres/...
Tested on OpenShift with MinIO as the S3 backend and NetworkPolicy blocking CDN egress.
Expected patched behaviour was observed:
- RHCC and RHEL repository mapping failures were logged as degraded scanner results.
- Scan results were persisted as
IndexPartial.
- Repeated immediate rescans stopped.
- Object-storage traffic returned to baseline after the initial scan.
Log evidence of correct behaviour:
WARN rhel: unable to create mappingFile object, skipping repository enrichment
WARN scanner produced degraded results reason="partial scan: ..."
WARN layers scan completed with partial results
INFO manifest partially scanned
Trade-off
This approach prioritizes stopping the repeated layer download storm while preserving the usable index data Clair already discovered.
When the mapping files are unavailable, RHCC package mapping enrichment and RHEL repository enrichment can be incomplete for affected images. However, the report remains distinguishable from a fully successful scan through IndexPartial.
To avoid permanently suppressing retries after a temporary CDN outage, partial reports can be made eligible for re-indexing later through the background retry scheduler.
Steps to Reproduce
- Deploy Quay with Clair on OpenShift.
- Block Clair egress to
security.access.redhat.com.
- Push or mirror a RHEL-based image, for example
registry.access.redhat.com/ubi9/ubi:9.3.
- Watch Clair logs for repeated mapping-file errors for the same manifest digest.
- Watch object-storage traffic increase continuously.
- Confirm the affected manifest is not persisted in the
indexreport table.
- Apply the patch and confirm traffic drops to baseline and the result is persisted as
IndexPartial.
Related
Summary
In disconnected or egress-restricted environments, Clair can repeatedly re-index the same RHEL-based image manifests when it cannot reach Red Hat mapping files hosted at
security.access.redhat.com.Because the mapping fetch failure is returned as a fatal scanner error, the manifest scan aborts before a durable index result is persisted. The manifest remains eligible for indexing and is picked up again repeatedly. Each attempt fetches image layers from object storage before failing at the same CDN call, which can produce sustained and unbounded S3/object-storage read traffic.
I observed this generating approximately 20 TB/day of traffic in production with 323 unscanned RHEL-based images.
Environment
security.access.redhat.comignore_unpatched: falsein Clair config, or unsetWhat Happens
Two RHEL scanners fetch Red Hat mapping files during layer scans:
rhel/rhcc/scanner.go(rhel_containerscanner, package scanner)rhel/repositoryscanner.go(rhel-repository-scanner, repository scanner)The RHCC scanner fetches:
The RHEL repository scanner fetches:
When Clair cannot reach the CDN, the RHCC scanner can fail in
rhel/rhcc/scanner.go:The repository scanner has the same fatal pattern in
rhel/repositoryscanner.go:Both paths abort the index operation before a result is persisted. Since the manifest has no successful or degraded index result, it remains eligible for indexing and is retried repeatedly.
With multiple Clair replicas, the impact can be amplified because multiple pods may process failing manifests concurrently.
Typical log signature:
{ "level": "error", "component": "indexer/controller/Controller.Index", "manifest": "sha256:...", "state": "ScanLayers", "error": "failed to scan all layer contents: layer \"sha256:...\": rhcc: unable to create a mappingFile object", "message": "error during scan" }In our environment, the same manifest digests were retried every 8-30 seconds with no increasing delay.
Measured Impact
I reproduced this on an isolated OpenShift test cluster using MinIO as the S3 backend and a NetworkPolicy blocking Clair egress.
Scaling Clair to 0 replicas caused an immediate 93% traffic drop, confirming Clair as the primary source of traffic.
I also confirmed that affected manifests were not persisted in the
indexreporttable during the loop, and were persisted after applying the mitigation.Proposed Fix
Introduce a degraded terminal state,
IndexPartial, for scans that produce usable index data but encounter scanner enrichment failures.For the two RHEL mapping-file scanners, return
indexer.Partial(err)instead of a fatal error when the mapping file cannot be fetched or the mapping object is invalid.rhel/rhcc/scanner.go:rhel/repositoryscanner.go:The indexer can then persist the scan as
IndexPartial, mark the manifest scanned for normal request paths, and avoid immediate repeated layer downloads. Operators can still distinguish degraded results from fully successfulIndexFinishedreports.The implementation in the linked PR adds:
indexer.ErrScanPartial,indexer.PartialError, andindexer.Partial(err).IndexPartialcontroller state handling.SetIndexPartialpersistence that stores the degraded index report and marks the manifest scanned.RequeueIndexPartialsto make old partial reports eligible for later re-indexing.indexreport.updated_atand an index for stale partial reports.libindexbackground retry scheduler with a default 24-hour retry interval.Validation
Tested locally:
go test ./indexer/... ./libindex/... ./rhel/... ./datastore/postgres/...Tested on OpenShift with MinIO as the S3 backend and NetworkPolicy blocking CDN egress.
Expected patched behaviour was observed:
IndexPartial.Log evidence of correct behaviour:
Trade-off
This approach prioritizes stopping the repeated layer download storm while preserving the usable index data Clair already discovered.
When the mapping files are unavailable, RHCC package mapping enrichment and RHEL repository enrichment can be incomplete for affected images. However, the report remains distinguishable from a fully successful scan through
IndexPartial.To avoid permanently suppressing retries after a temporary CDN outage, partial reports can be made eligible for re-indexing later through the background retry scheduler.
Steps to Reproduce
security.access.redhat.com.registry.access.redhat.com/ubi9/ubi:9.3.indexreporttable.IndexPartial.Related
ignore_unpatchedinScannerConfig