Skip to content

rhcc mapping failure can cause repeated layer downloads in disconnected environments #1884

@mohtork

Description

@mohtork

Summary

In disconnected or egress-restricted environments, Clair can repeatedly re-index the same RHEL-based image manifests when it cannot reach Red Hat mapping files hosted at security.access.redhat.com.

Because the mapping fetch failure is returned as a fatal scanner error, the manifest scan aborts before a durable index result is persisted. The manifest remains eligible for indexing and is picked up again repeatedly. Each attempt fetches image layers from object storage before failing at the same CDN call, which can produce sustained and unbounded S3/object-storage read traffic.

I observed this generating approximately 20 TB/day of traffic in production with 323 unscanned RHEL-based images.

Environment

  • Red Hat Quay 3.15.4 with Quay Operator on OpenShift
  • Clair managed by the Quay Operator
  • Image layers stored on S3-compatible storage
  • Cluster has no egress to security.access.redhat.com
  • ignore_unpatched: false in Clair config, or unset

What Happens

Two RHEL scanners fetch Red Hat mapping files during layer scans:

  • rhel/rhcc/scanner.go (rhel_containerscanner, package scanner)
  • rhel/repositoryscanner.go (rhel-repository-scanner, repository scanner)

The RHCC scanner fetches:

https://security.access.redhat.com/data/metrics/container-name-repos-map.json

The RHEL repository scanner fetches:

https://security.access.redhat.com/data/metrics/repository-to-cpe.json

When Clair cannot reach the CDN, the RHCC scanner can fail in rhel/rhcc/scanner.go:

vi, err := s.upd.Get(tctx, s.client)
if err != nil && vi == nil {
    return nil, err
}

v, ok := vi.(*mappingFile)
if !ok || v == nil {
    return nil, fmt.Errorf("rhcc: unable to create a mappingFile object")
}

The repository scanner has the same fatal pattern in rhel/repositoryscanner.go:

cmi, err := r.upd.Get(tctx, r.client)
if err != nil && cmi == nil {
    return []*claircore.Repository{}, err
}

cm, ok := cmi.(*mappingFile)
if !ok || cm == nil {
    return []*claircore.Repository{}, fmt.Errorf("rhel: unable to create a mappingFile object")
}

Both paths abort the index operation before a result is persisted. Since the manifest has no successful or degraded index result, it remains eligible for indexing and is retried repeatedly.

With multiple Clair replicas, the impact can be amplified because multiple pods may process failing manifests concurrently.

Typical log signature:

{
  "level": "error",
  "component": "indexer/controller/Controller.Index",
  "manifest": "sha256:...",
  "state": "ScanLayers",
  "error": "failed to scan all layer contents: layer \"sha256:...\": rhcc: unable to create a mappingFile object",
  "message": "error during scan"
}

In our environment, the same manifest digests were retried every 8-30 seconds with no increasing delay.

Measured Impact

I reproduced this on an isolated OpenShift test cluster using MinIO as the S3 backend and a NetworkPolicy blocking Clair egress.

Condition Object-storage traffic Behaviour
3 images, CDN blocked ~8 GB/hr and climbing Repeated re-indexing
3 images, CDN blocked, patched ~0.001 GB/hr and flat Results persisted
Production, 323 images ~20 TB/day Same loop at scale

Scaling Clair to 0 replicas caused an immediate 93% traffic drop, confirming Clair as the primary source of traffic.

I also confirmed that affected manifests were not persisted in the indexreport table during the loop, and were persisted after applying the mitigation.

Proposed Fix

Introduce a degraded terminal state, IndexPartial, for scans that produce usable index data but encounter scanner enrichment failures.

For the two RHEL mapping-file scanners, return indexer.Partial(err) instead of a fatal error when the mapping file cannot be fetched or the mapping object is invalid.

rhel/rhcc/scanner.go:

vi, err := s.upd.Get(tctx, s.client)
if err != nil && vi == nil {
    slog.WarnContext(ctx,
        "rhcc: unable to fetch mapping file, skipping RHCC mapping enrichment",
        "error", err,
    )
    return pkgs, indexer.Partial(err)
}

v, ok := vi.(*mappingFile)
if !ok || v == nil {
    err := fmt.Errorf("rhcc: unable to create a mappingFile object")
    slog.WarnContext(ctx,
        "rhcc: unable to create mappingFile object, skipping RHCC mapping enrichment",
    )
    return pkgs, indexer.Partial(err)
}

rhel/repositoryscanner.go:

cmi, err := r.upd.Get(tctx, r.client)
if err != nil && cmi == nil {
    slog.WarnContext(ctx,
        "rhel: unable to fetch mapping file, skipping repository enrichment",
        "error", err,
    )
    return []*claircore.Repository{}, indexer.Partial(err)
}

cm, ok := cmi.(*mappingFile)
if !ok || cm == nil {
    err := fmt.Errorf("rhel: unable to create a mappingFile object")
    slog.WarnContext(ctx,
        "rhel: unable to create mappingFile object, skipping repository enrichment",
    )
    return []*claircore.Repository{}, indexer.Partial(err)
}

The indexer can then persist the scan as IndexPartial, mark the manifest scanned for normal request paths, and avoid immediate repeated layer downloads. Operators can still distinguish degraded results from fully successful IndexFinished reports.

The implementation in the linked PR adds:

  1. indexer.ErrScanPartial, indexer.PartialError, and indexer.Partial(err).
  2. IndexPartial controller state handling.
  3. SetIndexPartial persistence that stores the degraded index report and marks the manifest scanned.
  4. RequeueIndexPartials to make old partial reports eligible for later re-indexing.
  5. A PostgreSQL migration adding indexreport.updated_at and an index for stale partial reports.
  6. A libindex background retry scheduler with a default 24-hour retry interval.

Validation

Tested locally:

go test ./indexer/... ./libindex/... ./rhel/... ./datastore/postgres/...

Tested on OpenShift with MinIO as the S3 backend and NetworkPolicy blocking CDN egress.

Expected patched behaviour was observed:

  • RHCC and RHEL repository mapping failures were logged as degraded scanner results.
  • Scan results were persisted as IndexPartial.
  • Repeated immediate rescans stopped.
  • Object-storage traffic returned to baseline after the initial scan.

Log evidence of correct behaviour:

WARN  rhel: unable to create mappingFile object, skipping repository enrichment
WARN  scanner produced degraded results  reason="partial scan: ..."
WARN  layers scan completed with partial results
INFO  manifest partially scanned

Trade-off

This approach prioritizes stopping the repeated layer download storm while preserving the usable index data Clair already discovered.

When the mapping files are unavailable, RHCC package mapping enrichment and RHEL repository enrichment can be incomplete for affected images. However, the report remains distinguishable from a fully successful scan through IndexPartial.

To avoid permanently suppressing retries after a temporary CDN outage, partial reports can be made eligible for re-indexing later through the background retry scheduler.

Steps to Reproduce

  1. Deploy Quay with Clair on OpenShift.
  2. Block Clair egress to security.access.redhat.com.
  3. Push or mirror a RHEL-based image, for example registry.access.redhat.com/ubi9/ubi:9.3.
  4. Watch Clair logs for repeated mapping-file errors for the same manifest digest.
  5. Watch object-storage traffic increase continuously.
  6. Confirm the affected manifest is not persisted in the indexreport table.
  7. Apply the patch and confirm traffic drops to baseline and the result is persisted as IndexPartial.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions