Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot pull certain layers despite successful advertisement #752

Open
diasbro opened this issue Feb 26, 2025 · 1 comment
Open

Cannot pull certain layers despite successful advertisement #752

diasbro opened this issue Feb 26, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@diasbro
Copy link

diasbro commented Feb 26, 2025

Spegel version

v0.0.30

Kubernetes distribution

vanilla

Kubernetes version

v1.28.4

CNI

Cilium v1.13.9

Describe the bug

We are encountering an issue in a 60‑node Kubernetes cluster where certain image layers cannot be pulled via Spegel, even though one of the nodes definitely has them and Spegel logs show those layers are being advertised. In contrast, on a 10‑node cluster (same Spegel configuration), pulling the same images works fine.

> crictl pull mylocalharbor.example.com/proxy-cache/library/nginx:alpine
E0225 21:59:01.973081 1665303 remote_image.go:167] "PullImage from image service failed" err="rpc error: code = NotFound desc = failed to pull and unpack image \"mylocalharbor.example.com/proxy-cache/library/nginx:alpine\": failed to copy: httpReadSeeker: failed open: content at http://localhost:20020/v2/proxy-cache/library/nginx/manifests/sha256:a71e0884a7f1192ecf5decf062b67d46b54ad63f0cc1b8aa7e705f739a97c2fc?ns=mylocalharbor.example.com not found: not found" image="mylocalharbor.example.com/proxy-cache/library/nginx:alpine"
FATA[0000] pulling image: rpc error: code = NotFound desc = failed to pull and unpack image "mylocalharbor.example.com/proxy-cache/library/nginx:alpine": failed to copy: httpReadSeeker: failed open: content at http://localhost:20020/v2/proxy-cache/library/nginx/manifests/sha256:a71e0884a7f1192ecf5decf062b67d46b54ad63f0cc1b8aa7e705f739a97c2fc?ns=mylocalharbor.example.com not found: not found

Spegel logs (Pull Node):

{"time":"2025-02-25T21:59:01.503680186Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handleMirror","file":"/build/pkg/registry/registry.go","line":236},"msg":"handling mirror request from external node","key":"mylocalharbor.example.com/proxy-cache/library/nginx:alpine","path":"/v2/proxy-cache/library/nginx/manifests/alpine","ip":"100.96.27.160"}
{"time":"2025-02-25T21:59:01.508731705Z","level":"DEBUG","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handleMirror","file":"/build/pkg/registry/registry.go","line":287},"msg":"mirrored request","key":"mylocalharbor.example.com/proxy-cache/library/nginx:alpine","path":"/v2/proxy-cache/library/nginx/manifests/alpine","ip":"100.96.27.160","url":"http://100.96.17.54:5000"}
{"time":"2025-02-25T21:59:01.508980219Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":132},"msg":"","path":"/v2/proxy-cache/library/nginx/manifests/alpine","status":200,"method":"HEAD","latency":"5.366138ms","ip":"100.96.27.160"}
...
{"time":"2025-02-25T21:59:01.572263309Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":135},"msg":"","err":"mirror resolve retries exhausted for key: sha256:a71e0884a7f1192ecf5decf062b67d46b54ad63f0cc1b8aa7e705f739a97c2fc","path":"/v2/proxy-cache/library/nginx/manifests/sha256:a71e0884a7f1192ecf5decf062b67d46b54ad63f0cc1b8aa7e705f739a97c2fc","status":404,"method":"GET","latency":"30.012755ms","ip":"100.96.27.160"}

Spegel logs (Node with the blob):

{"time":"2025-02-25T21:56:38.879279272Z","level":"DEBUG","source":{"function":"github.com/spegel-org/spegel/pkg/routing.(*P2PRouter).Advertise","file":"/build/pkg/routing/p2p.go","line":200},"msg":"advertising keys","host":"12D3KooWJ7aG75BFotyMa1SFcd6vCQ6ZFvcu3r5ygqNuTkiBDGbm","keys":["mylocalharbor.example.com/proxy-cache/library/nginx:alpine","sha256:4ff102c5d78d254a6f0da062b3cf39eaf07f01eec0927fd21e219d0af8bc0591","sha256:a71e0884a7f1192ecf5decf062b67d46b54ad63f0cc1b8aa7e705f739a97c2fc","sha256:1ff4bb4faebcfb1f7e01144fa9904a570ab9bab88694457855feb6c6bba3fa07",...]}

Despite Spegel advertising the layer (sha256:a71e0884a7f1192ecf5decf062b67d46b54ad63f0cc1b8aa7e705f739a97c2fc) on one node, the pull operation times out or returns 404 in the bigger cluster. In a smaller (10‑node) cluster, the exact same config and images work perfectly.

Spegel Helm Values:

      spegel:
        additionalMirrorRegistries: []
        appendMirrors: false
        blobSpeed: ''
        containerdContentPath: /var/lib/containerd/io.containerd.content.v1.content
        containerdMirrorAdd: false
        containerdNamespace: k8s.io
        containerdRegistryConfigPath: /etc/containerd/certs.d
        containerdSock: /run/containerd/containerd.sock
        kubeconfigPath: ''
        logLevel: DEBUG
        mirrorResolveRetries: 100
        mirrorResolveTimeout: "60s"
        registries:
          - https://mylocalharbor.example.com
          - https://docker.io
          - https://k8s.gcr.io
          - https://registry.k8s.io
          - https://public.ecr.aws
          - https://quay.io
        resolveLatestTag: true
        resolveTags: true

Observed behavior:

  • Spegel successfully retrieves the manifest but fails to fetch a specific blob (layer or config), returning “not found” or “mirror resolve retries exhausted.”
  • The node that actually has the blob claims to advertise it (shown in logs).
  • Extended mirrorResolveTimeout and mirrorResolveRetries do not help.

Is there a known limitation or bug in how Spegel’s P2P resolves layers in larger clusters? Any suggestions or debugging tips would be greatly appreciated. Thank you!

@diasbro diasbro added the bug Something isn't working label Feb 26, 2025
@phillebaba
Copy link
Member

In a distributed system there can be many reasons for content to not be found. I also find libp2p KAD very difficult to debug as there are many moving components. They did just merge a PR that seems to be related to what you are seeing caused by the wrong context used in a query.

libp2p/go-libp2p-kad-dht#1017

I thiink the best option for now is to see if this problem persists in the next release of Spegel, if not we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants