Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

after upgrade from v26 to v27 dind fails to start: Unexpected error in sigtimedwait: 'Function not implemented' #503

Open
mrclrchtr opened this issue Jun 27, 2024 · 9 comments

Comments

@mrclrchtr
Copy link

I tried to upgrade from v26 to v27.

I want to use docker dind in a github actions runner scale set with the following config:

image: docker:27.0.2-dind
name: dind
securityContext:
  privileged: true
env:
  - name: DOCKER_GROUP_GID
    value: "123"
resources:
  requests:
    cpu: 300m
    memory: 500Mi
  limits:
    cpu: 300m
    memory: 500Mi
args:
  - dockerd
  - --host=unix:///var/run/docker.sock
  - --group=$(DOCKER_GROUP_GID)

This ist the complete log, I can get:

cat: can't open '/proc/net/arp_tables_names': No such file or directory
iptables v1.8.10 (nf_tables)
[FATAL tini (1)] Unexpected error in sigtimedwait: 'Function not implemented'

The underlaying OS is Talos v1.7.4

Do you have any idea, whats happening?

@tianon
Copy link
Member

tianon commented Jun 27, 2024

Interesting -- why is tini involved here? 🤔

Do you have something configured on your system that would be putting tini inside that container automatically (for example, on dockerd there's a --init flag that would do so)?

(That being said, I can't reproduce the issue even using docker run --init to force tini to be the parent of my dockerd process, so that doesn't really help much, it's just the only meaningful thread I can see to pull on 😭)

@mrclrchtr
Copy link
Author

mrclrchtr commented Jun 27, 2024

Not that I know of... there is an earlier container that unpacks "dind-externals" from the github runner image and provides it via a volume mount for dind. But that shouldn't lead to a different startup behavior, should it?

This is the log of the v26 image:

cat: can't open '/proc/net/arp_tables_names': No such file or directory
iptables v1.8.10 (nf_tables)
time="2024-06-27T17:34:14.706370867Z" level=info msg="Starting up"
time="2024-06-27T17:34:14.711383174Z" level=info msg="containerd not running, starting managed containerd"
time="2024-06-27T17:34:14.797946949Z" level=info msg="started new containerd process" address=/var/run/docker/containerd/containerd.sock module=libcontainerd pid=346
time="2024-06-27T17:34:14.903422623Z" level=info msg="starting containerd" revision=ae71819c4f5e67bb4d5ae76a6b735f29cc25774e version=v1.7.18
...
...

I'll see if Talos has anything to do with it.

@mrclrchtr
Copy link
Author

I found this: https://github.com/docker-library/docker/blob/c0963f96ace4f48d13385cbf20356ae605edcb8b/27/dind/dockerd-entrypoint.sh#L143C2-L144C28

# XXX inject "docker-init" (tini) as pid1 to workaround https://github.com/docker-library/docker/issues/318 (zombie container-shim processes)
set -- docker-init -- "$@"

@tianon
Copy link
Member

tianon commented Jun 27, 2024

Oh lol, good catch -- I forgot all about that. 😭

However, that doesn't really help give us more threads to pull because it works fine here, so my only guess is something in the Talos environment or kernel or something? Maybe something about how Kubernetes is creating the container?

Is there any way you could get lower level on the affected system and debug/test more directly with simpler container run commands like docker run to help narrow down?

@mrclrchtr
Copy link
Author

However, that doesn't really help give us more threads to pull because it works fine here, so my only guess is something in the Talos environment or kernel or something? Maybe something about how Kubernetes is creating the container?

Yes, I also think it has to do with Talos. The question is whether the error message means that sigtimedwait is not present?

And I wonder what change to the image this function needs now?

Is there any way you could get lower level on the affected system and debug/test more directly with simpler container run commands like docker run to help narrow down?

No, unfortunately not. Talos is built in such a way that you can't even set up an SSH tunnel to the machine.

But I could build a very simple Kubernetes deployment with just the image. That's a good idea and helps to isolate the error.

Thank you very much for your help. I'll get back to you as soon as I have more information.

@mrclrchtr
Copy link
Author

mrclrchtr commented Jul 30, 2024

Today I tried version 27.1.1 (without any further changes) and it works. Unfortunately, I still don't know what was going on in the meantime. Thanks again for your support!

@mrclrchtr
Copy link
Author

With the upgrade to 27.1.2 the problem is present again 😖🧐

@mrclrchtr mrclrchtr reopened this Aug 15, 2024
@mrclrchtr
Copy link
Author

Ok, it's completely weird... in 27.2.0-dind it works, in 27.2.1-dind it doesn't work anymore..

I will continue to monitor it. Perhaps a pattern will emerge at some point or you can look at the history to see what has changed.

@LaurentGoderre
Copy link
Member

I'm not familiar with this but looking at the error handling here: https://github.com/krallin/tini/blob/0b44d3665869e46ccbac7414241b8256d6234dc4/src/tini.c#L505-L512 and the spec here: https://pubs.opengroup.org/onlinepubs/9699919799/functions/sigtimedwait.html there is an error code that is not handled (EINVAL) and I am wondering if the error message could be misleading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants