Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: blade create k8s container-container remove not works as expect #1092

Closed
spencercjh opened this issue Jan 13, 2025 · 2 comments · Fixed by chaosblade-io/chaosblade-exec-cri#22

Comments

@spencercjh
Copy link
Contributor

spencercjh commented Jan 13, 2025

Issue Description

bug report

Describe what happened (or what feature you want)

There is a bug in the implementation of container deletion logic for the Containerd runtime in chaosblade-exec-cri. The current implementation only removes metadata from the bucket usingContainerService().Delete`, but it does not properly delete the associated task (including processes) or the container's rootfs snapshot.

This issue causes Kubelet to fail to update the container state via CRI, which prevents new containers from being created for the affected Pod.

Describe what you expected to happen

The target container restarted with a non-zero exit code like 137.

How to reproduce it (as minimally and precisely as possible)

  1. Use blade create k8s container-container remove to remove a container in a Pod of Kubernetes.
  2. Watch Pods and there SHOULD be changes about ready container counts like (1/1 to 0/1 for the exp then 1/1 for the new container start)
  3. kubectl exec -it $THE_TARGET_POD -c $TARGET_CONTAINER -- bash leads an internal error from CRI for container not found.

Tell us your environment

  • containerd 1.6.5
  • the latest chaosblade

Anything else we need to know?

I think the whole chaosblade-exec-cri implementation doesn't make a lot of sense and deviates from the cri in the repo's name (It means interacting with the container runtime via the CRI interface)
The compromise is to refer to containerd cmd/ctr/commands/containers/containers.go to remove containers and their associated resources.

Here is some examples worked on my env to fix:

Use containerd.Container

// Load the container
container, err := c.cclient.LoadContainer(ctx, containerId)
if err != nil {
    return err
}
task, err := container.Task(ctx, nil)
if err != nil {
    return err
}
 
// Stop the task if it's running
if task != nil {
    // Send SIGTERM
    if err := task.Kill(ctx, syscall.SIGKILL); err != nil {
        return err
    }
    // Delete the task
    if _, err := task.Delete(ctx, containerd.WithProcessKill); err != nil {
        return err
    }
}

Use existed containerd client's Service API

if _, err := c.cclient.TaskService().Kill(ctx, &tasks.KillRequest{
	ContainerID: containerId,
	Signal:      uint32(syscall.SIGKILL),
}); err != nil {
	return err
}

return nil

Use k8s.io/cri-client

klogger := klog.FromContext(ctx)
criSvc, err := cripkg.NewRemoteRuntimeService(
	"unix:///run/containerd/containerd.sock",
	15*time.Second,
	noop.NewTracerProvider(),
	&klogger,
)
if err != nil {
	return err
}

if err := criSvc.StopContainer(ctx, containerId, 0); err != nil {
	return err
}

return nil

The above will restart the container in the Pod with a restart record.

image

@spencercjh
Copy link
Contributor Author

spencercjh commented Jan 14, 2025

In my real-world testing, crictl and kubelet behave out of the ordinary if the container (with all its information) is completely removed. So I changed the code shown in the issue.

The main problem is that it affects the information in the containerStatuses(from kubectl get pods -o yaml) field of the pod (the last container shown is incorrect, it won't be the one you removed), the restart information of the container and the restart count of the pod will be inaccurate (e.g. it stays at 0 forever). I'm sorry I don't have much time to read through the kube-apiserver, kubelet and containerd cri plugin server code to explain why this happened.

I took a cursory look at the code in the containerd cri plugin, which has the Status StatusStorage field, and it looks like cri stores more information about the status of containers for use in Kubernetes.

@xcaspar
Copy link
Member

xcaspar commented Jan 22, 2025

The core problem of ChaosBlade is that the stop container operation is missing in the ChaosBlade RemoveContainer implementation, which also lacks the task kill operation. There is no problem with the above repair solutions. I suggest using the second solution, which is Use existed containerd client's Service API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants