-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker 20.10.6 IPv6 bindings shouldn't be mapped as network bindings for tasks for non IPv6 networks. #2870
Comments
@tomelliff Thank you for reporting this. I am trying to repro this issue, can you please help me with AMI you are using and how are you updating the docker on the instance ? Thanks. |
We're basing off the latest Canonical Ubuntu 20.04 AMI and then we install Docker from the APT repo at https://download.docker.com/linux/ubuntu. Our ECS agent systemd unit file looks like this: [Unit]
Description=ECS Agent
Requires=docker.service
After=docker.service cloud-final.service
[Service]
Restart=always
ExecStartPre=/sbin/iptables -t nat -A PREROUTING --dst 169.254.170.2/32 \
-p tcp -m tcp --dport 80 -j DNAT --to-destination 127.0.0.1:51679
ExecStartPre=/sbin/iptables -t filter -I INPUT --dst 127.0.0.0/8 \
! --src 127.0.0.0/8 -m conntrack ! --ctstate RELATED,ESTABLISHED,DNAT -j DROP
ExecStartPre=/sbin/iptables -t nat -A OUTPUT --dst 169.254.170.2/32 \
-p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
ExecStartPre=/sbin/sysctl -w net.ipv4.conf.all.route_localnet=1
ExecStartPre=-/usr/bin/docker rm -f ecs-agent
ExecStartPre=-/bin/mkdir -p /var/lib/ecs/dhclient
ExecStart=/usr/bin/docker run --name ecs-agent \
--init \
--cap-add=NET_ADMIN \
--cap-add=SYS_ADMIN \
--restart=on-failure:10 \
--volume=/var/run:/var/run \
--volume=/var/log/ecs/:/log \
--volume=/var/lib/ecs/data:/data \
--volume=/etc/ecs:/etc/ecs \
--volume=/sbin:/sbin:ro \
--volume=/lib:/lib:ro \
--volume=/lib64:/lib64:ro \
--volume=/usr/lib:/usr/lib:ro \
--volume=/proc:/host/proc:ro \
--volume=/sys/fs/cgroup:/sys/fs/cgroup \
--net=host \
--env-file=/etc/ecs/ecs.config \
amazon/amazon-ecs-agent:latest
ExecStopPost=/usr/bin/docker rm -f ecs-agent
[Install]
WantedBy=default.target and our ECS config file looks like this:
We also add the ECS_CLUSTER variable to the config file to join the instance to the correct cluster via user data. |
I believe this is the same issue as moby/moby#42288, which the docker team is planning to fix in 20.10.7. If we can confirm that this is fixed in 20.10.7 then I think the best course of action would be for ECS to do nothing to workaround this. |
@tomelliff could you confirm if the instance has ipv6 enabled, then there would be no ill effects from this? I'm wondering if there could still be issues if the customer has ipv6 enabled on their instance but does not have ipv6 enabled up their entire VPC stack. As in, even after the above moby issue is fixed, we should always have a configuration to turn off the ipv6 networkBinding from being added? |
I experienced same issue on Ubuntu 18.04 with docker 20.10.6 using bridge network mode, ECS agent 1.51.0. I experienced deployment stuck forever in
|
Currently there are three known workarounds:
|
I built a development version off docker's master branch (20.10.7) and can confirm that this appears to be fixed. I did the same repro steps that I outlined here: moby/moby#42288 (comment), but this time I successfully launched the container and I confirmed that
|
This being said, we do have a change in behavior on 20.10.7 when ipv6 is enabled: 20.10.5
20.10.7
|
@jpradelle I've tried with 20.10.6 and 20.10.7 but have not been able to reproduce the situation where the ipv4 and ipv6 network bindings receive a different host port. How often do you see that happening? and do you have some task definition that reproduces it? |
It takes time for IPv4 and IPv6 port binding to be different. At the beginning both ports are the same. It took almost 3 weeks of run with at least a hundred deployments/redeployments before ports begins to be different. Currently I was not able to identify a pattern to reproduce. I experienced twice targets responding well to call made from ELB (from my browser http://my-elb/my-target) but being killed due to ELB health check timeout at the end of health check grace period. Currently I renewed my cluster instances all my tasks are running and being deployed with duplicate network binding on same ports, everything works fine. Here is the CloudFormation file I use AWSTemplateFormatVersion: '2010-09-09'
Resources:
LogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: stg-mta/my-app
RetentionInDays: 30
Task:
Type: AWS::ECS::TaskDefinition
Properties:
Family: my-app
NetworkMode: bridge
TaskRoleArn: ...
ContainerDefinitions:
- Name: my-app
Image: ...
MemoryReservation: 512
Environment:
- ...
PortMappings:
- ContainerPort: 8080
Protocol: tcp
ReadonlyRootFilesystem: true
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: stg-mta/my-app
awslogs-region:
Ref: AWS::Region
awslogs-stream-prefix: my-app
MountPoints:
- SourceVolume: tmp
ContainerPath: /tmp
Volumes:
- Name: tmp
Tags:
- ...
Service:
Type: AWS::ECS::Service
DependsOn:
- ListnerRuleApi
Properties:
ServiceName: my-app
TaskDefinition:
Ref: Task
Cluster: stg-mta-ecs
LaunchType: EC2
DesiredCount: 1
HealthCheckGracePeriodSeconds: 240
DeploymentConfiguration:
MaximumPercent: 200
MinimumHealthyPercent: 100
PropagateTags: TASK_DEFINITION
LoadBalancers:
- ContainerName: my-app
ContainerPort: 8080
TargetGroupArn:
Ref: TargetGroupApi
- ContainerName: my-app
ContainerPort: 8080
TargetGroupArn:
Ref: TargetGroupInternalNoSso
Tags:
- ...
TargetGroupApi:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
Name: my-app-api
Port: 8080
Protocol: HTTP
VpcId:
Fn::ImportValue: vpcid
TargetType: instance
Matcher:
HttpCode: 200-299
HealthCheckPath: /ping.php
HealthCheckProtocol: HTTP
HealthCheckIntervalSeconds: 5
HealthCheckTimeoutSeconds: 4
HealthyThresholdCount: 3
UnhealthyThresholdCount: 2
TargetGroupAttributes:
- Key: deregistration_delay.timeout_seconds
Value: '5'
Tags:
- ...
ListnerRuleApi:
Type: AWS::ElasticLoadBalancingV2::ListenerRule
Properties:
ListenerArn:
Fn::ImportValue: stg-mta-elb-api-listener
Actions:
- Type: forward
Order: 50000
TargetGroupArn:
Ref: TargetGroupApi
Conditions:
- Field: path-pattern
Values:
- Fn::Sub: /my-app/*
- Fn::Sub: /other-route2/*
- Fn::Sub: /other-route3/*
- Fn::Sub: /other-route4/*
Priority: 1190
TargetGroupInternalNoSso:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
Name: my-app-int
Port: 8080
Protocol: HTTP
VpcId:
Fn::ImportValue: vpcid
TargetType: instance
Matcher:
HttpCode: 200-299
HealthCheckPath: /ping.php
HealthCheckProtocol: HTTP
HealthCheckIntervalSeconds: 5
HealthCheckTimeoutSeconds: 4
HealthyThresholdCount: 3
UnhealthyThresholdCount: 2
TargetGroupAttributes:
- Key: deregistration_delay.timeout_seconds
Value: '5'
Tags:
- ...
ListnerRuleInternalNoSso:
Type: AWS::ElasticLoadBalancingV2::ListenerRule
Properties:
ListenerArn:
Fn::ImportValue: stg-mta-elb-internal-listener
Actions:
- Type: forward
Order: 50000
TargetGroupArn:
Ref: TargetGroupInternalNoSso
Conditions:
- Field: path-pattern
Values:
- Fn::Sub: /my-app/public/*
- Fn::Sub: /other-route/public/*
Priority: 1048
|
We just hit the issue again on 20.10.7 after unpinning from 20.10.5. So there's still the issue upstream with the different ports for IPv4 and IPv6. We haven't disabled IPv6 on the instance but as mentioned in the issue above ECS supposedly has strict opt ins for IPv6 that are not met here (we're not using |
Thanks for the update @tomelliff
Which issue are you referring to here? I'm not 100% sure what you mean by ECS having "strict opt ins" for IPv6, but this maybe happens at a higher level than at the ecs-agent in a way that I don't fully understand. From the ecs-agent level I'm not exactly sure what would be the best practice here. Obviously a user who intends to use ipv6 should not have their ipv6 interface stripped out, and it's not clear to me how the ecs-agent should determine that a particular ECS instance should be opted out of exposing ipv6 interfaces. I'm tempted to say that the best solution for you would be to disable ipv6 on your instance using the kernel parameter, but I'm also happy to understand the issue better if you think there's something that ecs-agent should be doing to filter out these ipv6 interfaces. I can also help to reroute this to the ECS backend side of things if this is something that could or should be filtered out on their end. |
I think Docker 20.10.7 upgrade solved the issue on my side. I renewed my 2 instances of the cluster based on same AMI version with following upgrades: And now in my task network bindings I no longer have duplicated network bindings for same ports. For exemple on a task, I only have IPv4 binding, which is what I expected:
|
OK, @jpradelle it sounds like maybe you were being affected by the docker bug that was exposing the ipv6 interface even though you had disabled ipv6 on your instances. Can you confirm if you have ipv6 disabled? For anyone else who sees this issue, I believe the current best workaround would be to disable ipv6 on your instances with the linux kernel parameter
|
I'm not sure exactly what my IPv6 configuration is. Using an Ubuntu AMI updated by my corporate service, never did anything on that part, neither on kernel parameter nor on docker configuration.
|
As mentioned by @sparrc, the workaround for now is to disable ipv6 on your instances if docker 20.10.6 is used (note - ECS Agent does not support this version yet). |
Summary
Docker 20.10.6 includes this fix which now means the API returns IPv6 bindings which I don't think it has ever done before? The ECS Agent then maps this binding to tasks so you end up with multiple bindings when previously you just had the IPv4 binding. According to https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-networking.html#task-networking-vpc-dual-stack I don't think IPv6 anything should be working by default and multiple things need to be enabled (and never on the bridge networking mode)although I could be misreading that.
Description
Normally this isn't an issue because you have a
0.0.0.0
binding and a::
binding which then seems to get deduplicated on the target group.Unfortunately there also appears to be an issue upstream (see moby/libnetwork#2639) where IPv6 host port bindings can be wrong and point to the wrong container when proxying IPv4 traffic to it. This seems to be the root cause of an issue that's been causing spurious healthcheck failures on our container fleet for the last couple of weeks (raised in support ticket 8277473901).
Expected Behavior
Either IPv6 host port bindings should be being filtered out here by default or there should be configuration to not enable it.
Only a single host port binding should be mapped to the task as with Docker 20.10.5 instances:
Observed Behavior
Both IPv4 and IPv6 host port bindings are mapped to the task:
Environment Details
Multiple network bindings observed on this instance:
Single IPv4 network binding on this instance:
Supporting Log Snippets
The text was updated successfully, but these errors were encountered: