Add check network packet loss implementation #4322

tshan2001 · 2024-09-05T21:31:39Z

Summary

Adding implementation for CheckNetworkPacketLoss() and corresponding unit tests. Also updating the os/exec wrapper to include 2 more helper methods to facilitate mocks in unit testing.

Implementation details

After the TMDS server receives the request to check for network packet loss, the following Linux command will be executed:

tc -j q

The output format will be a byte array of a json string. The output will then be unmarshalled and we will check whether the following exists:

{
"kind":"netem"
...
}

Testing

Unit tests for the TMDS package was run.

 % go test -tags unit -v -run TestCheckNetworkPacketLoss /workplace/tianzes/amazon-ecs-agent/ecs-agent/tmds/handlers/fault/v1/handlers
...
--- PASS: TestCheckNetworkPacketLoss (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_success-running (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_success-not-running (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_success-not-running-but-latency-is-running (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_only-one-ip-exists-not-running (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_unknown_request_body (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_failed_to_unmarshal_json (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_malformed_request_body_1 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_malformed_request_body_2 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_incomplete_request_body_1 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_incomplete_request_body_2 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_incomplete_request_body_3 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_invalid_LossPercent_in_the_request_body_1 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_invalid_LossPercent_in_the_request_body_2 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_invalid_LossPercent_in_the_request_body_3 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_invalid_IP_value_in_the_request_body_1 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_invalid_IP_CIDR_block_value_in_the_request_body_2 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_task_lookup_fail (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_task_metadata_fetch_fail (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_task_metadata_unknown_fail (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_fault_injection_disabled (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_invalid_network_mode (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_empty_task_network_config (0.00s)

New tests cover the changes:
2 unhappy test cases were added:

Call was successful but there's no fault running
Call was successful but got internal error when processing the Linux command
Call was successful, but there's network latency fault running instead of network packet loss fault

Manual Testing

Launched a Fargate Instance with the changes. Launched a task with ecs-exec enabled.

# First curl the TMDS endpoint from the task without any fault on the instance
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":10, "Sources":["192.168.0.1"]}'
{"Status":"not-running"}

# Log into the instance as 'su', manually inject a fault with 10% loss with IP '192.168.0.1' associated
root # nsenter --net=/var/run/netns/8aacba9e7ef74030a9310de01f3ca80e-028c5a03d94b tc qdisc add dev eth1 root handle 1: prio priomap 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
root # nsenter --net=/var/run/netns/8aacba9e7ef74030a9310de01f3ca80e-028c5a03d94b tc qdisc add dev eth1 parent 1:1 handle 10: netem loss 10%
root # nsenter --net=/var/run/netns/8aacba9e7ef74030a9310de01f3ca80e-028c5a03d94b tc filter add dev eth1 protocol ip parent 1:0 prio 1 u32 match ip dst 192.168.0.1 flowid 1:1

# Curl the TMDS endpoint with the same config, result should be running
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":10, "Sources":["192.168.0.1"]}'
{"Status":"running"}

# Now change the lossPercent in payload, result should be not-running
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":6, "Sources":["192.168.0.1"]}'
{"Status":"not-running"}

# Now add an additional IP address 10.1.1.1, and curl the endpoint with both IPs, result should be running
root # nsenter --net=/var/run/netns/8aacba9e7ef74030a9310de01f3ca80e-028c5a03d94b tc filter add dev eth1 protocol ip parent 1:0 prio 1 u32 match ip dst 10.1.1.1 flowid 1:1
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":10, "Sources":["192.168.0.1", "10.1.1.1"]}'
{"Status":"running"}

# If we only curl one of the endpoint, result should be running as well:
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":10, "Sources":["192.168.0.1"]}'
{"Status":"running"}

# Lastly, if we add an unfiltered ip in the payload, result should be not-running
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":10, "Sources":["192.168.0.1", "10.1.1.1", "1.1.1.1"]}'
{"Status":"not-running"}

# Edge case: manually add a network latency fault, and curl the check packet loss endpoint. The result should be not running
root # nsenter --net=/var/run/netns/8aacba9e7ef74030a9310de01f3ca80e-028c5a03d94b tc q show dev eth1
qdisc prio 1: root refcnt 9 bands 3 priomap 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
qdisc netem 10: parent 1:1 limit 1000 delay 100ms
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":10, "Sources":["192.168.0.1", "10.1.1.1", "1.1.1.1"]}'
{"Status":"not-running"}

Description for the changelog

Add check network packet loss implementation

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?

Does this PR include the addition of new environment variables in the README?

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

mye956 · 2024-09-12T00:22:50Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

+// runExecCommand wraps around the execwrapper, providing a convenient way of running any Linux command
+// and getting the result in both stdout and stderr.
+func (h *FaultHandler) runExecCommand(ctx context.Context, linuxCommandString string) ([]byte, error) {
+	cmdExec := h.osExecWrapper.CommandContext(ctx, "/bin/sh", "-c", linuxCommandString)


Not quite sure if the /bin/sh path exists within the agent container. Will need to double check this. Is it possible to make the call without this just like with the standard os/exec go library? (e.g. exec.Command("ls -al"))

Yes. I just pushed a new revision that removes "/bin/sh" from the command. Also re-ran all the manual testing to make sure that it still works.

mye956 · 2024-09-12T19:19:56Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

+	}
+	// Log the command output to better help us debug.
+	logger.Info(fmt.Sprintf("%s command result: %s", tcCheckInjectionCommandComposed, string(cmdOutput[:])))
+	var outputUnmarshalled []map[string]interface{}


Q: Instead of a parsing/unmarshaling the output into a map[string]interface{}, I wonder if we can parse it into a struct and then get only the needed json keys from the output? (if it's not possible/output is not deterministic then please disregard).

Just ran tc -j q show dev eth0 parent 1:1 locally and it seems to output the following:

[{ "kind": "netem", "handle": "10:", "parent": "1:1", "options": { "limit": 1000, "loss-random": { "loss": 0.5, "correlation": 0 }, "ecn": false, "gap": 0 } }]

tc -j q show dev eth0 does not always have all the fields. For example, when running on a host that doesn't have fault injected, it can have the following outout:

{"kind":"pfifo_fast","handle":"0:","parent":":1","options":{"bands":3,"priomap":[1,2,2,2,1,2,0,0,1,1,1,1,1,1,1,1],"multiqueue":false}}]

There's no loss-random field here. Also, when specifying parent 1:1, the output can also be [].

Unmarshalling it into a map[string]interface{} and indexing the json struct iteratively would allow us to better catch the nil when parsing it

mye956 · 2024-09-12T19:55:35Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

+// and getting the result in both stdout and stderr.
+func (h *FaultHandler) runExecCommand(ctx context.Context, linuxCommandString string) ([]byte, error) {
+	commandArray := strings.Split(linuxCommandString, " ")
+	cmdExec := h.osExecWrapper.CommandContext(ctx, commandArray[0], commandArray[1:]...)


Q: Should there be a cmdExec.Run() call somewhere in this method?

cmdExec.CombinedOutput() has output signature ([]byte, error), which is the same as the output of this method. We return the result of cmdExec.CombinedOutput() directly. CombinedOutput() is a generic method in os/exec. It runs the command and gives the stdout and stderr result at the same time (whereas Output() will only give stdout when exit code was 0, otherwise stdout will be nil)

Ah I see now

xxx0624 · 2024-09-12T21:28:21Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

+	// The command above gives the output of "tc q show dev {INTERFACE} parent 1:1" in json format.
+	// We will then unmarshall the json string and evaluate the fields of it.
+	tcCheckInjectionCommandComposed := nsenterPrefix + fmt.Sprintf(tcCheckInjectionCommandString, interfaceName)
+	fmt.Println(tcCheckInjectionCommandComposed)


Please use logger instead.

xxx0624 · 2024-09-12T21:30:15Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

+	fmt.Println(tcCheckInjectionCommandComposed)
+	cmdOutput, err := h.runExecCommand(ctx, tcCheckInjectionCommandComposed)
+	if err != nil {
+		return false, errors.New("failed to check network-packet-loss-fault: " + string(cmdOutput[:]) + err.Error())


Will the cmdOutput be nil if the err is not empty?

xxx0624 · 2024-09-12T21:34:27Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

+	ipSources := request.Sources
+	// If task's network mode is awsvpc, we need to run nsenter to access the task's network namespace.
+	nsenterPrefix := ""
+	if networkMode == "awsvpc" {


nit - we have const var for awsvpc in ECS ie

amazon-ecs-agent/ecs-agent/netlib/model/tasknetworkconfig/task_network_config.go

Line 29 in 78a2bf0

func New(networkMode string, netNSs ...*NetworkNamespace) (*TaskNetworkConfig, error) {

xxx0624 · 2024-09-12T21:38:55Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

+		var responseBody types.NetworkFaultInjectionResponse
+		var stringToBeLogged string
+		var httpStatusCode int
+		if err != nil {


Can we fire a metric here if the err is not empty? If this happens, it means there is something wrong in our DP and not customer/client issue.

xxx0624 · 2024-09-12T21:43:16Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

+	logger.Info(fmt.Sprintf("%s command result: %s", tcCheckIPCommandComposed, string(cmdOutput[:])))
+	allIPAddressesInRequestExist := true
+	for _, ipAddress := range ipSources {
+		ipAddressInHex, err := convertIPAddressToHex(*ipAddress)


Can we use aws.StringValue() instead? for the *ipAddress

xxx0624 · 2024-09-12T23:16:46Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers_test.go

+	invalidNetworkMode                   = "invalid"
+	tcLatencyFaultExistsCommandOutput    = `[{"kind":"netem","handle":"10:","parent":"1:1","options":{"limit":1000,"delay":{"delay":0.1,"jitter":0,"correlation":0},"ecn":false,"gap":0}}]`
+	tcLossFaultExistsCommandOutput       = `[{"kind":"netem","handle":"10:","dev":"eth0","parent":"1:1","options":{"limit":1000,"loss-random":{"loss":0.06,"correlation":0},"ecn":false,"gap":0}}]`
+	tcLossFaultDoesNotExistCommandOutput = `[{"kind":"dummyname"}]`


probably one more test case without kind

amogh09 · 2024-09-11T23:28:00Z

ecs-agent/utils/execwrapper/exec.go

+	Output() ([]byte, error)
+	CombinedOutput() ([]byte, error)


I think we should add descriptions for these (and existing) methods to let users know what to expect.

amogh09 · 2024-09-12T00:12:09Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

+	ipSources := request.Sources
+	// If task's network mode is awsvpc, we need to run nsenter to access the task's network namespace.
+	nsenterPrefix := ""
+	if networkMode == "awsvpc" {


How about using an existing constant?

amazon-ecs-agent/ecs-agent/api/ecs/model/ecs/api.go

Lines 25843 to 25844 in 78a2bf0

// NetworkModeAwsvpc is a NetworkMode enum value

NetworkModeAwsvpc = "awsvpc"

amogh09 · 2024-09-12T00:14:26Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

+	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+	defer cancel()
+
+	interfaceName := taskMetadata.TaskNetworkConfig.NetworkNamespaces[0].NetworkInterfaces[0].DeviceName


Probably out of scope for this PR but will Agent always populate device name in TaskResponse returned by GetTaskMetadata method? I think we would want to restrict that to fault injection use case only since computing device name is costly, at least for host network mode tasks.

amogh09 · 2024-09-12T00:16:04Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

+	// The command above gives the output of "tc q show dev {INTERFACE} parent 1:1" in json format.
+	// We will then unmarshall the json string and evaluate the fields of it.
+	tcCheckInjectionCommandComposed := nsenterPrefix + fmt.Sprintf(tcCheckInjectionCommandString, interfaceName)
+	fmt.Println(tcCheckInjectionCommandComposed)


Probably don't want this

amogh09 · 2024-09-12T00:38:33Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

+		var stringToBeLogged string
+		var httpStatusCode int
+		if err != nil {
+			responseBody = types.NewNetworkFaultInjectionErrorResponse(err.Error())


err contains the underlying error returned by the Linux commands. Should we return a more deterministic error to the caller instead?

amogh09 · 2024-09-12T01:18:48Z

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go

+				if lossRandom := options.(map[string]interface{})["loss-random"]; lossRandom != nil {
+					if loss := lossRandom.(map[string]interface{})["loss"]; loss != nil {


IMO Since the output is coming from a command I think we should be a bit more defensive in our parsing. Currently if any of these type assertions fail then Agent will panic. An example output is [{"kind":"netem","options":{"loss-random":5}}].

How about checking the type assertion success using something like below?

if lossRandom, ok := options.(map[string]interface{}); ok {}

tshan2001 requested a review from a team as a code owner September 5, 2024 21:31

tshan2001 added the bot/test label Sep 5, 2024

amazon-ecs-bot removed the bot/test label Sep 5, 2024

tshan2001 force-pushed the dev branch 2 times, most recently from f89a1d0 to 588af68 Compare September 5, 2024 21:56

tshan2001 requested review from xxx0624, amogh09 and mye956 September 5, 2024 22:04

tshan2001 added the bot/test label Sep 6, 2024

amazon-ecs-bot removed the bot/test label Sep 6, 2024

tshan2001 force-pushed the dev branch from 588af68 to 70ed058 Compare September 10, 2024 22:18

xxx0624 reviewed Sep 11, 2024

View reviewed changes

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go Outdated Show resolved Hide resolved

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go Outdated Show resolved Hide resolved

ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go Show resolved Hide resolved

tshan2001 force-pushed the dev branch 2 times, most recently from 5b987b0 to e521383 Compare September 11, 2024 22:33

tshan2001 requested a review from JoseVillalta September 11, 2024 22:34

tshan2001 force-pushed the dev branch 2 times, most recently from ce5220f to 912072f Compare September 11, 2024 22:44

mye956 reviewed Sep 12, 2024

View reviewed changes

Add check network packet loss implementation

e0bc434

tshan2001 force-pushed the dev branch from 28bd8c4 to e0bc434 Compare September 12, 2024 18:04

mye956 reviewed Sep 12, 2024

View reviewed changes

tshan2001 requested a review from xxx0624 September 12, 2024 19:44

mye956 reviewed Sep 12, 2024

View reviewed changes

xxx0624 reviewed Sep 12, 2024

View reviewed changes

amogh09 reviewed Sep 13, 2024

View reviewed changes

tshan2001 force-pushed the dev branch from 1923d74 to e0bc434 Compare September 16, 2024 17:56

tshan2001 added 2 commits September 16, 2024 10:56

Merge branch 'aws:dev' into dev

d6fba4b

Merge branch 'dev' into dev

69e37f7

This was referenced Sep 16, 2024

Add check black hole port fault status implementation #4330

Merged

Add start network packet loss implementation #4344

Merged

tshan2001 closed this Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add check network packet loss implementation #4322

Add check network packet loss implementation #4322

tshan2001 commented Sep 5, 2024 •

edited

Loading

mye956 Sep 12, 2024

tshan2001 Sep 12, 2024

mye956 Sep 12, 2024

tshan2001 Sep 12, 2024

mye956 Sep 12, 2024

tshan2001 Sep 12, 2024 •

edited

Loading

mye956 Sep 12, 2024

xxx0624 Sep 12, 2024

xxx0624 Sep 12, 2024

xxx0624 Sep 12, 2024

xxx0624 Sep 12, 2024

xxx0624 Sep 12, 2024

xxx0624 Sep 12, 2024

amogh09 Sep 11, 2024

amogh09 Sep 12, 2024

amogh09 Sep 12, 2024

amogh09 Sep 12, 2024

amogh09 Sep 12, 2024

amogh09 Sep 12, 2024

	// NetworkModeAwsvpc is a NetworkMode enum value
	NetworkModeAwsvpc = "awsvpc"

		if lossRandom := options.(map[string]interface{})["loss-random"]; lossRandom != nil {
		if loss := lossRandom.(map[string]interface{})["loss"]; loss != nil {

Add check network packet loss implementation #4322

Add check network packet loss implementation #4322

Conversation

tshan2001 commented Sep 5, 2024 • edited Loading

Summary

Implementation details

Testing

Manual Testing

Description for the changelog

Additional Information

Licensing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tshan2001 Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tshan2001 commented Sep 5, 2024 •

edited

Loading

tshan2001 Sep 12, 2024 •

edited

Loading