Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check network packet loss implementation #4322

Closed
wants to merge 3 commits into from
Closed

Conversation

tshan2001
Copy link
Contributor

@tshan2001 tshan2001 commented Sep 5, 2024

Summary

Adding implementation for CheckNetworkPacketLoss() and corresponding unit tests. Also updating the os/exec wrapper to include 2 more helper methods to facilitate mocks in unit testing.

Implementation details

After the TMDS server receives the request to check for network packet loss, the following Linux command will be executed:

tc -j q

The output format will be a byte array of a json string. The output will then be unmarshalled and we will check whether the following exists:

{
"kind":"netem"
...
}

Testing

Unit tests for the TMDS package was run.

 % go test -tags unit -v -run TestCheckNetworkPacketLoss /workplace/tianzes/amazon-ecs-agent/ecs-agent/tmds/handlers/fault/v1/handlers
...
--- PASS: TestCheckNetworkPacketLoss (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_success-running (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_success-not-running (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_success-not-running-but-latency-is-running (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_only-one-ip-exists-not-running (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_unknown_request_body (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_failed_to_unmarshal_json (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_malformed_request_body_1 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_malformed_request_body_2 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_incomplete_request_body_1 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_incomplete_request_body_2 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_incomplete_request_body_3 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_invalid_LossPercent_in_the_request_body_1 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_invalid_LossPercent_in_the_request_body_2 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_invalid_LossPercent_in_the_request_body_3 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_invalid_IP_value_in_the_request_body_1 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_invalid_IP_CIDR_block_value_in_the_request_body_2 (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_task_lookup_fail (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_task_metadata_fetch_fail (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_task_metadata_unknown_fail (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_fault_injection_disabled (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_invalid_network_mode (0.00s)
    --- PASS: TestCheckNetworkPacketLoss/check_network_packet_loss_empty_task_network_config (0.00s)

New tests cover the changes:
2 unhappy test cases were added:

  1. Call was successful but there's no fault running
  2. Call was successful but got internal error when processing the Linux command
  3. Call was successful, but there's network latency fault running instead of network packet loss fault

Manual Testing

Launched a Fargate Instance with the changes. Launched a task with ecs-exec enabled.

# First curl the TMDS endpoint from the task without any fault on the instance
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":10, "Sources":["192.168.0.1"]}'
{"Status":"not-running"}

# Log into the instance as 'su', manually inject a fault with 10% loss with IP '192.168.0.1' associated
root # nsenter --net=/var/run/netns/8aacba9e7ef74030a9310de01f3ca80e-028c5a03d94b tc qdisc add dev eth1 root handle 1: prio priomap 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
root # nsenter --net=/var/run/netns/8aacba9e7ef74030a9310de01f3ca80e-028c5a03d94b tc qdisc add dev eth1 parent 1:1 handle 10: netem loss 10%
root # nsenter --net=/var/run/netns/8aacba9e7ef74030a9310de01f3ca80e-028c5a03d94b tc filter add dev eth1 protocol ip parent 1:0 prio 1 u32 match ip dst 192.168.0.1 flowid 1:1

# Curl the TMDS endpoint with the same config, result should be running
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":10, "Sources":["192.168.0.1"]}'
{"Status":"running"}

# Now change the lossPercent in payload, result should be not-running
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":6, "Sources":["192.168.0.1"]}'
{"Status":"not-running"}

# Now add an additional IP address 10.1.1.1, and curl the endpoint with both IPs, result should be running
root # nsenter --net=/var/run/netns/8aacba9e7ef74030a9310de01f3ca80e-028c5a03d94b tc filter add dev eth1 protocol ip parent 1:0 prio 1 u32 match ip dst 10.1.1.1 flowid 1:1
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":10, "Sources":["192.168.0.1", "10.1.1.1"]}'
{"Status":"running"}

# If we only curl one of the endpoint, result should be running as well:
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":10, "Sources":["192.168.0.1"]}'
{"Status":"running"}

# Lastly, if we add an unfiltered ip in the payload, result should be not-running
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":10, "Sources":["192.168.0.1", "10.1.1.1", "1.1.1.1"]}'
{"Status":"not-running"}

# Edge case: manually add a network latency fault, and curl the check packet loss endpoint. The result should be not running
root # nsenter --net=/var/run/netns/8aacba9e7ef74030a9310de01f3ca80e-028c5a03d94b tc q show dev eth1
qdisc prio 1: root refcnt 9 bands 3 priomap 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
qdisc netem 10: parent 1:1 limit 1000 delay 100ms
sh-5.2# curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":10, "Sources":["192.168.0.1", "10.1.1.1", "1.1.1.1"]}'
{"Status":"not-running"}

Description for the changelog

Add check network packet loss implementation

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?

Does this PR include the addition of new environment variables in the README?

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@tshan2001 tshan2001 force-pushed the dev branch 2 times, most recently from 5b987b0 to e521383 Compare September 11, 2024 22:33
@tshan2001 tshan2001 force-pushed the dev branch 2 times, most recently from ce5220f to 912072f Compare September 11, 2024 22:44
// runExecCommand wraps around the execwrapper, providing a convenient way of running any Linux command
// and getting the result in both stdout and stderr.
func (h *FaultHandler) runExecCommand(ctx context.Context, linuxCommandString string) ([]byte, error) {
cmdExec := h.osExecWrapper.CommandContext(ctx, "/bin/sh", "-c", linuxCommandString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure if the /bin/sh path exists within the agent container. Will need to double check this. Is it possible to make the call without this just like with the standard os/exec go library? (e.g. exec.Command("ls -al"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I just pushed a new revision that removes "/bin/sh" from the command. Also re-ran all the manual testing to make sure that it still works.

}
// Log the command output to better help us debug.
logger.Info(fmt.Sprintf("%s command result: %s", tcCheckInjectionCommandComposed, string(cmdOutput[:])))
var outputUnmarshalled []map[string]interface{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Instead of a parsing/unmarshaling the output into a map[string]interface{}, I wonder if we can parse it into a struct and then get only the needed json keys from the output? (if it's not possible/output is not deterministic then please disregard).

Just ran tc -j q show dev eth0 parent 1:1 locally and it seems to output the following:

[{
    "kind": "netem",
    "handle": "10:",
    "parent": "1:1",
    "options": {
        "limit": 1000,
        "loss-random": {
            "loss": 0.5,
            "correlation": 0
        },
        "ecn": false,
        "gap": 0
    }
}]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tc -j q show dev eth0 does not always have all the fields. For example, when running on a host that doesn't have fault injected, it can have the following outout:

{"kind":"pfifo_fast","handle":"0:","parent":":1","options":{"bands":3,"priomap":[1,2,2,2,1,2,0,0,1,1,1,1,1,1,1,1],"multiqueue":false}}]

There's no loss-random field here. Also, when specifying parent 1:1, the output can also be [].

Unmarshalling it into a map[string]interface{} and indexing the json struct iteratively would allow us to better catch the nil when parsing it

@tshan2001 tshan2001 requested a review from xxx0624 September 12, 2024 19:44
// and getting the result in both stdout and stderr.
func (h *FaultHandler) runExecCommand(ctx context.Context, linuxCommandString string) ([]byte, error) {
commandArray := strings.Split(linuxCommandString, " ")
cmdExec := h.osExecWrapper.CommandContext(ctx, commandArray[0], commandArray[1:]...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Should there be a cmdExec.Run() call somewhere in this method?

Copy link
Contributor Author

@tshan2001 tshan2001 Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmdExec.CombinedOutput() has output signature ([]byte, error), which is the same as the output of this method. We return the result of cmdExec.CombinedOutput() directly. CombinedOutput() is a generic method in os/exec. It runs the command and gives the stdout and stderr result at the same time (whereas Output() will only give stdout when exit code was 0, otherwise stdout will be nil)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see now

// The command above gives the output of "tc q show dev {INTERFACE} parent 1:1" in json format.
// We will then unmarshall the json string and evaluate the fields of it.
tcCheckInjectionCommandComposed := nsenterPrefix + fmt.Sprintf(tcCheckInjectionCommandString, interfaceName)
fmt.Println(tcCheckInjectionCommandComposed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use logger instead.

fmt.Println(tcCheckInjectionCommandComposed)
cmdOutput, err := h.runExecCommand(ctx, tcCheckInjectionCommandComposed)
if err != nil {
return false, errors.New("failed to check network-packet-loss-fault: " + string(cmdOutput[:]) + err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the cmdOutput be nil if the err is not empty?

ipSources := request.Sources
// If task's network mode is awsvpc, we need to run nsenter to access the task's network namespace.
nsenterPrefix := ""
if networkMode == "awsvpc" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - we have const var for awsvpc in ECS ie

func New(networkMode string, netNSs ...*NetworkNamespace) (*TaskNetworkConfig, error) {

var responseBody types.NetworkFaultInjectionResponse
var stringToBeLogged string
var httpStatusCode int
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fire a metric here if the err is not empty? If this happens, it means there is something wrong in our DP and not customer/client issue.

logger.Info(fmt.Sprintf("%s command result: %s", tcCheckIPCommandComposed, string(cmdOutput[:])))
allIPAddressesInRequestExist := true
for _, ipAddress := range ipSources {
ipAddressInHex, err := convertIPAddressToHex(*ipAddress)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use aws.StringValue() instead? for the *ipAddress

invalidNetworkMode = "invalid"
tcLatencyFaultExistsCommandOutput = `[{"kind":"netem","handle":"10:","parent":"1:1","options":{"limit":1000,"delay":{"delay":0.1,"jitter":0,"correlation":0},"ecn":false,"gap":0}}]`
tcLossFaultExistsCommandOutput = `[{"kind":"netem","handle":"10:","dev":"eth0","parent":"1:1","options":{"limit":1000,"loss-random":{"loss":0.06,"correlation":0},"ecn":false,"gap":0}}]`
tcLossFaultDoesNotExistCommandOutput = `[{"kind":"dummyname"}]`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably one more test case without kind

Comment on lines +54 to +55
Output() ([]byte, error)
CombinedOutput() ([]byte, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add descriptions for these (and existing) methods to let users know what to expect.

ipSources := request.Sources
// If task's network mode is awsvpc, we need to run nsenter to access the task's network namespace.
nsenterPrefix := ""
if networkMode == "awsvpc" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using an existing constant?

// NetworkModeAwsvpc is a NetworkMode enum value
NetworkModeAwsvpc = "awsvpc"

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

interfaceName := taskMetadata.TaskNetworkConfig.NetworkNamespaces[0].NetworkInterfaces[0].DeviceName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably out of scope for this PR but will Agent always populate device name in TaskResponse returned by GetTaskMetadata method? I think we would want to restrict that to fault injection use case only since computing device name is costly, at least for host network mode tasks.

// The command above gives the output of "tc q show dev {INTERFACE} parent 1:1" in json format.
// We will then unmarshall the json string and evaluate the fields of it.
tcCheckInjectionCommandComposed := nsenterPrefix + fmt.Sprintf(tcCheckInjectionCommandString, interfaceName)
fmt.Println(tcCheckInjectionCommandComposed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't want this

var stringToBeLogged string
var httpStatusCode int
if err != nil {
responseBody = types.NewNetworkFaultInjectionErrorResponse(err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err contains the underlying error returned by the Linux commands. Should we return a more deterministic error to the caller instead?

Comment on lines +764 to +765
if lossRandom := options.(map[string]interface{})["loss-random"]; lossRandom != nil {
if loss := lossRandom.(map[string]interface{})["loss"]; loss != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO Since the output is coming from a command I think we should be a bit more defensive in our parsing. Currently if any of these type assertions fail then Agent will panic. An example output is [{"kind":"netem","options":{"loss-random":5}}].

How about checking the type assertion success using something like below?

if lossRandom, ok := options.(map[string]interface{}); ok {}

@tshan2001 tshan2001 closed this Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants