Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Windows integration tests #4304

Merged
merged 20 commits into from
Sep 3, 2024
Merged

Conversation

tinnywang
Copy link
Contributor

@tinnywang tinnywang commented Aug 20, 2024

Summary

Be not deceived by the falsehood of GitHub status checks. Windows integration tests aren't actually running and passing. The PowerShell script uses Bash syntax to set the CGO_ENABLED environment variable, which PowerShell doesn't understand, but the script doesn't fail fast because it doesn't have the PowerShell equivalent of set -e. As a result, the tests never run but the script returns successfully.

$env:ECS_LOGLEVEL = 'debug'; CGO_ENABLED=1 go test -race -tags integration -timeout=40m -v ../agent/engine ../agent/stats ../agent/app

Implementation details

  • 292c4d3 sets $ErrorActionPreference = 'Stop', the PowerShell equivalent of set -e, to make the script fail fast.
  • 4c3d855 updates the go test command to use PowerShell syntax.
  • 626361f fixes compilation errors in the agent package (undefined symbol usage).
  • d742ef1 fixes compilation errors in the ecs-agent package (types not implementing interfaces).
  • e0a70d9 makes TestImageManifestPullInteg only run for Linux. We can't test image manifest pulls on Windows because we don't run a local registry, so images are always cached on the machine running the tests and never get pulled.
  • 8040c20 skips task/container manifest pulled assertions if the tests are running on Windows and moves manifest pull tests into Unix integration test file. We can't test image manifest pulls on Windows because we don't run a local registry, so images are always cached on the machine running the tests and never get pulled.
  • 62438de increases timeouts on tests that were flaking on Windows Server 2019.
  • a2894c6 removes the retry in the PowerShell script because it will always fail. The TestIntegImageCleanup* tests assert that images exist, remove the images, and then assert that the images do not exist. So the images are deleted after the first run and don't exist when the second run starts and expects them to exist. For successful retires, the entire script needs to be rerun so that images can be rebuilt. Note that the linux Make target doesn't retry the go test command.

Testing

  • Ran .\scripts\run-integ-tests.ps1 on Windows Server 2016, 2019, and 2022 and verified that the tests ran and passed.
  • Verified that the PowerShell script returned a non-zero exit code if any command failed.
  • GitHub status checks passed after actually running the tests.

New tests cover the changes: no

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@tinnywang tinnywang force-pushed the fix_windows_integ_test branch 5 times, most recently from a64dbf2 to 2eba926 Compare August 27, 2024 18:01
@tinnywang tinnywang force-pushed the fix_windows_integ_test branch from 2eba926 to a2894c6 Compare August 27, 2024 19:28
@tinnywang tinnywang marked this pull request as ready for review August 27, 2024 19:44
@tinnywang tinnywang requested a review from a team as a code owner August 27, 2024 19:44
scripts/run-integ-tests.ps1 Outdated Show resolved Hide resolved
scripts/run-integ-tests.ps1 Outdated Show resolved Hide resolved
agent/engine/engine_windows_integ_test.go Outdated Show resolved Hide resolved
taskEngine, done, _, _ := SetupIntegTestTaskEngine(cfg, nil, t)
defer done()

first := createTestContainerWithImageAndName(testRegistryImage, "first")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use a Windows image from public ECR, is it trivial to support this test on Windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed testRegistryImage to point the Windows Server image from MCR in 725e2b7. (We're already using it in TestStartStopUnpulledImage and TestStartStopUnpulledImageDigest.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! So do we still need to move TestManifestPulledDoesNotDependOnContainerOrdering and other tests to this unix specific test file?

agent/engine/common_integ_testutil.go Show resolved Hide resolved
amogh09
amogh09 previously approved these changes Aug 29, 2024
return []string{"ping", "-t", "localhost"}
}

// TODO: implement this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please implement this method.

Copy link

@mcregan23 mcregan23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what Tom said. Might as well implement it now so it doesn't get forgotten about

func isDockerRunning() bool {
return true
_, err := os.Stat("//./pipe/docker_engine")
Copy link
Member

@oldschool-engineer oldschool-engineer Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the os.Stat implementation will work. In my testing it always returned the same result regardless of whether docker is running or not.
Here's an alternative implementation that does work.

package main

import (
    "flag"
    "golang.org/x/sys/windows/svc/mgr"
    "log"
)

func isDockerRunning() bool {
    var svcName = "docker"
    flag.StringVar(&svcName, "name", svcName, "name of the service")
    flag.Parse()
    svcMgmt, err := mgr.Connect()
    if err != nil {
        log.Fatalf("Error connecting to service manager: %v", err)
    }

    dockerSvc, err := svcMgmt.OpenService(svcName)
    if err != nil {
        log.Fatalf("Error opening service %q: %v", svcName, err)
    }

    dockerStatus, err := dockerSvc.Query()
    if err != nil {
        log.Fatalf("Error querying docker service: %v", err)
    }
    log.Printf("State:%v, ProcessId:%v, Win32ExitCode: %v", dockerStatus.State, dockerStatus.ProcessId, dockerStatus.Win32ExitCode)
    return dockerStatus.ProcessId != 0
}

func main() {
    dockerStatus := isDockerRunning()
    if dockerStatus != true {
        log.Fatalf("Docker is not running")
    }
    log.Printf("Docker is running")
}
2024/08/30 13:55:25 State:4, ProcessId:4500, Win32ExitCode: 0
2024/08/30 13:55:25 Docker is running
2024/08/30 13:55:04 State:1, ProcessId:0, Win32ExitCode: 1067
2024/08/30 13:55:04 Docker is not running

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have minimal knowledge/experience with Windows development, so bear with me while I explain my testing process...

To exercise the os.Stat implementation of isDockerRunning, I added a test

func TestIsDockerRunning(t *testing.T) {
	assert.True(t, isDockerRunning())
}

which I ran with

go test -tags integration  -count 1 -v ./agent/engine/... -run "TestIsDockerRunning"

I ran get-service docker to check if docker was already running. It was, and the test passed.
I ran stop-service docker to stop docker. The test failed when I reran it, as was expected.
I ran start-service docker to restart docker. The test passed.

Am I not starting/stopping docker correctly? I got the same results when I swapped out the os.Stat implementation with the service manager implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your test results, it seems that Go is handling it properly if run as part of the integration test framework. Given your test results, I'm OK with your earlier implementation since it is very simple and it works. (Plus, the implementation I provided should be modified to include deferred closures to the service manager and service instance handle.)

My concern was whether Go's os.Stat method can differentiate between named pipes and regular on-disk files. Named pipes are accessed via file system drivers (NPFS), but they are are not part of the actual on-disk file system, which is (typically) NTFS.

Running the code snippet, as-is, on Windows in a stand-alone way doesn't seem to work right, however. That said, connecting through service manager requires the unit tests to be run as an administrator.


Am I not starting/stopping docker correctly?

There is more than one way to start/stop services on Windows. start-service docker and stop-service docker is the PowerShell way to start/stop docker and is perfectly acceptable.

@tinnywang tinnywang merged commit ad039c2 into aws:dev Sep 3, 2024
40 checks passed
@tinnywang tinnywang deleted the fix_windows_integ_test branch September 3, 2024 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants