Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support (m)TLS API Socket #24601

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

meln5674
Copy link

@meln5674 meln5674 commented Nov 18, 2024

Add flags --tls-cert, --tls-key, --tls-ca/--tls-client-ca to the commands podman remote, podman system service, and podman system connection add to support serving the API socket using TLS and mTLS, as well as connecting to such a socket.

This relies on containers/common#2249 and will fail CI until merged.

Fixes #24583

Does this PR introduce a user-facing change?

* The `podman system service` command now supports serving over tcp with TLS and mTLS
* The `podman system connection add` command now supports creating connections to TLS and mTLS tcp sockets
* The `podman remote` commands now support connecting to TLS and mTLS tcp sockets

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note labels Nov 18, 2024
@github-actions github-actions bot added the kind/api-change Change to remote API; merits scrutiny label Nov 18, 2024
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@Luap99
Copy link
Member

Luap99 commented Nov 19, 2024

@jwhonce @baude @mheon FYI. Had you ever looked at TLS support for the podman service and remote client?
With the amount of code here this looks easy enough to accept to me but I was wondering if there is/was any reason why we do not support it.

@mheon
Copy link
Member

mheon commented Nov 19, 2024 via email

@mheon
Copy link
Member

mheon commented Nov 19, 2024

Just skimmed, but LGTM on approach. A lot less code than I was expecting for certificate auth.

Now, if we have to start providing docs for how to do certificate auth right, I think the amount of changes grows substantially, but I don't think it's really worth it at this point.

@TomSweeneyRedHat
Copy link
Member

Sending the directory location of the private key out over the wire doesn't give me the warm fuzzies. Is it something that other software does?

@jwhonce
Copy link
Member

jwhonce commented Nov 21, 2024

@Luap99 I seem to remember the idea behind using the ssh executable as well as the golang libraries was to push this kind of handshake into that code vs. podman. But @cdoern did that work, so I don't know where it stands.

@meln5674
Copy link
Author

Sending the directory location of the private key out over the wire doesn't give me the warm fuzzies. Is it something that other software does?

That certainly wasn't my intent, but I'm not sure what you're referring to. Is that a consequence of adding it to the conf file that I'm unaware of?

@Luap99
Copy link
Member

Luap99 commented Nov 25, 2024

Sending the directory location of the private key out over the wire doesn't give me the warm fuzzies. Is it something that other software does?

That certainly wasn't my intent, but I'm not sure what you're referring to. Is that a consequence of adding it to the conf file that I'm unaware of?

It is not being send anywhere, both the client and server read the files locally and then use the certificates to perform a normal TLS handshake via the go std lib AFAICT.
@TomSweeneyRedHat What are you referring to by sending the private key over the write?

@Luap99 I seem to remember the idea behind using the ssh executable as well as the golang libraries was to push this kind of handshake into that code vs. podman. But @cdoern did that work, so I don't know where it stands.

@jwhonce I am not sure what are you referring to? The question for me was if there was a specific reason why TLS support was not done before. If not then I think we should accept this PR

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this will need some e2e or system tests where we spawn a server with TLS and the connect with the remove client.

pkg/util/tls.go Outdated
"os"
)

func ReadCertBundle(path string) (*x509.CertPool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you mope this in a new separate package, maybe call it cert and then name the function ReadBundle()

We should really move away from dumping everything into util packages as this causes a lot of unwanted side effects via other imports that really should not have to be imported on the remote client, i.e. #23818. Fixing that will be quite some work but in the meantime we should not add new things there that are needed by the remote client.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 323 to 338
if len(tlsCAFile) != 0 || len(tlsCertFile) != 0 || len(tlsKeyFile) != 0 {
logrus.Debugf("using TLS cert=%s key=%s ca=%s", tlsCertFile, tlsKeyFile, tlsCAFile)
transport.TLSClientConfig = &tls.Config{}
connection.tls = true
}
if len(tlsCAFile) != 0 {
pool, err := util.ReadCertBundle(tlsCAFile)
if err != nil {
return connection, fmt.Errorf("unable to read CA bundle: %w", err)
}
transport.TLSClientConfig.RootCAs = pool
}
if len(tlsCertFile) != 0 && len(tlsKeyFile) != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition here do not seem to line up perfectly.

If only tlsCertFile is set then the first condition matches and says tls is used but then we never add the the key on the last one. That case should return a hard error I would assume.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent was to have that check performed in the cmd/podman/... packages, but on reflection it makes sense to also perform the same check here, as that wouldn't account for the env vars. Done.

Comment on lines 160 to 182
if cmd.Flags().Changed("tls-cert") {
return errors.New("--tls-cert option not supported for ssh scheme")
}
if cmd.Flags().Changed("tls-key") {
return errors.New("--tls-key option not supported for ssh scheme")
}
if cmd.Flags().Changed("tls-ca") {
return errors.New("--tls-ca option not supported for ssh scheme")
}
return ssh.Create(entities, sshMode)
case "unix":
if cmd.Flags().Changed("identity") {
return errors.New("--identity option not supported for unix scheme")
}
if cmd.Flags().Changed("tls-cert") {
return errors.New("--tls-cert option not supported for unix scheme")
}
if cmd.Flags().Changed("tls-key") {
return errors.New("--tls-key option not supported for unix scheme")
}
if cmd.Flags().Changed("tls-ca") {
return errors.New("--tls-ca option not supported for unix scheme")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates the same conditions. Move them outside the switch case and match if scheme != tcp so you do not duplicate that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Moved cert bundle read to separate package
* Removed duplication for non-tcp tls flag check
* Added tls info to `system connection list`
* Removed TCP warning if TLS is enabled
* Fixed not using TLS when using ABI instead of remote
* Added central check for cert without key or vice-versa

Signed-off-by: Andrew Melnick <[email protected]>
@meln5674
Copy link
Author

Also this will need some e2e or system tests where we spawn a server with TLS and the connect with the remove client.

Of course. I was hoping to have my environment capable of passing tests on main before I went on vacation last week, but that didn't end up happening. After throwing up my hands and just making a fresh CentOS VM, I have that working now.

I'd like to use any existing tests for SSH/TCP remotes as a basis for my tests, as well as ensure my tests have parity with them, but for the life of me, I cannot seem to find any such tests. Would you mind pointing me to where they are?

@TomSweeneyRedHat
Copy link
Member

@Luap99 sorry, just back to this. This line: https://github.com/containers/podman/pull/24601/files#diff-cf5cc76e70a369c5221f08f29ec0452710af1f4ecd423c162d5b669688416caeR525 seemed to indicate that you were looking for the directory where the private key resigned. Unless I misinterpreted, that doesn't feel right.

@meln5674
Copy link
Author

meln5674 commented Dec 9, 2024

@Luap99 sorry, just back to this. This line: https://github.com/containers/podman/pull/24601/files#diff-cf5cc76e70a369c5221f08f29ec0452710af1f4ecd423c162d5b669688416caeR525 seemed to indicate that you were looking for the directory where the private key resigned. Unless I misinterpreted, that doesn't feel right.

Yes, the purpose of this line is to create a command line argument so that the user can provide the path to their TLS client private key, which is necessary to perform authentication. Unless I too have misinterpreted something else in the codebase, neither that path, nor especially not the data in that file, ever goes over the wire. This is the analog of the --identity flag for SSH authentication.

* Fixed attach endpoint client using unencrypted socket when TLS is enabled
* Duplicated libpod remote unix socket e2e suite for plaintext TCP, TLS,
  and mTLS

Signed-off-by: Andrew Melnick <[email protected]>
Copy link
Contributor

openshift-ci bot commented Dec 30, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: meln5674
Once this PR has been reviewed and has the lgtm label, please assign luap99 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@meln5674
Copy link
Author

Also this will need some e2e or system tests where we spawn a server with TLS and the connect with the remove client.

Apologies for the delay on this, a number of other items were pushed onto my priority stack, but I've been able to resume work on it.

As mentioned, there weren't any substantial e2e tests for the TCP API, and the libpod remote test suite was entirely using the unix socket, so I have refactored it to allow running it for unix, plain TCP, TLS, and mTLS. This might seem excessive, but it was only by running this complete suite did I find additional issues I had to fix. Suggestions are welcome for how to achieve similar coverage with fewer tests.

Speaking of said issues, it would seem that the container attach client has a subtle race condition in it, as well as a memory leak.

By overwriting the http transport and capturing the net.Conn as a local variable, if the same Connection is used concurrently, there is a possibility that the first net.Conn will be overwritten on the first attach call's local variable, resulting in the two calls proxying each-other's streams, or just outright corrupting them.

As well, because the original transport is never restored, if the same client is used to attach to multiple containers, the closure is never freed, meaning the dial function will call an ever-increasing stack of closures. Fixing this issue felt outside the scope of this PR, but I have added a note warning of it to whoever feels like addressing it in the future.

I would suggest adding some sort of DoRequestWithConn method which handles using the new transport for that request alone and manually performs the dialing, writing the request, and reading the response, and validating the upgrade.

Finally, there were a number of other e2e tests that were failing in my fresh centos 9 environment, even on main, so I have simply ignored them for the time being.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few comments, I need more time to look at the test changes. But we definitely cannot just explode the matrix and run 4 times the amount.

Comment on lines +685 to +688
$(MAKE) ginkgo-run TAGS="$(REMOTETAGS) remote_testing remote_unix_testing"
$(MAKE) ginkgo-run TAGS="$(REMOTETAGS) remote_testing remote_tcp_testing"
$(MAKE) ginkgo-run TAGS="$(REMOTETAGS) remote_testing remote_tls_testing"
$(MAKE) ginkgo-run TAGS="$(REMOTETAGS) remote_testing remote_mtls_testing"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not acceptable to me, it is certainly great to force coverage. But we run this many times on each Pr. A 4x time increase is not acceptable.

What we can consider is some split testing, we run the test on fedora rawhide, 41, 40 and debian sid so technically would could wire this up in CI ro run each case on a different distro to not add any new overhead will still getting full coverage. The transport layer should certainly not care about the distro (except underlying kernel bugs of course) so I think that may be best option.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this was a "nuclear option" in order to get tests to run with the least amount of change to the tests themselves, and to make sure that I wasn't missing anything.

One thought I had in the interim was that ginkgo allows tagging tests and sets of tests, and selecting a subset of them on the command line. Using this, it would be possible to run a full remote tests via unix sockets, and then a subset of all tests against tcp, tls, and mtls. It would then also be possible to add a separate target to run all tests in all remotes, but wouldn't be run in CI, only on-demand in development environments.

The main thing I would need for this is guidance from the podman core team on which tests they believed were critical to be tested over every possible remote, and which are "good enough" to only be tested over unix. For example, the attach endpoints would definitely need to be tested on all different remotes, as that's how I discovered the additional fixes I had to make.

@@ -118,19 +118,22 @@ func inspect(cmd *cobra.Command, args []string) error {
rpt, err = rpt.Parse(report.OriginUser, format)
} else {
rpt, err = rpt.Parse(report.OriginPodman,
"{{range .}}{{.Name}}\t{{.URI}}\t{{.Identity}}\t{{.Default}}\t{{.ReadWrite}}\n{{end -}}")
"{{range .}}{{.Name}}\t{{.URI}}\t{{.Identity}}\t{{.TLSCAFile}}\t{{.TLSCertFile}}\t{{.TLSKeyFile}}\t{{.Default}}\t{{.ReadWrite}}\n{{end -}}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is a breaking change. While nobody should relay on the order of the output and use --format if they use it in scripts we can never know.
I guess I could be convinced to add them as last keys after readwrite.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree, and while I'm not sure if podman or the the umbrella containers project has strict guidelines on the topic, I've always considered "human readable" data to be excluded from breaking changes. I would no sooner worry about breaking scripts that scrape this output than scripts that break if a new log message was added.

BeforeEach(setupConnectionsConf)

Context("without running API service", func() {
It("add ssh://", func() {
cmd := []string{"system", "connection", "add",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these changes are done by your formatter? Can you revert them, they bloat the diff unnecessarily and make reviewing harder

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 8, 2025
@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/api-change Change to remote API; merits scrutiny needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support (m)TLS API socket
6 participants