Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: TAP-21 - Scale-out Architecture for High-volume Repositories #189

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

ergonlogic
Copy link

Coming from theupdateframework/specification/issues/309, and following discussion at the TUF community meeting on 2024-11-01, we've drafted "TAP-21 - Scale-out Architecture for High-volume Repositories".

This TAP is still very much a "draft". Several sections are currently marked "TBD". Our motivation for a PR at this stage is to validate the motivation and rationale, as well as the calculations we undertook to better understand the issues at hand.

Copy link
Member

@JustinCappos JustinCappos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the changes to tap1? If they need to be part of a PR, they can be a separate one.

@ergonlogic
Copy link
Author

Can you remove the changes to tap1? If they need to be part of a PR, they can be a separate one.

Yes, of course. I was separately experimenting with the idea of turning the headers into YAML front-matter. Not sure how those changes snuck in here.

tap1.md Outdated Show resolved Hide resolved
tap21.md Outdated

One concern with the present proposal is that it lacks a global snapshot. See [Security Analysis](#security-analysis) (below) for a more detailed discussion on this topic.

This concern is only relevant to registries where package versions are tightly-coupled (such as Debian Apt repositories). However, package versions are loosely-coupled in registries of language libraries (eg. Packagist, PyPi), which is the intended audience for this TAP.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even a loosely coupled registry might want to provide consistent snapshots to make mirroring the repository easier. Setting up a mirror involves getting the latest snapshot, and downloading everything in the latest snapshot. Updating a mirror involves getting the latest snapshot, diffing with the previous snapshot, and downloading the changes. All of this is significantly harder without a snapshoting system.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While snapshot metadata is convenient as a mirror manifest, presumably comparably diff'able data can be generated relatively easily from filesystem "modified" timestamps.

Copy link
Member

@trishankatdatadog trishankatdatadog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sending your PR, Christopher and Derek! Left some comments based on one round of reading.

I think I see part of your problem but not its entire context. Do you think getting on a call might help?

tap21.md Show resolved Hide resolved
tap21.md Outdated Show resolved Hide resolved
tap21.md Outdated Show resolved Hide resolved

## High churn in TUF metadata using hashed bins

Hashed bins bundle targets together, distributed evenly across multiple `bin_n` metadata files. A release of any package in the registry has an equal chance to update any given `bin_n` metadata. As a result, packages that are not in use for a given client project will still require re-download of `bin_n` metadata on an ongoing basis.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why? Again, this depends on the type of clients above. Most clients are going to be lazy rather than "eager" like mirrors, and so I don't see them needing to download all bin_n metadata, only what they need at the moment.

It's not as clear as I would like it to be (see step 7 in the spec here), but TUF clients are supposed to perform a lazy evaluation of targets using a preorder DFS.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TAP-21 does not suggest that clients need to download all bin_n metadata.

Clients do need to download the latest version of each bin_n metadata that is relevant to the targets that they require. Each of those (relevant) bin_n metadata files include targets that are not required. New releases of extraneous targets can produce new versions of relevant bin_n metadata.

Therefore, clients need to download new bin_n metadata even if there have been no releases of required targets.

Copy link
Member

@trishankatdatadog trishankatdatadog Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I now see the issues (in increasing order of significance):

  1. The bins role metadata size can be reduced quite a lot with succinct hashed bin delegations.
  2. Do you need hashes (SHA2-256 and SHA2-512) in your snapshot metadata? You really only need it for protection against malicious mirrors (see Section 5.6 here). If not, you should be to shave a lot off your snapshot by distributing only version numbers. Even if you do need hashes, have you calculated what snapshot Merkle trees can buy you?
  3. What are the dependencies between your packages? How do you plan to solve for mix-and-match and rollback attacks there?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The bins role metadata size can be reduced quite a lot with succinct hashed bin delegations.

We address this in the Motivation section: Hashed bin delegation. TAP-21 is primarily concerned with the churn of bin_n metadata. bins metadata, by virtue of being essentially static, does not factor into it.

Also, since we expect the top-level repo to require hashed bins, implementing TAP-15 would reduce overall TUF metadata when implementing TAP-21 as well.

  1. Do you need hashes (SHA2-256 and SHA2-512) in your snapshot metadata?

Including both SHA2-256 and SHA2-512 does seem redundant. I've updated the calculator spreadsheet to reflect removing SHA2-512 hashes. This streamlining also reduces both top-level and sub-repo metadata sizes (though the latter is not reflected in the calculations yet).

You really only need it for protection against malicious mirrors (see Section 5.6 here). If not, you should be to shave a lot off your snapshot by distributing only version numbers.

Very large repositories are almost certainly going to use mirrors. In my (admittedly limited) experience, they do want to protect against the potential of malicious mirrors. I proposed just distributing version numbers, and was asked to keep the hashes.

Hoever, this streamlining would also reduce both top-level and sub-repo metadata sizes in TAP-21.

Even if you do need hashes, have you calculated what snapshot Merkle trees can buy you?

Not directly, no. We did address this in the Motivation section: Snapshot Merkle Trees (TAP 16). For ease of comparison, I assumed a 90% reduction in snaphot metadata size using TAP-16. While reducing overall download size, this has no effect on bin_n churn.

Also, a recent Slack discussion on TAP-16 suggests there may be other performance issues with this approach.

  1. What are the dependencies between your packages?

For our primary use case, package dependencies are explicitely listed in the registry indexes, and resolved (independently of TUF) by the package manager. These are what we refer in the TAP as "loosely-coupled" in the TAP.

How do you plan to solve for mix-and-match and rollback attacks there?

We address both mix-and-match and rollback attacks in the Security Analysis section. Did you have a specific concern that we did not address?

To validate these hypotheses, the authors of this TAP created a [TAP-21 metadata overhead calculator](https://docs.google.com/spreadsheets/d/1Q1BPtS5T92e7Djx6878I0MdAOktg8hh73dEvMVhXIms) (based on the [calculator from PEP 458](https://docs.google.com/spreadsheets/d/11_XkeHrf4GdhMYVqpYWsug6JNz5ZK6HvvmDZX0__K2I)). The results for hashed bins (without this TAP) are outlined below:

Assumptions:
- (A1) Number of targets: 5M
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these specifically all versions of all packages? One optimisation we did for PEP 458 is that we only sign the "indirect" simple HTML indices that point to the packages themselves. So instead of signing ~6M releases, we sign only ~600K simple indices.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the client verify the packages themselves? Do the indices contain the package hashes and file sizes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: assumption is that the project indices contain at least the hashes, if not also the sizes. We can set a default upper bound on the file size in the TUF clients to avoid endless data attacks anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these specifically all versions of all packages? One optimisation we did for PEP 458 is that we only sign the "indirect" simple HTML indices that point to the packages themselves. So instead of signing ~6M releases, we sign only ~600K simple indices.

As noted elsewhere in this PR, this reduces the number of bin_n metadata required for a project to just those containing the required indices; as opposed to the bins for both indices and packages. This reduces the overall size of the downloaded metadata per update.

However, the frequency at which TUF metadata changes remains constant, since each new package release requires an update to the associated index. Thus, it does not reduce the frequency that bin_n metadata requires an update. Both the dividend and the divisor are halved in the relevant calculations:

(C2) Number of bins downloaded in example project: 100 [ C2 = 2 x A4 ] (1 for each of package and index)
(C3) Average number of bins requiring update per minute: 0.3656 [ C3 = C1 x C2 ]
(C4) Time to require full metadata refresh: 274 minutes [ C4 = C2/C3 ]

Drastically reducing the number of targets suggests that the number of bins should also be reduced. This, perhaps counterintuitively, results in higher bin_n churn:

Assumptions:

    (A1) Number of targets: 600K

    (A2) Frequency of new releases: 60/minute

    (A3) Number of bins: 2048

    (A4) Number of dependencies in example project: 50

Calculations:

    (C1) Likelihood that a bin will require an update in one minute: 2.89% [ C1 = 1 - ( (A3-1) / A3 )^A2 ]

    (C2) Number of bins downloaded in example project: 50 [ C2 = A4 ] (only indices)

    (C3) Average number of bins requiring update per minute: 1.4439

    [ C3 = C1 x C2 ]

    (C4) Time to require full metadata refresh: 35 minutes [ C4 = C2/C3 ]

Note that A1 is not used in any of these calculations. It merely informs the optimal number of bins.

Calculations:
- (C1) Likelihood that a bin will require an update in one minute: 0.37%
[ C1 = 1 - ( (A3-1) / A3 )^A2 ]
- (C2) Number of bins downloaded in example project: 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think C2 can be reduced to A4 if the bins sign only the indices instead of the packages themselves.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also reduced to A4 by having the index and packages all be covered by the same sub-repo.


Instead of scaling-up a single TUF repository, this TAP proposes scaling-out to multiple smaller repositories (eg. per-vendor, per-package, etc.) This dramatically reduces the rate at which metadata must be refreshed, since only packages actually used within a project are tracked.

However, this approach introduces a Trust-On-First-Use (TOFU) issue for the root metadata for each of these TUF repos, as it would not be feasible to ship them with the client, as is currently specified. This challenge can be overcome with the use of a higher-level TUF repo, whose targets are the initial root metadata of the "sub-repos". Only the root metadata for the top-level TUF repository would have to ship with the client.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea of using a "high-level" TUF repo to distribute the root metadata for other "low-level" TUF "subrepos": it's something I had considered myself for other use cases (e.g., a company having multiple TUF repos for different packages, and wishing to have a cleaner separation between them, especially when there are no dependencies between these packages as appears to be your own use case).

However, I think that this is a different problem than what you are really describing below, and thus the solution doesn't fit your real problem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this solution seems promising for a number of other use-cases. However, we believe that it does directly address the challenges outlined in TAP-21's Motivation and Rationale sections.

[...] I think that this is a different problem than what you are really describing below, and thus the solution doesn't fit your real problem.

Based on your earlier comments (eg. "I don't see [clients] needing to download all bin_n metadata, only what they need at the moment"), we may not have been sufficiently clear about the problem we're trying to solve. That said, if you think we've misidentified the root cause, I'm willing to re-evaluate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue discussing in our video meetings, and we'll get back here once we agree on some ways forward.

ergonlogic and others added 2 commits December 2, 2024 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants