DRAFT: TAP-21 - Scale-out Architecture for High-volume Repositories #189

ergonlogic · 2024-11-25T18:47:47Z

Coming from theupdateframework/specification/issues/309, and following discussion at the TUF community meeting on 2024-11-01, we've drafted "TAP-21 - Scale-out Architecture for High-volume Repositories".

This TAP is still very much a "draft". Several sections are currently marked "TBD". Our motivation for a PR at this stage is to validate the motivation and rationale, as well as the calculations we undertook to better understand the issues at hand.

JustinCappos

Can you remove the changes to tap1? If they need to be part of a PR, they can be a separate one.

ergonlogic · 2024-11-25T19:21:21Z

Can you remove the changes to tap1? If they need to be part of a PR, they can be a separate one.

Yes, of course. I was separately experimenting with the idea of turning the headers into YAML front-matter. Not sure how those changes snuck in here.

This reverts commit eb6cac4.

tap1.md

Eh2406 · 2024-11-25T21:18:39Z

tap21.md

+
+One concern with the present proposal is that it lacks a global snapshot. See [Security Analysis](#security-analysis) (below) for a more detailed discussion on this topic.
+
+This concern is only relevant to registries where package versions are tightly-coupled (such as Debian Apt repositories). However, package versions are loosely-coupled in registries of language libraries (eg. Packagist, PyPi), which is the intended audience for this TAP.


Even a loosely coupled registry might want to provide consistent snapshots to make mirroring the repository easier. Setting up a mirror involves getting the latest snapshot, and downloading everything in the latest snapshot. Updating a mirror involves getting the latest snapshot, diffing with the previous snapshot, and downloading the changes. All of this is significantly harder without a snapshoting system.

While snapshot metadata is convenient as a mirror manifest, presumably comparably diff'able data can be generated relatively easily from filesystem "modified" timestamps.

trishankatdatadog

Thanks for sending your PR, Christopher and Derek! Left some comments based on one round of reading.

I think I see part of your problem but not its entire context. Do you think getting on a call might help?

tap21.md

trishankatdatadog · 2024-12-01T22:58:49Z

tap21.md

+
+## High churn in TUF metadata using hashed bins
+
+Hashed bins bundle targets together, distributed evenly across multiple `bin_n` metadata files. A release of any package in the registry has an equal chance to update any given `bin_n` metadata. As a result, packages that are not in use for a given client project will still require re-download of `bin_n` metadata on an ongoing basis.


But why? Again, this depends on the type of clients above. Most clients are going to be lazy rather than "eager" like mirrors, and so I don't see them needing to download all bin_n metadata, only what they need at the moment.

It's not as clear as I would like it to be (see step 7 in the spec here), but TUF clients are supposed to perform a lazy evaluation of targets using a preorder DFS.

TAP-21 does not suggest that clients need to download all bin_n metadata.

Clients do need to download the latest version of each bin_n metadata that is relevant to the targets that they require. Each of those (relevant) bin_n metadata files include targets that are not required. New releases of extraneous targets can produce new versions of relevant bin_n metadata.

Therefore, clients need to download new bin_n metadata even if there have been no releases of required targets.

OK, I think I now see the issues (in increasing order of significance):

The bins role metadata size can be reduced quite a lot with succinct hashed bin delegations.

Do you need hashes (SHA2-256 and SHA2-512) in your snapshot metadata? You really only need it for protection against malicious mirrors (see Section 5.6 here). If not, you should be to shave a lot off your snapshot by distributing only version numbers. Even if you do need hashes, have you calculated what snapshot Merkle trees can buy you?

What are the dependencies between your packages? How do you plan to solve for mix-and-match and rollback attacks there?

The bins role metadata size can be reduced quite a lot with succinct hashed bin delegations.

We address this in the Motivation section: Hashed bin delegation. TAP-21 is primarily concerned with the churn of bin_n metadata. bins metadata, by virtue of being essentially static, does not factor into it.

Also, since we expect the top-level repo to require hashed bins, implementing TAP-15 would reduce overall TUF metadata when implementing TAP-21 as well.

Do you need hashes (SHA2-256 and SHA2-512) in your snapshot metadata?

Including both SHA2-256 and SHA2-512 does seem redundant. I've updated the calculator spreadsheet to reflect removing SHA2-512 hashes. This streamlining also reduces both top-level and sub-repo metadata sizes (though the latter is not reflected in the calculations yet).

You really only need it for protection against malicious mirrors (see Section 5.6 here). If not, you should be to shave a lot off your snapshot by distributing only version numbers.

Very large repositories are almost certainly going to use mirrors. In my (admittedly limited) experience, they do want to protect against the potential of malicious mirrors. I proposed just distributing version numbers, and was asked to keep the hashes.

Hoever, this streamlining would also reduce both top-level and sub-repo metadata sizes in TAP-21.

Even if you do need hashes, have you calculated what snapshot Merkle trees can buy you?

Not directly, no. We did address this in the Motivation section: Snapshot Merkle Trees (TAP 16). For ease of comparison, I assumed a 90% reduction in snaphot metadata size using TAP-16. While reducing overall download size, this has no effect on bin_n churn.

Also, a recent Slack discussion on TAP-16 suggests there may be other performance issues with this approach.

What are the dependencies between your packages?

For our primary use case, package dependencies are explicitely listed in the registry indexes, and resolved (independently of TUF) by the package manager. These are what we refer in the TAP as "loosely-coupled" in the TAP.

How do you plan to solve for mix-and-match and rollback attacks there?

We address both mix-and-match and rollback attacks in the Security Analysis section. Did you have a specific concern that we did not address?

trishankatdatadog · 2024-12-01T23:01:15Z

tap21.md

+To validate these hypotheses, the authors of this TAP created a [TAP-21 metadata overhead calculator](https://docs.google.com/spreadsheets/d/1Q1BPtS5T92e7Djx6878I0MdAOktg8hh73dEvMVhXIms) (based on the [calculator from PEP 458](https://docs.google.com/spreadsheets/d/11_XkeHrf4GdhMYVqpYWsug6JNz5ZK6HvvmDZX0__K2I)). The results for hashed bins (without this TAP) are outlined below:
+
+Assumptions:
+- (A1) Number of targets: 5M


Are these specifically all versions of all packages? One optimisation we did for PEP 458 is that we only sign the "indirect" simple HTML indices that point to the packages themselves. So instead of signing ~6M releases, we sign only ~600K simple indices.

How does the client verify the packages themselves? Do the indices contain the package hashes and file sizes?

Yes: assumption is that the project indices contain at least the hashes, if not also the sizes. We can set a default upper bound on the file size in the TUF clients to avoid endless data attacks anyway.

Are these specifically all versions of all packages? One optimisation we did for PEP 458 is that we only sign the "indirect" simple HTML indices that point to the packages themselves. So instead of signing ~6M releases, we sign only ~600K simple indices.

As noted elsewhere in this PR, this reduces the number of bin_n metadata required for a project to just those containing the required indices; as opposed to the bins for both indices and packages. This reduces the overall size of the downloaded metadata per update.

However, the frequency at which TUF metadata changes remains constant, since each new package release requires an update to the associated index. Thus, it does not reduce the frequency that bin_n metadata requires an update. Both the dividend and the divisor are halved in the relevant calculations:

(C2) Number of bins downloaded in example project: 100 [ C2 = 2 x A4 ] (1 for each of package and index)
(C3) Average number of bins requiring update per minute: 0.3656 [ C3 = C1 x C2 ]
(C4) Time to require full metadata refresh: 274 minutes [ C4 = C2/C3 ]

Drastically reducing the number of targets suggests that the number of bins should also be reduced. This, perhaps counterintuitively, results in higher bin_n churn:

Assumptions: (A1) Number of targets: 600K (A2) Frequency of new releases: 60/minute (A3) Number of bins: 2048 (A4) Number of dependencies in example project: 50 Calculations: (C1) Likelihood that a bin will require an update in one minute: 2.89% [ C1 = 1 - ( (A3-1) / A3 )^A2 ] (C2) Number of bins downloaded in example project: 50 [ C2 = A4 ] (only indices) (C3) Average number of bins requiring update per minute: 1.4439 [ C3 = C1 x C2 ] (C4) Time to require full metadata refresh: 35 minutes [ C4 = C2/C3 ]

Note that A1 is not used in any of these calculations. It merely informs the optimal number of bins.

trishankatdatadog · 2024-12-01T23:02:56Z

tap21.md

+Calculations:
+- (C1) Likelihood that a bin will require an update in one minute: 0.37%
+       [ C1 = 1 - ( (A3-1) / A3 )^A2 ]
+- (C2) Number of bins downloaded in example project: 100


I think C2 can be reduced to A4 if the bins sign only the indices instead of the packages themselves.

This is also reduced to A4 by having the index and packages all be covered by the same sub-repo.

trishankatdatadog · 2024-12-01T23:06:25Z

tap21.md

+
+Instead of scaling-up a single TUF repository, this TAP proposes scaling-out to multiple smaller repositories (eg. per-vendor, per-package, etc.) This dramatically reduces the rate at which metadata must be refreshed, since only packages actually used within a project are tracked.
+
+However, this approach introduces a Trust-On-First-Use (TOFU) issue for the root metadata for each of these TUF repos, as it would not be feasible to ship them with the client, as is currently specified. This challenge can be overcome with the use of a higher-level TUF repo, whose targets are the initial root metadata of the "sub-repos". Only the root metadata for the top-level TUF repository would have to ship with the client.


I like this idea of using a "high-level" TUF repo to distribute the root metadata for other "low-level" TUF "subrepos": it's something I had considered myself for other use cases (e.g., a company having multiple TUF repos for different packages, and wishing to have a cleaner separation between them, especially when there are no dependencies between these packages as appears to be your own use case).

However, I think that this is a different problem than what you are really describing below, and thus the solution doesn't fit your real problem.

I agree that this solution seems promising for a number of other use-cases. However, we believe that it does directly address the challenges outlined in TAP-21's Motivation and Rationale sections.

[...] I think that this is a different problem than what you are really describing below, and thus the solution doesn't fit your real problem.

Based on your earlier comments (eg. "I don't see [clients] needing to download all bin_n metadata, only what they need at the moment"), we may not have been sufficiently clear about the problem we're trying to solve. That said, if you think we've misidentified the root cause, I'm willing to re-evaluate.

Let's continue discussing in our video meetings, and we'll get back here once we agree on some ways forward.

Co-authored-by: Trishank Karthik Kuppusamy <[email protected]> Signed-off-by: Christopher Gervais <[email protected]>

ergonlogic added 3 commits November 5, 2024 10:27

Experiment with YAML frontmatter.

eb6cac4

Add TAP-21: Scale-out Architecture for High-volume Repositories

b0b3b9f

[TAP-21] Minor cleanup and clarification.

233c853

JustinCappos requested changes Nov 25, 2024

View reviewed changes

ergonlogic mentioned this pull request Nov 25, 2024

Discussion of Scale-out Architecture for High-volume Repositories TAP (TAP 21) #190

Open

Revert "Experiment with YAML frontmatter."

c19c397

This reverts commit eb6cac4.

ergonlogic commented Nov 25, 2024

View reviewed changes

tap1.md Outdated Show resolved Hide resolved

Eh2406 reviewed Nov 25, 2024

View reviewed changes

[TAP-21] Update global snapshot section w/ discussion of mirrors.

1434dbd

ergonlogic mentioned this pull request Nov 28, 2024

Request for comment: Scalable architecture theupdateframework/specification#309

Closed

trishankatdatadog reviewed Dec 1, 2024

View reviewed changes

ergonlogic and others added 2 commits December 2, 2024 09:25

Update tap21.md

c7c39f2

Co-authored-by: Trishank Karthik Kuppusamy <[email protected]> Signed-off-by: Christopher Gervais <[email protected]>

[TAP-21] Clarify rationale re. needing to optimize for read operations.

85c9037

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: TAP-21 - Scale-out Architecture for High-volume Repositories #189

DRAFT: TAP-21 - Scale-out Architecture for High-volume Repositories #189

ergonlogic commented Nov 25, 2024

JustinCappos left a comment

ergonlogic commented Nov 25, 2024

Eh2406 Nov 25, 2024

ergonlogic Nov 26, 2024

trishankatdatadog left a comment

trishankatdatadog Dec 1, 2024

ergonlogic Dec 2, 2024

trishankatdatadog Dec 4, 2024 •

edited

Loading

ergonlogic Dec 4, 2024

trishankatdatadog Dec 1, 2024

ergonlogic Dec 2, 2024

trishankatdatadog Dec 4, 2024

ergonlogic Dec 5, 2024

trishankatdatadog Dec 1, 2024

ergonlogic Dec 2, 2024

trishankatdatadog Dec 1, 2024

ergonlogic Dec 2, 2024

trishankatdatadog Dec 10, 2024


		One concern with the present proposal is that it lacks a global snapshot. See [Security Analysis](#security-analysis) (below) for a more detailed discussion on this topic.

		This concern is only relevant to registries where package versions are tightly-coupled (such as Debian Apt repositories). However, package versions are loosely-coupled in registries of language libraries (eg. Packagist, PyPi), which is the intended audience for this TAP.


		## High churn in TUF metadata using hashed bins

		Hashed bins bundle targets together, distributed evenly across multiple `bin_n` metadata files. A release of any package in the registry has an equal chance to update any given `bin_n` metadata. As a result, packages that are not in use for a given client project will still require re-download of `bin_n` metadata on an ongoing basis.


		Instead of scaling-up a single TUF repository, this TAP proposes scaling-out to multiple smaller repositories (eg. per-vendor, per-package, etc.) This dramatically reduces the rate at which metadata must be refreshed, since only packages actually used within a project are tracked.

		However, this approach introduces a Trust-On-First-Use (TOFU) issue for the root metadata for each of these TUF repos, as it would not be feasible to ship them with the client, as is currently specified. This challenge can be overcome with the use of a higher-level TUF repo, whose targets are the initial root metadata of the "sub-repos". Only the root metadata for the top-level TUF repository would have to ship with the client.

DRAFT: TAP-21 - Scale-out Architecture for High-volume Repositories #189

Are you sure you want to change the base?

DRAFT: TAP-21 - Scale-out Architecture for High-volume Repositories #189

Conversation

ergonlogic commented Nov 25, 2024

JustinCappos left a comment

Choose a reason for hiding this comment

ergonlogic commented Nov 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trishankatdatadog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trishankatdatadog Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trishankatdatadog Dec 4, 2024 •

edited

Loading