Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Come up with a rename to clarify aid instance vs aid #141

Open
pdobacz opened this issue Jan 5, 2022 · 6 comments
Open

Come up with a rename to clarify aid instance vs aid #141

pdobacz opened this issue Jan 5, 2022 · 6 comments
Labels
low priority Low priority

Comments

@pdobacz
Copy link
Contributor

pdobacz commented Jan 5, 2022

Related to discussions in the PR here and in Slack.

In the code, it is often not clear if aid means aid instance (so a column or expression) or a value ("instance" 😉) of aid instance, being something derived from an AID value. (But it's not really the AID value anymore, because it may have been hashed!)

Here is one example of the former: https://github.com/diffix/pg_diffix/blob/master/src/aggregation/count.c#L59
Here is another: https://github.com/diffix/pg_diffix/blob/master/pg_diffix/query/oid_cache.h#L14

Here is an example of the latter: https://github.com/diffix/pg_diffix/blob/master/pg_diffix/aggregation/contribution_tracker.h#L40

@pdobacz pdobacz added the low priority Low priority label Jan 5, 2022
@cristianberneanu
Copy link
Collaborator

I think the term instance hinders more than it helps. We have AID columns and AID values.
We should differentiate between them if the defining scope allows for both to exist at the same time.
Otherwise, using aid generically and taking context into account is mostly fine IMO (we can't put all the relevant information into variable names anyway).
As Edon said, the hash part is not relevant after initialization anymore, so we can leave it out of the name.

@pdobacz
Copy link
Contributor Author

pdobacz commented Jan 6, 2022

As Edon said, the hash part is not relevant after initialization anymore, so we can leave it out of the name.

I have been thinking about this: If you're open to reconsider aid_hash, it would be a option which could potentially help solve this. Arguments:

  1. We do use it in reference, e.g.: (type AidCountState = { AidContributions: Dictionary<AidHash, float>; mutable UnaccountedFor: int64 })
  2. It is relevant insofar as we know "is safe to use in subsequent steps of anonymization, like seeding", as opposed to unhashed AID value.

@edongashi
Copy link
Member

There is no way to get an unsafe AID value because it's literally impossible without the type adapter. A postgres Datum is an opaque value unless we have type info, which is handled by the AidSpec.

@pdobacz
Copy link
Contributor Author

pdobacz commented Jan 6, 2022

There is no way to get an unsafe AID value

wait, so what is the hashing for? isn't it there to make the seed pseudo-random? we're turning of hashing only in DEBUG, so I assumed that it is there for some sort of safety

@pdobacz
Copy link
Contributor Author

pdobacz commented Jan 6, 2022

ah, ok I see what you mean now. nvm

@edongashi
Copy link
Member

wait, so what is the hashing for? isn't it there to make the seed pseudo-random? we're turning of hashing only in DEBUG, so I assumed that it is there for some sort of safety

We need to squash various types into an aid_t. For text we hash to get a number, for numbers we hash to mix distribution (IDs tend to be sequential, using only the lower end of bits). For debug we leave it off in case we want to investigate what entity is contributing to what.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
low priority Low priority
Projects
None yet
Development

No branches or pull requests

3 participants