Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help understanding ApplySortingLsh? #3

Open
stepthom opened this issue Dec 9, 2021 · 6 comments
Open

Help understanding ApplySortingLsh? #3

stepthom opened this issue Dec 9, 2021 · 6 comments

Comments

@stepthom
Copy link

stepthom commented Dec 9, 2021

Hi,

Thank you for your package - it is very nicely written and very easy to use.

I am trying to understand how the ApplySortingLsh function works. In particular:

  • What do the values in cluster_data represent? I see that there are 33872 values in the array, so I assume there is one per cohort. But I see that most of the values seem to be 35, 34, 36, 98, or 99. What do these values mean/represent in the real world? Or, why/how are they used?
@shigeki
Copy link
Owner

shigeki commented Dec 10, 2021

cluster_data combines a blocked_flag (7th bit) and a bit width of each cohort Id (1st-6th bits).
In the figure below (ignore Japanese in the fig), you can see that each cohort Id has about 34-36 bits range according to cluster_data and SimHash value.
The cluster_data of 98 or 99 has the blocked_flag in the 7th bit so that the cohort Id is blocked.
image

@stepthom
Copy link
Author

Thank you, that helps. Some more questions:

  • Why are some cohort IDs blocked?
  • Why is the bit width of each cohort ID different? E.g., why not make them all 35?

Thanks,
Steve

@stepthom
Copy link
Author

Thank you, @shigeki. In the document you link to, I assume that the relevant part is this:

The inputs are turned into a cohort ID using a technique we're calling PrefixLSH. It is similar to a SimHash variant called SortingLSH that was described by our colleagues in Google Research and Google Ads last October.

  • ...

  • The browser then uses the full set of domain name inputs to deterministically produce a 50-bit Locality-Sensitive Hash bitvector, where the i'th bit indicates the sign (positive or negative) of the sum of the i'th coordinates of all the floating-point vectors derived from the domain names.

  • A Chrome-operated server-side pipeline counts how many times each 50-bit hash occurs among qualifying users — those for whom we log cohort calculations along with their sync data.

  • The 50-bit hashes start in two big cohorts: all hashes whose first bit is 0, versus all hashes whose first bit is 1. Then each cohort is repeatedly divided into two smaller cohorts by looking at successive bits of the hash value, as long as such a division yields two cohorts each with at least 2000 qualifying users. (Each cohort will comprise thousands of people total, when including those Chrome users for whom we don't sync cohort data.)

  • ...

Two questions:

I assume that the "Chrome-operated server-side pipeline" has already been computed and the results are in the contents of "cluster_data" - is that correct?

Also, the step "The 50-bit hashes start in two big cohorts" makes sense to me, but I don't see how the code in ApplySortingLsh implements this. The code seems to start to do the following:

Given a sim_hash (computed from the given URL list):

  1. Set cumulative_sum to 0
  2. Set index to 0
  3. Add either 2^35, 2^35, or 2^36 (depending on the “cluster data”) to cumulative_sum.
  4. Check if cumulative_sum is bigger than sim_hash
    a. If yes, then done. Set cohort_id to current index and return
    b. If no, continue
  5. Increment index by 1
  6. GOTO step 3

I'm having trouble seeing how this is the same what the document describes.

@stepthom
Copy link
Author

To expand:

It seems like what the document is saying the algorithm should do is something like this (just making up the numbers for illustration):

  • Cohort 0 is assigned to any sim_hash between 10000000000000000000000000000100000000000000000000 and 10000000000000000000000000001100000000000000000000
  • Cohort 1 is assigned to any sim_hash between 10000000000000000000000000001100000000000000000000 and 10000000000000000000000000001101000000000000000000
  • ...
  • Cohort 33,872 is assigned to any sim_hash between 11110000000000010000000000001100000000000000000000 and 11110000000000110000000000001100000000000000000000

@shigeki
Copy link
Owner

shigeki commented Dec 15, 2021

I assume that the "Chrome-operated server-side pipeline" has already been computed and the results are in the contents of "cluster_data" - is that correct?

Right.

Also, the step "The 50-bit hashes start in two big cohorts" makes sense to me, but I don't see how the code in ApplySortingLsh implements this.

It was the step at the internal server in Google, not floc_simulator.
The floc_simulator just uses its results as the cluster_data file in Chrome provided by Google. I do not know the details in Google internal. Please ask them at https://github.com/WICG/floc/issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants