Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental stats sitewide #3114

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open

Conversation

amCap1712
Copy link
Member

@amCap1712 amCap1712 commented Jan 5, 2025

In the ListenBrainz Spark cluster, full dump listens (which remain constant for ~15 days) and incremental listens (ingested daily) are the two main sources of data. Incremental listens are cleared whenever a new full dump is imported. Aggregating full dump listens daily for various statistics is inefficient since this data does not change.

To optimize this process:

  1. A partial aggregate is generated from the full dump listens the first time a stat is requested. This partial aggregate is stored in HDFS for future use, eliminating the need for redundant full dump aggregation.
  2. Incremental listens are aggregated daily. Although all incremental listens since the full dump’s import are used (not just today’s), this introduces some redundant computation.
  3. The incremental aggregate is combined with the existing partial aggregate, forming a combined aggregate from which final statistics are generated.

For non-sitewide statistics, further optimization is possible: If an entity’s listens (e.g., for a user) are not present in the incremental listens, its statistics do not need to be recalculated. Similarly, entity-level listener stats can skip recomputation when relevant data is absent in incremental listens.

@pep8speaks
Copy link

pep8speaks commented Jan 5, 2025

Hello @amCap1712! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2025-01-07 20:40:29 UTC

@amCap1712 amCap1712 force-pushed the incremental-stats-sitewide branch from 46caca9 to a9a62ce Compare January 7, 2025 20:40
@amCap1712 amCap1712 marked this pull request as ready for review January 7, 2025 20:41
@amCap1712
Copy link
Member Author

Note that for sitewide statistics there is a slight inaccuracy in the final counts of listens because we can enforce the user listen count limit only per aggregate to do it efficiently, therefore in the worst case (both the full dump listens and the incremental listens have max allowed number of listens for a user) the actual user listen count limit can be upto 2x than the desired limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants