Incremental stats sitewide #3114

amCap1712 · 2025-01-05T18:39:10Z

In the ListenBrainz Spark cluster, full dump listens (which remain constant for ~15 days) and incremental listens (ingested daily) are the two main sources of data. Incremental listens are cleared whenever a new full dump is imported. Aggregating full dump listens daily for various statistics is inefficient since this data does not change.

To optimize this process:

A partial aggregate is generated from the full dump listens the first time a stat is requested. This partial aggregate is stored in HDFS for future use, eliminating the need for redundant full dump aggregation.
Incremental listens are aggregated daily. Although all incremental listens since the full dump’s import are used (not just today’s), this introduces some redundant computation.
The incremental aggregate is combined with the existing partial aggregate, forming a combined aggregate from which final statistics are generated.

For non-sitewide statistics, further optimization is possible: If an entity’s listens (e.g., for a user) are not present in the incremental listens, its statistics do not need to be recalculated. Similarly, entity-level listener stats can skip recomputation when relevant data is absent in incremental listens.

pep8speaks · 2025-01-05T18:39:18Z

Hello @amCap1712! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2025-01-07 20:40:29 UTC

amCap1712 · 2025-01-07T20:44:46Z

Note that for sitewide statistics there is a slight inaccuracy in the final counts of listens because we can enforce the user listen count limit only per aggregate to do it efficiently, therefore in the worst case (both the full dump listens and the incremental listens have max allowed number of listens for a user) the actual user listen count limit can be upto 2x than the desired limit.

amCap1712 added 21 commits January 8, 2025 02:10

interim checkin

169a3f6

fix table use

006b367

fix combined table

4805bd2

fix partial df use

11c8042

add per user limit for sitewide stats

ca2e1fb

testing more scenarios

bb62d95

refactor incremental sitewide stats

d64f7ea

fix import

570e3ce

add all time incremental stats for other entities

6fae52c

Delete partial sitewide aggregates on import of full dump

4e037c2

Add bookkeeping for using aggregates of any stats_range

85ea8a4

fix imports

0d64068

fix metadata path

1fe0397

add logging to debug

da7fce5

fix existing agg usable check

28551bd

add schema to json read

6b97954

fix skip_trash arg in dump upload

b9fd965

Refactor SitewideEntity for sharing with other stats

dd9ec00

Fix constructors

1b9df3a

Fix call to generate_stats

f1af83c

Fix call to generate_stats - 2

a9a62ce

amCap1712 force-pushed the incremental-stats-sitewide branch from 46caca9 to a9a62ce Compare January 7, 2025 20:40

amCap1712 marked this pull request as ready for review January 7, 2025 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental stats sitewide #3114

Incremental stats sitewide #3114

amCap1712 commented Jan 5, 2025 •

edited

Loading

pep8speaks commented Jan 5, 2025 •

edited

Loading

amCap1712 commented Jan 7, 2025

Incremental stats sitewide #3114

Are you sure you want to change the base?

Incremental stats sitewide #3114

Conversation

amCap1712 commented Jan 5, 2025 • edited Loading

pep8speaks commented Jan 5, 2025 • edited Loading

Comment last updated at 2025-01-07 20:40:29 UTC

amCap1712 commented Jan 7, 2025

amCap1712 commented Jan 5, 2025 •

edited

Loading

pep8speaks commented Jan 5, 2025 •

edited

Loading