Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add running totals functionality #76

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

PetrDlouhy
Copy link
Collaborator

With huge amount of transaction I started to have problems with performance of counting account totals.
I tried to start counting running totals for the accounts, but I realized that the task is more complicated than I anticipated. I ended up rather optimizing the performance of the server, but I will probably need to implement this functionality eventually.

I am leaving the work in progress code here if anyone would be interested in finishing it.

@PetrDlouhy PetrDlouhy force-pushed the experimental/running_totals branch from aab3992 to b7cf6d2 Compare November 21, 2022 15:42
@PetrDlouhy PetrDlouhy force-pushed the experimental/running_totals branch 6 times, most recently from 26c9655 to d71730b Compare December 20, 2022 06:12
@codecov-commenter
Copy link

codecov-commenter commented Dec 20, 2022

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (40a4be4) 92.97% compared to head (fefe20a) 93.18%.

Files Patch % Lines
hordak/admin.py 80.00% 2 Missing and 2 partials ⚠️
hordak/tests/models/test_concurrently.py 86.20% 4 Missing ⚠️
.../management/commands/recalculate_running_totals.py 92.30% 1 Missing and 1 partial ⚠️
hordak/models/core.py 96.42% 0 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #76      +/-   ##
==========================================
+ Coverage   92.97%   93.18%   +0.20%     
==========================================
  Files          59       62       +3     
  Lines        3828     4106     +278     
  Branches      248      279      +31     
==========================================
+ Hits         3559     3826     +267     
- Misses        224      231       +7     
- Partials       45       49       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@PetrDlouhy PetrDlouhy force-pushed the experimental/running_totals branch 4 times, most recently from b160cc4 to 3b30eae Compare December 20, 2022 13:58
@PetrDlouhy
Copy link
Collaborator Author

PetrDlouhy commented Dec 20, 2022

Now I have updated this PR to working state. It is working, but have some problems:

@PetrDlouhy PetrDlouhy force-pushed the experimental/running_totals branch 8 times, most recently from a5a3f46 to a5f05c7 Compare December 23, 2022 15:53
@PetrDlouhy PetrDlouhy force-pushed the experimental/running_totals branch 5 times, most recently from cd4162c to 8192ecc Compare January 11, 2023 17:00
@PetrDlouhy PetrDlouhy force-pushed the experimental/running_totals branch from 1d7bbef to 741b68d Compare January 11, 2023 19:28
@PetrDlouhy PetrDlouhy marked this pull request as ready for review January 11, 2023 20:09
@PetrDlouhy PetrDlouhy changed the title (experimental) add running totals functionality add running totals functionality Jan 11, 2023
@nitsujri
Copy link
Contributor

nitsujri commented May 19, 2023

@PetrDlouhy I was thinking about this and would recommend a different approach if you're open to it.

To speed up Balance calculations with a large number of Legs, a different solution would be to use an AccountStatement record. This comes direct from the StackOverflow post discussed in #44 (comment):

Notification_Center

In the SO post, it's called LedgerStatement, but for Hordak, it would be called AccountStatement.

Key Columns of the AccountStatement

  • balance - money
  • balance_currency
  • account_id
  • calculated_to_leg_id - Points to last leg it includes as part of its balance
  • account_statement_id - Previous time this account created an AccountStatement.

Functionality

  • Current balance calculation only needs to sum Legs from the last AccountStatement for a given currency.
  • Individual Accounts would only sum its Legs that are directly attached.
    • To get inclusive of children, recursively perform above.
  • Create-function via Background job would create an AccountStatement every X number of Legs for each currency for that account since the last AccountSatement.
  • Background-job AccountStatementAuditor or similar would double check accuracy of each AccountStatement.

Differences from StackOverflow (SO)

  • The SO diagram uses date, while this hard references Legs.
    • Dates are common for like "Monthly Statements" which is great for humans, but that's not what we care about here. Obviously if we used date/time you could end up with 0 Legs between statements or 1b.

      We want predicable speed, so by hard linking via calculated_to_leg_id instead of time we can use "number of legs between each statement" and tune it to 10,000, 100k, 1m? Whatever has a good balance between speed vs number of AccountStatement created for that DB's instance CPU+HD speed.

Benefits

  • Fully accurate
  • Predictable performance
  • Completely optional - if AccountStatement is never generated (i.e. background job is never run) then query includes all Legs.

Drawbacks

  • If you update a historical Leg, obviously this is all inaccurate, thus requiring a recalculation from the changed Leg.
    • It begs the question though, why are historical Legs being updated. In terms of accounting, you should never update historicals, but be adding adjustments/transactions/legs for corrections.
    • Optional paths, on update:
      • Prevent/throw DB error for all legs 'belong to' an AccountStatement
      • Trigger a recalculation for all AccountStatements equal to or older than the Leg being updated.
      • Do nothing, just warn of footgun via documentation.
  • This pattern could still be slow if you have 1 master account that has an enormous tree.

Thoughts?

@PetrDlouhy
Copy link
Collaborator Author

Some quick thoughts:

  • I see the biggest benefit in the fact that this is not dependent on signals (except hypothetical the historical update). Hordak so far counts with other applications writing in the database which would break with this implmentation of running totals. But on the other hands the DB functions are causing lot of trouble.
  • I am not sure about the name AccountStatement, because it is often used for account statements provided to the users in form of PDF or something. I know, that this is technically very similar, but it still might be confusing.
  • In my application I sometimes do historical updates. It might not be methodically best approach, but it is easier especially when I do run some balance checks periodically. There also might be some possibility to make that correction transactions automatically when the model is updated, but it would add some complexity.
  • Not entirely sure, if the benefits are worth the effort. The current running totals are working for me pretty good and for some time already the balances are matching.
  • We might want to change to join table for account currencies before merging either running totals or account statements.

@PetrDlouhy PetrDlouhy force-pushed the experimental/running_totals branch 2 times, most recently from 61f16c8 to 568f99e Compare May 23, 2023 14:58
@nitsujri
Copy link
Contributor

@PetrDlouhy thanks for the feedback:

  • Agreed, AccountStatement is a confusing name. Maybe BalanceStatement?
  • Historical updates should be allowed.
    • I walk my "never historical updates" back - way too aggressive. I spoke with our accountant - historical updates should be allowed until the "books are closed" which is EOM for our company. But for Hordak, I don't think this is a concept that we should handle since it's business/organization specific.
    • I don't have a strong answer for allowing historical updates in my proposal other than update all after the updated Leg. I'm not a huge fan of this as it can take a long time if it's a "really old" transaction.

Reservations with current RunningTotals

Agreed that signals makes me pause a bit, but I don't think that's a blocker for this feature.

In my eyes the biggest concern in the current architecture, is that RunningTotals is, at its core, a differential cache. These are hard to get right at scale. Edge cases take forever to crop up and many times are insanely difficult to debug.

One is - while it's been correctly identified a table lock is helpful, adding a sleep(2) to simulate the connections holding onto the lock too long (db network issue), the result is weird balances.

Separately, looking at django-concurrent-test-helper there's a note about the methodology of process forking. That it uses threads instead of processes. I'm not familiar enough with psycopg2, django, and python to know if the threading is isolated enough such that the concurrent test is doing exactly what we expect it to do.

Recommendations

  • Move update_running_totals() to a DB function/trigger, resulting in guarantees on the values.
  • Move to a non-cache based architecture

As it stands, any organization like our own would not be able to adopt this feature because we rely on Balance being 100% correct as we move money in bank accounts based the current value of the balance. If a Transaction+Legs fail to insert because RunningTotal could not be updated, that's okay/handle-able. It's when there's a failure but it acts like it worked, we have real trouble.

@PetrDlouhy
Copy link
Collaborator Author

@nitsujri Seems very reasonable.
I am not sure, if the BalanceStatements are 100 % prone to rounding errors, but the clean thread safety seems like big benefit. It makes the design more like integral part of the application and less like a caching tool.

Although I am not sure, if I will find enough time for this in the near future.

@PetrDlouhy PetrDlouhy force-pushed the experimental/running_totals branch 4 times, most recently from a8f3fa9 to e8bd6ba Compare May 30, 2023 10:49
@PetrDlouhy PetrDlouhy force-pushed the experimental/running_totals branch from 2399fd7 to 3d59235 Compare January 5, 2024 08:51
@PetrDlouhy PetrDlouhy force-pushed the experimental/running_totals branch from c2d176a to fefe20a Compare January 5, 2024 13:55
@adamcharnock
Copy link
Owner

adamcharnock commented Jun 25, 2024

This discussion sounds interesting and good to me. I have been aware of this possibly becoming a performance issue.

Question: What kind of scale are you seeing this become an issue at? (i.e. roughly how many legs do you have on a big account)

I'm very much in favour of this being done in-database rather than in-django. I think there are a couple of sides of this:

  1. Maintenance of the BalanceStatement records (Is this done online upon insert, or offline as a batch job?)
  2. Deletion of any relevant BalanceStatement records should a Leg be updated.

I could also see this functionality being provided rather easily by a postgresql materialised view, but that would be Postgresql-only (and BalanceStatements would be updated as a batch job). How would people feel about that?

@adamcharnock
Copy link
Owner

adamcharnock commented Jun 25, 2024

I just had a little play with the SQL required to generate balances for all accounts (would would be useful for implementing this as a materialised view). Not sure if this will be useful, but I need to run so I'll leave it here in case it is:

UPDATE: Ignore this old implementation, instead see the new version in source.

SELECT
    A.id as account_id,
    L.*
FROM hordak_account A
INNER JOIN LATERAL
    (
        SELECT
            L2.amount_currency as balance_currency,
            COALESCE(SUM(L2.amount), 0.0) as balance,
            MAX(L2.id) calculated_to_leg_id
        FROM hordak_account A2
        INNER JOIN public.hordak_leg L2 on L2.account_id = A2.id
        WHERE A2.lft >= A.lft AND A2.rght <= A.rght AND A.tree_id = A2.tree_id
        GROUP BY L2.amount_currency
    ) L ON True;

The results will be unique on (account_id, balance_currency)


EDIT 1: Related: Calculating the balances for a list of accounts is also slow because this is all calculated in Python/Django, and not directly in the database. If we had a database function for account balance calculations (ACCOUNT_BALANCE(account_id)) then we could pull balances from the database along with their accounts (relates to #52). For example: Account.objects.annotate(balance=Balance(F('account_id')))


EDIT 2: A question occurs to me: Is a BalanceStatement supposed to be used as an internal caching tool only, or is this also a user-facing concept in some way? I suspect it is the former.

@PetrDlouhy
Copy link
Collaborator Author

PetrDlouhy commented Jun 28, 2024

@adamcharnock Ours service has ~1M users, (x2 Hordak accounts), ~13M transactions (x2 legs). Some of the accounts have much larger number of accounts than others, I would expect that maximum can reach 1M transactions per accont.
But I had to solve the performance issues few years back, so I expect at ~10x less data.

We are using PostgreSQL, so PSQL only is not a problem for us.
I am not sure, if DB functions would by fast enough, but I can do some tests.

@PetrDlouhy
Copy link
Collaborator Author

That SQL function doesn't work very quickly for me. This is explain analyze for 1 account:

explain analyze SELECT
    A.id as account_id,
    L.*
FROM hordak_account A
INNER JOIN LATERAL
    (
        SELECT
            L2.amount_currency as balance_currency,
            COALESCE(SUM(L2.amount), 0.0) as balance,
            MAX(L2.id) calculated_to_leg_id
        FROM hordak_account A2
        INNER JOIN public.hordak_leg L2 on L2.account_id = A2.id
        WHERE A2.lft >= A.lft AND A2.rght <= A.rght AND A.tree_id = A2.tree_id
        GROUP BY L2.amount_currency
    ) L ON True where A.id=1;
                                                                                   QUERY PLAN                                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=15610.28..15614.34 rows=1 width=44) (actual time=31348.592..31348.600 rows=1 loops=1)
   ->  Index Scan using hordak_account_pkey on hordak_account a  (cost=0.09..4.09 rows=1 width=16) (actual time=0.010..0.015 rows=1 loops=1)
         Index Cond: (id = 1)
   ->  GroupAggregate  (cost=15610.20..15610.24 rows=1 width=40) (actual time=31348.579..31348.581 rows=1 loops=1)
         Group Key: l2.amount_currency
         ->  Sort  (cost=15610.20..15610.21 rows=19 width=14) (actual time=25666.045..28243.838 rows=12684935 loops=1)
               Sort Key: l2.amount_currency
               Sort Method: external merge  Disk: 322928kB
               ->  Nested Loop  (cost=0.17..15610.12 rows=19 width=14) (actual time=0.392..20596.186 rows=12684935 loops=1)
                     ->  Index Scan using idx_hordak_account_on_tree_id_lft_id on hordak_account a2  (cost=0.09..4.09 rows=1 width=4) (actual time=0.020..0.022 rows=1 loops=1)
                           Index Cond: ((tree_id = a.tree_id) AND (lft >= a.lft))
                           Filter: (rght <= a.rght)
                     ->  Index Scan using hordak_leg_account_id on hordak_leg l2  (cost=0.09..15581.32 rows=8237 width=18) (actual time=0.370..18445.563 rows=12684935 loops=1)
                           Index Cond: (account_id = a2.id)
 Planning Time: 1.824 ms
 Execution Time: 31401.735 ms
(16 rows)

@adamcharnock
Copy link
Owner

Oof, 31 seconds. Ok, I'll take another look and see what I can do. Off-the-cuff thoughts:

  1. This is getting the balance including child accounts, which will be slower and is often not required. Maybe a 'get_simple_balance()' would be faster, which ignores child accounts. I'll put that together and do some tests.
  2. I would be interested to know how Create accounting-oriented database view on Legs table #120 performs for you too

@PetrDlouhy
Copy link
Collaborator Author

I don't use any child accounts. I just have 2 accounts for every user and then 3 internal accounts with the in/outband transactions.

@adamcharnock
Copy link
Owner

adamcharnock commented Jun 28, 2024

Ah. Yes, I see you are (reasonably) using the function from the comment above. I've improved this now and you can find the better version here:

https://github.com/adamcharnock/django-hordak/blob/feature/get-balance/hordak/migrations/0047_get_balance.pg.sql

Once you've run this SQL, you should be able to do this:

postgres.public> select get_balance(7)
[2024-06-28 20:25:18] 1 row retrieved starting from 1 in 430 ms (execution: 420 ms, fetching: 10 ms)

That 420ms for an account with 1 million legs. What do you see on your side?

UPDATE: This is on an M2 Macbook. Also, if I just calculate the balance for the one account (not including children) then it shaves about 30-40% off the execution time. This is a decent win, but I think we'll get bigger gains from adding running totals, as per this PR.

I'm copying this comment to the #126 PR too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants