Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append creates very slow node to read #2093

Open
giuse88 opened this issue Dec 26, 2024 · 6 comments
Open

Append creates very slow node to read #2093

giuse88 opened this issue Dec 26, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@giuse88
Copy link

giuse88 commented Dec 26, 2024

Describe the bug

Hi,

I noticed that when I append a dataframe to a node, the read of that node becomes very very slow. To clarify the problem, this is the code which reproduce the problem:

lib = ac['test_lib']
lib.write('test', pd.Series([1,1,1]))
print(lib.read('test').data)

import time
import pandas as pd
import numpy as np
from datetime import datetime
cols = ['COL_%d' % i for i in range(50)]
df = pd.DataFrame(np.random.randint(0, 50, size=(1000, 50)), columns=cols)
df.index = pd.date_range(datetime(2000, 1, 1, 5), periods=1000, freq="h")
print(lib.write('test', df), 'Data Written')

start = time.time()
print(lib.read('test').data)
end = time.time()
print(end - start)

lib.delete('test_ap')
for idx in range(len(df)):
  lib.append('test_ap', df.iloc[[idx]])
  print(idx)

print('append done')
start = time.time()
print(lib.read('test_ap').data)
end = time.time()
print(end - start)

output:

Lib available
0    1
1    1
2    1
dtype: int64
VersionedItem(symbol='test', library='test_lib', data=n/a, version=15, metadata=None, host='S3(endpoint=s3.eu-west-2.amazonaws.com, bucket=crypto-data-s3)', timestamp=1735175412730032201) Data Written
                     COL_0  COL_1  COL_2  COL_3  COL_4  COL_5  COL_6  COL_7  COL_8  COL_9  COL_10  COL_11  ...  COL_38  COL_39  COL_40  COL_41  COL_42  COL_43  COL_44  COL_45  COL_46  COL_47  COL_48  COL_49
2000-01-01 05:00:00     19     49     34     13     40      1     32     36     32     14      38       2  ...      44       2      23       0      33       9       3      22      33      20      11      26
2000-01-01 06:00:00      8     25     41     26     33     48     32     36      1      5      21      45  ...      25      21       6      16      12      47       6      11      48      37      23      48
2000-01-01 07:00:00     21     14     45     21     10      7      5     22     24     27      49       8  ...       3      10      22      29      33       4      44      12       4      27      43      26
2000-01-01 08:00:00     33      7     49     19     40     47     26     32      7     20      28      30  ...      21       1      23      45      22      31      18      12      43      11       2       3
2000-01-01 09:00:00     38      6     15     13     17      7     11     22     12     39      35       1  ...       3       6      13      46       6       1       6      12       0      25       7      18
...                    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...     ...     ...  ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
2001-02-20 16:00:00     32      0     25     15     17     35     15     34     19     46      32      29  ...       1      18      17       5      42      26      24      16       5      31      48      40
2001-02-20 17:00:00     31      9     32     18     46     12      5     13      0     49      28      31  ...      33       5      49      45      33       4      22      38      41      15      23      25
2001-02-20 18:00:00      0     24     13     35     13     43     34     41     35      0      19       8  ...      28       7      31      14      47      10      17      38       5      41      32      47
2001-02-20 19:00:00     29     45     21     40     19     12     13     25     39     38      16      10  ...      36      26      49      23       8       2      18      46      42      39      27      29
2001-02-20 20:00:00     46      5     37     41     14     25     17     37      0     15       0       6  ...      39      19      28       2       2      25      28      48       7      13      35      42

[10000 rows x 50 columns]
0.3140294551849365
append done
                     COL_0  COL_1  COL_2  COL_3  COL_4  COL_5  COL_6  COL_7  COL_8  COL_9  COL_10  COL_11  ...  COL_38  COL_39  COL_40  COL_41  COL_42  COL_43  COL_44  COL_45  COL_46  COL_47  COL_48  COL_49
2000-01-01 05:00:00     11     18     21     42     42     28     48     35      5     35      35      37  ...       2      44      44      46       5      49      13      26      35      49       6       7
2000-01-01 06:00:00      9     48     16     14     21     39     33     27     21      0      40      31  ...      28      13      23      24      39      44      25      26      43      40      16       7
2000-01-01 07:00:00     11     25      1     35     29     18     19     32     31     10      29      21  ...      31      49       8      19      17       7      35      32      35      31      43       4
2000-01-01 08:00:00      2     22      6     12      6     34     12     42     21     49       6      43  ...      47       6      46      46      45      15      16       7      14      37      29       5
2000-01-01 09:00:00     28      8      3     34     39      9      2     32     34      1      29       1  ...      19      32      22      43      24       8      19      43      41      15       4      47
...                    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...     ...     ...  ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
2000-04-26 21:00:00     44     10      9     15     41     11     12     38     31      7      22      47  ...      40       4      12      31      23      38      30      25      38      20      42      39
2000-04-26 22:00:00      9     28     31     18     10     21     21      5     36     33      18      45  ...      30      33       8      19      42      24      24      29      18      19       8      34
2000-04-26 23:00:00      4     46     42     26     10     36     23      0     46     23       6      26  ...      31      42      27      32      13      32      35      36      35      10      10       5
2000-04-27 00:00:00     35     44     46      3     37     42      3      5     41     31      44      13  ...       5      19       7      15      44      26      46      33       2      17      25      49
2000-04-27 01:00:00      5     40     40     37     28     16      5     48     36     37      43      38  ...      49      45      18      46       3      45      13      14      40      29      35      49

[2805 rows x 50 columns]
47.270530462265015

You can see that reading the data from the appended node takes 47s.

The library is exactly the same, the problem is the append function.

Expected Results

The read takes the same amount of time.

OS, Python Version and ArcticDB Version

Python: 3.11.11 (main, Dec 4 2024, 08:55:08) [GCC 13.2.0]
OS: Linux-6.8.0-1018-aws-x86_64-with-glibc2.39
ArcticDB: 5.1.2

Backend storage used

AWS S3

Additional Context

None

@giuse88 giuse88 added the bug Something isn't working label Dec 26, 2024
@G-D-Petrov
Copy link
Collaborator

G-D-Petrov commented Dec 27, 2024

hi @giuse88,
This is more-or-less expected as appends create a separate version in storage.
When reading, all of the individual versions need to be read and it the repro above this means that 1000 individual IO operations need to be performed, which can be expensive for operations over the network.

It does seems slower than expected though, so I will investigate further.
In the mean time, a workaround for your use case can be to read the data after the appends and rewrite it to the symbol, e.g. something like lib.write("test_ap", lib.read("test_ap").data)
This will make subsequent reads much faster i.e. as fast as the read after the initial write in your repro.

@giuse88
Copy link
Author

giuse88 commented Dec 27, 2024

This is a bit weird I thought the data scruture would be the same between a write and append. ! Apart from what you suggested, isn’t there a possibility to turn off this behaviour?

The other things I've noticed is that I am creating a version for each append. Is it possible to turn it off?

@giuse88
Copy link
Author

giuse88 commented Dec 27, 2024

Does artic db support write with an index?

@G-D-Petrov
Copy link
Collaborator

isn’t there a possibility to turn off this behaviour?

The other things I've noticed is that I am creating a version for each append. Is it possible to turn it off?

It is not possible to turn off the behavior for append, it is intentional that every new 'write' operation creates a new version (e.g. write, append, update.
That is what the architecture of ArcticDB is based around.

Does artic db support write with an index?

I am not sure what you mean by this, we support writing data frames to symbols.
You can:

  • write a whole symbol/data frame with write
  • add to an existing symbol/data frame with append
  • change an existing symbol/data frame with update

@giuse88
Copy link
Author

giuse88 commented Jan 4, 2025

Hi @G-D-Petrov , thank you for your help. I started to use lmdb as backend to improve the performance, and the difference is speed is much better and I will also do what you suggested.

However I've noticed another diffrence between append and write. Append seems to create massive file. I've a a library which is ~2M of data but arctic is using 166M to save it! If I copy this library into a new one with a single write the size is less than 1M.

Is this expected behaviour?

Screenshot 2025-01-04 at 10 27 57

@giuse88
Copy link
Author

giuse88 commented Jan 4, 2025

Just for reference, versions & snapshots are off. No extra data is created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants