-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrieving large amounts of data from a database #24
Comments
I think your assessment is right. I consider adding streaming retrieval for the PostgreSQL driver, but haven't looked into the details yet. On the other hand the let rec loop batch acc =
if batch >= batch_count then Lwt.return_ok acc else
Db.fold_s request f (params, batch) acc >>=? loop (batch + 1)
in
loop 0 where |
Thanks for the prompt reply. I have had a go at implementing the batching solution in my application and it seems to work quite nicely. I would still be happy to implement single row mode as it would useful going forwards, particularly when using the stream functions added by #22. I haven't looked into it in detail, but my guess would be that there are three main ways this could be added:
Do you have a preference for which way would be best? 1 seems like it might be overkill, but it would mean that the other drivers do not have to care about the single row mode config. I think that 2 and 3 would both work, but I think 2 might be a cleaner API. |
A PR would be much appreciated, and discussing the details now is already useful even if I implement it myself later. I agree that 1 is an overkill, but see my option 5 below which would provide the same functionality from the user point of view. You option 2 offers selection per query, while 3 offers selection per query and per query parameters. Both gets a bit invasive on the API, since the parameter must be duplicated across convenience functions, but I think we can find an acceptable solution. I have a slight preference for 3 due to the finer granularity, supplementing the I think there are two other options of selecting single-row-mode:
These are nice from an API point of view, but if the segmentation of transfers implied by single-row-mode has a significant impact on performance, having better granularity would be good. Option 5 would allow exploring performance impact, as well as allowing the sysadmin to configure the mode according to DB size vs memory size. So, I think one way forward would be to implement 5, and supplement with 2 or 3 only when/if needed, in which case the URI parameter would act as a default. |
5 sounds sensible to me. Just to check, you would mean that the user could pass a connection URL that looks like:
to enable single row mode for queries with Looking at the documentation here my understanding is that the parameter would be ignored by other clients, which would print a warning to |
Yes, this was my though precisely. I see your point about the URI being reused, that could even happen if another part of a project use the same configuration parameter to connect directly. But I think we can leave it up to the application to split the URI into two parts, e.g. like db_uri: "postgresql://localhost/foo?connect_timeout=10"
db_caqti_options: "?single_row_mode=true" and merge in the second query string before passing the URI to the Caqti connector. The sqlite3 driver already has some facilities to decode query parameters like At the moment the URI documentation is in the mli files of the drivers, though would be good to make it more visible. Candidate places to document it would be in |
I'd like to argue in favor of point "4.":
The main concern is:
Other database client libraries in other languages usually solve this by using a batch mode. That is, they do perform streaming, but fetch 100 rows at once before further processing. (or 128 rows if you prefer round numbers) This is usually not configurable by enabling/disabling streaming completely, but instead by allowing to specify the batch size on a per-query basis. In practice I've never found a reason to adjust those defaults. I did perform experiments and measurements, but never found a batch size that performs significantly better than the defaults. However, in theory this might be desirable if you fetch huge rows, e.g. if every single rows has already 100 MiB. |
Do you know how batches are implemented in those client libraries? Using |
Oh, I'm sorry, I just realized that I confused reading and writing. I was describing batch inserts, where the inserts are not executed one-by-one, but also not executed all at once, but instead in batched of around 100 rows. Sorry for the noise! I'm curious if the streaming reading will actually have a performance impact or not. In interpreted languages, it usually does, but that doesn't tell us anything about OCaml. In the best case, it doesn't, so we can just fully switch to that one in the future. |
@vog Did they do any benchmarking to pin down the bottleneck? I would think, at least for compiled languages, the slowdown would be due to fragmenting the data into individual packets, rather than in the application itself. If so, all solutions based on the same client library would have this issue. |
@paurkedal Sorry, I did not pin down the bottleneck, I was just tuning parameters. But when was I talked about other programming languages, I had mostly interpreted languages in mind (Python, Ruby, PHP, Perl), so those languages would probably not be good for comparison anyway. |
What I found on the net about libpq is that the single-row mode performance [1, 2] suggests a slowdown of a factor 2, for a presumably efficient C code running against localhost. I am not sure how this compares to the following benchmarks, since we don't have the absolute row rate from the C benchmarks. This is less relevant to this issues, but I benchmarked the stream implementation against other options:
I committed the code; use |
Commit 60f2e01 implements single-row mode for multi-row requests when
for normal mode vs
for single-row mode, when using a local socket connection. The test bench_fetch_many.ml.txt can be compiled with (executable
(name bench_fetch_many)
(modules bench_fetch_many)
(libraries
bechamel bechamel-notty
caqti caqti.blocking caqti-driver-postgresql
notty.unix)) |
It looks like the inefficiency is due to polling, with the following incorrect replacement, the single-row mode is within a factor 2 slowdown compared to the normal mode: diff --git a/caqti-driver-postgresql/lib/caqti_driver_postgresql.ml b/caqti-driver-postgresql/lib/caqti_driver_postgresql.ml
index a2712b6..d2ece8b 100644
--- a/caqti-driver-postgresql/lib/caqti_driver_postgresql.ml
+++ b/caqti-driver-postgresql/lib/caqti_driver_postgresql.ml
@@ -381,18 +381,8 @@ module Connect_functor (System : Caqti_platform_unix.System_sig.S) = struct
| exception Pg.Error msg -> return (Error msg)
| socket -> Unix.wrap_fd aux (Obj.magic socket))
- let get_next_result ~uri ~query db =
- let rec retry fd =
- db#consume_input;
- if db#is_busy then
- Unix.poll ~read:true fd >>= (fun _ -> retry fd)
- else
- return (Ok db#get_result)
- in
- try Unix.wrap_fd retry (Obj.magic db#socket)
- with Pg.Error err ->
- let msg = extract_communication_error db err in
- return (Error (Caqti_error.request_failed ~uri ~query msg))
+ let get_next_result ~uri:_ ~query:_ db =
+ return (Ok db#get_result)
let get_one_result ~uri ~query db =
get_next_result ~uri ~query db >>=? function |
Firstly, thanks for a great package,
caqti
has proved to be super-useful for handling database interactions in ocaml.I'm currently trying to work out how to efficiently extract large amounts of data (multiple GB) from a postgresql database, which I then want to fold over using
lwt
. Unfortunately, naively using a singleSELECT
query andfold_s
does not work, ascaqti
loads all of the data in memory, and causes my machine to run out of RAM).I have had a look through the postgresql documentation, and it seems like there are two ways to do this (if you can think of others I would be happy to hear them):
OFFSET
/LIMIT
batching, potentially with an extra query at the start to count the rows that will be returnedcaqti
as far as I can tellI could potentially implement the
OFFSET
/LIMIT
behaviour in my application, but that would mean that I could no-longer make use of thefold_s
helper function, which is very useful IMO. Similarly, it probably does not make sense forcaqti
to implement this sort of batching, as the API does not provide enough detail about the SQL statements being executed to make sure that addingOFFSET
andLIMIT
parameters would not conflict.Do you think it would be worth adding the single row mode behaviour to
caqti
when using the postgres driver - either by default or (more likely) as an option?I would be happy to put in a PR to add this function.
The text was updated successfully, but these errors were encountered: