Memory Consumption when performing a `left_join()` #434

larry77 · 2025-01-01T14:17:55Z

Hello,
Please have a look at the reprex at the end of the file.
I have a seasoned laptop with 8Gb of RAM which runs debian stable.
When I carry out an aggregation and then a left join with arrow, I need to use some swap when I collect the result (a long dataframe), but I can run the computation.
Instead, if just run the part in duckplyr (commented out in the second part of the reprex), my memory is so insufficient that the laptop freezes. Can someone take a look into this? Is there a much higher memory consumption in duckplyr wrt arrow? Thanks a lot.

library(tidyverse)

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:lubridate':
#> 
#>     duration
#> The following object is masked from 'package:utils':
#> 
#>     timestamp


# Uncomment and run this only once
## dd <- tibble(x=1:100000000, y=rep(LETTERS[1:20], 5000000))


## write_csv(dd, "test.csv")




df2  <- open_dataset("test.csv",
  format = "csv",
  skip_rows = 0)

system.time({
    df_stat2 <- df2 |>
        group_by(y) |> 
        summarise(total=sum(x)) |>
        ungroup()

    df_out2 <- df2 |>
        left_join(y=df_stat2, by=c("y")) |>
        collect()

    
})
#>    user  system elapsed 
#>  36.892   5.439  26.582


df_out2|>glimpse()
#> Rows: 100,000,000
#> Columns: 3
#> $ x     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
#> $ y     <chr> "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M",…
#> $ total <int64> 249999955000000, 249999960000000, 249999965000000, 24999997000…


## uncomment to run --this takes a lot of memory on my system

## library(duckplyr)


## df <- duck_csv("test.csv")

## system.time({
## df_stat <- df |>
##     summarise(total=sum(x), .by = y) 



## df_out <- df |>
##     left_join(y=df_stat,  by=c("y")) |>
##     collect()

## })


sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Debian GNU/Linux 12 (bookworm)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] arrow_15.0.1    lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1  
#>  [5] dplyr_1.1.4     purrr_1.0.2     readr_2.1.5     tidyr_1.3.1    
#>  [9] tibble_3.2.1    ggplot2_3.5.1   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] bit_4.0.5          gtable_0.3.5       compiler_4.4.2     reprex_2.1.0      
#>  [5] tidyselect_1.2.1   assertthat_0.2.1   scales_1.3.0       yaml_2.3.8        
#>  [9] fastmap_1.1.1      R6_2.5.1           generics_0.1.3     knitr_1.46        
#> [13] munsell_0.5.1      R.cache_0.16.0     pillar_1.10.1.9000 tzdb_0.4.0        
#> [17] R.utils_2.12.3     rlang_1.1.4        stringi_1.8.4      xfun_0.43         
#> [21] fs_1.6.4           bit64_4.0.5        timechange_0.3.0   cli_3.6.3         
#> [25] withr_3.0.0        magrittr_2.0.3     digest_0.6.35      grid_4.4.2        
#> [29] hms_1.1.3          lifecycle_1.0.4    R.methodsS3_1.8.2  R.oo_1.26.0       
#> [33] vctrs_0.6.5        evaluate_0.23      glue_1.7.0         styler_1.10.3     
#> [37] colorspace_2.1-0   rmarkdown_2.26     tools_4.4.2        pkgconfig_2.0.3   
#> [41] htmltools_0.5.8.1

^{Created on 2025-01-01 with reprex v2.1.0}

krlmlr · 2025-01-01T18:09:44Z

Thanks, we need to work on the docs here.

Can you try:

duck_exec("set memory_limit='1GB'")

or a similar value?

larry77 · 2025-01-01T19:58:15Z

Mmmhhh.... I added that line just after loading the duckplyr library, but I still have the RAM consumption going through the roof of my system.

krlmlr · 2025-01-02T06:43:47Z

Thanks. The session info indicates you're using arrow 15.0.1, but it doesn't tell which version of duckdb is installed. Can you please double-check with the latest, 1.1.3-1?

larry77 · 2025-01-02T09:21:49Z

Hello, duckdb was automagically installed when I installed duckplyr. From a more explicit sessioninfo() (loading duckdb and duckplyr) I see this

other attached packages: [1] duckdb_1.1.3-1 DBI_1.2.2 duckplyr_0.99.99.9910 [4] dplyr_1.1.4

So I am not using an outdated version of duckdb.

krlmlr · 2025-01-02T17:04:02Z

Thanks, confirming that, on a VM with 4 GB with Debian Bookworm running in OrbStack, the following example is killed:

options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(duckplyr)
library(readr)


if (!file.exists("test.csv")) {
  dd <- tibble(x=1:100000000, y=rep(LETTERS[1:20], 5000000))

  write_csv(dd, "test.csv")
}

duck_exec("set memory_limit='1GB'")

df <- duck_csv("test.csv")

df_stat <- df |>
    summarise(total=sum(x), .by = y)

df_out <- df |>
    left_join(y=df_stat,  by=c("y")) |>
    collect()

df_out

larry77 · 2025-01-02T18:13:44Z

I see. I have 8Gb in my machine and I opened the issue because I can run the code with arrow on my machine (with some difficulties), but not with duckplyr. No intention to start a competition between the two tools, but I assumed there may be a memory leak in duckplyr

krlmlr · 2025-01-02T19:00:23Z

Thanks. Could be related to the following phenomenon, which I'd classify as a bug. I'd expect collect() to return a plain tibble:

options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(duckplyr)

duck_tbl(a = 1) |>
  collect() |>
  class()
#> [1] "duckplyr_df" "tbl_df"      "tbl"         "data.frame"

duck_tbl(a = 1, .lazy = TRUE) |>
  collect() |>
  class()
#> [1] "duckplyr_df" "tbl_df"      "tbl"         "data.frame"

^{Created on 2025-01-02 with reprex v2.1.1}

larry77 · 2025-01-02T19:22:47Z

To finish this off on my side: if I kill almost any other process, duckplyr also gets the job done on my machine, but the memory consumption is significantly higher than under arrow.

krlmlr · 2025-01-04T09:11:00Z

I'm no longer sure that collect() is really the problem here

The following example works, even with 4 GB:

options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(duckplyr)
library(readr)


if (!file.exists("test.csv")) {
  dd <- tibble(x=1:100000000, y=rep(LETTERS[1:20], 5000000))

  write_csv(dd, "test.csv")
}

duck_exec("set memory_limit='1GB'")

df <- duck_csv("test.csv")

df_stat <- df |>
    summarise(total=sum(x), .by = y)

df_out <-
  df |>
    left_join(y=df_stat,  by=c("y")) |>
    compute_parquet("test.parquet")

df_out

Perhaps you can also use compute() . #439 might be a problem, this also needs a very recent version of duckdb, on CRAN later this month.

The large memory consumption is still interesting, though.

larry77 · 2025-01-04T12:12:40Z

Thanks. I now also have the procedure to convert a csv into an parquet file without ingesting everything into memory.

krlmlr added this to the 1.0.0 milestone Jan 1, 2025

krlmlr added the documentation label Jan 1, 2025

krlmlr changed the title ~~Memory Consumption when performing a left_join~~ Memory Consumption when performing a left_join() Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Consumption when performing a `left_join()` #434

Memory Consumption when performing a `left_join()` #434

larry77 commented Jan 1, 2025

krlmlr commented Jan 1, 2025

larry77 commented Jan 1, 2025

krlmlr commented Jan 2, 2025

larry77 commented Jan 2, 2025 •

edited

Loading

krlmlr commented Jan 2, 2025

larry77 commented Jan 2, 2025

krlmlr commented Jan 2, 2025

larry77 commented Jan 2, 2025

krlmlr commented Jan 4, 2025 •

edited

Loading

larry77 commented Jan 4, 2025

Memory Consumption when performing a left_join() #434

Memory Consumption when performing a left_join() #434

Comments

larry77 commented Jan 1, 2025

krlmlr commented Jan 1, 2025

larry77 commented Jan 1, 2025

krlmlr commented Jan 2, 2025

larry77 commented Jan 2, 2025 • edited Loading

krlmlr commented Jan 2, 2025

larry77 commented Jan 2, 2025

krlmlr commented Jan 2, 2025

larry77 commented Jan 2, 2025

krlmlr commented Jan 4, 2025 • edited Loading

larry77 commented Jan 4, 2025

Memory Consumption when performing a `left_join()` #434

Memory Consumption when performing a `left_join()` #434

larry77 commented Jan 2, 2025 •

edited

Loading

krlmlr commented Jan 4, 2025 •

edited

Loading