Performance experiments with duckplyr in R
I recently became aware of the 'duckplyr' library for R. Here are the results of my experimenting with it and benchmarking it against `dplyr`.
2024-08-26 19:00 +0200
I recently became aware of the duckplyr library for R, which takes the place
of tidyverse’s dplyr library, but uses the DuckDB database under the hood.
Without really knowing anything about how dplyr works and if the use of DuckDB
would improve my workflow at all, I decided to perform an experiment. I am
currently analyzing two datasets, one with ~80k records and ~70 variables and
one with ~60k records and ~100 variables. Both datasets are wrangled with
Tidyverse-foo in multiple ways and finally combined. The wrangling of the
data involves things like rowwise()
and c_across()
, which I know from
experience is quite an ’expensive’ operation.
In order to get the execution times of my code, I did this repeatedly:
-
Restart R (by pressing CTRL SHIFT F10).
-
Run
system.time(rmarkdown::render("my_file.Rmd"))
-
Record the user time and the system time elapsed.
-
Repeat twice.
I did this with both the “balanced power mode” and the “performance mode” on my laptop. During execution of the code, I left the laptop alone in order not to interfere with the timing.
This is the result of my benchmarking:
The times are user times. I left out the system times, which are in the range of 2-3 seconds.
Not really mind-boggling, right? It occurred to me that I rather double-check
that duckplyr
was really being used. Indeed, this was not the case:
> class(clinical_data)
[1] "tbl_df" "tbl" "data.frame"
clinical_data
was missing the `duckplyr_df’ class. How come?
I import the raw data from Excel files (don’t ask…) into tibbles, and
evidently, this prevents duckplyr
from seeing the data frames. So I piped the
data frames through as_duckplyr_tibble()
explicitly, and this got me the right
classes:
> class(clinical_data)
[1] "duckplyr_df" "tbl_df" "tbl" "data.frame"
However, this did not really speed up the execution either.
I looked around my RMarkdown chunks and their outputs, but I did not find any
warning that duckplyr
had to fall back to dplyr
’s methods. This could have
explained the absence of a noticeable difference.
Here are the average times (in seconds) for the benchmarking runs.
> runs_table
# A tibble: 6 × 4
# Groups: library, power_mode [6]
library power_mode mean sd
<chr> <chr> <dbl> <dbl>
1 dplyr balanced 31.8 0.722
2 dplyr performance 32.6 0.477
3 duckplyr balanced 31.4 1.10
4 duckplyr performance 31.3 0.495
5 duckplyr with `as_duckplyr_tibble` balanced 36.0 0.517
6 duckplyr with `as_duckplyr_tibble` performance 33.6 0.303
So at least for my (!!!) use case, the use of duckplyr
instead of dplyr
did
not make any practical difference, and I can also leave my laptop’s performance
mode alone. When it comes to optimizing performance, you can’t just buy a
solution off the shelf, you always have to try and find the best solution for
your specific problem.
Your mileage will vary, of course. The people who develop duckplyr
are
brilliant, and the fact that it does not work for me tells more about me and my
work than it does about duckplyr
.
The duckplyr demo dataset
As a case in point, the duckplyr demo repository contains a taxi data set. The ZIP file alone is a ~1.7 GB download. Deflated, the files take up 2.4 GB. With about 21 million records (24 variables), this dataset is considerably larger than mine.
Here are the results from running dplyr/run_all_queries.R
and
duckplyr/run_all_queries.R
on my Thinkpad P14s (performance mode in F40 KDE):
Library | q01 | q02 | q03 | q04 |
---|---|---|---|---|
dplyr | 3.4 s | 3.9 s | 9.1 s | 14.3 s |
duckplyr | 4.3 s | 4.4 s | 9.4 s | 14.8 s |
I should add that execution times vary with each run, but the big picture stays the same.
Maybe I’m missing the point and it’s not about execution times, after all.
¯\_(ツ)_/