Performance experiments with duckplyr in R


I recently became aware of the duckplyr library for R, which takes the place of tidyverse’s dplyr library, but uses the DuckDB database under the hood. Without really knowing anything about how dplyr works and if the use of DuckDB would improve my workflow at all, I decided to perform an experiment. I am currently analyzing two datasets, one with ~80k records and ~70 variables and one with ~60k records and ~100 variables. Both datasets are wrangled with Tidyverse-foo in multiple ways and finally combined. The wrangling of the data involves things like rowwise() and c_across(), which I know from experience is quite an ’expensive’ operation.

In order to get the execution times of my code, I did this repeatedly:

  1. Restart R (by pressing CTRL SHIFT F10).

  2. Run

    system.time(rmarkdown::render("my_file.Rmd"))
    
  3. Record the user time and the system time elapsed.

  4. Repeat twice.

I did this with both the “balanced power mode” and the “performance mode” on my laptop. During execution of the code, I left the laptop alone in order not to interfere with the timing.

This is the result of my benchmarking:

The times are user times. I left out the system times, which are in the range of 2-3 seconds.

Not really mind-boggling, right? It occurred to me that I rather double-check that duckplyr was really being used. Indeed, this was not the case:

> class(clinical_data)
[1] "tbl_df"      "tbl"         "data.frame"

clinical_data was missing the `duckplyr_df’ class. How come?

I import the raw data from Excel files (don’t ask…) into tibbles, and evidently, this prevents duckplyr from seeing the data frames. So I piped the data frames through as_duckplyr_tibble() explicitly, and this got me the right classes:

> class(clinical_data)
[1] "duckplyr_df" "tbl_df"      "tbl"         "data.frame"

However, this did not really speed up the execution either.

I looked around my RMarkdown chunks and their outputs, but I did not find any warning that duckplyr had to fall back to dplyr’s methods. This could have explained the absence of a noticeable difference.

Here are the average times (in seconds) for the benchmarking runs.

> runs_table
# A tibble: 6 × 4
# Groups:   library, power_mode [6]
  library                            power_mode   mean    sd
  <chr>                              <chr>       <dbl> <dbl>
1 dplyr                              balanced     31.8 0.722
2 dplyr                              performance  32.6 0.477
3 duckplyr                           balanced     31.4 1.10 
4 duckplyr                           performance  31.3 0.495
5 duckplyr with `as_duckplyr_tibble` balanced     36.0 0.517
6 duckplyr with `as_duckplyr_tibble` performance  33.6 0.303

So at least for my (!!!) use case, the use of duckplyr instead of dplyr did not make any practical difference, and I can also leave my laptop’s performance mode alone. When it comes to optimizing performance, you can’t just buy a solution off the shelf, you always have to try and find the best solution for your specific problem.

Your mileage will vary, of course. The people who develop duckplyr are brilliant, and the fact that it does not work for me tells more about me and my work than it does about duckplyr.

The duckplyr demo dataset

As a case in point, the duckplyr demo repository contains a taxi data set. The ZIP file alone is a ~1.7 GB download. Deflated, the files take up 2.4 GB. With about 21 million records (24 variables), this dataset is considerably larger than mine.

Here are the results from running dplyr/run_all_queries.R and duckplyr/run_all_queries.R on my Thinkpad P14s (performance mode in F40 KDE):

Library q01 q02 q03 q04
dplyr 3.4 s 3.9 s 9.1 s 14.3 s
duckplyr 4.3 s 4.4 s 9.4 s 14.8 s

I should add that execution times vary with each run, but the big picture stays the same.

Maybe I’m missing the point and it’s not about execution times, after all.

¯\_(ツ)_/