doltr is a package to interface with Dolt, an SQL database with git-like versioning.

Installation

You will need the dolt command-line utility installed on your computer to use doltr. Installation instructions for Linux, macOS, and Windows can be found here.

Install the R package with

remotes::install_github("ecohealthalliance/doltr")

Usage

doltr package provides two DBI-compliant drivers to connect to a dolt database dolt_remote() connects to a dolt server via TCP, and is a thin wrapper around the RMariaDB package because Dolt shares a communication protocol with MySQL and MariaDB. dolt_local() connects to a Dolt database directory locally on-disk. Behind the scenes dolt_local() launches and manages a background server process, which can also be done manually with dolt_server(). Both can be used to connect to a database as you would with other DBI packages:

library(doltr)
remote_conn <- DBI::dbConnect(dolt_remote(), dname = "dbname", username = "user", ...)
local_conn <- DBI::dbConnect(dolt_local(), dir = "/path/to/my/dolt/db/directory")

Since Dolt has git-like versioning concepts, doltr’s API design includes both components drawn from DBI and also from git interfaces like gert and git2r (as well as Dolt’s command-line interface).

doltr has the concept of a “default database” for a project. When working with git (or git2r or gert), commands apply to the current working directory by default. Similarly, with doltr, many commands use the default database. The default database is set with the environment directory DOLT_DIR, which is doltdb by default.
For a project you might set DOLT_DIR in a project-level .Renviron or .env file

To explore doltr’s capabilities, let’s pull an existing database. dolt_clone(), like git clone clones a database to a local directory, using DoltHub as the default remote source (though dolt Database remotes can be hosted elsewhere). We’ll clone doltr/nycflights, which contains a subset of the data from the {nycflights13} package.

library(doltr)
dolt_clone("doltr/nycflights")

This creates an nycflights directory. Let’s set it as our default database for this session:

Sys.setenv(DOLT_DIR="nycflights")

You can use the dolt() function to connect to the database. dolt() is a shortcut for dbConnect(dolt_local/dolt_remote(), ...). It also caches the database connection, so it can be called repeatedly in place of a connection variable. dolt() is also the default argument for a database connection in many functions.

Running dolt() prints a summary of the database state:

dolt()
#> 
#> ── <DoltLocalConnection> nycflights ────────────────────────────────────────────
#> • Serving /home/runner/work/doltr/doltr/vignettes/nycflights, PID 14662
#> • Connected at: root@127.0.0.1:3306
#> • On branch main
#> • Last commit by noamross 2 hours ago: Update README
#> • Working database clean

You can use dolt() with DBI or dbplyr functions to read from or write to the database:

dbListTables(dolt())
#> [1] "airlines" "airports" "flights"  "planes"   "weather"
dbReadTable(dolt(), "airlines")
#>                                                carrier                   name
#> 1                                                   9E      Endeavor Air Inc.
#> 2                                                   9E      Endeavor Air Inc.
#> 3                                                   9E      Endeavor Air Inc.
#> 4                                                   9E      Endeavor Air Inc.
#> 5                                                   MQ              Envoy Air
#> 6                                                   MQ              Envoy Air
#> 7                                                   MQ              Envoy Air
#> 8                                                   MQ              Envoy Air
#> 9                                                   B6        JetBlue Airways
#> 10                                                  B6        JetBlue Airways
#> 11                                                  B6        JetBlue Airways
#> 12                                                  B6        JetBlue Airways
#> 13 virus that MVP Safe Mode VC WeChat freemium DPI zip     we https in runway
#> 14                                                  WN Southwest Airlines Co.
#> 15                                                  WN Southwest Airlines Co.
#> 16                                                  WN Southwest Airlines Co.
#> 17                                                  WN Southwest Airlines Co.
#> 18                                                  AA American Airlines Inc.
#> 19                                                  AA American Airlines Inc.
#> 20                                                  AA American Airlines Inc.
#> 21                                                  AA American Airlines Inc.
#> 22                                                  UA  United Air Lines Inc.
#> 23                                                  UA  United Air Lines Inc.
#> 24                                                  UA  United Air Lines Inc.
#> 25                                                  UA  United Air Lines Inc.
#> 26                                                Test                   test
#> 27                                                Test                   test
tbl(dolt(), "flights") %>% 
  filter(origin == "EWR", dest == "MDW") %>%
  head() %>% 
  collect()
#> # A tibble: 6 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     2     9     1754           1750         4     1913           1920
#> 2  2013     2    11      605            610        -5      729            735
#> 3  2013     2    23      827            830        -3      954            955
#> 4  2013     2     6     1946           1940         6     2100           2100
#> 5  2013     2     9       NA            850        NA       NA           1025
#> 6  2013     2    13     1549           1550        -1     1710           1710
#> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dbl>
dbWriteTable(dolt(), "mtcars", mtcars)

With the last command, we changed the database by adding a table. This is reflected in the change to the database working state when we print dolt()

dolt()
#> 
#> ── <DoltLocalConnection> nycflights ────────────────────────────────────────────
#> • Serving /home/runner/work/doltr/doltr/vignettes/nycflights, PID 14662
#> • Connected at: root@127.0.0.1:3306
#> • On branch main
#> • Last commit by noamross 2 hours ago: Update README
#> • Working: mtcars (new)

The summary no longer says “Working database clean” but shows that the working state of the database now includes a new table called mtcars. As with a new file in a git repository, we can stage this table for comitting, with dolt_add(). Rather than printing the whole database summary, we can get just the last bullet with dolt_status()

dolt_add("mtcars")
dolt_status()
#> Staged: mtcars (new)

dolt_status() pretty-prints but actually yields a table of working or stages changes to the database:

as.data.frame(dolt_status())
#>   table_name staged    status
#> 1     mtcars      1 new table

With the table staged, we can commit it to the database and provide a message:

dolt_commit(message = "Add mtcars table")
dolt_status()
#> Working database clean
dolt_last_commit()
#> Last commit by P.D. En'ko 0 secs ago: Add mtcars table

Exploring Dolt history

You can view the commit history of the database with dolt_log(), which collects the Dolt system table dolt_log:

dolt_log()
#> # A tibble: 12 × 5
#>    commit_hash                      committer  email date                message
#>    <chr>                            <chr>      <chr> <dttm>              <chr>  
#>  1 dhdlfbdqin235i3hostit72rqrlv6av0 P.D. En'ko pd@e… 2022-05-19 22:16:28 "Add m…
#>  2 42m1pf2kjnf5j6fs9nopucm5mdkol3q8 noamross   ross… 2022-05-19 20:40:24 "Updat…
#>  3 88kolt95duve79hm59s4kvsntrak03dn noamross   ross… 2022-05-13 17:44:37 "Updat…
#>  4 766q8e7hfe08qn46h1ksuidr5363gq9s noamross   ross… 2022-05-13 17:31:43 "Remov…
#>  5 d51247codqq86uccqo375punlnofkgp6 noamross   ross… 2022-04-05 00:51:04 "test3"
#>  6 k4kdaldectieuudbnh09p0qgbp3rjr3j noamross   ross… 2022-04-05 00:44:32 "taggi…
#>  7 arpqa7d7hb33fjvc21k7b0fkfu4b1pfg noamross   ross… 2022-04-05 00:33:52 "Updat…
#>  8 1qrt92dpemtnmvvhc2v2s7dl0pd5g1t4 noamross   ross… 2022-04-05 00:12:29 "Updat…
#>  9 9l309ujv7i1hks67lklnds6j5e8q8vrm noamross   ross… 2021-11-14 17:52:38 "Add m…
#> 10 0ofs6si34lge6e4ke9pjr9nudajsn3fn noamross   ross… 2021-11-13 17:52:00 "\"add…
#> 11 iauk9n1in8508djitue1869stutn04ke noamross   noam… 2021-11-09 16:14:29 "Add n…
#> 12 raa6u0g1bk1pi1p200f6pq7tpoobn6cl noamross   noam… 2021-11-09 16:13:05 "Initi…

The Connection Pane

For RStudio users, doltr provides a connection pane with which you can explore the database.

doltr’s connection pane shows information about the versioning state of your database in addition to your tables, the Dolt system tables and the database information schema.

Created by EcoHealth Alliance