doltr is a package to interface with Dolt, an SQL database with git-like versioning.
You will need the dolt command-line utility installed on your computer to use doltr. Installation instructions for Linux, macOS, and Windows can be found here.
Install the R package with
remotes::install_github("ecohealthalliance/doltr")
doltr package provides two DBI-compliant drivers to connect to a dolt database dolt_remote()
connects to a dolt server via TCP, and is a thin wrapper around the RMariaDB package because Dolt shares a communication protocol with MySQL and MariaDB. dolt_local()
connects to a Dolt database directory locally on-disk. Behind the scenes dolt_local()
launches and manages a background server process, which can also be done manually with dolt_server()
. Both can be used to connect to a database as you would with other DBI packages:
library(doltr)
remote_conn <- DBI::dbConnect(dolt_remote(), dname = "dbname", username = "user", ...)
local_conn <- DBI::dbConnect(dolt_local(), dir = "/path/to/my/dolt/db/directory")
Since Dolt has git-like versioning concepts, doltr’s API design includes both components drawn from DBI and also from git interfaces like gert and git2r (as well as Dolt’s command-line interface).
doltr has the concept of a “default database” for a project. When working with git (or git2r or gert), commands apply to the current working directory by default. Similarly, with doltr, many commands use the default database. The default database is set with the environment directory DOLT_DIR
, which is doltdb
by default.
For a project you might set DOLT_DIR
in a project-level .Renviron
or .env
file
To explore doltr’s capabilities, let’s pull an existing database. dolt_clone()
, like git clone
clones a database to a local directory, using DoltHub
as the default remote source (though dolt Database remotes can be hosted elsewhere). We’ll clone doltr/nycflights
, which contains a subset of the data from the {nycflights13
} package.
library(doltr)
dolt_clone("doltr/nycflights")
This creates an nycflights
directory. Let’s set it as our default database for this session:
Sys.setenv(DOLT_DIR="nycflights")
You can use the dolt()
function to connect to the database. dolt()
is a shortcut for dbConnect(dolt_local/dolt_remote(), ...)
. It also caches the database connection, so it can be called repeatedly in place of a connection variable. dolt()
is also the default argument for a database connection in many functions.
Running dolt()
prints a summary of the database state:
dolt()
#>
#> ── <DoltLocalConnection> nycflights ────────────────────────────────────────────
#> • Serving /home/runner/work/doltr/doltr/vignettes/nycflights, PID 14662
#> • Connected at: root@127.0.0.1:3306
#> • On branch main
#> • Last commit by noamross 2 hours ago: Update README
#> • Working database clean
You can use dolt()
with DBI or dbplyr functions to read from or write to the database:
dbListTables(dolt())
#> [1] "airlines" "airports" "flights" "planes" "weather"
dbReadTable(dolt(), "airlines")
#> carrier name
#> 1 9E Endeavor Air Inc.
#> 2 9E Endeavor Air Inc.
#> 3 9E Endeavor Air Inc.
#> 4 9E Endeavor Air Inc.
#> 5 MQ Envoy Air
#> 6 MQ Envoy Air
#> 7 MQ Envoy Air
#> 8 MQ Envoy Air
#> 9 B6 JetBlue Airways
#> 10 B6 JetBlue Airways
#> 11 B6 JetBlue Airways
#> 12 B6 JetBlue Airways
#> 13 virus that MVP Safe Mode VC WeChat freemium DPI zip we https in runway
#> 14 WN Southwest Airlines Co.
#> 15 WN Southwest Airlines Co.
#> 16 WN Southwest Airlines Co.
#> 17 WN Southwest Airlines Co.
#> 18 AA American Airlines Inc.
#> 19 AA American Airlines Inc.
#> 20 AA American Airlines Inc.
#> 21 AA American Airlines Inc.
#> 22 UA United Air Lines Inc.
#> 23 UA United Air Lines Inc.
#> 24 UA United Air Lines Inc.
#> 25 UA United Air Lines Inc.
#> 26 Test test
#> 27 Test test
tbl(dolt(), "flights") %>%
filter(origin == "EWR", dest == "MDW") %>%
head() %>%
collect()
#> # A tibble: 6 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 2 9 1754 1750 4 1913 1920
#> 2 2013 2 11 605 610 -5 729 735
#> 3 2013 2 23 827 830 -3 954 955
#> 4 2013 2 6 1946 1940 6 2100 2100
#> 5 2013 2 9 NA 850 NA NA 1025
#> 6 2013 2 13 1549 1550 -1 1710 1710
#> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dbl>
dbWriteTable(dolt(), "mtcars", mtcars)
With the last command, we changed the database by adding a table. This is reflected in the change to the database working state when we print dolt()
dolt()
#>
#> ── <DoltLocalConnection> nycflights ────────────────────────────────────────────
#> • Serving /home/runner/work/doltr/doltr/vignettes/nycflights, PID 14662
#> • Connected at: root@127.0.0.1:3306
#> • On branch main
#> • Last commit by noamross 2 hours ago: Update README
#> • Working: mtcars (new)
The summary no longer says “Working database clean” but shows that the working state of the database now includes a new table called mtcars
. As with a new file in a git repository, we can stage this table for comitting, with dolt_add()
. Rather than printing the whole database summary, we can get just the last bullet with dolt_status()
dolt_add("mtcars")
dolt_status()
#> Staged: mtcars (new)
dolt_status()
pretty-prints but actually yields a table of working or stages changes to the database:
as.data.frame(dolt_status())
#> table_name staged status
#> 1 mtcars 1 new table
With the table staged, we can commit it to the database and provide a message:
dolt_commit(message = "Add mtcars table")
dolt_status()
#> Working database clean
dolt_last_commit()
#> Last commit by P.D. En'ko 0 secs ago: Add mtcars table
You can view the commit history of the database with dolt_log()
, which collects the Dolt system table dolt_log
:
dolt_log()
#> # A tibble: 12 × 5
#> commit_hash committer email date message
#> <chr> <chr> <chr> <dttm> <chr>
#> 1 dhdlfbdqin235i3hostit72rqrlv6av0 P.D. En'ko pd@e… 2022-05-19 22:16:28 "Add m…
#> 2 42m1pf2kjnf5j6fs9nopucm5mdkol3q8 noamross ross… 2022-05-19 20:40:24 "Updat…
#> 3 88kolt95duve79hm59s4kvsntrak03dn noamross ross… 2022-05-13 17:44:37 "Updat…
#> 4 766q8e7hfe08qn46h1ksuidr5363gq9s noamross ross… 2022-05-13 17:31:43 "Remov…
#> 5 d51247codqq86uccqo375punlnofkgp6 noamross ross… 2022-04-05 00:51:04 "test3"
#> 6 k4kdaldectieuudbnh09p0qgbp3rjr3j noamross ross… 2022-04-05 00:44:32 "taggi…
#> 7 arpqa7d7hb33fjvc21k7b0fkfu4b1pfg noamross ross… 2022-04-05 00:33:52 "Updat…
#> 8 1qrt92dpemtnmvvhc2v2s7dl0pd5g1t4 noamross ross… 2022-04-05 00:12:29 "Updat…
#> 9 9l309ujv7i1hks67lklnds6j5e8q8vrm noamross ross… 2021-11-14 17:52:38 "Add m…
#> 10 0ofs6si34lge6e4ke9pjr9nudajsn3fn noamross ross… 2021-11-13 17:52:00 "\"add…
#> 11 iauk9n1in8508djitue1869stutn04ke noamross noam… 2021-11-09 16:14:29 "Add n…
#> 12 raa6u0g1bk1pi1p200f6pq7tpoobn6cl noamross noam… 2021-11-09 16:13:05 "Initi…
For RStudio users, doltr provides a connection pane with which you can explore the database.
doltr’s connection pane shows information about the versioning state of your database in addition to your tables, the Dolt system tables and the database information schema.