Tidyverse and Some SQLite

May 27, 2025

How to Navigate this Slide Show?

This ioslides presentation contains scrollable slides.
Which slides are scrollable, is indicated by a tag at the bottom of the corresponding slides stating:

[ Scroll down to continue ]

The following single character keyboard shortcuts enable alternate display modes of ioslides:
- f: enable fullscreen mode
- w: toggle widescreen mode
- o: enable overview mode
- h: enable code highlight mode
Pressing Esc exits all of these modes. Additional details can be found here.

Outline

Overview
Install
File Import and Export
Usage
Chaining (Pipes)
SQLite Databases
References

Overview

Modern object classes and methods for handling data.frame like structures are provided by the dplyr (tidyr) and data.table packages. A related example is Bioconductor’s DataTable object class. This tutorial provide a short introduction to the usage and functionalities of the dplyr and related packages.

Outline

Overview
Install
File Import and Export
Usage
Chaining (Pipes)
SQLite
References

Installation

The dplyr (tidyr) environment has evolved into an ecosystem of packages. To simplify package management, one can install and load the entire collection via the tidyverse package (“Learn the tidyverse,” n.d.). For more details on tidyverse see here.

install.packages("tidyverse")

Load tidyverse package environment

library(tidyverse)

## ── Attaching core tidyverse packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Outline

Overview
Install
File Import and Export
Usage
Chaining (Pipes)
SQLite Databases
References

Construct objects

Construct a `tibble`

as_tibble(iris) # coerce data.frame to tibble tbl

Reading and writing tabular files

While the base R read/write utilities can be used for data.frames, best time performance with the least amount of typing is achieved with the export/import functions from the readr package. For very large files the fread function from the data.table package achieves the best time performance.

Import with `readr`

Import functions provided by readr include:

read_csv(): comma separated (CSV) files
read_tsv(): tab separated files
read_delim(): general delimited files
read_fwf(): fixed width files
read_table(): tabular files where colums are separated by white-space.
read_log(): web log files

[ scroll down to continue ]

Create a sample tab delimited file for import

write_tsv(iris, "iris.txt") # Creates sample file

Import with read_tsv

iris_df <- read_tsv("iris.txt") # Import with read_tbv from readr package

## Rows: 150 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

iris_df

To import Google Sheets directly into R, see here.

Fast table import with `fread`

The fread function from the data.table package provides the best time performance for reading large tabular files into R.

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week, yday, year

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

iris_df <- as_tibble(fread("iris.txt")) # Import with fread and conversion to tibble
iris_df

Note: to ignore lines starting with comment signs, one can pass on to fread a shell command for preprocessing the file. The following example illustrates this option.

fread("grep -v '^#' iris.txt")

Export with `readr`

Export function provided by readr inlcude

write_delim(): general delimited files
write_csv(): comma separated (CSV) files
write_excel_csv(): excel style CSV files
write_tsv(): tab separated files

For instance, the write_tsv function writes a data.frame or tibble to a tab delimited file with much nicer default settings than the base R write.table function.

write_tsv(iris_df, "iris.txt")

Outline

Overview
Install
File Import and Export
Usage
Chaining (Pipes)
SQLite Databases
References

How to use `tibbles`?

Column and row binds

The equivalents to base R’s rbind and cbind are bind_rows and bind_cols, respectively.

bind_cols(iris_df, iris_df)[1:2,]

## New names:
## • `Sepal.Length` -> `Sepal.Length...1`
## • `Sepal.Width` -> `Sepal.Width...2`
## • `Petal.Length` -> `Petal.Length...3`
## • `Petal.Width` -> `Petal.Width...4`
## • `Species` -> `Species...5`
## • `Sepal.Length` -> `Sepal.Length...6`
## • `Sepal.Width` -> `Sepal.Width...7`
## • `Petal.Length` -> `Petal.Length...8`
## • `Petal.Width` -> `Petal.Width...9`
## • `Species` -> `Species...10`

[ scroll down to continue ]

bind_rows(iris_df, iris_df)[1:2,]

Extract column as vector

The subsetting operators [[ and $can be used to extract from a tibble single columns as vector.

iris_df[[5]][1:12]

##  [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"

iris_df$Species[1:12]

##  [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"

Important `dplyr` functions

filter() and slice()
arrange()
select() and rename()
distinct()
mutate() and transmute()
summarise()
sample_n() and sample_frac()

Slice and filter functions

Filter function

filter(iris_df, Sepal.Length > 7.5, Species=="virginica")

Base R code equivalent

iris_df[iris_df[, "Sepal.Length"] > 7.5 & iris_df[, "Species"]=="virginica", ]

Including boolean operators

filter(iris_df, Sepal.Length > 7.5 | Sepal.Length < 5.5, Species=="virginica")

Subset rows by position

dplyr approach

slice(iris_df, 1:2)

Base R code equivalent

iris_df[1:2,]

Subset rows by names

Since tibbles do not contain row names, row wise subsetting via the [,] operator cannot be used. However, the corresponding behavior can be achieved by passing to select a row position index obtained by basic R intersect utilities such as match.

Create a suitable test tibble

df1 <- bind_cols(data_frame(ids1=paste0("g", 1:10)), as_data_frame(matrix(1:40, 10, 4, dimnames=list(1:10, paste0("CA", 1:4)))))

## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a tibble, or `as.data.frame()` to convert to a data frame.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

df1

dplyr approach

slice(df1, match(c("g10", "g4", "g4"), ids1))

Base R equivalent

df1_old <- as.data.frame(df1)
rownames(df1_old) <- df1_old[,1]
df1_old[c("g10", "g4", "g4"),]

Sorting with `arrange`

Row-wise ordering based on specific columns

dplyr approach

arrange(iris_df, Species, Sepal.Length, Sepal.Width)

For ordering descendingly use desc() function

arrange(iris_df, desc(Species), Sepal.Length, Sepal.Width)

Base R code equivalent

iris_df[order(iris_df$Species, iris_df$Sepal.Length, iris_df$Sepal.Width), ]

iris_df[order(iris_df$Species, decreasing=TRUE), ]

Select columns with `select`

Select specific columns

select(iris_df, Species, Petal.Length, Sepal.Length)

Select range of columns by name

select(iris_df, Sepal.Length : Petal.Width)

Drop specific columns (here range)

select(iris_df, -(Sepal.Length : Petal.Width))

Renaming columns with `rename`

dplyr approach

rename(iris_df, new_col_name = Species)

Base R code approach

colnames(iris_df)[colnames(iris_df)=="Species"] <- "new_col_names"

Obtain unique rows with `distinct`

dplyr approach

distinct(iris_df, Species, .keep_all=TRUE)

Base R code approach

iris_df[!duplicated(iris_df$Species),]

Add columns

`mutate`

The mutate function allows to append columns to existing ones.

mutate(iris_df, Ratio = Sepal.Length / Sepal.Width, Sum = Sepal.Length + Sepal.Width)

`transmute`

The transmute function does the same as mutate but drops existing columns

transmute(iris_df, Ratio = Sepal.Length / Sepal.Width, Sum = Sepal.Length + Sepal.Width)

`bind_cols`

The bind_cols function is the equivalent of cbind in base R. To add rows, use the corresponding bind_rows function.

bind_cols(iris_df, iris_df)

## New names:
## • `Sepal.Length` -> `Sepal.Length...1`
## • `Sepal.Width` -> `Sepal.Width...2`
## • `Petal.Length` -> `Petal.Length...3`
## • `Petal.Width` -> `Petal.Width...4`
## • `Species` -> `Species...5`
## • `Sepal.Length` -> `Sepal.Length...6`
## • `Sepal.Width` -> `Sepal.Width...7`
## • `Petal.Length` -> `Petal.Length...8`
## • `Petal.Width` -> `Petal.Width...9`
## • `Species` -> `Species...10`

Summarize data

Summary calculation on single column

summarize(iris_df, mean(Petal.Length))

Summary calculation on many columns

summarize_all(iris_df[,1:4], mean)

Summarize by grouping column

summarize(group_by(iris_df, Species), mean(Petal.Length))

Aggregate summaries

summarize_all(group_by(iris_df, Species), mean)

Note: group_by does the looping for the user similar to aggregate or tapply.

Merging tibbles

The dplyr package provides several join functions for merging tibbles by a common key column similar to the merge function in base R. These *_join functions include:

inner_join(): returns join only for rows matching among both tibbles
full_join(): returns join for all (matching and non-matching) rows of two tibbles
left_join(): returns join for all rows in first tibble
right_join(): returns join for all rows in second tibble
anti_join(): returns for first tibble only those rows that have no match in the second one

Sample tibbles to illustrate *.join functions.

df1 <- bind_cols(data_frame(ids1=paste0("g", 1:10)), as_data_frame(matrix(1:40, 10, 4, dimnames=list(1:10, paste0("CA", 1:4)))))
df1

df2 <- bind_cols(data_frame(ids2=paste0("g", c(2,5,11,12))), as_data_frame(matrix(1:16, 4, 4, dimnames=list(1:4, paste0("CB", 1:4)))))
df2

Inner join

inner_join(df1, df2, by=c("ids1"="ids2"))

Left join

left_join(df1, df2, by=c("ids1"="ids2"))

Right join

right_join(df1, df2, by=c("ids1"="ids2"))

Full join

full_join(df1, df2, by=c("ids1"="ids2"))

Anti join

anti_join(df1, df2, by=c("ids1"="ids2"))

For additional join options users want to cosult the *_join help pages.

Outline

Overview
Install
File Import and Export
Usage
Chaining (Pipes)
SQLite Databases
References

Chaining

To simplify chaining of serveral operations, dplyr provides the %>% operator, where x %>% f(y) turns into f(x, y). This way one can pipe together multiple operations by writing them from left-to-right or top-to-bottom. This makes for easy to type and readable code. Since R-4.1.0, a native pipe |> operator is available that works largely the same as %>%.

[ scroll down to continue ]

Example 1

Series of data manipulations and export

read_tsv("iris.txt") %>% # Import with read_tbv from readr package
    as_tibble() %>% # Declare to use tibble
    select(Sepal.Length:Species) %>% # Select columns
    filter(Species=="setosa") %>% # Filter rows by some value
    arrange(Sepal.Length) %>% # Sort by some column
    mutate(Subtract=Petal.Length - Petal.Width) # Calculate and append

## Rows: 150 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

    # write_tsv("iris.txt") # Export to file, omitted here to show result

Example 2

Series of summary calculations for grouped data (group_by)

iris_df %>% # Declare tibble to use 
    group_by(Species) %>% # Group by species
    summarize(Mean_Sepal.Length=mean(Sepal.Length), 
              Max_Sepal.Length=max(Sepal.Length),
              Min_Sepal.Length=min(Sepal.Length),
              SD_Sepal.Length=sd(Sepal.Length),
              Total=n())

Example 3

Combining dplyr chaining with ggplot

iris_df %>% 
    group_by(Species) %>% 
    summarize_all(mean) %>% 
    reshape2::melt(id.vars=c("Species"), variable.name = "Samples", value.name="Values") %>%
    ggplot(aes(Samples, Values, fill = Species)) + 
           geom_bar(position="dodge", stat="identity")

Outline

Overview
Install
File Import and Export
Usage
Chaining (Pipes)
SQLite Databases
References

SQLite Databases

SQLite is a lightweight relational database solution. The RSQLite package provides an easy to use interface to create, manage and query SQLite databases directly from R. Basic instructions for using SQLite from the command-line are available here. A short introduction to RSQLite is available here.

Loading data into SQLite databases

The following loads two data.frames derived from the iris data set (here mydf1 and mydf2) into an SQLite database (here test.db).

library(RSQLite)
unlink("test.db") # Delete any existing test.db
mydb <- dbConnect(SQLite(), "test.db") # Creates database file test.db
mydf1 <- data.frame(ids=paste0("id", seq_along(iris[,1])), iris)
mydf2 <- mydf1[sample(seq_along(mydf1[,1]), 10),]
dbWriteTable(mydb, "mydf1", mydf1)
dbWriteTable(mydb, "mydf2", mydf2)

List names of tables in database

dbListTables(mydb)

## [1] "mydf1" "mydf2"

[ scroll down to continue ]

Import table into `data.frame`

dbGetQuery(mydb, 'SELECT * FROM mydf2')

Query database

dbGetQuery(mydb, 'SELECT * FROM mydf1 WHERE "Sepal.Length" < 4.6')

Join tables

The two tables can be joined on the shared ids column as follows.

dbGetQuery(mydb, 'SELECT * FROM mydf1, mydf2 WHERE mydf1.ids = mydf2.ids')

Session Info

sessionInfo()

## R version 4.4.3 (2025-02-28)
## Platform: x86_64-pc-linux-gnu
## Running under: Debian GNU/Linux 11 (bullseye)
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/Los_Angeles
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RSQLite_2.3.9     data.table_1.16.4 lubridate_1.9.4   forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4       purrr_1.0.2       readr_2.1.5       tidyr_1.3.1       tibble_3.2.1      ggplot2_3.5.1     tidyverse_2.0.0  
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    stringi_1.8.4     hms_1.1.3         digest_0.6.37     magrittr_2.0.3    evaluate_1.0.1    grid_4.4.3        timechange_0.3.0  blob_1.2.4        fastmap_1.2.0    
## [13] plyr_1.8.9        jsonlite_1.8.9    DBI_1.2.3         fansi_1.0.6       scales_1.3.0      jquerylib_0.1.4   cli_3.6.3         rlang_1.1.4       crayon_1.5.3      bit64_4.5.2       munsell_0.5.1     withr_3.0.2      
## [25] cachem_1.1.0      yaml_2.3.10       tools_4.4.3       parallel_4.4.3    reshape2_1.4.4    tzdb_0.4.0        memoise_2.0.1     colorspace_2.1-1  vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   bit_4.5.0.1      
## [37] vroom_1.6.5       pkgconfig_2.0.3   pillar_1.9.0      bslib_0.8.0       gtable_0.3.6      glue_1.8.0        Rcpp_1.0.13-1     xfun_0.49         tidyselect_1.2.1  knitr_1.49        farver_2.1.2      htmltools_0.5.8.1
## [49] labeling_0.4.3    rmarkdown_2.29    compiler_4.4.3

Outline

Overview
Install
File Import and Export
Usage
Chaining (Pipes)
SQLite Databases
References

References

“Learn the tidyverse.” n.d. https://www.tidyverse.org/learn/. https://www.tidyverse.org/learn/.

How to Navigate this Slide Show?

Outline

Overview

Related documentation

Outline

Installation

Outline

Construct objects

Construct a tibble

Reading and writing tabular files

Import with readr

Fast table import with fread

Export with readr

Outline

How to use tibbles?

Column and row binds

Extract column as vector

Important dplyr functions

Slice and filter functions

Filter function

Base R code equivalent

Including boolean operators

Subset rows by position

Subset rows by names

Sorting with arrange

Select columns with select

Renaming columns with rename

Obtain unique rows with distinct

Add columns

mutate

transmute

bind_cols

Summarize data

Merging tibbles

Inner join

Left join

Right join

Full join

Anti join

Outline

Chaining

Example 1

Example 2

Example 3

Outline

SQLite Databases

Loading data into SQLite databases

List names of tables in database

Import table into data.frame

Query database

Join tables

Session Info

Outline

References

Construct a `tibble`

Import with `readr`

Fast table import with `fread`

Export with `readr`

How to use `tibbles`?

Important `dplyr` functions

Sorting with `arrange`

Select columns with `select`

Renaming columns with `rename`

Obtain unique rows with `distinct`

`mutate`

`transmute`

`bind_cols`

Import table into `data.frame`