May 28, 2024

How to Navigate this Slide Show?


  • This ioslides presentation contains scrollable slides.
  • Which slides are scrollable, is indicated by a tag at the bottom of the corresponding slides stating:

[ Scroll down to continue ]

  • The following single character keyboard shortcuts enable alternate display modes of ioslides:
    • f: enable fullscreen mode
    • w: toggle widescreen mode
    • o: enable overview mode
    • h: enable code highlight mode
  • Pressing Esc exits all of these modes. Additional details can be found here.

Outline

  • Overview
  • Install
  • File Import and Export
  • Usage
  • Chaining (Pipes)
  • SQLite Databases
  • References

Overview

Modern object classes and methods for handling data.frame like structures are provided by the dplyr (tidyr) and data.table packages. A related example is Bioconductor’s DataTable object class. This tutorial provide a short introduction to the usage and functionalities of the dplyr and related packages.

Related documentation

More detailed tutorials on this topic can be found here:

Outline

  • Overview
  • Install
  • File Import and Export
  • Usage
  • Chaining (Pipes)
  • SQLite
  • References

Installation

The dplyr (tidyr) environment has evolved into an ecosystem of packages. To simplify package management, one can install and load the entire collection via the tidyverse package (“Learn the tidyverse,” n.d.). For more details on tidyverse see here.

install.packages("tidyverse")

Load tidyverse package environment

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Outline

  • Overview
  • Install
  • File Import and Export
  • Usage
  • Chaining (Pipes)
  • SQLite Databases
  • References

Construct objects

Construct a tibble

as_tibble(iris) # coerce data.frame to tibble tbl

Reading and writing tabular files

While the base R read/write utilities can be used for data.frames, best time performance with the least amount of typing is achieved with the export/import functions from the readr package. For very large files the fread function from the data.table package achieves the best time performance.

Import with readr

Import functions provided by readr include:

  • read_csv(): comma separated (CSV) files
  • read_tsv(): tab separated files
  • read_delim(): general delimited files
  • read_fwf(): fixed width files
  • read_table(): tabular files where colums are separated by white-space.
  • read_log(): web log files

[ scroll down to continue ]





Create a sample tab delimited file for import

write_tsv(iris, "iris.txt") # Creates sample file

Import with read_tsv

iris_df <- read_tsv("iris.txt") # Import with read_tbv from readr package
## Rows: 150 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
iris_df

To import Google Sheets directly into R, see here.

Fast table import with fread

The fread function from the data.table package provides the best time performance for reading large tabular files into R.

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week, yday, year
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
iris_df <- as_tibble(fread("iris.txt")) # Import with fread and conversion to tibble
iris_df

Note: to ignore lines starting with comment signs, one can pass on to fread a shell command for preprocessing the file. The following example illustrates this option.

fread("grep -v '^#' iris.txt") 

Export with readr

Export function provided by readr inlcude

  • write_delim(): general delimited files
  • write_csv(): comma separated (CSV) files
  • write_excel_csv(): excel style CSV files
  • write_tsv(): tab separated files

For instance, the write_tsv function writes a data.frame or tibble to a tab delimited file with much nicer default settings than the base R write.table function.

write_tsv(iris_df, "iris.txt")

Outline

  • Overview
  • Install
  • File Import and Export
  • Usage
  • Chaining (Pipes)
  • SQLite Databases
  • References

How to use tibbles?

Column and row binds

The equivalents to base R’s rbind and cbind are bind_rows and bind_cols, respectively.

bind_cols(iris_df, iris_df)[1:2,]
## New names:
## • `Sepal.Length` -> `Sepal.Length...1`
## • `Sepal.Width` -> `Sepal.Width...2`
## • `Petal.Length` -> `Petal.Length...3`
## • `Petal.Width` -> `Petal.Width...4`
## • `Species` -> `Species...5`
## • `Sepal.Length` -> `Sepal.Length...6`
## • `Sepal.Width` -> `Sepal.Width...7`
## • `Petal.Length` -> `Petal.Length...8`
## • `Petal.Width` -> `Petal.Width...9`
## • `Species` -> `Species...10`

[ scroll down to continue ]





bind_rows(iris_df, iris_df)[1:2,]

Extract column as vector

The subsetting operators [[ and $can be used to extract from a tibble single columns as vector.

iris_df[[5]][1:12]
##  [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"
iris_df$Species[1:12]
##  [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"

Important dplyr functions

  1. filter() and slice()
  2. arrange()
  3. select() and rename()
  4. distinct()
  5. mutate() and transmute()
  6. summarise()
  7. sample_n() and sample_frac()

Slice and filter functions

Filter function

filter(iris_df, Sepal.Length > 7.5, Species=="virginica")

Base R code equivalent

iris_df[iris_df[, "Sepal.Length"] > 7.5 & iris_df[, "Species"]=="virginica", ]

Including boolean operators

filter(iris_df, Sepal.Length > 7.5 | Sepal.Length < 5.5, Species=="virginica")

Subset rows by position

dplyr approach

slice(iris_df, 1:2)

Base R code equivalent

iris_df[1:2,]

Subset rows by names

Since tibbles do not contain row names, row wise subsetting via the [,] operator cannot be used. However, the corresponding behavior can be achieved by passing to select a row position index obtained by basic R intersect utilities such as match.

Create a suitable test tibble

df1 <- bind_cols(data_frame(ids1=paste0("g", 1:10)), as_data_frame(matrix(1:40, 10, 4, dimnames=list(1:10, paste0("CA", 1:4)))))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a tibble, or `as.data.frame()` to convert to a data frame.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
df1

dplyr approach

slice(df1, match(c("g10", "g4", "g4"), ids1))

Base R equivalent

df1_old <- as.data.frame(df1)
rownames(df1_old) <- df1_old[,1]
df1_old[c("g10", "g4", "g4"),]

Sorting with arrange

Row-wise ordering based on specific columns

dplyr approach

arrange(iris_df, Species, Sepal.Length, Sepal.Width)

For ordering descendingly use desc() function

arrange(iris_df, desc(Species), Sepal.Length, Sepal.Width)

Base R code equivalent

iris_df[order(iris_df$Species, iris_df$Sepal.Length, iris_df$Sepal.Width), ]
iris_df[order(iris_df$Species, decreasing=TRUE), ] 

Select columns with select

Select specific columns

select(iris_df, Species, Petal.Length, Sepal.Length)

Select range of columns by name

select(iris_df, Sepal.Length : Petal.Width)

Drop specific columns (here range)

select(iris_df, -(Sepal.Length : Petal.Width))

Renaming columns with rename

dplyr approach

rename(iris_df, new_col_name = Species)

Base R code approach

colnames(iris_df)[colnames(iris_df)=="Species"] <- "new_col_names"

Obtain unique rows with distinct

dplyr approach

distinct(iris_df, Species, .keep_all=TRUE)

Base R code approach

iris_df[!duplicated(iris_df$Species),]

Add columns

mutate

The mutate function allows to append columns to existing ones.

mutate(iris_df, Ratio = Sepal.Length / Sepal.Width, Sum = Sepal.Length + Sepal.Width)

transmute

The transmute function does the same as mutate but drops existing columns

transmute(iris_df, Ratio = Sepal.Length / Sepal.Width, Sum = Sepal.Length + Sepal.Width)

bind_cols

The bind_cols function is the equivalent of cbind in base R. To add rows, use the corresponding bind_rows function.

bind_cols(iris_df, iris_df)
## New names:
## • `Sepal.Length` -> `Sepal.Length...1`
## • `Sepal.Width` -> `Sepal.Width...2`
## • `Petal.Length` -> `Petal.Length...3`
## • `Petal.Width` -> `Petal.Width...4`
## • `Species` -> `Species...5`
## • `Sepal.Length` -> `Sepal.Length...6`
## • `Sepal.Width` -> `Sepal.Width...7`
## • `Petal.Length` -> `Petal.Length...8`
## • `Petal.Width` -> `Petal.Width...9`
## • `Species` -> `Species...10`

Summarize data

Summary calculation on single column

summarize(iris_df, mean(Petal.Length))

Summary calculation on many columns

summarize_all(iris_df[,1:4], mean)

Summarize by grouping column

summarize(group_by(iris_df, Species), mean(Petal.Length))

Aggregate summaries

summarize_all(group_by(iris_df, Species), mean) 

Note: group_by does the looping for the user similar to aggregate or tapply.

Merging tibbles

The dplyr package provides several join functions for merging tibbles by a common key column similar to the merge function in base R. These *_join functions include:

  • inner_join(): returns join only for rows matching among both tibbles
  • full_join(): returns join for all (matching and non-matching) rows of two tibbles
  • left_join(): returns join for all rows in first tibble
  • right_join(): returns join for all rows in second tibble
  • anti_join(): returns for first tibble only those rows that have no match in the second one

Sample tibbles to illustrate *.join functions.

df1 <- bind_cols(data_frame(ids1=paste0("g", 1:10)), as_data_frame(matrix(1:40, 10, 4, dimnames=list(1:10, paste0("CA", 1:4)))))
df1
df2 <- bind_cols(data_frame(ids2=paste0("g", c(2,5,11,12))), as_data_frame(matrix(1:16, 4, 4, dimnames=list(1:4, paste0("CB", 1:4)))))
df2

Inner join

inner_join(df1, df2, by=c("ids1"="ids2"))

Left join

left_join(df1, df2, by=c("ids1"="ids2"))

Right join

right_join(df1, df2, by=c("ids1"="ids2"))

Full join

full_join(df1, df2, by=c("ids1"="ids2"))

Anti join

anti_join(df1, df2, by=c("ids1"="ids2"))

For additional join options users want to cosult the *_join help pages.

Outline

  • Overview
  • Install
  • File Import and Export
  • Usage
  • Chaining (Pipes)
  • SQLite Databases
  • References

Chaining





To simplify chaining of serveral operations, dplyr provides the %>% operator, where x %>% f(y) turns into f(x, y). This way one can pipe together multiple operations by writing them from left-to-right or top-to-bottom. This makes for easy to type and readable code. Since R-4.1.0, a native pipe |> operator is available that works largely the same as %>%.









[ scroll down to continue ]





Example 1

Series of data manipulations and export

read_tsv("iris.txt") %>% # Import with read_tbv from readr package
    as_tibble() %>% # Declare to use tibble
    select(Sepal.Length:Species) %>% # Select columns
    filter(Species=="setosa") %>% # Filter rows by some value
    arrange(Sepal.Length) %>% # Sort by some column
    mutate(Subtract=Petal.Length - Petal.Width) # Calculate and append
## Rows: 150 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    # write_tsv("iris.txt") # Export to file, omitted here to show result 

Example 2

Series of summary calculations for grouped data (group_by)

iris_df %>% # Declare tibble to use 
    group_by(Species) %>% # Group by species
    summarize(Mean_Sepal.Length=mean(Sepal.Length), 
              Max_Sepal.Length=max(Sepal.Length),
              Min_Sepal.Length=min(Sepal.Length),
              SD_Sepal.Length=sd(Sepal.Length),
              Total=n()) 

Example 3

Combining dplyr chaining with ggplot

iris_df %>% 
    group_by(Species) %>% 
    summarize_all(mean) %>% 
    reshape2::melt(id.vars=c("Species"), variable.name = "Samples", value.name="Values") %>%
    ggplot(aes(Samples, Values, fill = Species)) + 
           geom_bar(position="dodge", stat="identity")

Outline

  • Overview
  • Install
  • File Import and Export
  • Usage
  • Chaining (Pipes)
  • SQLite Databases
  • References

SQLite Databases





SQLite is a lightweight relational database solution. The RSQLite package provides an easy to use interface to create, manage and query SQLite databases directly from R. Basic instructions for using SQLite from the command-line are available here. A short introduction to RSQLite is available here.

Loading data into SQLite databases

The following loads two data.frames derived from the iris data set (here mydf1 and mydf2) into an SQLite database (here test.db).

library(RSQLite)
unlink("test.db") # Delete any existing test.db
mydb <- dbConnect(SQLite(), "test.db") # Creates database file test.db
mydf1 <- data.frame(ids=paste0("id", seq_along(iris[,1])), iris)
mydf2 <- mydf1[sample(seq_along(mydf1[,1]), 10),]
dbWriteTable(mydb, "mydf1", mydf1)
dbWriteTable(mydb, "mydf2", mydf2)

List names of tables in database

dbListTables(mydb)
## [1] "mydf1" "mydf2"

[ scroll down to continue ]





Import table into data.frame

dbGetQuery(mydb, 'SELECT * FROM mydf2')

Query database

dbGetQuery(mydb, 'SELECT * FROM mydf1 WHERE "Sepal.Length" < 4.6')

Join tables

The two tables can be joined on the shared ids column as follows.

dbGetQuery(mydb, 'SELECT * FROM mydf1, mydf2 WHERE mydf1.ids = mydf2.ids')

Session Info

sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Debian GNU/Linux 11 (bullseye)
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/Los_Angeles
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RSQLite_2.3.6     data.table_1.15.4 lubridate_1.9.3   forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4       purrr_1.0.2       readr_2.1.5       tidyr_1.3.1       tibble_3.2.1      ggplot2_3.5.1     tidyverse_2.0.0  
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    stringi_1.8.3     hms_1.1.3         digest_0.6.35     magrittr_2.0.3    evaluate_0.23     grid_4.4.0        timechange_0.3.0  blob_1.2.4        fastmap_1.1.1    
## [13] plyr_1.8.9        jsonlite_1.8.8    DBI_1.2.2         fansi_1.0.6       scales_1.3.0      jquerylib_0.1.4   cli_3.6.2         rlang_1.1.3       crayon_1.5.2      bit64_4.0.5       munsell_0.5.1     withr_3.0.0      
## [25] cachem_1.0.8      yaml_2.3.8        tools_4.4.0       parallel_4.4.0    reshape2_1.4.4    tzdb_0.4.0        memoise_2.0.1     colorspace_2.1-0  vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   bit_4.0.5        
## [37] vroom_1.6.5       pkgconfig_2.0.3   pillar_1.9.0      bslib_0.7.0       gtable_0.3.5      glue_1.7.0        Rcpp_1.0.12       highr_0.10        xfun_0.43         tidyselect_1.2.1  knitr_1.46        farver_2.1.1     
## [49] htmltools_0.5.8.1 labeling_0.4.3    rmarkdown_2.26    compiler_4.4.0

Outline

  • Overview
  • Install
  • File Import and Export
  • Usage
  • Chaining (Pipes)
  • SQLite Databases
  • References

References