Programming in R

GEN242: Data Analysis in Genome Biology

Thomas Girke

2026-07-08

Overview

One of the main attractions of R is how easy it is to write custom functions and programs — even for users with no prior programming experience. Once the basic control structures are understood, R becomes a powerful environment for complex custom analyses of almost any type of data.

Topics covered in this tutorial:

Why programming in R?
R scripts — structure and execution
Control structures: if, ifelse
Loops: for, while, apply family
Speed performance of loops
Writing custom functions
Useful utilities: regex, string operations, debugging
Executing R scripts from console and command-line
Programming exercises (HW04)

Note

Prerequisite: The R Basics tutorial provides the foundational R knowledge assumed here.

Why Programming in R?

Powerful statistical environment and programming language
Facilitates reproducible research — analyses are scripts, not clicks
Efficient data structures make programming very easy
Easy to implement custom functions for any analysis
Powerful, publication-quality graphics
Access to a rapidly growing ecosystem of packages
Widely used language in bioinformatics
Standard for data mining and biostatistical analysis
Free, open-source, available for all operating systems

R Scripts

An R script is a plain text file (.R or .Rmd/.qmd) containing R code and comments. It is the primary way to write reproducible analyses.

Structure of a well-organized R script

#!/usr/bin/env Rscript
# -----------------------------------------------
# Script: my_analysis.R
# Author: Thomas Girke
# Description: Example analysis workflow
# -----------------------------------------------

## Section 1 — Load libraries
library(ggplot2)
library(dplyr)

## Section 2 — Import data
myDF <- read.delim("mydata.txt", sep="\t")

## Section 3 — Analysis
result <- myDF |> filter(value > 10) |> summarize(mean=mean(value))

## Section 4 — Export
write.table(result, "result.txt", sep="\t", quote=FALSE)

Style guidelines

Use # for comments — explain why, not just what
Organize code into labeled sections with ## Section name
Keep functions in a separate file and load with source()
Follow the Tidyverse style guide
Use the formatR package to auto-format scripts

Rmd and Quarto scripts

.Rmd and .qmd files extend R scripts with narrative text, results, and formatted output. They render to HTML, PDF, and other formats. Details in the R Markdown tutorial.

Control Structures — Operators

Comparison operators

Operator	Meaning
`==`	equal
`!=`	not equal
`>` / `>=`	greater than / or equal
`<` / `<=`	less than / or equal

Logical operators

Operator	Meaning	Scope
`&`	AND	element-wise on vectors
`&&`	AND	first element only — use in `if` statements
`\\|`	OR	element-wise on vectors
`\\|\\|`	OR	first element only — use in `if` statements
`!`	NOT

Tip

Use && and || in if statements (they evaluate only the first element and short-circuit). Use & and | for element-wise operations on vectors.

Conditional Execution — `if` and `ifelse`

`if` statement — operates on a single logical value

if (TRUE) {
    statements_1
} else {
    statements_2
}

Warning

Keep } else { on the same line — avoid a newline before else or R will misparse the statement.

Examples

# Basic if / else
if (1 == 0) {
    print(1)
} else {
    print(2)     # runs this branch
}

# if / else if / else chain
if (1 == 0) {
    print(1)
} else if (1 == 2) {
    print(2)
} else {
    print(3)     # runs this branch
}

`ifelse` — vectorized conditional, operates on entire vectors

ifelse(test, true_value, false_value)   # syntax

x <- 1:10
ifelse(x < 5, sqrt(x), 0)   # sqrt for values < 5, else 0

ifelse is much more efficient than a for loop with an if inside when operating on vectors.

`for` Loops

Iterate over elements of a sequence:

for (variable in sequence) {
    statements
}

Example — compute row means (append approach)

mydf <- iris
myve <- NULL
for (i in seq(along=mydf[,1])) {
    myve <- c(myve, mean(as.numeric(mydf[i, 1:3])))  # appends result each iteration
}
myve[1:8]

Warning

The append approach (c()) is slow for large objects — each iteration creates a new copy of the entire vector. Use the inject approach instead.

Inject approach — pre-allocate the result vector (much faster)

myve <- numeric(length(mydf[,1]))   # pre-allocate vector of correct length
for (i in seq(along=myve)) {
    myve[i] <- mean(as.numeric(mydf[i, 1:3]))   # assign result by index
}
myve[1:8]

Conditional stop inside a loop

Use stop() to break out of a loop with an error message when a condition is met:

x <- 1:10
z <- NULL
for (i in seq(along=x)) {
    if (x[i] < 5) {
        z <- c(z, x[i]-1)
        print(z)
    } else {
        stop("values need to be < 5")   # breaks loop and prints error
    }
}

`while` Loop

Iterates as long as a condition remains TRUE:

while (condition) {
    statements
}

Example

z <- 0
while (z < 5) {
    z <- z + 2    # increment z each iteration
    print(z)      # prints: 2, 4, 6  (stops when z >= 5)
}

Tip

Use while when the number of iterations is not known in advance. Use for when iterating over a known sequence or vector.

The `apply` Function Family

Apply functions avoid explicit loops and are often more readable and faster.

`apply` — apply a function over rows or columns of a matrix/data.frame

apply(X, MARGIN, FUN, ...)
# X:      matrix, array, or data.frame
# MARGIN: 1 = rows, 2 = columns
# FUN:    function to apply

apply(iris[1:8, 1:3], 1, mean)    # row-wise mean for first 8 rows, cols 1-3
apply(iris[, 1:4], 2, mean)       # column-wise mean for all numeric columns

`tapply` — apply a function to groups defined by a factor

tapply(vector, factor, FUN)

tapply(iris$Sepal.Length, iris$Species, mean)   # mean Sepal.Length per species

`lapply` and `sapply` — apply a function to each element of a list or vector

l <- list(a=1:10, beta=exp(-3:3), logic=c(TRUE,FALSE,FALSE,TRUE))

lapply(l, mean)    # returns a list
sapply(l, mean)    # returns a vector or matrix when possible
vapply(l, mean, FUN.VALUE=numeric(1))   # safer: enforces output type

Often used with an inline anonymous function:

sapply(names(l), function(x) mean(l[[x]]))    # same result, explicit element access

Choosing between `lapply`, `sapply`, `vapply`

Function	Returns	Best for
`lapply`	always a list	when output types may vary
`sapply`	vector/matrix if possible, else list	interactive use
`vapply`	vector/array of specified type	scripts — safer, faster

Loop Speed Performance

Looping over large data sets can be slow. The key principle: avoid growing objects inside loops and prefer vectorized operations over loops entirely.

Test matrix used in all benchmarks

myMA <- matrix(rnorm(1000000), 100000, 10,
               dimnames=list(1:100000, paste("C", 1:10, sep="")))

1. `for` loop: append vs inject

# SLOW — append approach: c() creates a new copy each iteration
results <- NULL
system.time(for(i in seq(along=myMA[,1])) results <- c(results, mean(myMA[i,])))
#  user: 39.2s   elapsed: 45.6s

# FAST — inject approach: pre-allocate, assign by index
results <- numeric(length(myMA[,1]))
system.time(for(i in seq(along=myMA[,1])) results[i] <- mean(myMA[i,]))
#  user: 1.6s    elapsed: 1.6s   → ~25× faster

2. `apply` vs `rowMeans`

system.time(myMAmean <- apply(myMA, 1, mean))     # user: 1.5s
system.time(myMAmean <- rowMeans(myMA))            # user: 0.005s  → ~300× faster

3. `apply` vs vectorized calculation

# apply loop for row-wise standard deviation
system.time(myMAsd <- apply(myMA, 1, sd))                               # user: 3.7s

# vectorized equivalent — same result, dramatically faster
system.time(myMAsd <- sqrt((rowSums((myMA - rowMeans(myMA))^2)) /
                            (length(myMA[1,]) - 1)))                    # user: 0.02s

Key takeaways

Use rowMeans, rowSums, colMeans, colSums instead of apply where possible
Pre-allocate result objects before loops — never grow with c(), cbind(), rbind()
Design data structures for matrix-to-matrix operations — eliminates loops entirely

Fast Querying with Matrix Operations

A practical example: filtering differentially expressed genes (DEGs) across multiple contrasts using matrix-to-matrix logic — no looping required.

Create a test DEG matrix (LFCs and p-values)

lfcPvalMA <- function(Nrow=200, Ncol=4, stats_labels=c("lfc", "pval")) {
    set.seed(1410)
    assign(stats_labels[1], runif(n=Nrow*Ncol, min=-4, max=4))
    assign(stats_labels[2], runif(n=Nrow*Ncol, min=0,  max=1))
    lfc_ma  <- matrix(lfc,  Nrow, Ncol, dimnames=list(paste("g",1:Nrow,sep=""),
                      paste("t",1:Ncol,"_",stats_labels[1],sep="")))
    pval_ma <- matrix(pval, Nrow, Ncol, dimnames=list(paste("g",1:Nrow,sep=""),
                      paste("t",1:Ncol,"_",stats_labels[2],sep="")))
    statsMA <- cbind(lfc_ma, pval_ma)
    return(statsMA[, order(colnames(statsMA))])
}
degMA <- lfcPvalMA(Nrow=200, Ncol=4)
degMA[1:4,]

Separate LFC and p-value matrices into a list

degList <- list(
    lfc  = degMA[, grepl("lfc",  colnames(degMA))],
    pval = degMA[, grepl("pval", colnames(degMA))]
)
sapply(degList, dim)   # confirm dimensions match

Apply combinatorial filter (|LFC| >= 1 AND pval <= 0.5)

queryResult <- (degList$lfc >= 1 | degList$lfc <= -1) & degList$pval <= 0.5
colnames(queryResult) <- gsub("_.*", "", colnames(queryResult))
queryResult[1:4,]

Extract matching gene IDs

# Per-column: genes passing filter in each contrast
matchingIDlist <- sapply(colnames(queryResult),
                         function(x) names(queryResult[queryResult[,x], x]),
                         simplify=FALSE)

# Across columns: genes passing filter in > 2 contrasts
matchingID <- rowSums(queryResult) > 2
names(matchingID[matchingID])    # gene names meeting the threshold

Tip

Storing LFC and p-value as two parallel matrices in a list enables flexible, fast, zero-loop combinatorial filtering — a pattern worth reusing in any multi-contrast analysis.

Functions — Overview and Syntax

Functions are the primary way to organize and reuse code in R. Almost everything in R is a function.

Define a function

myfct <- function(arg1, arg2, ...) {
    # function body — operations on the arguments
    result <- arg1 + arg2
    return(result)    # return value explicitly, or just: result
}

Call a function

myfct(arg1=3, arg2=4)   # with argument names (recommended)
myfct(3, 4)              # positional — order must match definition

Key rules

Concept	Rule
Naming	Avoid names of existing functions (e.g. don’t name a function `mean`)
Default args	Provide defaults with `arg=value` — caller can then omit them
Empty args	`function() { ... }` — valid for functions that always return the same value
`...`	Pass unknown arguments through to another function
Return value	Last unassigned expression, or explicit `return()`
Scope	Variables inside a function are local — invisible outside
Global assign	Use `<<-` to force a variable to exist in the global environment

Functions — Examples

Define a function with a default argument

myfct <- function(x1, x2=5) {   # x2 has default value 5
    z1 <- x1 / x1
    z2 <- x2 * x2
    myvec <- c(z1, z2)
    return(myvec)
}

Call with and without the default argument

myfct(x1=2, x2=5)   # explicit: returns c(1, 25)
myfct(2, 5)          # positional: same result
myfct(x1=2)          # uses default x2=5: same result
myfct                 # without () prints the function definition

Scope — variables inside functions are local

x <- 10                     # global x
myfct2 <- function() {
    x <- 99                 # local x — does not affect global x
    cat("inside:", x, "\n")
}
myfct2()                    # prints: inside: 99
cat("outside:", x, "\n")    # prints: outside: 10

# Force global assignment with <<-
myfct3 <- function() {
    x <<- 99                # modifies global x
}
myfct3()
cat("outside:", x, "\n")    # prints: outside: 99

Tip

Avoid <<- in general — it makes code harder to reason about. Prefer returning values explicitly with return() and assigning outside the function.

Useful Utilities — Debugging

R provides several tools for finding and fixing errors in code:

Function	Purpose
`traceback()`	Shows the call stack after an error
`browser()`	Insert a breakpoint — pauses execution and opens interactive prompt
`debug(myfct)`	Step through `myfct` line by line
`undebug(myfct)`	Remove debug mode from a function
`options(error=recover)`	On error, open interactive debugger at the call stack
`options(error=NULL)`	Reset to default error handling

# Example: use browser() as a breakpoint inside a function
myfct <- function(x) {
    browser()          # execution pauses here — inspect variables interactively
    result <- x^2
    return(result)
}
myfct(5)

Full guide: Debugging in R (Advanced R)

Useful Utilities — Regular Expressions

R’s regex utilities work similarly to other languages. Main reference: ?regexp

Pattern matching with `grep`

month.name[grep("^A", month.name)]        # months starting with A
grep("^J", month.name, value=TRUE)        # same with value=TRUE
grepl("^A", month.name)                   # returns logical vector

String substitution with `gsub` and `sub`

# gsub: replace ALL matches
gsub("(i.*a)", "xxx_\\1", "virginica", perl=TRUE)   # back reference with \\1

# sub: replace FIRST match only
sub("a", "X", "banana")    # returns "bXnana"

String operations

# Insert a character with back reference, then split on it
x <- gsub("(a)", "\\1_", month.name[1], perl=TRUE)   # "J_anu_ary"
strsplit(x, "_")                                       # split on "_"

# Reverse a string
paste(rev(unlist(strsplit("hello", NULL))), collapse="")   # "olleh"

Import lines matching a pattern from a file

cat(month.name, file="months.txt", sep="\n")         # write months to file
x <- readLines("months.txt")                          # read all lines
x[grep("^J", x, perl=TRUE)]                          # keep lines starting with J

Useful Utilities — String and Time Functions

String paste and manipulation

paste("sample", 1:5, sep="_")           # "sample_1" ... "sample_5"
paste0("C", 1:5)                         # "C1" ... "C5" (no separator)
paste(month.name[1:3], collapse=", ")    # "January, February, March"
nchar("hello")                           # 5  (string length)
toupper("hello"); tolower("HELLO")       # case conversion
trimws("  hello  ")                      # remove leading/trailing whitespace

Interpret a string as R code

myfct <- function(x) x^2
mylist <- ls()
n <- which(mylist %in% "myfct")

get(mylist[n])       # retrieves the object named by the string
get(mylist[n])(2)    # calls it as a function with argument 2
eval(parse(text=mylist[n]))   # alternative: parse string as expression

Timing and system calls

system.time(ls())       # measure time for an expression
date()                  # current system date and time
Sys.sleep(1)            # pause R for 1 second

Call external command-line tools from R

system("blastall -p blastp -i seq.fasta -d uniprot -o seq.blastp")
system2("blastp", args=c("-i", "seq.fasta", "-d", "uniprot"))

File integrity check

library(tools)
md5 <- as.vector(md5sum(dir(R.home(), pattern="^COPY", full.names=TRUE)))
identical(md5, md5)                        # TRUE
identical(md5, sub("^b", "z", md5))       # FALSE — detects any change

Executing R Scripts

From the R console

source("my_script.R")    # execute entire script, output printed to console

From the command-line (preferred for automation)

Rscript myscript.R                   # standard method
./myscript.R                         # requires shebang + executable permission
R CMD BATCH myscript.R               # older alternative
R --slave < myscript.R               # older alternative

Shebang line — required for `./myscript.R` execution

Add as the first line of the script:

#!/usr/bin/env Rscript

Then make the script executable:

chmod +x myscript.R
./myscript.R

Passing arguments from command-line to R

Create test.R:

myarg <- commandArgs()
print(iris[1:myarg[6], ])   # myarg[6] receives the first user-provided argument

Run it:

Rscript test.R 10    # prints first 10 rows of iris

Tip

For scripts accepting multiple complex arguments, use the argparse or optparse packages for clean argument parsing with help messages.

Programming Exercises

Exercise 1 — Comparing loop approaches for row-wise means

Create the test matrix:

myMA <- matrix(rnorm(500), 100, 5,
               dimnames=list(1:100, paste("C", 1:5, sep="")))

Task 1.1 — for loop with append (slow but instructive):

myve_for <- NULL
for (i in seq(along=myMA[,1])) {
    myve_for <- c(myve_for, mean(as.numeric(myMA[i,])))
}

Task 1.2 — while loop:

z <- 1; myve_while <- NULL
while (z <= nrow(myMA)) {
    myve_while <- c(myve_while, mean(as.numeric(myMA[z,])))
    z <- z + 1
}

Task 1.3 — confirm both methods give identical results:

all(myve_for == myve_while)    # should return TRUE

Task 1.4 — apply loop:

myve_apply <- apply(myMA, 1, mean)

Task 1.5 — built-in rowMeans (fastest):

mymean <- rowMeans(myMA)
# Compare all approaches side by side:
myResult <- cbind(myMA, mean_for=myve_for, mean_while=myve_while,
                  mean_apply=myve_apply, mean_rowMeans=mymean)
myResult[1:4, -c(1,2,3)]    # show only the mean columns

Programming Exercises (cont.)

Exercise 2 — Custom function for grouped column means

Task 2.1 — implement a function that computes means for user-specified column groups in any matrix or data frame:

myMA <- matrix(rnorm(100000), 10000, 10,
               dimnames=list(1:10000, paste("C", 1:10, sep="")))

# Group columns: cols 1-3 → group 1, cols 4-6 → group 2, etc.
myList <- tapply(colnames(myMA), c(1,1,1,2,2,2,3,3,4,4), list)
names(myList) <- sapply(myList, paste, collapse="_")

# Apply mean to each column group
myMAmean <- sapply(myList, function(x) apply(myMA[,x], 1, mean))
myMAmean[1:4,]

Exercise 3 — Nested loops: pairwise similarity matrix

Task 3.1 — create a list of character vectors of varying lengths:

setlist <- lapply(11:30, function(x) sample(letters, x, replace=TRUE))
names(setlist) <- paste("S", seq(along=setlist), sep="")

Task 3.2 — compute all pairwise intersect sizes:

setlist <- sapply(setlist, unique)    # remove duplicates first
olMA <- sapply(names(setlist), function(x)
               sapply(names(setlist), function(y)
               sum(setlist[[x]] %in% setlist[[y]])))
olMA[1:4, 1:4]

Task 3.3 — plot as heatmap:

library(pheatmap); library(RColorBrewer)
pheatmap(olMA, color=brewer.pal(9,"Blues"),
         cluster_rows=FALSE, cluster_cols=FALSE,
         display_numbers=TRUE, number_format="%.0f", fontsize_number=10)

HW04

Important

Assignment: HW04 — Programming in R

The programming exercises above (Exercises 1–3) form the basis of HW04.

Summary of exercise tasks

Exercise	Task	Topic
1.1	`for` loop with append	Row means on matrix
1.2	`while` loop	Same computation
1.3	Confirm identical results	`all()` comparison
1.4	`apply` loop	Same computation
1.5	`rowMeans`	Fastest approach
2.1	Custom function	Grouped column means with `tapply` + `sapply`
3.1	Create list	Character vectors of varying lengths
3.2	Nested loops	Pairwise intersect matrix with `%in%`
3.3	Heatmap	Visualize similarity matrix with `pheatmap`

Note

For the full homework instructions and submission details, see the HW04 page.

Summary — Key R Programming Commands

Category	Command	Purpose
Conditionals	`if / else`	single-value branching
	`ifelse(test, yes, no)`	vectorized conditional
Loops	`for (i in seq)`	iterate over sequence
	`while (cond)`	iterate while condition is TRUE
	`stop("msg")`	break loop with error
Apply family	`apply(X, 1, FUN)`	rows of matrix
	`apply(X, 2, FUN)`	columns of matrix
	`tapply(vec, fac, FUN)`	by factor groups
	`sapply(list, FUN)`	returns vector/matrix
	`lapply(list, FUN)`	returns list
Speed	`rowMeans`, `rowSums`	fast built-in row ops
	pre-allocate with `numeric(n)`	avoid growing with `c()`
Functions	`function(arg1, arg2=default)`	define
	`return(value)`	explicit return
Strings	`grep`, `grepl`	pattern match
	`gsub`, `sub`	pattern substitute
	`strsplit`, `paste`, `paste0`	split / combine
Scripts	`source("script.R")`	run from R console
	`Rscript script.R`	run from command-line
Timing	`system.time(expr)`	benchmark expression

Next: T5 — Parallel R

Programming in R

Overview

Why Programming in R?

R Scripts

Structure of a well-organized R script

Style guidelines

Rmd and Quarto scripts

Control Structures — Operators

Comparison operators

Logical operators

Conditional Execution — if and ifelse

if statement — operates on a single logical value

Examples

ifelse — vectorized conditional, operates on entire vectors

for Loops

Example — compute row means (append approach)

Inject approach — pre-allocate the result vector (much faster)

Conditional stop inside a loop

while Loop

Example

The apply Function Family

apply — apply a function over rows or columns of a matrix/data.frame

tapply — apply a function to groups defined by a factor

lapply and sapply — apply a function to each element of a list or vector

Choosing between lapply, sapply, vapply

Loop Speed Performance

Test matrix used in all benchmarks

1. for loop: append vs inject

2. apply vs rowMeans

3. apply vs vectorized calculation

Key takeaways

Fast Querying with Matrix Operations

Create a test DEG matrix (LFCs and p-values)

Separate LFC and p-value matrices into a list

Apply combinatorial filter (|LFC| >= 1 AND pval <= 0.5)

Extract matching gene IDs

Functions — Overview and Syntax

Define a function

Call a function

Key rules

Functions — Examples

Define a function with a default argument

Call with and without the default argument

Scope — variables inside functions are local

Useful Utilities — Debugging

Useful Utilities — Regular Expressions

Pattern matching with grep

String substitution with gsub and sub

String operations

Import lines matching a pattern from a file

Useful Utilities — String and Time Functions

String paste and manipulation

Interpret a string as R code

Timing and system calls

Call external command-line tools from R

File integrity check

Executing R Scripts

From the R console

From the command-line (preferred for automation)

Shebang line — required for ./myscript.R execution

Passing arguments from command-line to R

Programming Exercises

Exercise 1 — Comparing loop approaches for row-wise means

Programming Exercises (cont.)

Exercise 2 — Custom function for grouped column means

Exercise 3 — Nested loops: pairwise similarity matrix

HW04

Summary of exercise tasks

Summary — Key R Programming Commands

Conditional Execution — `if` and `ifelse`

`if` statement — operates on a single logical value

`ifelse` — vectorized conditional, operates on entire vectors

`for` Loops

`while` Loop

The `apply` Function Family

`apply` — apply a function over rows or columns of a matrix/data.frame

`tapply` — apply a function to groups defined by a factor

`lapply` and `sapply` — apply a function to each element of a list or vector

Choosing between `lapply`, `sapply`, `vapply`

1. `for` loop: append vs inject

2. `apply` vs `rowMeans`

3. `apply` vs vectorized calculation

Pattern matching with `grep`

String substitution with `gsub` and `sub`

Shebang line — required for `./myscript.R` execution