GEN242: Data Analysis in Genome Biology
2026-04-10
Topics covered in this tutorial:
Note
Homework: HW02 tasks are linked throughout these slides at the relevant sections.
All tasks are assembled into a single R script HW2.R submitted via GitHub.
R is a powerful statistical environment and programming language for data analysis and visualization, widely used in bioinformatics and data science.
| Repository | Packages | Focus |
|---|---|---|
| CRAN | >14,000 | General data analysis |
| Bioconductor | >2,000 | Bioscience data analysis |
| Omegahat | >90 | Programming interfaces |
Several IDEs support syntax highlighting and sending code to the R console:
Key shortcuts in RStudio:
| Shortcut | Action |
|---|---|
Ctrl+Enter |
Send code to R console |
Ctrl+Shift+C |
Comment / uncomment |
Ctrl+1 / Ctrl+2 |
Switch between editor and console |
Terminal-based environment combining Neovim + R + Tmux. Ideal for working on the HPCC cluster.
\rfEnterEmacs (ESS), VS Code, gedit, Notepad++, Eclipse — all support R to varying degrees.
Tip
For detailed Bioconductor install instructions see the Bioc Install page and the BiocManager vignette.
When working in R, a good practice is to write all commands directly into an R script, instead of the R console, and then send the commands for execution to the R console with the Ctrl+Enter shortcut in RStudio/Posit, or similar shortcuts in other R coding environments, such as Nvim-R. This way all work is preserved and can be reused in the future.
The following instructions in this section provide a short overview of the standard working routine users should use to load R-based tutorials into their R IDE.
Step 1. Download *.qmd, *.Rmd or *.R file. These so called source files are always linked on the top right corner of each tutorial or slide show. From within R the file download can be accomplished via download.file (see below), wget from the command-line or with the save function in a user’s web browser. The following downloads the Rmd file of this tutorial via download.file from the R console.
Step 2. Load *.qmd, *.Rmd or *.R file in Nvim-R or RStudio.
Step 3. Send code from code editor to R console by pushing Ctrl + Enter in RStudio or Enter in Nvim-R. In *.Rmd files the code lines are in so called code chunks and only those ones can be sent to the console. To obtain in Neovim a connected R session one has to initiate by pressing the \rf key combination. For details see here.
Warning
Answer n when asked to save the workspace. Saving .RData creates large files. Better practice: save your analysis as an R script and re-run it to restore your session.
The %>% pipe from dplyr/magrittr chains operations left-to-right. New native R pipe is |>.
Makes code readable by avoiding deeply nested calls. Details in the dplyr tutorial.
Preferred version
Older alternatives
[1] 1 2 3
[1] TRUE
[1] "1" "2" "3"
[1] "1" "2" "3"
[1] TRUE
[1] 1 2 3
| Type | Dimensions | Data types | Example |
|---|---|---|---|
vector |
1D | uniform | c(1, 2, 3) |
factor |
1D | grouping labels | factor(c("a","b","a")) |
matrix |
2D | uniform | matrix(1:9, 3, 3) |
data.frame |
2D | mixed | data.frame(x=1:3, y=c("a","b","c")) |
tibble |
2D | mixed | modern data.frame |
list |
any | any | list(name="Fred", age=30) |
function |
— | code | function(x) x^2 |
# in namesa b c d e
1 2 3 4 5
b d f h
2 4 6 8
b d f
2 4 6
[1] dog cat mouse dog dog cat
Levels: cat dog mouse
Factors encode categorical variables with defined levels — essential for statistical modeling.
[1] "matrix" "array"
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 2 3 4 5 6 7 8 9 10
[2,] 11 12 13 14 15 16 17 18 19 20
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 2 3 4 5 6 7 8 9 10
[1] "data.frame"
Col1 Col2
1 1 10
2 2 9
[1] "matrix" "array"
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
Tip
The iris dataset is built into R — no import needed. It is used throughout these examples.
$name
[1] "Fred"
$wife
[1] "Mary"
$no.children
[1] 3
$child.ages
[1] 4 7 9
[1] 4 7
Lists are the most flexible R object — they can hold vectors, data frames, other lists, and functions all at once.
A B C D
1 2 3 4
E F G H I J K L M N O P Q R S T U V W X Y Z
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
K L M N O P Q R S T U V W X Y Z
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
$ sign (single column or list component)[1] setosa setosa setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
21 5.4 3.4 1.7 0.2 setosa
22 5.1 3.7 1.5 0.4 setosa
23 4.6 3.6 1.0 0.2 setosa
24 5.1 3.3 1.7 0.5 setosa
25 4.8 3.4 1.9 0.2 setosa
26 5.0 3.0 1.6 0.2 setosa
27 5.0 3.4 1.6 0.4 setosa
28 5.2 3.5 1.5 0.2 setosa
29 5.2 3.4 1.4 0.2 setosa
30 4.7 3.2 1.6 0.2 setosa
31 4.8 3.1 1.6 0.2 setosa
32 5.4 3.4 1.5 0.4 setosa
33 5.2 4.1 1.5 0.1 setosa
34 5.5 4.2 1.4 0.2 setosa
35 4.9 3.1 1.5 0.2 setosa
36 5.0 3.2 1.2 0.2 setosa
37 5.5 3.5 1.3 0.2 setosa
38 4.9 3.6 1.4 0.1 setosa
39 4.4 3.0 1.3 0.2 setosa
40 5.1 3.4 1.5 0.2 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
43 4.4 3.2 1.3 0.2 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa
47 5.1 3.8 1.6 0.2 setosa
48 4.6 3.2 1.4 0.2 setosa
49 5.3 3.7 1.5 0.2 setosa
50 5.0 3.3 1.4 0.2 setosa
[1] 1 2 3
[1] 1 2 3 101 102 103
x y
[1,] 1 101
[2,] 2 102
[3,] 3 103
[4,] 1 101
[5,] 2 102
[6,] 3 103
[1] 150
[1] 150 5
[1] "1" "2" "3" "4" "5" "6" "7" "8"
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
[1] "name" "wife" "no.children" "child.ages"
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[1] 11 11 11 11 11 11 11 11 11 11
[1] 55
[1] 5.5
1 2 3 4 5 6
3.333333 3.100000 3.066667 3.066667 3.333333 3.666667
Sepal.Length Sepal.Width Petal.Length
4.950000 3.383333 1.450000
Widely used read.table and read.delim import functions
Better alternative from readr package with better default arguments and performance. For details see here.
Import from Google Sheet directly
Note
HW02 — Task A: Sort iris by first column, subset first 12 rows, export to file, modify column names in a spreadsheet program, re-import with read.table.
→ HW02 instructions
[1] 150
[1] 35
Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
knitr| System | Level | Package |
|---|---|---|
| Base R graphics | Low + high | built-in |
| grid | Low-level | built-in |
| lattice | High-level | lattice |
| ggplot2 | High-level | ggplot2 |
plot, barplot, boxplot, hist, pie, pairs, image, heatmap
Tip
For new code, ggplot2 is generally recommended. Base R graphics remain useful for quick exploration and highly customized plots.
plot(y[,1], y[,2], pch=20, col="red", main="Symbols and Labels")
text(y[,1]+0.03, y[,2], rownames(y))
Call:
lm(formula = y[, 2] ~ y[, 1])
Residuals:
Min 1Q Median 3Q Max
-0.40357 -0.17912 -0.04299 0.22147 0.46623
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5764 0.2110 2.732 0.0258 *
y[, 1] -0.3647 0.3959 -0.921 0.3839
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3095 on 8 degrees of freedom
Multiple R-squared: 0.09589, Adjusted R-squared: -0.01712
F-statistic: 0.8485 on 1 and 8 DF, p-value: 0.3839
| Argument | Description |
|---|---|
col |
color of symbols |
pch |
symbol type (example(points) to see options) |
lwd |
line/symbol width |
cex.* |
font size controls |
mar |
margin sizes c(bottom, left, top, right) |
log="xy" |
log scale on both axes |
Note
HW02 — Task B: Generate a scatter plot of iris columns 1 and 2, colored by Species. Use xlim/ylim to restrict data to the bottom-left quadrant.
→ HW02 instructions
Tip
When input is a matrix, barplot uses column names as group labels and row names as within-group labels. Convert data.frame input with as.matrix() first.
bar <- barplot(m <- rowMeans(y) * 10, ylim=c(0, 10))
stdev <- sd(t(y))
arrows(bar, m, bar, m + stdev, length=0.15, angle=90)Works the same for jpeg(), png(), svg(), tiff().
Note
HW02 — Task C: Calculate mean values per Species for first four iris columns. Organize as a matrix. Generate stacked and horizontally arranged bar plots.
→ HW02 instructions
A step-by-step workflow using two sample biological datasets. This analysis routine is used by Homework 2D-H.
Open in Excel, save as tab-delimited text, then import:
Or import directly from the web:
Note
HW02 — Task D: Execute merge to return only common rows directly (without na.omit). Prove both methods return identical results.
HW02 — Task E: Replace all NA values in my_mw_target2a with zeros.
Note
HW02 — Task F: How many proteins have MW > 4,000 and < 5,000? Subset and sort by MW to verify.
Note
HW02 — Task G: Retrieve rows where second column contains specific IDs using %in%. Also use the second column as a row index and repeat. Explain the difference between the two approaches.
Assemble all solutions into a single R script HW2.R and submit via GitHub.
| Task | Topic | Key functions |
|---|---|---|
| A | Sort iris, export, modify columns, re-import |
order, write.table, read.table |
| B | Scatter plot iris col 1-2, colored by Species |
plot, xlim, ylim |
| C | Mean matrix by Species, stacked & horizontal bars | aggregate, barplot |
| D | Merge returning only common rows; prove equivalence | merge(all=FALSE), all() |
| E | Replace NAs with zeros | is.na, indexing |
| F | Filter proteins by MW range 4,000–5,000 | boolean indexing |
| G | Subset rows by ID using %in% and row index |
%in%, rownames |
| H | Assemble all code into HW2.R, run with source() |
source, Rscript |
Homework/HW2/HW2.R
Due: Thu, April 16th at 6:00 PM
Note
The preassembled workflow script for Task H is available here — it does not include solutions for Tasks A–C.
GEN242 · UC Riverside · Tutorial source