Introduction to R

GEN242: Data Analysis in Genome Biology

Thomas Girke

2026-07-08

Overview

Topics covered in this tutorial:

What is R and why use it?
R working environments (RStudio, Nvim-R-Tmux)
Installation of R, RStudio and packages
Navigating directories and basic syntax
Data types and data objects
Subsetting, utilities, and calculations
Reading and writing external data
Graphics in R (base graphics)
Analysis routine: data import, merging, filtering, plotting

Note

Homework: HW02 tasks are linked throughout these slides at the relevant sections.
All tasks are assembled into a single R script HW2.R submitted via GitHub.

What is R?

R is a powerful statistical environment and programming language for data analysis and visualization, widely used in bioinformatics and data science.

Why use R?

Complete statistical environment and programming language
Efficient functions and data structures for data analysis
Powerful, publication-quality graphics
Access to a fast-growing number of analysis packages
One of the most widely used languages in bioinformatics
Standard for data mining and biostatistical analysis
Free, open-source, available for all operating systems

Key package repositories

Repository	Packages	Focus
CRAN	>14,000	General data analysis
Bioconductor	>2,000	Bioscience data analysis
Omegahat	>90	Programming interfaces

R Working Environments

Several IDEs support syntax highlighting and sending code to the R console:

RStudio / Posit

RStudio Desktop — local installation
RStudio Server / OnDemand — web-based, available at UCR HPCC
Posit Cloud — cloud-based, no local install needed

Key shortcuts in RStudio:

Shortcut	Action
`Ctrl+Enter`	Send code to R console
`Ctrl+Shift+C`	Comment / uncomment
`Ctrl+1` / `Ctrl+2`	Switch between editor and console

Nvim-R-Tmux

Terminal-based environment combining Neovim + R + Tmux. Ideal for working on the HPCC cluster.

Start R session: \rf
Send line to R console: Enter
Full instructions: Nvim-R-Tmux tutorial

Other editors

Emacs (ESS), VS Code, gedit, Notepad++, Eclipse — all support R to varying degrees.

Installation of R and Packages

Install R and RStudio

Install R from CRAN
Install RStudio from posit.co

Install CRAN packages

install.packages(c("pkg1", "pkg2"))
install.packages("pkg.zip", repos=NULL)   # install from local file

Install Bioconductor packages

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")        # install BiocManager if not available
BiocManager::version()                     # check Bioconductor version
BiocManager::install(c("pkg1", "pkg2"))   # install Bioc packages

Load packages

library("my_library")                                          # single package
lapply(c("lib1", "lib2"), require, character.only=TRUE)       # multiple packages

Explore a package

library(help="my_library")    # list functions
vignette("my_library")        # open manual (PDF or HTML)

Tip

For detailed Bioconductor install instructions see the Bioc Install page and the BiocManager vignette.

Working Routine for Tutorials

When working in R, a good practice is to write all commands directly into an R script, instead of the R console, and then send the commands for execution to the R console with the Ctrl+Enter shortcut in RStudio/Posit, or similar shortcuts in other R coding environments, such as Nvim-R. This way all work is preserved and can be reused in the future.

The following instructions in this section provide a short overview of the standard working routine users should use to load R-based tutorials into their R IDE.

Step 1. Download *.qmd, *.Rmd or *.R file. These so called source files are always linked on the top right corner of each tutorial or slide show. From within R the file download can be accomplished via download.file (see below), wget from the command-line or with the save function in a user’s web browser. The following downloads the Rmd file of this tutorial via download.file from the R console.

download.file("https://raw.githubusercontent.com/tgirke/GEN242/main/slides/rbasics/rbasics_slides.qmd", "rbasics.qmd")

Step 2. Load *.qmd, *.Rmd or *.R file in Nvim-R or RStudio.

Step 3. Send code from code editor to R console by pushing Ctrl + Enter in RStudio or Enter in Nvim-R. In *.Rmd files the code lines are in so called code chunks and only those ones can be sent to the console. To obtain in Neovim a connected R session one has to initiate by pressing the \rf key combination. For details see here.

Getting Around

Starting and closing R

q()                    # quit R
# Save workspace image? [y/n/c]:

Warning

Answer n when asked to save the workspace. Saving .RData creates large files. Better practice: save your analysis as an R script and re-run it to restore your session.

Navigating directories

ls()                              # list objects in current R session
dir()                             # list files in current working directory
getwd()                           # print path of current working directory
setwd("/home/user")               # change working directory

File information

list.files(path="./", pattern="*.txt$", full.names=TRUE)   # list files by pattern
file.exists(c("file1", "file2"))                            # check if files exist
file.info(list.files(path="./", pattern=".txt$", full.names=TRUE))  # file details

Basic Syntax

Assignment and general syntax

object <- ...                          # assignment operator (preferred over =)
object <- function_name(arguments)     # call a function
object <- object[arguments]            # subset an object
assign("x", function(arguments))       # alternative: assign()

Pipes

The %>% pipe from dplyr/magrittr chains operations left-to-right. New native R pipe is |>.

x %>% f(y)    # equivalent to f(x, y)

Makes code readable by avoiding deeply nested calls. Details in the dplyr tutorial.

Getting help

?function_name       # open help page for a function

Run scripts

Preferred version

Rscript my_script.R        # execute from command-line (preferred)

Older alternatives

source("my_script.R")      # execute R script from within R
R CMD BATCH my_script.R    # alternative

Data Types

Numeric

x <- c(1, 2, 3)
x

[1] 1 2 3

is.numeric(x)

[1] TRUE

as.character(x)    # convert to character

[1] "1" "2" "3"

Character

x <- c("1", "2", "3")
x

[1] "1" "2" "3"

is.character(x)

[1] TRUE

as.numeric(x)      # convert to numeric

[1] 1 2 3

Complex (mixed types — coerced to character)

c(1, "b", 3)       # numeric values coerced to character

[1] "1" "b" "3"

Logical

x <- 1:10 < 5
x                  # TRUE/FALSE vector

 [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

!x                 # negate

 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

which(x)           # indices of TRUE values

[1] 1 2 3 4

Data Objects — Overview

Common object types

Type	Dimensions	Data types	Example
`vector`	1D	uniform	`c(1, 2, 3)`
`factor`	1D	grouping labels	`factor(c("a","b","a"))`
`matrix`	2D	uniform	`matrix(1:9, 3, 3)`
`data.frame`	2D	mixed	`data.frame(x=1:3, y=c("a","b","c"))`
`tibble`	2D	mixed	modern `data.frame`
`list`	any	any	`list(name="Fred", age=30)`
`function`	—	code	`function(x) x^2`

Naming rules

Object names should not start with a number
Avoid spaces and special characters like # in names

Vectors and Factors

Vectors (1D, uniform type)

myVec <- setNames(1:10, letters[1:10])   # named numeric vector
myVec[1:5]                                # subset by position

a b c d e 
1 2 3 4 5

myVec[c(2,4,6,8)]                        # subset by multiple positions

b d f h 
2 4 6 8

myVec[c("b", "d", "f")]                  # subset by name

b d f 
2 4 6

Factors (1D, grouping information)

factor(c("dog", "cat", "mouse", "dog", "dog", "cat"))

[1] dog   cat   mouse dog   dog   cat  
Levels: cat dog mouse

# Levels: cat dog mouse

Factors encode categorical variables with defined levels — essential for statistical modeling.

Matrices and Data Frames

Matrices (2D, uniform type)

myMA <- matrix(1:30, 3, 10, byrow=TRUE)
class(myMA)

[1] "matrix" "array"

myMA[1:2, ]                  # first two rows

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    2    3    4    5    6    7    8    9    10
[2,]   11   12   13   14   15   16   17   18   19    20

myMA[1, , drop=FALSE]        # first row, keep matrix structure

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    2    3    4    5    6    7    8    9    10

class(as.data.frame(myMA))   # convert to data.frame

[1] "data.frame"

Data Frames (2D, mixed types)

myDF <- data.frame(Col1=1:10, Col2=10:1)
myDF[1:2, ]

  Col1 Col2
1    1   10
2    2    9

class(as.matrix(myDF))       # convert to matrix

[1] "matrix" "array"

Tibbles — modern data frames

library(tidyverse)
as_tibble(iris)              # nicer printing, same structure as data.frame

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

Tip

The iris dataset is built into R — no import needed. It is used throughout these examples.

Lists and Functions

Lists (containers for any object type)

myL <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9))
myL

$name
[1] "Fred"

$wife
[1] "Mary"

$no.children
[1] 3

$child.ages
[1] 4 7 9

myL[[4]][1:2]     # access fourth element, first two values

[1] 4 7

Lists are the most flexible R object — they can hold vectors, data frames, other lists, and functions all at once.

Functions (reusable pieces of code)

myfct <- function(arg1, arg2, ...) {
    function_body
}

Subsetting Data Objects

1. By position

myVec <- 1:26; names(myVec) <- LETTERS
myVec[1:4]          # first four elements

A B C D 
1 2 3 4

myVec[-(1:4)]       # everything except first four

 E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z 
 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

2. By logical vector

myLog <- myVec > 10
myVec[myLog]        # elements where condition is TRUE

 K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z 
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

3. By name

myVec[c("B", "K", "M")]

 B  K  M 
 2 11 13

4. By `$` sign (single column or list component)

iris$Species[1:8]

[1] setosa setosa setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

Subsetting 2D objects

iris[1:4, ]                          # first 4 rows, all columns

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa

iris[1:4, 1:2]                       # first 4 rows, first 2 columns

  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1

iris[iris$Species=="setosa", ]       # rows matching a condition

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
11          5.4         3.7          1.5         0.2  setosa
12          4.8         3.4          1.6         0.2  setosa
13          4.8         3.0          1.4         0.1  setosa
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa
16          5.7         4.4          1.5         0.4  setosa
17          5.4         3.9          1.3         0.4  setosa
18          5.1         3.5          1.4         0.3  setosa
19          5.7         3.8          1.7         0.3  setosa
20          5.1         3.8          1.5         0.3  setosa
21          5.4         3.4          1.7         0.2  setosa
22          5.1         3.7          1.5         0.4  setosa
23          4.6         3.6          1.0         0.2  setosa
24          5.1         3.3          1.7         0.5  setosa
25          4.8         3.4          1.9         0.2  setosa
26          5.0         3.0          1.6         0.2  setosa
27          5.0         3.4          1.6         0.4  setosa
28          5.2         3.5          1.5         0.2  setosa
29          5.2         3.4          1.4         0.2  setosa
30          4.7         3.2          1.6         0.2  setosa
31          4.8         3.1          1.6         0.2  setosa
32          5.4         3.4          1.5         0.4  setosa
33          5.2         4.1          1.5         0.1  setosa
34          5.5         4.2          1.4         0.2  setosa
35          4.9         3.1          1.5         0.2  setosa
36          5.0         3.2          1.2         0.2  setosa
37          5.5         3.5          1.3         0.2  setosa
38          4.9         3.6          1.4         0.1  setosa
39          4.4         3.0          1.3         0.2  setosa
40          5.1         3.4          1.5         0.2  setosa
41          5.0         3.5          1.3         0.3  setosa
42          4.5         2.3          1.3         0.3  setosa
43          4.4         3.2          1.3         0.2  setosa
44          5.0         3.5          1.6         0.6  setosa
45          5.1         3.8          1.9         0.4  setosa
46          4.8         3.0          1.4         0.3  setosa
47          5.1         3.8          1.6         0.2  setosa
48          4.6         3.2          1.4         0.2  setosa
49          5.3         3.7          1.5         0.2  setosa
50          5.0         3.3          1.4         0.2  setosa

Important Utilities

Combining objects

c(1, 2, 3)

[1] 1 2 3

x <- 1:3; y <- 101:103
c(x, y)                   # concatenate vectors

[1]   1   2   3 101 102 103

ma <- cbind(x, y)         # bind as columns
rbind(ma, ma)             # bind as rows

     x   y
[1,] 1 101
[2,] 2 102
[3,] 3 103
[4,] 1 101
[5,] 2 102
[6,] 3 103

Dimensions and names

length(iris$Species)      # number of elements

[1] 150

dim(iris)                 # rows x columns

[1] 150   5

rownames(iris)[1:8]

[1] "1" "2" "3" "4" "5" "6" "7" "8"

colnames(iris)

[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

names(myL)                # names of list components

[1] "name"        "wife"        "no.children" "child.ages"

Sorting

sort(10:1)
sortindex <- order(iris[,1], decreasing=FALSE)
iris[sortindex, ][1:2, ]
iris[order(iris$Sepal.Length, iris$Sepal.Width), ][1:2, ]  # sort by multiple columns

Checking identity

myma <- iris[1:2,]
all(myma == iris[1:2,])       # all values equal?

[1] TRUE

identical(myma, iris[1:2,])   # strict identity?

[1] TRUE

Operators and Calculations

Comparison operators

1 == 1    # equal

[1] TRUE

1 != 2    # not equal

[1] TRUE

# also: <, >, <=, >=

Logical operators

x <- 1:10; y <- 10:1
x > y & x > 5    # AND

 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

x > y | x > 5    # OR

 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

!x                # NOT

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Basic calculations

x + y

 [1] 11 11 11 11 11 11 11 11 11 11

sum(x)

[1] 55

mean(x)

[1] 5.5

apply(iris[1:6, 1:3], 1, mean)    # row means (margin=1)

       1        2        3        4        5        6 
3.333333 3.100000 3.066667 3.066667 3.333333 3.666667

apply(iris[1:6, 1:3], 2, mean)    # column means (margin=2)

Sepal.Length  Sepal.Width Petal.Length 
    4.950000     3.383333     1.450000

Reading and Writing Data

Import tabular data

Widely used read.table and read.delim import functions

myDF <- read.delim("myData.tsv", sep="\t")           # tab-delimited file

Better alternative from readr package with better default arguments and performance. For details see here.

myTibble <- readr::read_tsv(myData.tsv")

Import from Google Sheet directly

library(googlesheets4)
gs4_deauth()                                           # for public sheets
mysheet <- read_sheet("1U-32UcwZP1k3saKeaH1mbvEAOfZRdNHNkWK2GI1rpPM", skip=4)
myDF <- as.data.frame(mysheet)

library(readxl)
mysheet <- read_excel(targets_path, sheet="Sheet1")   # Excel files

Export tabular data

write.table(myDF, file="myfile.xls", sep="\t", quote=FALSE, col.names=NA)

Line-wise import/export

myDF <- readLines("myData.txt")           # import line by line
writeLines(month.name, "myData.txt")      # export line by line

Save and load R objects

mylist <- list(C1=iris[,1], C2=iris[,2])
saveRDS(mylist, "mylist.rds")             # save
mylist <- readRDS("mylist.rds")           # load

Note

HW02 — Task A: Sort iris by first column, subset first 12 rows, export to file, modify column names in a spreadsheet program, re-import with read.table.
→ HW02 instructions

Useful R Functions

Unique entries

length(iris$Sepal.Length)          # 150 total entries

[1] 150

length(unique(iris$Sepal.Length))  # number of unique values

[1] 35

Count occurrences

table(iris$Species)    # frequency table per group


    setosa versicolor  virginica 
        50         50         50

Aggregate statistics

aggregate(iris[,1:4], by=list(iris$Species), FUN=mean, na.rm=TRUE)

     Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

Set operations

month.name %in% c("May", "July")    # logical: which elements are in set

 [1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

Merge data frames

frame1 <- iris[sample(1:nrow(iris), 30), ]
my_result <- merge(frame1, iris, by.x=0, by.y=0, all=TRUE)
# all=TRUE: outer join (keep all rows)
# all=FALSE: inner join (keep only common rows)

Graphics in R — Overview

Why R graphics?

Powerful environment for scientific visualization
Integrated with statistics infrastructure
Publication-quality, fully reproducible output
Supports LaTeX and Markdown via knitr

Four main graphics systems

System	Level	Package
Base R graphics	Low + high	built-in
grid	Low-level	built-in
lattice	High-level	`lattice`
ggplot2	High-level	`ggplot2`

Key base graphics functions

plot, barplot, boxplot, hist, pie, pairs, image, heatmap

Tip

For new code, ggplot2 is generally recommended. Base R graphics remain useful for quick exploration and highly customized plots.

Scatter Plots

Sample dataset

set.seed(1410)
y <- matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))

Basic scatter plot

plot(y[,1], y[,2])

All pairs

pairs(y)

With color and labels

plot(y[,1], y[,2], pch=20, col="red", main="Symbols and Labels")
text(y[,1]+0.03, y[,2], rownames(y))

Add regression line

plot(y[,1], y[,2])
myline <- lm(y[,2] ~ y[,1])
abline(myline, lwd=2)

summary(myline)


Call:
lm(formula = y[, 2] ~ y[, 1])

Residuals:
     Min       1Q   Median       3Q      Max 
-0.40357 -0.17912 -0.04299  0.22147  0.46623 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.5764     0.2110   2.732   0.0258 *
y[, 1]       -0.3647     0.3959  -0.921   0.3839  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3095 on 8 degrees of freedom
Multiple R-squared:  0.09589,   Adjusted R-squared:  -0.01712 
F-statistic: 0.8485 on 1 and 8 DF,  p-value: 0.3839

Important plot parameters

Argument	Description
`col`	color of symbols
`pch`	symbol type (`example(points)` to see options)
`lwd`	line/symbol width
`cex.*`	font size controls
`mar`	margin sizes `c(bottom, left, top, right)`
`log="xy"`	log scale on both axes

Note

HW02 — Task B: Generate a scatter plot of iris columns 1 and 2, colored by Species. Use xlim/ylim to restrict data to the bottom-left quadrant.
→ HW02 instructions

Bar Plots, Histograms and More

Bar plot with legend

barplot(y[1:4,], ylim=c(0, max(y[1:4,])+0.3), beside=TRUE, legend=letters[1:4])

Tip

When input is a matrix, barplot uses column names as group labels and row names as within-group labels. Convert data.frame input with as.matrix() first.

Bar plot with error bars

bar <- barplot(m <- rowMeans(y) * 10, ylim=c(0, 10))
stdev <- sd(t(y))
arrows(bar, m, bar, m + stdev, length=0.15, angle=90)

Histogram and density plot

hist(y, freq=TRUE, breaks=10)

plot(density(y), col="red")

Save graphics to file

pdf("test.pdf")
plot(1:10, 1:10)
dev.off()         # always close the device!

Works the same for jpeg(), png(), svg(), tiff().

Note

HW02 — Task C: Calculate mean values per Species for first four iris columns. Organize as a matrix. Generate stacked and horizontally arranged bar plots.
→ HW02 instructions

Analysis Routine — Data Import

A step-by-step workflow using two sample biological datasets. This analysis routine is used by Homework 2D-H.

Step 1 — Download sample data

Open in Excel, save as tab-delimited text, then import:

my_mw <- read.delim(file="MolecularWeight_tair7.xls", header=TRUE, sep="\t")
my_mw[1:2,]
my_target <- read.delim(file="TargetP_analysis_tair7.xls", header=TRUE, sep="\t")
my_target[1:2,]

Or import directly from the web:

my_mw <- read.delim("https://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/MolecularWeight_tair7.xls",
                     header=TRUE, sep="\t")
my_target <- read.delim("https://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/TargetP_analysis_tair7.xls",
                          header=TRUE, sep="\t")

Analysis Routine — Merging Data Frames

Step 2 — Assign uniform ID column names

colnames(my_target)[1] <- "ID"
colnames(my_mw)[1] <- "ID"

Step 3 — Merge on common ID field (outer join)

my_mw_target <- merge(my_mw, my_target, by.x="ID", by.y="ID", all.x=TRUE)

Step 4 — Merge shortened table, then remove non-matching rows

my_mw_target2a <- merge(my_mw, my_target[1:40,], by.x="ID", by.y="ID", all.x=TRUE)
my_mw_target2 <- na.omit(my_mw_target2a)    # remove rows with NAs

Note

HW02 — Task D: Execute merge to return only common rows directly (without na.omit). Prove both methods return identical results.
HW02 — Task E: Replace all NA values in my_mw_target2a with zeros.

Analysis Routine — Filtering and String Operations

Step 5 — Filter rows by conditions

# Proteins with MW > 100,000 AND targeted to chloroplast (Loc == "C")
query <- my_mw_target[my_mw_target[,2] > 100000 & my_mw_target[,4] == "C", ]
query[1:4, ]
dim(query)

Note

HW02 — Task F: How many proteins have MW > 4,000 and < 5,000? Subset and sort by MW to verify.

Step 6 — Remove gene model extensions with regex

# AT1G01010.1 → AT1G01010  (remove everything from . onward)
my_mw_target3 <- data.frame(
    loci = gsub("\\..*", "", as.character(my_mw_target[,1]), perl=TRUE),
    my_mw_target
)
my_mw_target3[1:3, 1:8]

Note

HW02 — Task G: Retrieve rows where second column contains specific IDs using %in%. Also use the second column as a row index and repeat. Explain the difference between the two approaches.

Analysis Routine — Calculations and Export

Step 7 — Count duplicates

mycounts <- table(my_mw_target3[,1])[my_mw_target3[,1]]
my_mw_target4 <- cbind(my_mw_target3, Freq=mycounts[as.character(my_mw_target3[,1])])

Step 8 — Vectorized calculation (average AA weight)

data.frame(my_mw_target4, avg_AA_WT=(my_mw_target4[,3] / my_mw_target4[,4]))[1:2,]

Step 9 — Row-wise mean and standard deviation

mymean  <- apply(my_mw_target4[,6:9], 1, mean)
mystdev <- apply(my_mw_target4[,6:9], 1, sd, na.rm=TRUE)
data.frame(my_mw_target4, mean=mymean, stdev=mystdev)[1:2, 5:12]

Step 10 — Scatter plot

plot(my_mw_target4[1:500, 3:4], col="red")

Step 11 — Export results

write.table(my_mw_target4, file="my_file.xls", quote=FALSE, sep="\t", col.names=NA)

Note

HW02 — Task H: Assemble all commands from this exercise into HW2.R and run it:

source("HW2.R")    # from within R

Rscript HW2.R      # from command-line

HW02 Summary

Assemble all solutions into a single R script HW2.R and submit via GitHub.

Task	Topic	Key functions
A	Sort `iris`, export, modify columns, re-import	`order`, `write.table`, `read.table`
B	Scatter plot `iris` col 1-2, colored by Species	`plot`, `xlim`, `ylim`
C	Mean matrix by Species, stacked & horizontal bars	`aggregate`, `barplot`
D	Merge returning only common rows; prove equivalence	`merge(all=FALSE)`, `all()`
E	Replace NAs with zeros	`is.na`, indexing
F	Filter proteins by MW range 4,000–5,000	boolean indexing
G	Subset rows by ID using `%in%` and row index	`%in%`, `rownames`
H	Assemble all code into `HW2.R`, run with `source()`	`source`, `Rscript`

Submission path

Homework/HW2/HW2.R

Due: Thu, April 16th at 6:00 PM

Note

The preassembled workflow script for Task H is available here — it does not include solutions for Tasks A–C.

Introduction to R

Overview

What is R?

Why use R?

Key package repositories

R Working Environments

RStudio / Posit

Nvim-R-Tmux

Other editors

Installation of R and Packages

Install R and RStudio

Install CRAN packages

Install Bioconductor packages

Load packages

Explore a package

Working Routine for Tutorials

Getting Around

Starting and closing R

Navigating directories

File information

Basic Syntax

Assignment and general syntax

Pipes

Getting help

Run scripts

Data Types

Numeric

Character

Complex (mixed types — coerced to character)

Logical

Data Objects — Overview

Common object types

Naming rules

Vectors and Factors

Vectors (1D, uniform type)

Factors (1D, grouping information)

Matrices and Data Frames

Matrices (2D, uniform type)

Data Frames (2D, mixed types)

Tibbles — modern data frames

Lists and Functions

Lists (containers for any object type)

Functions (reusable pieces of code)

Subsetting Data Objects

1. By position

2. By logical vector

3. By name

4. By $ sign (single column or list component)

Subsetting 2D objects

Important Utilities

Combining objects

Dimensions and names

Sorting

Checking identity

Operators and Calculations

Comparison operators

Logical operators

Basic calculations

Reading and Writing Data

Import tabular data

Export tabular data

Line-wise import/export

Save and load R objects

Useful R Functions

Unique entries

Count occurrences

Aggregate statistics

Set operations

Merge data frames

Graphics in R — Overview

Why R graphics?

Four main graphics systems

Key base graphics functions

Scatter Plots

Sample dataset

Basic scatter plot

All pairs

With color and labels

Add regression line

Important plot parameters

4. By `$` sign (single column or list component)