Introduction to R
Overview
What is R?
R is a powerful statistical environment and programming language for the analysis and visualization of data. The associated Bioconductor and CRAN package repositories provide many additional R packages for statistical data analysis for a wide array of research areas. The R software is free and runs on all common operating systems.
Why Using R?
- Complete statistical environment and programming language
- Efficient functions and data structures for data analysis
- Powerful graphics
- Access to fast growing number of analysis packages
- Most widely used language in bioinformatics
- Is standard for data mining and biostatistical analysis
- Technical advantages: free, open-source, available for all OSs
Books and Documentation
R Working Environments
Some R working environments with support for syntax highlighting and utilities to send code to the R console:
- RStudio/Posit Desktop: excellent choice for beginners (Cheat Sheets)
- RStudio/Posit Server: web-based UI for RStudio. Available at UCR via onDemand (old standalone web instance will be discontinued).
- RStudio/Posit Cloud: cloud-based RStudio Server
- Vim-R-Tmux: R working environment based on vim and tmux.
- Emacs (ESS add-on package)
- gedit, Rgedit, RKWard, Eclipse, Tinn-R, Notepad++, NppToR
Example: RStudio
New integrated development environment (IDE) for R. Highly functional for both beginners and advanced.
Some userful shortcuts: Ctrl+Enter (send code), Ctrl+Shift+C (comment/uncomment), Ctrl+1/2 (switch window focus)
Example: Nvim-R-Tmux
Terminal-based Working Environment for R: Nvim-R-Tmux.
Install and usage instructions for Nvim-R are provided in this slide show and this tutorial. The most detailed instructions can be found on the Nvim-R_Tmux GitHub repos.
R Package Repositories
Working routine for tutorials
When working in R, a good practice is to write all commands directly into an R script, instead of the R console, and then send the commands for execution to the R console with the Ctrl+Enter shortcut in RStudio/Posit, or similar shortcuts in other R coding environments, such as Nvim-R. This way all work is preserved and can be reused in the future.
The following instructions in this section provide a short overview of the standard working routine users should use to load R-based tutorials of this website into an R IDE (Nvim-R or RStudio). For Nvim-R on HPCC users can visit the Quick Demo slide here.
Step 1. Download *.Rmd or *.R file. These so called source files are always linked on the top right corner of each tutorial. The ones for this tutorial are here. The file download can be accomplished via download.file from within R (see below), wget from the command-line or with the save function in a user’s web browser. The following downloads the Rmd file of this tutorial via download.file from the R console.
Load
*.Rmdor*.Rfile in Neovim (Nvim-R) or RStudio.Send code from code editor to R console by pushing
space barin Neovim (Nvim-R) orCtrl + Enterin RStudio. In*.Rmdfiles the code lines are in so called code chunks and only those ones can be sent to the console. To obtain in Neovim a connected R session one has to initiate by pressing the\rfkey combination. For details see here.
Installation of R, RStudio and R Packages
Install R for your operating system from CRAN.
Install RStudio from RStudio.
Install CRAN Packages from R console like this:
- Install Bioconductor packages as follows:
For more details consult the Bioc Install page and BiocManager package.
Instructions for upgrading R and packages to newer versions are given at the end of this tutorial here.
Getting Around
Startup and Closing Behavior
Starting R: The R GUI versions, including RStudio, under Windows and Mac OS X can be opened by double-clicking their icons. Alternatively, one can start it by typing
Rin a terminal (default under Linux).Startup/Closing Behavior: The R environment is controlled by hidden files in the startup directory:
.RData,.Rhistoryand.Rprofile(optional).Closing R:
q()
Save workspace image? [y/n/c]:
- Note: When responding with
y, then the entire R workspace will be written to the.RDatafile which can become very large. Often it is better to selectnhere, because a much better working pratice is to save an analysis protocol to anRorRmdsource file. This way one can quickly regenerate all data sets and objects needed in a future session.
Basic Syntax
Create an object with the assignment operator <- or =
General R command syntax
Instead of the assignment operator one can use the assign function
To simplify chaining of serveral operations, dplyr (magrittr) provides the %>% (pipe) operator, where x %>% f(y) turns into f(x, y). This way one can pipe together multiple operations by writing them from left-to-right or top-to-bottom. This makes for easy to type and readable code. Details on this are provided in the dplyr tutorial here.
Finding help
Load one or more R packages (libraries)
List functions defined by a library
Load library manual (PDF or HTML file)
Execute an R script from within R
Execute an R script from command-line (the first of the three options is preferred)
Data Types
Numeric data
Example: 1, 2, 3, ...
Character data
Example: "a", "b", "c", ...
Complex data
Example: mix of both
Logical data
Example: TRUE of FALSE
Data Objects
Object types
- List of common object types
vectors: ordered collection of numeric, character, complex and logical values.factors: special type vectors with grouping information of its componentsdata.framesincluding modern variantsDataFrame,tibbles, etc.: two dimensional structures with different data typesmatrices: two dimensional structures with data of same typearrays: multidimensional arrays of vectorslists: general form of vectors with different types of elementsfunctions: piece of code- Many more …
- Simple rules for naming objects and their components
- Object, row and column names should not start with a number
- Avoid spaces in object, row and column names
- Avoid special characters like ‘#’
Vectors (1D)
Definition: numeric or character
Factors (1D)
Definition: vectors with grouping information
Matrices (2D)
Definition: two dimensional structures with data of same type
[1] "matrix" "array"
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 2 3 4 5 6 7 8 9 10
[2,] 11 12 13 14 15 16 17 18 19 20
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 2 3 4 5 6 7 8 9 10
[1] "data.frame"
Data Frames (2D)
Definition: data.frames are two dimensional objects with data of variable types
Tibbles
Tibbles are a more modern version of data.frames. Among many other advantages, one can see here that tibbles have a nicer printing bahavior. Much more detailed information on this object class is provided in the dplyr/tidyverse manual section.
Note: The above example uses the iris test dataset that is available in every R installation without explicitly importing or loading it. The following examples will often make use of this dataset.
Arrays
Definition: data structure with one, two or more dimensions
Lists
Definition: containers for any object type
Functions
Definition: piece of code
Subsetting of data objects
(1.) Subsetting by positive or negative index/position numbers
(2.) Subsetting by same length logical vectors
K L M N O P Q R S T U V W X Y Z
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
(3.) Subsetting by field names
(4.) Subset with $ sign: references a single column or list component by its name
Important Utilities
Combining Objects
The c function combines vectors and lists
The cbind and rbind functions can be used to append columns and rows, respecively.
Accessing Dimensions of Objects
Length and dimension information of objects
Accessing Name Slots of Objects
Accessing row and column names of 2D objects
[1] "1" "2" "3" "4" "5" "6" "7" "8"
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Return name field of vectors and lists
Sorting Objects
The function sort returns a vector in ascending or descending order.
The function order returns a sorting index for sorting an object alphanumerically.
[1] 14 9 39 43 42 4 7 23 48 3 30 12
Sorting multiple columns
Check differences
To check whether the values in two objects are the same, one can use the == comparison operator. The all function allows to find out whether all values are the same. To check whether two objects are exactly identical, use the identical function.
Operators and Calculations
Comparison Operators
Comparison operators: ==, !=, <, >, <=, >=
Logical operators for boolean operations: AND: &, OR: |, NOT: !
Basic Calculations
To look up math functions, see Function Index here
Reading and Writing External Data
Import of tabular data
Import of a tab-delimited tabular file
Import of Google Sheets. The following example imports a sample Google Sheet from here. Detailed instructions for interacting from R with Google Sheets with the required googlesheets4 package are here.
Import from Excel sheets works well with readxl. For details see the readxl package manual here. Note: working with tab- or comma-delimited files is more flexible and highly preferred for automated analysis workflows.
Additional import functions are described in the readr package section here.
Export of tabular data
Line-wise import
Line-wise export
Export R object
Import R object
Copy and paste into R
On Windows/Linux systems
On Mac OS X systems
Copy and paste from R
On Windows/Linux systems
On Mac OS X systems
Homework 3A
Homework 3A: Object Subsetting Routines and Import/Export
Useful R Functions
Unique entries
Make vector entries unique with unique
Count occurrences
Count occurrences of entries with table
Aggregate data
Compute aggregate statistics with aggregate
Intersect data
Compute intersect between two vectors with %in%
Merge data frames
Join two data frames by common field entries with merge (here row names by.x=0). To obtain only the common rows, change all=TRUE to all=FALSE. To merge on specific columns, refer to them by their position numbers or their column names.
Graphics in R
Advantages
- Powerful environment for visualizing scientific data
- Integrated graphics and statistics infrastructure
- Publication quality graphics
- Fully programmable
- Highly reproducible
- Full LaTeX and Markdown support via
knitrandR markdown - Vast number of R packages with graphics utilities
Documentation for R Graphics
General
- Graphics Task Page - URL
- R Graph Gallery - URL
- R Graphical Manual - URL
- Paul Murrell’s book R (Grid) Graphics - URL
Interactive graphics
Graphics Environments
Viewing and saving graphics in R
- On-screen graphics
- postscript, pdf, svg
- jpeg, png, wmf, tiff, …
Four major graphic environments
- Low-level infrastructure
- R Base Graphics (low- and high-level)
grid: Manual
- High-level infrastructure \begin{itemize}
Base Graphics: Overview
Important high-level plotting functions
plot: generic x-y plottingbarplot: bar plotsboxplot: box-and-whisker plothist: histogramspie: pie chartsdotchart: cleveland dot plotsimage, heatmap, contour, persp: functions to generate image-like plotsqqnorm, qqline, qqplot: distribution comparison plotspairs, coplot: display of multivariant data
Help on graphics functions
?myfct?plot?par
Preferred Object Types
- Matrices and data frames
- Vectors
- Named vectors
Scatter Plots
Basic Scatter Plot
Sample data set for subsequent plots
A B C
a 0.26904539 0.47439030 0.4427788756
b 0.53178658 0.31128960 0.3233293493
c 0.93379571 0.04576263 0.0004628517
d 0.14314802 0.12066723 0.4104402000
e 0.57627063 0.83251909 0.9884746270
f 0.49001235 0.38298651 0.8235850153
g 0.66562596 0.70857731 0.7490944304
h 0.50089252 0.24772695 0.2117313873
i 0.57033245 0.06044799 0.8776291364
j 0.04087422 0.85814118 0.1061618729
Plot data
All pairs
With labels
More examples
Print instead of symbols the row names
Usage of important plotting parameters
Important arguments
mar: specifies the margin sizes around the plotting area in order:c(bottom, left, top, right)col: color of symbolspch: type of symbols, samples:example(points)lwd: size of symbolscex.*: control font sizes- For details see
?par
Add regression line

Call:
lm(formula = y[, 2] ~ y[, 1])
Residuals:
Min 1Q Median 3Q Max
-0.40357 -0.17912 -0.04299 0.22147 0.46623
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5764 0.2110 2.732 0.0258 *
y[, 1] -0.3647 0.3959 -0.921 0.3839
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3095 on 8 degrees of freedom
Multiple R-squared: 0.09589, Adjusted R-squared: -0.01712
F-statistic: 0.8485 on 1 and 8 DF, p-value: 0.3839
Log scale
Same plot as above, but on log scale
Add a mathematical expression
Homework 3B
Homework 3B: Scatter Plots
Line Plots
Single data set
Many Data Sets
Plots line graph for all columns in data frame y. The split.screen function is used in this example in a for loop to overlay several line graphs in the same plot.
Bar Plots
Basics
barplot(y[1:4,], ylim=c(0, max(y[1:4,])+0.3), beside=TRUE, legend=letters[1:4])
text(labels=round(as.vector(as.matrix(y[1:4,])),2), x=seq(1.5, 13, by=1) + sort(rep(c(0,1,2), 4)), y=as.vector(as.matrix(y[1:4,]))+0.04) 
The barplot function has a convenient default behavior when the input data are provided as matrix containing row and column names. The column names are used in the barplot as group labels (here A to C) and the row names as labels for each measurement within a group (here: a to d). When working with a data.frame or tibble, use as.matrix to coerce the input to a matrix; and to populate or change the rownames or colnames slots, use rownames(y) <- ... or colnames(y) <- ..., respectively.
Error Bars
Histograms
Density Plots
Pie Charts
Color Selection Utilities
Default color palette and how to change it
[1] "black" "#DF536B" "#61D04F" "#2297E6" "#28E2E5" "#CD0BBC" "#F5C710" "gray62"
[1] "#FF9900" "#FFBF00" "#FFE600" "#F2FF00" "#CCFF00"
The gray function allows to select any type of gray shades by providing values from 0 to 1
Color gradients with colorpanel function from gplots library`
[1] "#00008B" "#808046" "#FFFF00" "#FFFF80" "#FFFFFF"
Much more on colors in R see Earl Glynn’s color chart here
Saving Graphics to File
After the pdf() command all graphs are redirected to file test.pdf. Works for all common formats similarly: jpeg, png, ps, tiff, …
Generates Scalable Vector Graphics (SVG) files that can be edited in vector graphics programs, such as InkScape.
Homework 3C
Homework 3C: Bar Plots
Analysis Routine
Overview
The following exercise introduces a variety of useful data analysis utilities in R.
Analysis Routine: Data Import
Step 1: To get started with this exercise, direct your R session to a dedicated workshop directory and download into this directory the following sample tables. Then import the files into Excel and save them as tab delimited text files.
Import the tables into R
Import molecular weight table
Import subcelluar targeting table
Online import of molecular weight table
my_mw <- read.delim(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/MolecularWeight_tair7.xls", header=TRUE, sep="\t")
my_mw[1:2,]Online import of subcelluar targeting table
Merging Data Frames
- Step 2: Assign uniform gene ID column titles
- Step 3: Merge the two tables based on common ID field
- Step 4: Shorten one table before the merge and then remove the non-matching rows (NAs) in the merged file
- Homework 3D: How can the merge function in the previous step be executed so that only the common rows among the two data frames are returned? Prove that both methods - the two step version with
na.omitand your method - return identical results. - Homework 3E: Replace all
NAsin the data framemy_mw_target2awith zeros.
Filtering Data
- Step 5: Retrieve all records with a value of greater than 100,000 in ‘MW’ column and ‘C’ value in ‘Loc’ column (targeted to chloroplast).
[1] 170 8
- Homework 3F: How many protein entries in the
my_mw_targetdata frame have a MW of greater then 4,000 and less then 5,000. Subset the data frame accordingly and sort it by MW to check that your result is correct.
String Substitutions
- Step 6: Use a regular expression in a substitute function to generate a separate ID column that lacks the gene model extensions.
my_mw_target3 <- data.frame(loci=gsub("\\..*", "", as.character(my_mw_target[,1]), perl = TRUE), my_mw_target)
my_mw_target3[1:3,1:8]- Homework 3G: Retrieve those rows in
my_mw_target3where the second column contains the following identifiers:c("AT5G52930.1", "AT4G18950.1", "AT1G15385.1", "AT4G36500.1", "AT1G67530.1"). Use the%in%function for this query. As an alternative approach, assign the second column to the row index of the data frame and then perform the same query again using the row index. Explain the difference of the two methods.
Calculations on Data Frames
- Step 7: Count the number of duplicates in the loci column with the
tablefunction and append the result to the data frame with thecbindfunction.
- Step 8: Perform a vectorized devision of columns 3 and 4 (average AA weight per protein)
- Step 9: Calculate for each row the mean and standard deviation across several columns
Plotting Example
- Step 10: Generate scatter plot for the ‘MW’ and ‘Residues’ columns.
Export Results and Run Entire Exercise as Script
- Step 11: Write the data frame
my_mw_target4into a tab-delimited text file and inspect it in Excel.
- Homework 3H: Write all commands from this exercise into an R script named
exerciseRbasics.R, or download it from here. For demonstration the downloadable script version contains code for generating some additional plots that are not part of this exercise. Then execute the script with thesourcefunction like this:source("exerciseRbasics.R"). This will run all commands of this exercise and generate the corresponding output files in the current working directory. For homework 3H it is not necessary to submit the result files generated by theexerciseRbasics.Rscript. Stating how the script was executed (e.g.sourceorRscriptcommand) will be sufficient.
Or run it from the command-line (not from R!) with Rscript like this:
Rscript exerciseRbasics.R
Miscellaneous Topics
Upgrading to New R/Bioc Versions
When upgrading to a new R version, it is important to understand that a reinstall of all R packages is necessary because CRAN/Bioc packages are developed and tested for specific R versions. This means when upgrading R, then the corresponding packages need to be upgraded to the versions that match the new R version. The following steps will work in many situations.
Step 1. Export a list of all packages installed in a current version of R to a file (below named my_R_pkgs.txt) by running the following commands from within R (or use Rscript -e from command-line)
- Install new version of R, and then from within the new R version all packages one had installed before. The first install command below installs first a series of packages that are useful to have in general no matter what. Custom packages are then installed in the next lines. Note, this can only install packages from CRAN and Bioconductor. Packages from custom sources, including private GitHub accounts, need to be installed separately. Usually, one can identify them by the report generated at the end of the below install routine telling which packages are not available on CRAN or Bioconductor.
Session Info
sessionInfo()












