Introduction to R | Computational Biology Core

Basic R Usage

$ R                               # Starts the R console under Unix/Linux. The R GUI versions under Windows and Mac OS X can be started by double-clicking their icons.
object <- function(arguments)     # This general R command syntax uses the assignment operator '<-' (or '=') to assign data generated by a command to an object.
object = function(arguments)      # A more recently introduced assignment operator is '='. Both of them work the same way and in both directions.
                                    For consistency reasons one should use only one of them. assign("x", function(arguments)) # Has the same effect, but uses the assignment function instead of the assignment operator. source("my_script") # Command to execute an R script, here 'my_script'. For example, generate a text file 'my_script' with the command 'print(1:100)',                                     then execute it with the source function. x <- edit(data.frame()) # Starts empty GUI spreadsheet editor for manual data entry. x <- edit(x) # Opens existing data frame (table) 'x' in GUI spreadsheet editor. x <- scan(w="c") # Lets you enter values from the keyboard or by copy & paste and assigns them to vector 'x'. q() # Quits R console.

R Startup Behavior

The R environment is controlled by hidden files in the startup directory: .RData, .Rhistory and .Rprofile (optional)

Finding Help

Various online manuals are available on the R project site. Very useful manuals for beginners are:

R Stats Tutorial
An Introduction to R
Quick-R,
simpleR – Using R for Introductory Statistics
R for Beginners
Kelly Black’s R Tutorial
Kim Seefeld’s R-introduction for Biostatistics
Peter Dalgaard’s book Introductory Statistics with R
Applied Statistics for Bioinformatics using R by Wim Krijnen.
The R Reference Card.
Paul Murrell’s R graphics book
R search site
R Seek
References on R programming are listed in the ‘Programming in R‘ chapter of this manual.

Documentation within R can be found with the following commands

? function            # Opens documentation on a function
apropos(function)     # Finds all functions containing a given term.
example(heatmap)      # Executes examples for function 'heatmap'
help.search("topic")  # Searches help system for documentation.
RSiteSearch('regression', restrict='functions', matchesPerPage=100) # Searches for key words or phrases in the R-help mailing list archives, help pages, 
                        vignettes or task views, using the search engine at http://search.r-project.org and view them in a web browser. 
help.start()          # Starts local HTML interface. The link 'Packages' provides a list of all installed packages. 
                        After initiating 'start.help()' in a session the '?function' commands will open as HTML pages.
sessionInfo()         # Prints version information about R and all loaded packages. 
# The generated output should be provided when sending questions or bug reports to the R and BioC mailing lists.
$ R -h                # or 'R --help'; provides help on R environment, more detailed information on page 90 of 'An Introduction to R'

Basics on Functions and Packages

R contains most arithmetic functions like mean, median, sum, prod, sqrt, length, log, etc. An extensive list of R functions can be found on the function and variable index page. Many R functions and datasets are stored in separate packages, which are only available after loading them into an R session. Information about installing new packages can be found in the administrative section of this manual.

Loading of libraries/packages

library()                     # Shows all libraries available on a system.
library(my_package)           # Loads a particular package.
library(help=mypackage)       # Loads a particular package.
help.search("topic")          # Searches help system for documentation.
(v <- vignette("mypackage"))  # Opens vignette of a package. 
Stangle(v$file)               # Writes code chunks of vignette to a file named mypackage.R for loading into R IDE (e.g. RStudio).
search()                      # Lists which packages are currently loaded.

Information and management of objects

ls() or objects()                # Lists R objects created during session, they are stored in file '.RData' when exiting R and the workspace is saved. 
rm(my_object1, my_object2, ...)  # Removes objects.
rm(list = ls())                  # Removes all objects without warning!
str(object)                      # Displays object types and structure of an R object.
ls.str(pattern="")               # Lists object type info on all objects in a session. 
print(ls.str(), max.level=0)     # If a session contains very long list objects then one can  simplify the output with this command.
lsf.str(pattern="")              # Lists object type info on all functions in a session.
class(object)                    # Prints the object type.
mode(object)                     # Prints the storage mode of an object.
summary(object)                  # Generic summary info for all kinds of objects.
attributes(object)               # Returns an object's attribute list.
gc()                             # Initiates garbage collection. This is sometimes useful to clean up memory allocations after deleting large objects.
length(object)                   # IProvides length of object.
.Last.value                      # Prints the value of the last evaluated expression.

Reading and changing directories

dir()               # Reads content of current working directory. 
getwd()             # Returns current working directory.
setwd("/home/user") # Changes current working directory to the specified directory.

System commands under Linux

R IN/OUTPUT & BATCH Mode

One can redirect R input and output with ‘|’, ‘>’ and ‘<‘ from the Shell command line.

$ R --slave < my_infile > my_outfile # The argument '--slave' makes R run as 'quietly' as possible. 
     # This option is intended to support programs which use R to compute results for them. 
     # For example, if my_infile contains 'x <- c(1:100); x;' the result for this R expression will be written to 'my_outfile' (or STDOUT).
$ R CMD BATCH [options] my_script.R [outfile] # Syntax for running R programs in BATCH mode from the command-line. 
     # The output file lists the commands from the script file and their outputs. 
     # If no outfile is specified, the name used is that of 'infile' and '.Rout' is appended to outfile. 
     # To stop all the usual R command line information from being written to the outfile, add this as first line to my_script.R file: 'options(echo=FALSE)'. 
     # If the command is run like this 'R CMD BATCH --no-save my_script.R', then nothing will be saved in the .Rdata file which can get often very large. 
     # More on this can be found on the help pages: '$ R CMD BATCH --help' or '> ?BATCH'.
$ echo 'sqrt(3)' | R --slave # Calculates sqrt of 3 in R and prints it to STDOUT.

Executing Shell & Perl commands from R with system() function.
Remember, single escapes (e.g. ‘\n’) need to be double escaped in R (e.g. ‘\\n’)

system("ls -al") # Prints content of current working directory.
system("perl -ne 'print if (/my_pattern1/ ? ($c=1) : (--$c > 0)); print if (/my_pattern2/ ? ($d = 1) : (--$d > 0));' my_infile.txt > my_outfile.txt") 
# Runs Perl one-liner that extracts two patterns from external text file and writes result into new file.

Reading and Writing Data from/to Files

Import from files

read.delim("clipboard", header=T)       # Command to copy & paste tables from Excel or other programs into R. 
                                          If the 'header' argument is set to FALSE, then the first line of the data set will not be used as column titles. 
read.delim(pipe("pbpaste"))             # Command to copy & paste on Mac OS X systems.
file.show("my_file")                    # Prints content of file to screen, allows scrolling.
scan("my_file")                         # Reads vector/array into vector from file or keyboard.
my_frame <- read.table(file="my_table") # Reads in table and assigns it to data frame. 
my_frame <- read.table(file="my_table", header=TRUE, sep="\t") # Same as above, but with info on column headers and field separators. 
                                        # If you want to import the data in character mode, then include this argument: colClasses = "character".
my_frame <- read.delim("my_file", na.strings = "", fill=TRUE, header=T, sep="\t") 
                                        # Often more flexible for importing tables with empty fields and long character strings (e.g. gene descriptions). 
cat(month.name, file="zzz.txt", sep="\n"); x <- readLines("zzz.txt"); x <- x[c(grep("^J", as.character(x), perl = TRUE))]; t(as.data.frame(strsplit(x,"u"))) 
     # A somewhat more advanced example for retrieving specific lines from an external file with a regular expression. 
     # In this example, an external file is created with the 'cat' function, all lines of this file are imported into a vector with 'readLines', the specific elements (lines) 
     # are then retrieved with the 'grep' function, and the resulting lines are split into sub-fields with 'strsplit'.

Export to files

write.table(iris, "clipboard", sep="\t", col.names=NA, quote=F) 
     # Command to copy & paste from R into Excel or other programs. 
     # It writes the data of an R data frame object into the clipboard from where it can be pasted into other applications. 
zz <- pipe('pbcopy', 'w'); write.table(iris, zz, sep="\t", col.names=NA, quote=F); close(zz) # Command to copy & paste from R into Excel or other programs on Mac OS X systems.
write.table(my_frame, file="my_file", sep="\t", col.names = NA) 
     # Writes data frame to a tab-delimited text file. 
     # The argument 'col.names = NA' makes sure that the titles align with columns when row/index names are exported (default).
save(x, file="my_file.txt"); load(file="file.txt")              # RCommands to save R object to an external file and to read it in again from this file.
files <- list.files(pattern=".txtquot;); for(i in files) { x <- read.table(i, header=TRUE, row.names=1, comment.char = "A", sep="\t"); assign(print(i, quote=FALSE), x);
write.table(x, paste(i, c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) } 
     # Batch import and export of many files. 
     # First, the *.txt file names in the current directory are assigned to a list ($ sign is used to anchor string '*.txt' to end of names).
     # Second, the files are imported one-by-one using a for loop where the original names are assigned to the generated data frames with the 'assign' function. 
     # Read ?read.table to understand arguments 'row.names=1' and 'comment.char = "A"'. 
     # Third, the data frames are exported using their names for file naming and appending '*.out'.
HTML(my_frame, file = "my_table.html") 
     # Writes data frame to HTML table. Subsequent exports to the same file will arrange several tables in one HTML document. 
     # In order to access this function one needs to load the library 'R2HTML' first. This library is usually not installed by default.
sink("My_R_Output") # Redirects all subsequent R output to a file 'My_R_Output' without showing it in the R console anymore.
sink() # Restores normal R output behavior.

R Objects

Data and Object Types

Data Types

Numeric data: 1, 2, 3

x <- c(1, 2, 3); x; is.numeric(x); as.character(x) # Creates a numeric vector, checks for the data type and converts it into a character vector.

Character data: “a”, “b” , “c”

x <- c("1", "2", "3"); x; is.character(x); as.numeric(x) # Creates a character vector, checks for the data type and converts it into a numeric vector.

Complex data: 1, b, 3

Logical data: TRUE, FALSE, TRUE

 1:10 < 5 # Returns TRUE where x is < 5.

Object Types in R

vectors: ordered collection of numeric, character, complex and logical values.
factors: special type vectors with grouping information of its components
data frames: two dimensional structures with different data types
matrices: two dimensional structures with data of same type
arrays: multidimensional arrays of vectors
lists: general form of vectors with different types of elements
functions: piece of code

Naming Rules

Object, row and column names should not start with a number.
Avoid spaces in object, row and column names.
Avoid special characters like ‘#’.

General Subsetting Rules

Subsetting syntax:

my_object[row]           # Subsetting of one dimensional objects, like vectors and factors.
my_object[row, col]      # Subsetting of two dimensional objects, like matrices and data frames.
my_object[row, col, dim] # Subsetting of three dimensional objects, like arrays.

There are three possibilities to subset data objects:

Subsetting by positive or negative index/position numbers:

my_object <- 1:26; names(my_object) <- LETTERS # Creates a vector sample with named elements.
my_object[1:4]     # Returns the elements 1-4.
my_object[-c(1:4)] # Excludes elements 1-4.

Subsetting by same length logical vectors:

my_logical <- my_object > 10 # Generates a logical vector as example.
my_object[my_logical]        # Returns the elements where my_logical contains TRUE values.

Subsetting by field names:

my_object[c("B", "K", "M")]  # Returns the elements with element titles: B, K and M

Calling a single column or list component by its name with the ‘$’ sign

iris$Species         # Returns the 'Species' column in the sample data frame 'iris'.
iris[,c("Species")]  # Has the same effect as the previous step.

Assigning values to object components

zzz <- iris[,1:3]; zzz <- 0          # Reassignment syntax to create/replace an entire object.
zzz <- iris[,1:3]; zzz[,] <- 0       # Populates all fields in an object with zeros.
zzz <- iris[,1:3]; zzz[zzz < 4] <- 0 # Populates only specified fields with zeros

Basic Operators and Calculations

Comparison operators

equal: ==
not equal: !=
greater/less than: > <
greater/less than or equal: >= <=

Example:

1 == 1 # Returns TRUE.

Logical operators

AND: &

x <- 1:10; y <- 10:1 # Creates the sample vectors 'x' and 'y'.
x > y & x > 5        # Returns TRUE where both comparisons return TRUE.

OR: |

x == y | x != y # Returns TRUE where at least one comparison is TRUE.

NOT: !

!x > y # The '!' sign returns the negation (opposite) of a logical vector.

Calculations

Four basic arithmetic functions: addition, subtraction, multiplication and division

1 + 1; 1 - 1; 1 * 1; 1 / 1 # Returns results of basic arithmetic calculations.

Calculations on vectors

x <- 1:10; sum(x); mean(x), sd(x); sqrt(x) # Calculates for the vector x its sum, mean, standard deviation and square root. 
     # A list of the basic R functions can be found on the function and variable index page.
x <- 1:10; y <- 1:10; x + y                # Calculates the sum for each element in the vectors x and y.

Iterative calculations

apply(iris[,1:3], 1, mean)            # Calculates the mean values for the columns 1-3 in the sample data frame 'iris'. 
# With the argument setting '1', row-wise iterations are performed and with '2' column-wise iterations.
tapply(iris[,4], iris$Species, mean)  # Calculates the mean values for the 4th column based on the grouping information in the 'Species' column in the 'iris' data frame.
sapply(x, sqrt)                       # Calculates the square root for each element in the vector x. Generates the same result as 'sqrt(x)'.

Regular expressions

R’s regular expression utilities work similar as in other languages. To learn how to use them in R, one can consult the main help page on this topic with: ?regexp.

?regexp                                              # Opens general help page for regular expression support in R.
month.name[grep("A", month.name)]                    # The grep function can be used for finding patterns in strings, here letter 'A' in vector 'month.name'.
gsub('(i.*a)', '\\1_xxx', iris$Species, perl = TRUE) # Example for using regular expressions to substitute a pattern by another one using a back reference.
                                                       Remember: single escapes '\' need to be double escaped '\\' in R.

Vectors

General information

Vectors are ordered collection of ‘atomic’ (same data type) components or modes of the following four types: numeric, character, complex and logical. Missing values are indicated by ‘NA’. R inserts them automatically in blank fields.

x <- c(2.3, 3.3, 4.4)         # Example for creating a numeric vector with ordered collection of numbers using the c() function.
z <- scan(file="my_file")     # Reads data line-wise from file and assigns them to vector 'z'.
vector[rows]                  # Syntax to access vector sections.
z <- 1:10; z; as.character(z) # The function 'as.character' changes the data mode from numeric to character.
as.numeric(character)         # The function 'as.numeric' changes the data mode from character to numeric. 
d <- as.integer(x); d         # Transforms numeric data into integers.
x <- 1:100; sample(x, 5)      # Selects a random sample of size of 5 from a vector.
x <- as.integer(runif(100, min=1, max=5)); sort(x); rev(sort(x)); order(x); x[order(x)]
     # Generates random integers from 1 to 4. The sort() function sorts the items by size. The rev() function reverses the order. The order() function returns the corresponding
     # indices for a sorted object. The order() function is usually the one that needs to be used for sorting complex objects, such as data frames or lists.
x <- rep(1:10, times=2); x; unique(x) # The unique() function removes the duplicated entries in a vector.
sample(1:10, 5, replace=TRUE) # Returns a set of randomly selected elements from a vector (here 5 numbers from 1 to 10) using either with or without replacement.

Sequences

R has several facilities to create sequences of numbers:

1:30                                    # Generates a sequence of integers from 1 to 30.
letters; LETTERS; month.name; month.abb # Generates lower case letters, capital letters, month names and abbreviated month names, respectively.
2*1:10                                  # Creates a sequence of even numbers.
seq(1, 30, by=0.5)                      # Same as before, but with 0.5 increments.
seq(length=100, from=20, by=0.5)        # Creates number sequence with specified start and length.
rep(LETTERS[1:8], times=5)              # Replicates given sequence or vector x times.

Character Vectors

paste(LETTERS[1:8], 1:12, sep="") # The command 'paste' merges vectors after converting to characters.
x <- paste(rep("A", times=12), 1:12, sep=""); y <- paste(rep("B", times=12), 1:12, sep=""); append(x,y)
# Possibility to build plate location vector in R (better example under 'arrays').

Subsetting Vectors

x <- 1:100; x[2:23]                        # Values in square brackets select vector range.
x <- 1:100; x[-(2:23)]                     # Prints everything except the values in square brackets.
x[5] <- 99                                 # Replaces value at position 5 with '99'.
x <- 1:10; y <- c(x[1:5],99,x[6:10]); y    # Inserts new value at defined position of vector.
letters=="c"                               # Returns logical vector of "FALSE" and "TRUE" strings.
which(rep(letters,2)=="c")                 # Returns index numbers where "c" occurs in the 'letters' vector. 
                                             For retrieving indices of several strings provided by query vector, use the following 'match' function:
match(c("c","g"), letters)                 
     # Returns index numbers for "c" and "g" in the 'letters' vector. If thequery vector (here 'c("c","g")') contains entries 
       that are duplicated in the target vector, then this syntax returns only the first occurence(s) for each duplicate. 
       To retrieve the indices for all duplicates, use the following '%in%' function:
x <- rep(1:10, 2); y <- c(2,4,6); x %in% y 
 # The function '%in%' returns a logical vector. This syntax allows the subsetting of vectors and data frames with a query vector ('y') containing entries that are duplicated in the target vector ('x'). The resulting logical vector can be used for the actual subsetting step of vectors and data frames.

Finding Identical and Non-Identical Entries between Vectors

intersect(month.name[1:4], month.name[3:7])                 # Returns identical entries of two vectors.
month.name[month.name %in% month.name[3:7]]                 # Returns the same result as in the previous step. The vector comparison with %in% returns first a 
                                                            logical vector of identical items that is then used to subset the first vector.
setdiff(x=month.name[1:4], y=month.name[3:7]); setdiff(month.name[3:7], month.name[1:4]) 
     # Returns the unique entries occuring only in the first vector. 
       Note: if the argument names are not used, as in the second example, then the order of the arguments is important.
union(month.name[1:4], month.name[3:7])                    # Joins two vectors without duplicating identical entries.
x <- c(month.name[1:4], month.name[3:7]); x[duplicated(x)] #  Returns duplicated entries

Factors

Factors are vector objects that contain grouping (classification) information of its components

animalf <- factor(c("dog", "cat", "mouse", "dog", "dog", "cat")) # Creates factor 'animalf' from vector.
animalf        # Prints out factor 'animalf', this lists first all components and then the different levels (unique entries); alternatively one can print only levels with 'levels(animalf)'.
table(animalf) # Creates frequency table for levels.

Function tapply applies calculation on all members of a level

weight <- c(102, 50, 5, 101, 103, 52) # Creates new vector with weight values for 'animalf' (both need to have same length).
mean <- tapply(weight, animalf, mean) # Applies function (length, mean, median, sum, sterr, etc) to all level values; 'length' provides the number of entries (replicates) in each level.

Function cut divides a numeric vector into size intervals

y <- 1:200; interval <- cut(y, right=F, breaks=c(1, 2, 6, 11, 21, 51, 101, length(y)+1), labels=c("1","2-5","6-10", "11-20", "21-50", "51-100", ">=101")); table(interval) 
# Prints the counts for the specified size intervals (beaks) in the numeric vector: 1:200.
plot(interval, ylim=c(0,110), xlab="Intervals", ylab="Count", col="green"); text(labels=as.character(table(interval)), x=seq(0.7, 8, by=1.2), y=as.vector(table(interval))+2) 
# Plots the size interval counts as bar diagram.

Matrices and Arrays

Matrices are two dimensional data objects consisting of rows and columns. Arrays are similar, but they can have one, two or more dimensions. In contrast to data frames (see below), one can store only a single data type in the same object (e.g. numeric or character).

x <- matrix(1:30, 3, 10, byrow = T) # Lays out vector (1:30) in 3 by 10 matrix. The argument 'byrow' defines whether the matrix is filled by row or columns.
dim(x) <- c(3,5,2)                  # Transforms above matrix into multidimensional array.
x <- array(1:25, dim=c(5,5))        # Creates 5 by 5 array ('x') and fills it with values 1-25.
y <- c(x)                           # Writes array into vector.
x[c(1:5),3]                         # Writes values from 3rd column into vector structure.
mean(x[c(1:5),3])                   # Calculates mean of 3rd column.

Subsetting matrices and arrays

array[rows, columns]              # Syntax to access columns and rows of matrices and two-dimensional arrays.
as.matrix(iris)                   # Many functions in R require matrices as input. If something doesn't work then try to convert the object into a matrix with the as.matrix() function.
i <- array(c(1:5,5:1),dim=c(3,2)) # Creates 5 by 2 index array ('i') and fills it with the values 1-5, 5-1.
x[i]                              # Extracts the corresponding elements from 'x' that are indexed by 'i'.
x[i] <- 0                         # Replaces those elements by zeros.
array1 <- array(scan(file="my_array_file", sep="\t"), c(4,3)) # Reads data from 'my_array_file' and writes it into 4 by 3 array.

Subsetting arrays with more than two dimensions

array[rows, columns, dimensions]             #  Syntax to access columns, rows and dimensions in arrays with more than two dimensions.
x <- array(1:250, dim=c(10,5,5)); x[2:5,3,]  # Example to generate 10x5x5 array and how to retrieve slices from all sub-arrays.

Calculations between arrays

Z <- array(1:12, dim=c(12,8)); X <- array(12:1, dim=c(12,8))   # Creates arrays Z and X with same dimensions.
calarray <- Z/X                                                # Divides Z/X elements and assigns result to array 'calarray'.
t(my_array)                                                    # Transposes 'my_array'; a more flexible transpose function is 'aperm(my_array, perm)'

Information about arrays

dim(X); nrow(X); ncol(X)  # Returns number of rows and columns of array 'X'.

Data Frames

Data frames are two dimensional data objects that are composed of rows and columns. They are very similar to matrices. The main difference is that data frames can store different data types, whereas matrices allow only one data type (e.g. numeric or character).

Constructing data frames

my_frame <- data.frame(y1=rnorm(12), y2=rnorm(12), y3=rnorm(12), y4=rnorm(12)) # Creates data frame with vectors 1-12 and 12-1.
rownames(my_frame) <- month.name[1:12]                                         # Assigns row (index) names. These names need to be unique.
names(my_frame) <- c("y4", "y3", "y2", "y1")                                   # Assigns new column titles.
names(my_frame)[c(1,2)] <- c("y3", "y4")                                       # Changes titles of specific columns.
my_frame <- data.frame(IND=row.names(my_frame), my_frame)                      # Generates new column with title "IND" containing the row names.
my_frame[,2:5]; my_frame[,-1]                                                  # Different possibilities to remove column(s) from a data frame.

Accessing and slicing data frame sections

my_frame[rows, columns]                            # Generic syntax to access columns and rows in data frames.
dim(my_frame)                                      # Gives dimensions of data frame.
length(my_frame); length(my_frame$y1)              # Provides number of columns or rows of data frame, respectively
colnames(my_frame); rownames(my_frame)             # Gives column and row names of data frame.
row.names(my_frame)                                # Prints row names or indexing column of data frame.
my_frame[order(my_frame$y2, decreasing=TRUE), ]    # Sorts the rows of a data frame by the specified columns, here 'y2'; for increasing order use 'decreasing=FALSE'.
my_frame[order(my_frame[,4], -my_frame[,3]),]      # Subsequent sub-sorts can be performed by specifying additional columns. 
                                                     By adding a "-" sign one can reverse the sort order.
my_frame$y1                                        # Notation to print entire column of a data frame as vector or factor.
my_frame$y1[2:4]                                   # Notation to access column element(s) of a data frame.
v <-my_frame[,4]; v[3]                             # Notation for returning the value of an individual cell. In this example the corresponding column is 
                                                     first assigned to a vector and then the desired field is accessed by its index number.
my_frame[1:5,]                                     # Notation to view only the first five rows of all columns.
my_frame[, 1:2]                                    # Notation to view all rows of the first two columns.
my_frame[, c(1,3)]                                 # Notation to view all rows of the specified columns.
my_frame[1:5,1:2]                                  # Notation to view only the first five rows of the columns 1-2.
my_frame["August",]                                # Notation to retrieve row values by their index name (here "August").
x <- data.frame(row.names=LETTERS[1:10], letter=letters[1:10], Month=month.name[1:10]); x; match(c("c","g"), x[,1]) 
     # Returns matching index numbers of data frame or vector using 'match' function. This syntax returns for duplicates only the index of their first occurrence. 
     # To return all, use the following syntax:
x[x[, 2] %in% month.name[3:7],]                    # Subsets a data frame with a query vector using the '%in%' function. This returns all occurences of duplicates.
as.vector(as.matrix(my_frame[1,]))                 # Returns one row entry of a data frame as vector.
as.data.frame(my_list)                             # Returns a list object as data frame if the list components can be converted into a two dimensional object.

Calculations on data frames

summary(my_frame)                                  # Prints a handy summary for a data frame.
mean(my_frame)                                     # Calculates the mean for all columns.
data.frame(my_frame, mean=apply(my_frame[,2:5], 1, mean), ratio=(my_frame[,2]/my_frame[,3]))
     # The apply function performs row-wise or column-wise calculations on data frames or matrices (here mean and ratios). The results are returned as vectors.
     # In this example, they are appended to the original data frame with the data.frame function. The argument '1' in the apply function specifies row-wise calculations.
     # If '2' is selected, then the calculations are performed column-wise.
aggregate(my_frame, by=list(c("G1","G1","G1","G1","G2","G2","G2","G2","G3","G3","G3","G4")), FUN=mean) 
     # The aggregate function can be used to compute the mean (or any other stats) for data groups specified under the argument 'by'.
cor(my_frame[,2:4]); cor(t(my_frame[,2:4]))         # Syntax to calculate a correlation matrix based on all-against-all rows and all-against-all columns.
x <- matrix(rnorm(48), 12, 4, dimnames=list(month.name, paste("t", 1:4, sep=""))); corV <- cor(x["August",], t(x), method="pearson"); y <- cbind(x, correl=corV[1,]); y[order(-y[,5]), ] 
     # Commands to perform a correlation search and ranking across all rows (matrix object required here). First, an example matrix 'x' is created.
     # Second, the correlation values for the "August" row against all other rows are calculated. 
     # Finally, the resulting vector with the correlation values is merged with the original matrix 'x' and sorted by decreasing correlation values.
merge(frame1, frame2, by.x = "frame1col_name", by.y = "frame2col_name", all = TRUE) 
     # Merges two data frames (tables) by common field entries. To obtain only the common rows, change 'all = TRUE' to 'all = FALSE'. 
     # To merge on row names (indices), refer to it with "row.names" or '0', e.g.: 'by.x = "row.names", by.y = "row.names"'.
my_frame1 <- data.frame(title1=month.name[1:8], title2=1:8); my_frame2 <- data.frame(title1=month.name[4:12], title2=4:12); merge(my_frame1, my_frame2, by.x = "title1", by.y = "title1", all = TRUE) # Example for merging two data frames by common field.

Fast computations on large data frames and matrices

myDF <- as.data.frame(matrix(rnorm(100000), 10000, 10)) # Creates an example data frame.
myCol <- c(1,1,1,2,2,2,3,3,4,4); myDFmean <- t(aggregate(t(myDF), by=list(myCol), FUN=mean, na.rm=T)[,-1])
colnames(myDFmean) <- tapply(names(myDF), myCol, paste, collapse="_"); myDFmean[1:4,]
     # The aggregate function can be used to perform calculations (here mean) across rows for any combination of column selections (here myCol) after transposing the data frame.
     # However, this will be very slow for data frames with millions of rows.
myList <- tapply(colnames(myDF), c(1,1,1,2,2,2,3,3,4,4), list)
names(myList) <- sapply(myList, paste, collapse="_"); myDFmean <- sapply(myList, function(x) mean(as.data.frame(t(myDF[,x])))); myDFmean[1:4,]              
     # Faster alternative for performing the aggregate computation of the previous step. However, this step will be still very slow for very large data sets, 
     # due to the sapply loop over the row elements.
myList <- tapply(colnames(myDF), c(1,1,1,2,2,2,3,3,4,4), list)
myDFmean <- sapply(myList, function(x) rowSums(myDF[,x])/length(x)); colnames(myDFmean) <- sapply(myList, paste, collapse="_")
myDFmean[1:4,]             
     # By using the rowSums or rowMeans functions one can perform the above aggregate computations 100-1000 times faster by avoiding a loop over the rows altogether.
myDFsd <- sqrt((rowSums((myDF-rowMeans(myDF))^2)) / (length(myDF)-1)); myDFsd[1:4]
     # Similarly, one can compute the standard deviation for large data frames by avoiding loops.
     # This approach is about 100 times faster than the loop-based alternatives: sd(t(myDF)) or apply(myDF, 1, sd).

Conditional selections

my_frame[!duplicated(my_frame[,2]),]   # Removes rows with duplicated values in selected column.
my_frame[my_frame$y2 > my_frame$y3,]   # Prints all rows of data frame where values of col1 > col2. 
     # Comparison operators are: == (equal), != (not equal), >= (greater than or equal), etc. Logical operators are & (and), | (or) and ! (not).
x <- 0.5:10; x[x<1.0] <- -1/x[x<1.0]   # Replaces all values in vector or data frame that are below 1 with their reciprocal value.
x <-data.frame(month=month.abb[1:12], AB=LETTERS[1:2], no1=1:48, no2=1:24); x[x$month == "Apr" & (x$no1 == x$no2 | x$no1 > x$no2),] 
     # Prints all records of frame 'x' that contain 'Apr' AND have equal values in columns 'no1' and 'no2' OR have greater values in column 'no1'.
x[x[,1] %in% c("Jun", "Aug"),]         # Retrieves rows with column matches specified in a query vector.
x[c(grep("\\d{2}", as.character(x$no1), perl = TRUE)),] 
     # Possibility to print out all rows of a data frame where a regular expression matches (here all double digit values in col 'no1').
x[c(grep("\\d{2}", as.character(for(i in 1:4){x[,i]}), perl = TRUE)),]  # Same as above, but searches all columns (1-4) using a for loop (see below):
z <- data.frame(chip1=letters[1:25], chip2=letters[25:1], chip3=letters[1:25]); z; y <- apply(z, 1, function(x) sum(x == "m") > 2); z[y,]
     # Identifies in a data frame ('z') all those rows that contain a certain number of identical fields (here 'm' > 2).
x <- rep(1:10, times=2); x; unique(x)  # The unique() function removes the duplicated entries in a vector.
sample(1:10, 5, replace=TRUE)          # Returns a set of randomly selected elements from a vector (here 5 numbers from 1 to 10) using either with or without replacement. 
z <- data.frame(chip1=1:25, chip2=25:1, chip3=1:25); c <- data.frame(z, count=apply(z[,1:3], 1, FUN <- function(x) sum(x >= 5))); c  # Counts in each row of a data frame the number of fields that are above or below a specified value and appends this information to the data frame.  # By default rows with "NA" values will be ignored. To work around this limitation, one can replace the NA fields with a value that doesn't affect the result, # e.g.: x[is.na(x)] <- 1. x <- data.frame(matrix(rep(c("P","A","M"),20),10,5)); x; index <- x == "P"; cbind(x, Pcount=rowSums(index)); x[rowSums(index)>=2,] 
 # Example of how one can count occurrences of strings across rows. In this example the occurances of "P" in a data frame of "PMA" values are counted by converting # to a data frame of logical values and then counting the 'TRUE' occurences with the 'rowSums' function, which does in this example the same as: # 'cbind(x, count=apply(x=="P",1,sum))'.

Reformatting data frames with reshape.2 and splitting/apply routines with plyr

 ##############
 ## reshape2 ##
 ##############
 ## Some of these operations are important for plotting routines with the ggplot2 library.
 ## melt: rbinds many columns

 library(reshape2)  (iris_mean <- aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=mean)) Species Sepal.Length Sepal.Width Petal.Length Petal.Width 1     setosa        5.006       3.428        1.462       0.246 2 versicolor        5.936       2.770        4.260       1.326 3  virginica        6.588       2.974        5.552       2.026 (df_mean <- melt(iris_mean, id.vars=c("Species"), variable.name = "Samples")) # See ?melt.data.frame for details. Species      Samples value 1      setosa Sepal.Length 5.006 2  versicolor Sepal.Length 5.936 3   virginica Sepal.Length 6.588 4      setosa  Sepal.Width 3.428 5  versicolor  Sepal.Width 2.770 6   virginica  Sepal.Width 2.974 ... 
 ## dcast: cbinds row aggregates (reverses melt result)
 dcast(df_mean, formula = Species ~ ...) Species Sepal.Length Sepal.Width Petal.Length Petal.Width 1     setosa        5.006       3.428        1.462       0.246 2 versicolor        5.936       2.770        4.260       1.326 3  virginica        6.588       2.974        5.552       2.026  
 ## colsplit: splits strings into columns
  x <- c("a_1_4", "a_2_3", "b_2_5", "c_3_9")  colsplit(x, "_", c("trt", "time1", "time2")) trt time1 time2 1   a     1     4 2   a     2     3 3   b     2     5 4   c     3     9  ##########
 ## plyr ##
 ##########
 library(plyr)
 ddply(.data=iris, .variables=c("Species"), mean=mean(Sepal.Length), summarize)
      Species  mean
 1     setosa 5.006
 2 versicolor 5.936
 3  virginica 6.588
  ddply(.data=iris, .variables=c("Species"), mean=mean(Sepal.Length), transform) Sepal.Length Sepal.Width Petal.Length Petal.Width Species  mean 1          5.1         3.5          1.4         0.2  setosa 5.006 2          4.9         3.0          1.4         0.2  setosa 5.006 3          4.7         3.2          1.3         0.2  setosa 5.006 4          4.6         3.1          1.5         0.2  setosa 5.006 ... ## Usage with parallel package library(parallel); library(doMC); registerDoMC(2) # 2 cores test <- ddply(.data=iris, .variables=c("Species"), mean=mean(Sepal.Length), summarize, parallel=TRUE)

Lists

Lists are ordered collections of objects that can be of different modes (e.g. numeric vector, array, etc.). A name can be assigned to each list component.

my_list <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9)) # Example of how to create a list.
attributes(my_list); names(my_list) # Prints attributes and names of all list components.
my_list[2]; my_list[[2]] # Returns component 2 of list; The [[..]] operator returns the object type stored in this component, while '[..]' returns a single component list.
my_list$wife # Named components in lists can also be called with their name. Command returns same component as 'my_list[[2]]'.
my_list[[4]][2] # Returns from the fourth list component the second entry.
length(my_list[[4]]) # Prints the length of the fourth list component.
my_list$wife <- 1:12 # Replaces the content of the existing component with new content.
my_list$wife <- NULL # Deletes list component 'wife'.
my_list <- c(my_list, list(my_title2=month.name[1:12])) # Appends new list component to existing list.
my_list <- c(my_name1=my_list1, my_name2=my_list2, my_name3=my_list3) # Concatenates lists.
my_list <- c(my_title1=my_list[[1]], list(my_title2=month.name[1:12])) # Concatenates existing and new components into a new list.
unlist(my_list); data.frame(unlist(my_list)); matrix(unlist(my_list)); data.frame(my_list) 
# Useful commands to convert lists into other objects, such as vectors, data frames or matrices.
my_frame <- data.frame(y1=rnorm(12),y2=rnorm(12), y3=rnorm(12), y4=rnorm(12)); my_list <- apply(my_frame, 1, list); my_list <- lapply(my_list, unlist); my_list 
# Stores the rows of a data frame as separate vectors in a list container.
mylist <- list(a=letters[1:10], b=letters[10:1], c=letters[1:3]); lapply(names(mylist), function(x) c(x, mylist[[x]]))
# Syntax to access list component names with the looping function 'lapply'. In this example the list component names are prepended to the corresponding vectors.

Missing Values

Missing values are represented in R data objects by the missing value place holder ‘NA’. The default behavior for many R functions on data objects with missing values is ‘na.fail’ which returns the value ‘NA’. Several ‘na.action’ options are available to change this behavior.

x <- 1:10; x <- x[1:12]; z <- data.frame(x,y=12:1) # Creates sample data.frame with 'NA' values
is.na(x) # Finds out which elements of x contain "NA" fields.
na.omit(z) # Removes rows with 'NA' fields in numeric data objects.
x <- letters[1:10]; print(x); x <- x[1:12]; print(x); x[!is.na(x)] # A similar result can be achieved on character data objects with the function 'is.na()'.
apply(z, 1, mean, na.rm=T) # Uses available values to perform calculation while ignoring the 'NA' values; the default 'na.rm=F' returns 'NA' for rows with missing values.
z[is.na(z)] <- 0
# Replaces 'NA' values with zeros. To replace the special <NA> signs in character columns of data frames, convert the data frame with as.matrix() to a matrix,
# perform the substitution on this level and then convert it back to a data frame with as.data.frame().

Some Great R Functions

These sections contains a small collection of extremely useful R functions.

The unique() function makes vector entries unique:

unique(iris$Sepal.Length); length(unique(iris$Sepal.Length))
# The first step returns the unique entries for the Sepal.Length column in the iris data set and the second step counts the number of its unique entries.

The table() function counts the occurrence of entries in a vector. It is the most basic “clustering function”:

my_counts <- table(iris$Sepal.Length, exclude=NULL)[iris$Sepal.Length]; cbind(iris, CLSZ=my_counts)[1:4,]
# Counts identical entries in the Sepal.Length column of the iris data set and aligns the counts with the original data frame. 
# Note: to count NAs, set the 'exclude' argument to NULL.
myvec <- c("a", "a", "b", "c", NA, NA); table(factor(myvec, levels=c(unique(myvec), "z"), exclude=NULL))
# Additional count levels can be specified by turning the test vector into a factor and specifying them with the 'levels' argument.

The combn() function creates all combinations of elements:

labels <- paste("Sample", 1:5, sep=""); combn(labels, m=2, FUN=paste, collapse="-")
# Example to create all possible pairwise combinations of sample labels. The number of samples to consider in the comparisons canbe controlled with the 'm' argument.
allcomb <- lapply(seq(along=labels), function(x) combn(labels, m=x, simplify=FALSE, FUN=paste, collapse="-")); unlist(allcomb)
# Creates all possible combinations of sample labels.

The aggregate() function computes any type of summary statistics of data subsets that are grouped together:

aggregate(iris[,1:4], by=list(iris$Species), FUN=mean, na.rm=T) # Computes the mean (or any other stats) for the data groups defined by the 'Species' column in the iris data set.
t(aggregate(t(iris[,1:4]), by=list(c(1,1,2,2)), FUN=mean, na.rm=T)[,-1])
# The function can also perform computations for column aggregates (here: columns 1-2 and 3-4) after transposing the data frame.

The %in% function returns the intersect between two vectors. In a subsetting context with ‘[ ]‘, it can be used to intersect matrices, data frames and lists:

my_frame <- data.frame(Month=month.name, N=1:12); my_query <- c("May", "August") # Generates a sample data frame and a sample query vector.
my_frame[my_frame$Month %in% my_query, ] # Returns the rows in my_frame where the "Months" column and my_query contain identical entries.

The merge() function joins data frames based on a common key column:

frame1 <- iris[sample(1:length(iris[,1]), 30), ]
my_result <- merge(frame1, iris, by.x = 0, by.y = 0, all = TRUE); dim(my_result)
# Merges two data frames (tables) by common field entries, here row names (by.x=0). To obtain only the common rows, change 'all = TRUE' to 'all = FALSE'. 
# To merge on specific columns, refer to them by their position numbers or their column names (e.g. "Species").

Graphical Procedures

Introduction to R Graphics

R provides comprehensive graphics utilities for visualizing and exploring scientific data. In addition, several powerful graphics environments extend these utilities. These include the grid, lattice andggplot2 packages. The grid package is part of R’s base distribution. It provides the low-level infrastructure for many graphics packages, including lattice and ggplot2. Extensive information on graphics utilities in R can be found on the Graphics Task Page, the R Graph Gallery and the R Graphical Manual. Another useful reference for graphics procedures is Paul Murrell’s book R Graphics. Interactive graphics in R can be generated with rggobi (GGobi) and iplots.

Base Graphics

The following list provides an overview of some very useful plotting functions in R’s base graphics. To get familiar with their usage, it is recommended to carefully read their help documentation with ?myfct as well as the help on the functions plot and par.

plot: generic x-y plotting barplot: bar plots boxplot: box-and-whisker plot hist: histograms pie: pie charts dotchart: cleveland dot plots image, heatmap, contour, persp: functions to generate image-like plots qqnorm, qqline, qqplot: distribution comparison plots pairs, coplot: display of multivariant data

Lattice [ Manuals: lattice, Intro, book ]

The lattice package developed by Deepayan Sarkar implements in R the Trellis graphics system from S-Plus. The environment greatly simplifies many complicated high-level plotting tasks, such as automatically arranging complex graphical features in one or several plots. The syntax of the package is similar to R’s base graphics; however, high-level lattice functions return an object of class “trellis”, that can be either plotted directly or stored in an object. The command library(help=lattice) will open a list of all functions available in the lattice package, while ?myfct and example(myfct) can be used to access and/or demo their documentation. Important functions for accessing and changing global parameters are: ?lattice.options and ?trellis.device.

ggplot2 [ Manuals: ggplot2, Docs, Intro and book ]

ggplot2 is another more recently developed graphics system for R, based on the grammar of graphics theory. The environment streamlines many graphics routines for the user to generate with minimum effort complex multi-layered plots. Its syntax is centered around the main ggplot function, while the convenience function qplot provides many shortcuts. The ggplotfunction accepts two arguments: the data set to be plotted and the corresponding aesthetic mappings provided by the aes function. Additional plotting parameters such as geometric objects (e.g. points, lines, bars) are passed on by appending them with ‘+’ as separator. For instance, the following command will generate a scatter plot for the first two columns of the iris data frame: ggplot(iris, aes(iris[,1], iris[,2])) + geom_point(). A list of the available geom_* functions can be found here. The settings of the plotting theme can be accessed with the command theme_get(). Their settings can be changed with the opts()function. To benefit from the many convenience features built into ggplot2, the expected input data class is usually a data frame where all labels for the plot are provided by the column titles and/or grouping factors in additional column(s).

The following graphics sections demonstrate how to generate different types of plots first with R’s base graphics device and then with the lattice and ggplot2 packages.

Scatter Plots

Base Graphics

y <- as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set. plot(y[,1], y[,2]) # Plots two vectors (columns) in form of a scatter plot against each other. plot(y[,1], y[,2], type="n", main="Plot of Labels"); text(y[,1], y[,2], rownames(y)) # Prints instead of symbols the row names. plot(y[,1], y[,2], pch=20, col="red", main="Plot of Symbols and Labels"); text(y[,1]+0.03, y[,2], rownames(y))
# Plots both, symbols plus their labels. grid(5, 5, lwd = 2) # Adds a grid to existing plot. op <- par(mar=c(8,8,8,8), bg="lightblue") plot(y[,1], y[,2], type="p", col="red", cex.lab=1.2, cex.axis=1.2, cex.main=1.2, cex.sub=1, lwd=4, pch=20, xlab="x label", ylab="y label", main="My Main", sub="My Sub") par(op)
# The previous commands demonstrates the usage of the most important plotting parameters. The 'mar' argument specifies the margin sizes around the plotting area in this order:
# c(bottom, left, top, right). The color of the plotted symbols can be controlled with the 'col' argument. The plotting symbols can be selected with the 'pch' argument,
# while their size is controlled by the 'lwd' argument. A selection palette for 'pch' plotting symbols can be opened with the command 'example(points)'.
# As alternative, one can plot any character string by passing it on to 'pch', e.g.: pch=".". The font sizes of the different text components
# can be changed with the 'cex.*' arguments. Please consult the '?par' documentation for more details on fine tuning R's plotting behavior. plot(y[,1], y[,2]); myline <- lm(y[,2]~y[,1], data=y[,1:2]); abline(myline, lwd=2) # Adds a regression line to the plot. summary(myline) # Prints summary about regression line. plot(y[,1], y[,2], log="xy") # Generates the same plot as above, but on log scale. plot(y[,1], y[,2]); text(y[1,1], y[1,2], expression(sum(frac(1,sqrt(x^2*pi)))), cex=1.3) # Adds a mathematical formula to the plot. plot(y) # Produces all possible scatter plots for all-against-all columns in a matrix or a data frame. 
# The column headers of the matrix or data frame are used as axis titles. pairs(y) # Alternative way to produce all possible scatter plots for all-against-all columns in a matrix or a data frame. library(scatterplot3d); scatterplot3d(y[,1:3], pch=20, color="red") # Plots a 3D scatter plot for first three columns in y. library(geneplotter); smoothScatter(y[,1], y[,2]) # Same as above, but generates a smooth scatter plot that shows the density of the data points.

Scatter Plot Generated with Base Graphics

lattice

library(lattice)
xyplot(1:10 ~ 1:10) # Simple scatter plot.
xyplot(1:10 ~ 1:10 | rep(LETTERS[1:5], each=2), as.table=TRUE) 
# Plots subcomponents specified by grouping vector after '|' in separate panels. The argument as.table controls the order of the panels.
myplot <- xyplot(Petal.Width ~ Sepal.Width | Species , data = iris); print(myplot) # Assigns plotting function to an object and executes it.
xyplot(Petal.Width ~ Sepal.Width | Species , data = iris, layout = c(3, 1, 1)) # Changes layout of individual plots.## Change plotting parameters
show.settings() # Shows global plotting parameters in a set of sample plots.
default <- trellis.par.get(); mytheme <- default; names(mytheme) # Stores the global plotting parameters in list mytheme and prints its component titles.
mytheme["background"][[1]][[2]] <- "grey" # Sets background to grey
mytheme["strip.background"][[1]][[2]] <- "transparent" # Sets background of title bars to transparent.
trellis.par.set(mytheme) # Sets global parameters to 'mytheme'.
show.settings() # Shows custom settings.
xyplot(1:10 ~ 1:10 | rep(LETTERS[1:5], each=2), as.table=TRUE, layout=c(1,5,1), col=c("red", "blue"))
trellis.par.set(default)

Scatter Plot Generated with lattice

ggplot2

library(ggplot2)
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() # Plots two vectors (columns) in form of a scatter plot against eachother.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point(aes(color = Species), size=4) # Plots larger dots and colors them with default color scheme.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point(aes(color = Species), size=4) + ylim(2,4) + xlim(4,8) + scale_color_manual(values=rainbow(10)) # Colors dots with custom scheme.
ggplot(iris, aes(Sepal.Length, Sepal.Width, label=1:150)) + geom_text() + opts(title = "Plot of Labels") # Print instead of symbols a set of custom labels and adds a title to the plot.
ggplot(iris, aes(Sepal.Length, Sepal.Width, label=1:150)) + geom_point() + geom_text(hjust=-0.5, vjust=0.5) # Prints both symbolsand labels.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() + opts(panel.background=theme_rect(fill = "white", colour = "black"))# Changes background color to white.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() + stat_smooth(method="lm", se=FALSE) # Adds a regression line to the plot.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() + coord_trans(x = "log2", y = "log2") # Plots axes in log scale.

Scatter Plot Generated with ggplot2

Line Plots

Base Graphics

y <- as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set. plot(y[,1], type="l", lwd=2, col="blue") # Plots a line graph instead. split.screen(c(1,1)) plot(y[,1], ylim=c(0,1), xlab="Measurement", ylab="Intensity", type="l", lwd=2, col=1) for(i in 2:length(y[1,])) screen(1, new=FALSE); plot(y[,i], ylim=c(0,1), type="l", lwd=2, col=i, xaxt="n", yaxt="n", ylab="", xlab="", main="", bty="n") close.screen(all=TRUE) # Plots a line graph for all columns in data frame 'y'. The 'split.screen()' function is used in this example in a for loop to overlay several line graphs in the same plot. 
# More details on this topic are provided in the 'Arranging Plots' section.

lattice

xyplot(Sepal.Length ~ Sepal.Width | Species, data=iris, type="a", layout=c(1,3,1))
parallel(~iris[1:4] | Species, iris) # Plots data for each species in iris data set in separate line plot.
parallel(~iris[1:4] | Species, iris, horizontal.axis = FALSE, layout = c(1, 3, 1)) # Changes layout of plot.

ggplot2

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_line(aes(color=Species), size=1) # Plots lines for three samples in the 'Species' column in a single plot.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_line(aes(color=Species), size=1) + facet_wrap(~Species, ncol=1)  # Plots three line plots, one for each sample in Species column.

Bar Plots

Base Graphics

y <- as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
barplot(as.matrix(y[1:4,]), ylim=c(0,max(y[1:4,])+0.1), beside=T)
# Generates a bar diagram for the first four rows in y. The barplot() function expects as input format a matrix, 
# and it uses by default the column titles for labeling the bars.
text(labels=round(as.vector(as.matrix(y[1:4,])),2), x=seq(1.5, 13, by=1)+sort(rep(c(0,1,2), 4)), y=as.vector(as.matrix(y[1:4,]))+0.02)
# Adds corresponding values on top of each bar.
ysub <- as.matrix(y[1:4,]); myN <- length(ysub[,1])
mycol1 <- gray(1:(myN+1)/(myN+1))[-(myN+1)]
mycol2 <- sample(colors(),myN); barplot(ysub, beside=T, ylim=c(0,max(ysub)*1.2), col=mycol2, main="Bar Plot", sub="data: ysub")
legend("topright", legend=row.names(ysub), cex=1.3, bty="n", pch=15, pt.cex=1.8, col=mycol2, ncol=myN)
# Generates a bar plot with a legend for the first four rows of the input matrix 'y'. To plot the diagram in black and white, set the 'col' argument to 'col=col1".
# The argument 'ncol' controls the number of columns that are used for printing the legend.
par(mar=c(10.1, 4.1, 4.1, 2.1)); par(xpd=TRUE);
barplot(ysub, beside=T, ylim=c(0,max(ysub)*1.2), col=mycol2, main="Bar Plot"); legend(x=4.5, y=-0.3, legend=row.names(ysub), cex=1.3, bty="n", pch=15, pt.cex=1.8, col=mycol2, ncol=myN)
# Same as above, but places the legend below the bar plot. The arguments 'x' and 'y' control the placement of the legend and the 'mar' argument specifies 
# the margin sizes around the plotting area in this order: c(bottom, left, top, right).
bar <- barplot(x <- abs(rnorm(10,2,1)), names.arg = letters[1:10], col="red", ylim=c(0,5))
stdev <- x/5; arrows(bar, x, bar, x + stdev, length=0.15, angle = 90)
arrows(bar, x, bar, x + -(stdev), length=0.15, angle = 90)
# Creates bar plot with standard deviation bars. The example in the 'barplot' documentation provides a very useful outline of this function.

Bar Plot with Error Bars Generated with Base Graphics

lattice

y <- matrix(sample(1:10, 40, replace=TRUE), ncol=4, dimnames=list(letters[1:10], LETTERS[1:4])) # Creates a sample data set.
barchart(y, auto.key=list(adj = 1), freq=T, xlab="Counts", horizontal=TRUE, stack=FALSE, groups=TRUE) 
# Plots all data in a single bar plot. The TRUE/FALSE arguments control the layout of the plot.
barchart(y, col="grey", layout = c(2, 2, 1), xlab="Counts", as.table=TRUE, horizontal=TRUE, stack=FALSE, groups=FALSE)
# Plots the different data components in separate bar plots. The TRUE/FALSE and layout arguments allow to rearrange the bar plots in different ways.

Bar Plot Generated with lattice

ggplot2

## (A) Sample Set: the following transforms the iris data set into a ggplot2-friendly format
iris_mean <- aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=mean) # Calculates the mean values for the aggregates givenby the Species column in the iris data set.
iris_sd <- aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=sd) # Calculates the standard deviations for the aggregates given by the Species column.
convertDF <- function(df=df, mycolnames=c("Species", "Values", "Samples")) { myfactor <- rep(colnames(df)[-1], each=length(df[,1])); mydata <- as.vector(as.matrix(df[,-1])); df <- data.frame(df[,1], mydata, myfactor); colnames(df) <- mycolnames; return(df) }# Defines function to convert data frames into ggplot2-friendly format.
df_mean <- convertDF(iris_mean, mycolnames=c("Species", "Values", "Samples")) # Converts iris_mean.
df_sd <- convertDF(iris_sd, mycolnames=c("Species", "Values", "Samples")) # Converts iris_sd.
limits <- aes(ymax = df_mean[,2] + df_sd[,2], ymin=df_mean[,2] - df_sd[,2]) # Define standard deviation limits.

## (B) Bar plots of data stored in df_mean
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="dodge") # Plots bar sets defined by 'Species' column next to each other.
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="dodge") + coord_flip() + opts(axis.text.y=theme_text(angle=0, hjust=1)) 
# Plots bars and labels sideways.
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="stack") # Plots same data set as stacked bars.
ggplot(df_mean, aes(Samples, Values)) + geom_bar(aes(fill = Species)) + facet_wrap(~Species, ncol=1) # Plots data sets below eachother.
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") 
# Generates the same plot as before, but with error bars.

## (C) Customizing colors
library(RColorBrewer); display.brewer.all() # Select a color scheme and pass it on to 'scale_*' arguments.
ggplot(df_mean, aes(Samples, Values, fill=Species, color=Species)) + geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") + scale_fill_brewer(pal="Greys") + scale_color_brewer(pal = "Greys") # Generates the same plot as before, but with grey color scheme. 
ggplot(df_mean, aes(Samples, Values, fill=Species, color=Species)) + geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") + scale_fill_manual(values=c("red", "green3", "blue")) + scale_color_manual(values=c("red", "green3", "blue")) # Uses custom colors passed on as vectors.

Bar Plot Generated with ggplot2

Pie Charts

Base Graphics

y <- table(rep(c("cat", "mouse", "dog", "bird", "fly"), c(1,3,3,4,2))) # Creates a sample data set.
pie(y, col=rainbow(length(y), start=0.1, end=0.8), main="Pie Chart", clockwise=T) # Plots a simple pie chart.
pie(y, col=rainbow(length(y), start=0.1, end=0.8), labels=NA, main="Pie Chart", clockwise=T)
legend("topright", legend=row.names(y), cex=1.3, bty="n", pch=15, pt.cex=1.8, col=rainbow(length(y), start=0.1, end=0.8), ncol=1)
# Same pie chart as above, but with legend.

Pie Chart Generated with Base Graphics

ggplot2

df <- data.frame(variable=rep(c("cat", "mouse", "dog", "bird", "fly")), value=c(1,3,3,4,2)) # Creates a sample data set.
ggplot(df, aes(x = "", y = value, fill = variable)) + geom_bar(width = 1) + coord_polar("y", start=pi / 3) + opts(title = "Pie Chart") # Plots pie chart.
ggplot(df, aes(x = variable, y = value, fill = variable)) + geom_bar(width = 1) + coord_polar("y", start=pi / 3) + opts(title = "Pie Chart") # Plots wind rose pie chart.

Wind Rose Pie Chart Generated with ggplot2

Heatmaps

Base Graphics

y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))
image(y) # The image function plots matrices as heatmaps.

lattice

## Sample data
library(lattice); library(gplots)
y <- lapply(1:4, function(x) matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))))
 ## Plot single heatmap: levelplot(y[[1]]) ## Arrange several heatmaps in one plot x1 <- levelplot(y[[1]], col.regions=colorpanel(40, "darkblue", "yellow", "white"), main="colorpanel") x2 <- levelplot(y[[2]], col.regions=heat.colors(75), main="heat.colors") x3 <- levelplot(y[[3]], col.regions=rainbow(75), main="rainbow") x4 <- levelplot(y[[4]], col.regions=redgreen(75), main="redgreen") print(x1, split=c(1,1,2,2)) print(x2, split=c(2,1,2,2), newpage=FALSE) print(x3, split=c(1,2,2,2), newpage=FALSE) print(x4, split=c(2,2,2,2), newpage=FALSE)

Venn Diagrams

Computation of Venn intersects

The following imports several functions from the overLapper.R script for computing Venn intersects and plotting Venn diagrams (old version: vennDia.R). These functions are relatively generic and scalable by supporting the computation of Venn intersects of 2-20 or more samples. The upper limit around 20 samples is unavoidable because the complexity of Venn intersects increases exponentially with the sample number n according to this relationship: (2^n) – 1. A useful feature of the actual plotting step is the possiblity to combine the counts from several Venn comparisons with the same number of test sets in a single Venn diagram. The overall workflow of the method is to first compute for a list of samples sets their Venn intersects using the overLapper function, which organizes the result sets in a list object. Subsequently, the Venn counts are computed and plotted as bar or Venn diagrams. The current implementation of the plotting function, vennPlot, supports Venn diagrams for 2-5 sample sets. To analyze larger numbers of sample sets, the Intersect Plot methods often provide reasonable alternatives. These methods are much more scalable than Venn diagrams, but lack their restrictive intersect logic. Additional Venn diagram resources are provided by limma, gplots, vennerable, eVenn, VennDiagram, shapes, C Seidel (online) andVenny (online).

Sample Venn Diagrams (here 4- and 5-way)

Histograms

Base Graphics

x <- rnorm(100); hist(x, freq=FALSE); curve(dnorm(x), add=TRUE) 
# Plots a random distribution as histogram and overlays curve ('freq=F': ensures density histogram; 'add=T': enables 'overplotting').
plot(x<-1:50, dbinom(x,size=50,prob=.33), type="h") # Creates pin diagram for binomial distribution with n=50 and p=30.

Basic Histogram Generated with Base Graphics

ggplot2

ggplot(iris, aes(x=Sepal.Width)) + geom_histogram(aes(fill = ..count..), binwidth=0.2) # Plots histogram for second column in 'iris' data set.
ggplot(iris, aes(x=Sepal.Width)) + geom_histogram(aes(y = ..density.., fill = ..count..), binwidth=0.2) + geom_density() # Density representation.

Histogram Generated with ggplot2

Density Plots

plot(density(rnorm(10)), xlim=c(-2,2), ylim=c(0,1), col="red") # Plots a random distribution in form of a density plot.
split.screen(c(1,1))
plot(density(rnorm(10)), xlim=c(-2,2), ylim=c(0,1), col="red")
screen(1, new=FALSE)
plot(density(rnorm(10)), xlim=c(-2,2), ylim=c(0,1), col="green", xaxt="n", yaxt="n", ylab="", xlab="", main="",bty="n")
close.screen(all = TRUE)
# Several density plots can be overlayed with the 'split.screen' function. In ths example two density plots are plotted at the same scale ('xlim/ylim') and
# overlayed on the same graphics device using 'screen(1,new=FALSE))'.

Box Plots

y <- as.data.frame(matrix(runif(300), ncol=10, dimnames=list(1:30, LETTERS[1:10]))) # Generates a sample data set.
boxplot(y, col=1:length(y))
# Creates a box plot. A box plot (also known as a box-and-whisker diagram) is a graphical representation of a five-number summary, which consists of the smallest
# observation, lower quartile, median, upper quartile and largest observation.

Basic Box Plot Generated with Base Graphics

Miscellaeneous Plotting Utilities

plot(1:10); lines(1:10, lty=1)
# Superimposes lines onto existing plots. The usage of plotted values will connect the data points. Different line types can be specified with argument "lty = 2". 
# More on this can be found in the documentation for 'par'.
plot(x <- 1:10, y <- 1:10); abline(-1,1, col="green"); abline(1,1, col="red"); abline(v=5, col="blue"); abline(h=5, col="brown")
# Function 'abline' adds lines in different colors to x-y-plot.