Difference between revisions of "R Programming - Getting Started"

From Computer Science
Jump to: navigation, search
(Install on Your Computer)
(R Functions)
 
Line 135: Line 135:
 
* math - log2, sum, prod
 
* math - log2, sum, prod
 
* plotting - plot, boxplot, hist, abline, points, beanplot, par, image, mtext, smoothScatter, pairs
 
* plotting - plot, boxplot, hist, abline, points, beanplot, par, image, mtext, smoothScatter, pairs
* strings - grep, substring, gsub
+
* strings - grep, substring, gsub, paste, paste0
 
* sets - unique, intersect, union, setdiff
 
* sets - unique, intersect, union, setdiff
 
* packages/running-files - install.packages, library, source, file.path, file.exists, load, save
 
* packages/running-files - install.packages, library, source, file.path, file.exists, load, save

Latest revision as of 21:29, 5 June 2024

This page is part of Programming and CS - Getting Started

So you want to learn R programming. Good for you. This page will hopefully walk you through getting into R.

Reading

There are numerous good tutorials, getting started, and so forth for R. Reading through just about any of them is good for you. Here are a few you can try, but feel free to pick your own as well.

Software Setup

R is free to use and has numerous free packages as well. We recommend using the Rstudio IDE since it is the most popular and has some very nice features.

Install on Your Computer

  1. Download and install the latest version of R from https://cloud.r-project.org
  2. Download and install Rstudio desktop (the free version) from https://www.rstudio.com/products/rstudio/download/

Use on ISU CS Systems

To use R on the ISU CS systems, you can either use Rstudio when you are in one of the labs or run R from the terminal when you are logged in remotely. To run Rstudio on one of the CS lab computers, simply run the rstudio command (either from a terminal, or via the graphical menu). To run R from a terminal, simply run the R command.

Packages

Note - before trying to install a package, first try to load it with the library command. If it isn't installed, then you try to install it. See next...

One of the best features of R is the large number of very good packages that are easy to install and use. Once you have downloaded and installed R and Rstudio and open up Rstudio, you can download and install packages using the install.packages command. For example, here is the command to install openxlsx, you would run the following.

install.packages("openxlsx")

You only need to run this once on your computer. Once it is installed, you use the library command to load the package so it is available for use.

library("openxlsx")

Many packages related to biology and medicine are installed a little differently, using a system called Bioconductor. You first must install the R Bioconductor by running the following.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

Once this is complete, you can install a Bioconductor package (here the package edgeR) as follows.

BiocManager::install("edgeR")

Sample Quizzes

R Getting Started

For R reserved words, operators, and functions you should be able to give a short description given the reserved word / operator function. You should also be able to identify the reserved word / operator / function given a short description. For functions you should be able to describe the parameters and return value of the function.

Reserved words

  • if - execute a statement only if a condition is true
  • else - specifies statements to run when ```if``` condition is not true
  • for - loop that iterates through a vector, matrix, etc.
  • while - loop that iterates as long as a boolean condition is true
  • break - jumps out of a loop
  • repeat - loop that repeats forever until a break statement inside the loop
  • next - jump back to top of a loop
  • in - ??
  • TRUE, FALSE - boolean values
  • Inf - result of math operation that would give infinity (e.g., 1/0)
  • NaN - result of math operation where result is not defined (e.g., 1/0 - 1/0)
  • NA - value for missing data (e.g., no empty cell in an imported csv file), "dominates" in operations (e.g., 1 + NA will result in NA)
  • NA_integer_, NA_real_, NA_complex_, NA_character_ - versions of NA
  • NULL - the null object, has length 0

Operators

  • Arithmetic - minus, + plus, - minus, * times, / divide, ^ exponentiation, %% remainder. Note that these operate element-wise on vectors, matrices, data frames.
  • Comparison - < <= == != >= >
  • Boolean - ! not, & and, | or
  • Model formula' - ~
  • Assignment - -> = <- <<- ->>
  • List indexing - $
  • Sequence - :
  • Special infix operators - %% remainder, %*% matrix multiplication, %/% integer division, %in% testing membership, %o% outer product (aka cross product), %x%

Other punctuation

  • Grouping - () for order of operations, calling functions, defining functions
  • Compound statement - {} for specifying body of loops, if, functions
  • Indexing - [] to index arrays, vectors, matrices, data frames. [[]] to index lists. Also use $ for lists/data-frames with named columns/items

Functions

  • Statistics - min for minimum, max for maximum, mean for average, var for variance, cor for correlation, cov for covariance, sd for standard deviation
  • Input/output - print to print in the console, View to view graphically, read.csv/write.csv, setwd sets the directory
  • Dataframes / matrices / arrays - summary gives information about each column, table, ncol # of columns, nrow # of rows, dim for dimensions, sapply to apply to a vector and simplify result, tapply to apply to each level of a factor, cbind to combine based on column, rbind to combine based on row, rowMeans to compue means of each row
  • vectors - c to combine into one vector, length to get # items, rep to create vector from repeating
  • math - log2, sum for summation, prod for product
  • plotting - plot, boxplot, hist
  • strings - grep to search for pattern, substring to take a substring of a string, gsub to substitute/replace
  • sets - unique to pull out just the unique items, intersect, union, setdiff

The Basics Quiz

Look through the R Getting Started slides again to refresh your memory about the basics, and then download and open the following "quizzes" R files. For each of these, your goal is to open the file, look at the first line, decide what you think will happen after the first line is run, then run the first line and see what happens; then proceed one line at a time trying to think what will happen, and then running the line to see what actually happens.

Case Studies

Read through one of the tutorials, and start looking at each of the following case studies. These are R files that are looking at some interesting data. Our first goal is just to understand what the data is and how the code works. Once we understand how the code works we can ask some more questions about the data.

Gene Expression in Developing Heart Cells

For this example we look at some of the data from a scientific study by researchers looking at heart cell development. The data was downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE69618, and the research of the authors was published at https://www.ncbi.nlm.nih.gov/pubmed/26485529. You can also view ISU posters related to looking at this data at http://cs.indstate.edu/info/posters/

Let's take a look at the data and some R code to begin looking at it. Login to one of the CS systems, and run the following commands.

cd ~
mkdir heart-genes
cd heart-genes
cp /u1/junk/bd4isu/GSE69618_data.csv .
cp /u1/junk/bd4isu/GSE69618.R

You can also download the files from http://cs.indstate.edu/~jkinne/bd4isu-summer-2019/code/. Open the GSE69618.R file in Rstudio and run each line in the file. Note - your instructor will show you how to do this and explain the different parts of Rstudio that you are seeing.

References

Cheat Sheets

R Language Definition

Every programming language contains a list of "reserved" words that have special meaning and cannot be used for variable or function names. R's - R reserved words

Every programming language has special meaning for what punctuation means - normally parenthesis () are used for enforcing order of operations and for defining and calling functions. Every language is slightly different in the rules. For R, this is all listed in the specification of the R parser (that is a bit of a boring read, but there you go).

And the complete R language specification is at https://cran.r-project.org/doc/manuals/r-release/R-lang.html. This is aimed at "mature" programmers, so view at your own risk.

R Functions

The following are R functions that we commonly use. You can find examples by searching online. You can also look up help on the functions in Rstudio.

  • statistics - min, max, mean, var, cor, cov, sd, quantile
  • I/O - print, View, read.csv, write.csv, setwd
  • data frames / matrices / arrays - summary, table, ncol, nrow, dim, tapply, sapply, cbind, rowMeans, read.csv, rownames, colnames, t
  • vectors - c, length, rep, which, order
  • lists - names
  • math - log2, sum, prod
  • plotting - plot, boxplot, hist, abline, points, beanplot, par, image, mtext, smoothScatter, pairs
  • strings - grep, substring, gsub, paste, paste0
  • sets - unique, intersect, union, setdiff
  • packages/running-files - install.packages, library, source, file.path, file.exists, load, save
  • data types - as.list, as.dataframe, as.matrix, as.numeric, as.vector
  • variable management - rm, View, head, print
  • clustering - hclust, heatmap

Packages to install - beanplot, Polychrome, oompaBase

You can download and run the following file which demonstrates running many of these functions - coming soon