# A very quick tour of R

This post is a quick introduction to the R. I learnt R when I was an undergrad and I still use it from time to time. It was one of the first major programs I compiled from source as well.

What is R? It is simply the best statistical computing environment in use today. Better yet, it’s open source. If you’re working with data, you need to learn R. This post won’t teach you how to use R. Instead, it will give a whirlwind tour so you can get appreciate R’s flavour. The only thing I don’t like much about R is searching for material on it. Its one letter name makes that hard.

I will assume that you have R installed, and have opened the interactive console. It should look something like this:

## Variables

Variables are perhaps the most important part of R because all your data will be stored in variables. Variables are declared as follows:

> a = 12


This stores the value 12 in the variable named ‘a’. Pretty easy right? You can now type ‘a’ in the console and hit enter, and this is what you’ll get:

> a
[1] 12


Chances are you’ll be reading R code somewhere along the line and encounter this kind of variable assignment:

> a <- 12


This does the same thing as 'a = 12', but is uglier and harder to type.

## Vectors

Vectors in R are really fun. They're like lists in Python, or arrays in C. Here's how to declare a vector with the numbers 2,4,6,8 in it:

> a = c(2,4,6,8)


The function 'c' stands for 'combine' and just takes the numbers 2,4,6,8 and puts them in a vector. It's a good idea not to use the letter 'c' as a variable then. R has a really great way of handling vectors, at least for the purposes of numerical computing. If you want to get a new vector whose elements are the square of every element in the vector stored in 'a', you just do:

> a^2
[1]  4 16 36 64


The vector 'a' itself is unchanged. If you wanted to modify 'a' by squaring each entry then you would do:

a = a^2


Pretty much all arithmetic operations operate on vectors pointwise. So 'a + a' will return a new vector via vector addition, rather than concatenating the vectors:

> a +a
[1]  4  8 12 16


If you want to concatenate instead:

> c(a,a)
[1] 2 4 6 8 2 4 6 8


Finding the sum of all the elements of the vector 'a' is easy:

> sum(a)
[1] 20


There are many built-in functions that can be applied to vectors like 'mean', 'summary', 'sd', 'min', 'max', etc. Chances are if you need a function like this, you can just guess what it is and it will probably be right.

## Probability distributions

R has a bunch of built-in probability distributions. The density functions are named as follows:

1. dbeta: Beta distribution
2. dbinom: Binomial distribution
3. dcauchy: Cauchy distribution
4. dchisq: Chi-squared distribution
5. dexp: Expoential distribution
6. df: F distribution
7. dgamma: Gamma distribution
8. dgeom: Geometric distribution
9. dhyper: Hypergeometric distribution
10. dlnorm: log-normal distribution
11. dmultinom: Multinomial distribution
12. dnbinom: Negative binomial distribution
13. dnorm: Normal distribution
14. dpois: Poisson distribution
15. dt: Student's t distribution
16. dunif: Uniform distribution
17. dweibull: Weibull distribution

By replacing 'd' in the names by:

1. 'p' you get the cumulative distribution function
2. 'q' you get the quantile function
3. 'r' you get a random number generator for that distribution

Now would be a good time to tell you that if you need help with any function, such as what parameters it accepts, just type '?function' at the console. For example,

> ?dunif


Will tell you about the density function for the uniform distribution. Let's see how to use some of these density functions. For example, the typical usage for 'dnorm' is

> dnorm(x, mean = 0, sd = 1, log = FALSE)


For example,

> dnorm(0)
[1] 0.3989423


If you'll recall, the density function for the normal distribution with mean zero and standard deviation one is
$$f(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}$$
Then 'dnorm(0)' is just evaluating this function at zero. Notice that we didn't actually specify 'mean' or 'sd' but they took on the default values in R. The default values are indicated in the help. By typing '?dnorm' you'll see that the function is written with 'mean = 0'. That means the default 'mean' for 'dnorm' is zero.

For more on distributions, type '?Distributions' in R.

## Random number generation

One of the cool things about R is that it can generate a bunch of random numbers for each of the distributions we talked about. Of course, all you really need is a uniform random number generator, but it's good that these other ones are built into R for convenience. For example:

> rnorm(10)
[1] -1.4569007  0.8524113  0.2940385  0.5111377  1.6543332 -0.8684520
[7]  2.0536998 -0.3351626 -2.0603866 -0.9382230


This gave me ten numbers from a normal distribution with mean zero and variance one (the 'standard normal'). This calculation:

> pnorm(0.5)
[1] 0.6914625


Evaluates the cumulative distribution function of the standard normal. This means that the probability of a standard normal random variable being less than or equal or 0.5 is about 0.6914625. Let's see if the random number generator gives something believable:

> sum(rnorm(1000) <= 0.5)
[1] 696


This tells gives us an 'experimental' probability of 0.696. Pretty good. I should tell you how this code works. The boolean expression

rnorm(1000) <= 0.5


generates a vector of a 1000 i.i.d. samples from the standard normal, and then a new vector is returned that has TRUE in the places in which the original vector is actually less than or equal to 0.5. Otherwise it has FALSE in the indices in which the original vector is greater that 0.5. This may explain it better:

> c(1,2,3,4,5) > 2
[1] FALSE FALSE  TRUE  TRUE  TRUE


The function 'sum()' just sums up all the values in any vector. If the vector consists of TRUE and FALSE, then FALSE is treated as zero and TRUE is treated as one. For example:

> sum(c(1,2,3,4,5) > 2)
[1] 3


In other words, there are three numbers in the vector c(1,2,3,4,5) greater than two. This illustrates R's powerful vector handling mechanism.

## Statistical Tests

The heart of R is statistics. Part of statistics is hypothesis testing. How do you find out what tests are included with R? R has a way of listing all the functions containing a certain word. It's the 'apropos()' function. For example:

> apropos('test')
[1] "ansari.test"             "bartlett.test"
[3] "binom.test"              "Box.test"
[5] "chisq.test"              "cor.test"
[7] "file_test"               "fisher.test"
[9] "fligner.test"            "friedman.test"
[11] "kruskal.test"            "ks.test"
[13] "mantelhaen.test"         "mauchly.test"
[15] "mcnemar.test"            "mood.test"
[17] "oneway.test"             "pairwise.prop.test"
[19] "pairwise.t.test"         "pairwise.wilcox.test"
[21] "poisson.test"            "power.anova.test"
[23] "power.prop.test"         "power.t.test"
[25] "PP.test"                 "prop.test"
[29] "shapiro.test"            "testInheritedMethods"
[31] "testPlatformEquivalence" "testVirtual"
[33] "t.test"                  ".valueClassTest"
[35] "var.test"                "wilcox.test"


If you've ever taken statistics before, you should have heard of many of these tests like t.test, chisq.test, etc. Again, the question mark is your friend. Type '?t.test' to see its syntax. These functions not only test hypotheses, but automatically give you confidence intervals, as well!

## Importing Data

So far, we've covered what you can do to data but not how you get data into R. The simplest function is 'scan'. Let's say I had a file 'data.txt' that looked like this:

1
5
2
5


To load these numbers into R, I would type:

a = scan('data.txt')


This will store the vector 'c(1,5,2,5)' into the variable 'a'. If the data points are separated by tabs instead for instance, you could type

a = scan('data.txt',sep='\t')


If instead you have data in columns as it appears in a spreadsheet, you need to use the 'read.table' function:

a = read.table('data.txt')


Instead of a vector, 'a' now has a 'data frame' type. It's basically like a spreadsheet with columns. You might have to pass additional arguments to 'read.table' depending on the format of your data. Use '?read.table' to see how to use 'read.table'. For example, the 'header' parameter controls whether the columns have labels. The boolean values in R are TRUE or FALSE, which can be abbreviated by T and F respectively.

## Installing new packages

R has a good truckload of tests in it and modeling routines like generalised linear models. But what if it doesn't have your favourite test, model, or function? That's what installing new packages is for, and doing it is dead easy. Just type

> install.packages()


at the console and you'll be presented with a rudimentary-looking but functional interface to select new packages. Once you find the package you want, click it and install it!

## Concluding Remarks

Still here? Amazing. Hopefully that whetted your apetite for more. If so, you may want to check out Modern Statistics for the Social and Behavioral Sciences by Rand Wilcox. This book will tell you how in R to do all the basic stats stuff. Wilcox's book covers a huge range of topics rather than being comprehensive on a few, and will give you a great start. If you're interested in some specific topic, there are many books that use R for their examples. If you're into linear modeling, Alan Agresti's book Categorical Data Analysis is great, and is the book I used when I studied this topic as an undergraduate. While it doesn't use R in the book itself, the book's website has supplementary material on using R with the methods described in the book.