Variables, data frames, indexing, functions etc¶

Computer representation of numbers¶

Real numbers are not stored exactly on computers. Use binary version of ``scientific'' notation e.g. $1.234 \times 10^2$. This needs care e.g.

x = seq(0, 0.5, 0.1) ##generate a sequence from 0 to 0.5 in steps of 0.1
x ##Look at x

Is x equal to (0, 0.1, 0.2, 0.3, 0.4, 0.5)? To find out, type

x == c(0, 0.1, 0.2, 0.3, 0.4, 0.5)

This is known as FAQ 7.31: see [https://cran.r-project.org/doc/FAQ/R-FAQ.html].

Rounding problems¶

Tiny inaccuracies can accumulate:

The sample variance of a vector x is often calculated as

$var (x) = (\sum x^2-n\bar{x}^2) / (n-1)$

Try it out and compare it with var():

myvar = function(x) (sum(x^2) - length(x) * mean(x)^2) / (length(x) - 1) #define a function myvar(x) to estimate the variance
x = seq(1:100)
myvar(x)
var(x)
x = seq(1:100) + 10000000000
myvar(x)
var(x)

Can you see why there was a problem?

Variables¶

Basic Types of Variables

Variables are the equivalent of memories in your calculator. But you can have unlimited (almost!) quantities of them and they have names of your choosing. And different types. The basic types are

integer
double
character
logical: these take one of the two values TRUE or FALSE (or NA, see later)
factor or categorical

More Specialised Classes¶

As well as the basic types of variables, R recognises many more complicated objects such as

vectors, matrices, arrays: groups of objects all of the same type
lists of other objects which may be of different types
Specialised objects such as Dates, Linear Model fits

Special values of objects¶

There are some types of data which need to be treated specially in calculations:

NA The value NA is given to any data which R knows to be missing. This is not a character string, so a character string with value "NA" will be treated differently from one with the value NA.
Inf The result of e.g. dividing any non-zero number by zero
NaN The result of e.g. attempting to find the logarithm of a negative number.

as.numeric(c("a", "1"))
x / 0
log(-x)

Warning message in eval(expr, envir, enclos):
"NAs introduced by coercion"

Warning message in log(-x):
"NaNs produced"

Factors¶

Factors are variables which can only take one of a finite set of discrete values. They naturally occur as vectors, and can be

numeric e.g. drug doses with values 1mg, 2mg, 5mg
or character e.g. voting intention with values Liberal Democrat, Conservative, Labour, Other

Although factors are stored as numbers, along with the label corresponding to each number, they cannot be treated as numeric. Would it make sense to ask R to calculate mean(voting intention)?

A more useful function for factors is table which will count how many of each value occur in the vector.

Ordered Factors¶

Some factor variables have a natural ordering. Drug doses do, but voting intentions usually do not. R will treat the two types differently. It is important not to allow R to treat non-ordered factors as ordered ones, since the results could be meaningless.

Creating factors¶

Use cut to create factor variables from continuous ones:

age <- runif(100) * 50
table(cut(age, c(0, 10, 20, 30, 40, 50)))

 (0,10] (10,20] (20,30] (30,40] (40,50] 
     13      20      27      20      20

The function factor() can be used to create factor variables from characters.

Data frames¶

For storing data which is a collection of observations (rows) of a set of variables (columns). E.g. book titles and prices.
Similar to a matrix but variables in different columns can have different types.
Always the same number of entries in each row, although some may be missing (NA).
Can be formed by reading in data e.g. from a spreadsheet, or constructed using the function data.frame.

You will need the following file Example Chicken weights csv

Download the file from the link to the folder where you are running R.

mydata = read.csv('chickwt.csv')
head(x = mydata)

Lists¶

A data frame is a kind of list, which is a vector of objects of possibly different types. For example, an employee record might be created by

Empl = list(employee = "Anna", spouse = "Fred", children = 3,
             child.ages = c(4, 7, 9))
Empl

Try the following:

Empl[[1]]
Empl$spouse

Empl[[4]]
length(Empl[[4]]) # numeric vector of length 3

Empl[4]
length(Empl[4]) # list of length 1
Empl[2:4]  # this works

Components are always numbered and may be referred to by number. e.g. Empl[[1]]. If they are named, can also be referred to by name using the \$ operator eg. ```Empl$spouse```

Keeping track of objects¶

Once you have created some objects, how do you remind yourself what you called them?

ls()
str(Empl)

List of 4
 $ employee  : chr "Anna"
 $ spouse    : chr "Fred"
 $ children  : num 3
 $ child.ages: num [1:3] 4 7 9

ls() lists the names of all the objects in you workspace and `str() gives information about a specific object.

Operators¶

Arithmetic Operators: +, - , /, *, ^.

3^2
10 %% 3 # modulo reduction
10 %/% 3 # integer division

a = matrix(1:4, nrow=2)
b = matrix(c(2, 1, 2, 4), nrow=2)
a %*% b # matrix multiplication

Operators can also be used on vectors, with recycling if necessary.

x = c(1, 2, 3, 4)
y = c(5, 6)
x + 3
#(4,5,6,7) and
x + y
#(6,8,8,10)

Logical operators¶

== (equal), != (not equal), >, <, >=, <=
! (not), | (or), || (or), & (and) && (and)
|, & work on vectors
|| and && consider only one element

Examples:

x = c(TRUE, FALSE, TRUE)
y = c(FALSE, TRUE, TRUE)

x | y
x || y

x & y
x && y
x[3] && y[3]

The && is also a "short circuit and", i.e. it won't evaluate its second argument if the first argument is FALSE. Compare the following

Dates¶

Example:

myDate = as.Date('10-Jan-1993', format="%d-%b-%Y")

class(myDate)
as.numeric(myDate)

myDate2 = as.Date('10-Jan-1994', format="%d-%b-%Y")

myDate2-myDate # can substract two dates

Time difference of 365 days

Some useful functions¶

Look at the following functions (you can use R help to see what they do using "?funcname"):

c(1, "a")
1:5
c(1,2,3,4,5)
seq(1, 10, by=2) #sequence from 1 to 10 in steps of 2
rep(c(1, 2), times=3)  # replicate (1,2) 3 times
rep(c(1, 2), each=3)   # replicate (1,2) first entry 3 times and second 3 times 
rep(c(1, 2), c(2, 3))  # replicate (1,2) first entry 2 times and second 3 times


paste(c(1, 2), c('x', 'y', 'z')) # create vector of concatenations of the two vectors
paste(c(1, 2), c('x', 'y', 'z'), collapse=' ') # create the previous vector and concatenate its entries into a single string

sort(mydata$weight) # sort ascending
sort(mydata$weight, decreasing=TRUE) # sort descending

table(rpois(20, 5)) # create a table with 20 random poisson variables with rate parameter lambda=5

 0  2  3  4  5  6  7  9 10 
 1  4  3  3  3  1  3  1  1

Matrices, arrays and indexing¶

(mymat = matrix(1:12, 3, 4)) # Entries go down columns unless you specify byrow=TRUE.
dim(mymat)
myarr = mymat
dim(myarr) = c(3,2,2) # creating an array
myarr
myarr[,,1] # using indexing to select the entries of the arrays third dimension y

x = c(2,4,6,8,10,12)
names(x) = c("a", "b", "c", "d", "e", "f")

## Use different types of indices
x[c(1,3,6,5)] 
x[c("a","c","f", "e")] ## gives same as above
x[c(TRUE,FALSE,TRUE,rep(FALSE,3))] # can also use logical vectors to select entries
x[c(-1,-4)] ## exclude first and fourth entry

y = x
(y[]  =  0 ) ## Empty. Select all, useful to replace all vector entries
names(y) ## will be the same as before

# compare with:
y = x
(y = 0) 

# Recycling:
x[c(1,3)] = 4.5 # recycling will be used if sub-vector selected for replacement is longer than the right-hand side.

(x[10] = 8) # replacing to an index greater than the length of the vector extends it, filling in with NA's

x[11] # returns NA

# indexing matrices and arrays
mymat[1:2, -2]
mymat[mymat>1] = NA # note: no comma

mymat = matrix(1:12, 3, 4)
mymat[cbind(rep(1,3), c(2,3,4))] = NA

# If the result has length 1 in any dimension, this is dropped unless you use the argument drop=FALSE:
mydata[1:2, 1] #is a vector
class(mydata[1:2, 1])

mydata[1, 1:2] #is a data.frame
class(mydata[1, 1:2])

Indexing data frames¶

Data frames can be indexed like matrices, but only drop dimensions if you select from a single column, not if you select from a single row.
If you select rows from a data frame with only one column the result will be a vector unless you use drop=FALSE
Often want to select the rows of a data frame which meet some criterion.
use logical indexing

mydata[1,]
attributes(mydata[,2])

mydata[mydata$weight> 400,] #index by rows where weight is greater than 400

More Examples on matrices: Lower triangular, adding matrices, eigenvalues, column sums, ...¶

Check out the following:

mymat = matrix(1:12, nrow=3, )
mymat
mymat2 = matrix(1:12, nrow=3, byrow=TRUE)
mymat2
mymat + mymat2 #componentwise addition
mymat %*% t(mymat2) #matrix multiplication

mysq = matrix(rnorm(9), nrow=3)
solve(mysq) #invert mysq

mysym = mysq 
mysym[lower.tri(mysym)] = mysym[upper.tri(mysym)] #make symmetric version of mysq
eigen(mysym) #eigen decompositions
colSums(mymat) #column sums

eigen() decomposition
$values
[1]  0.634857 -0.660316 -1.247812

$vectors
           [,1]       [,2]       [,3]
[1,] -0.6416955  0.7622148 0.08517897
[2,] -0.5820237 -0.5562791 0.59312901
[3,]  0.4994750  0.3310321 0.80058886

Functions¶

Writing simple functions:

x = rnorm(100, mean=0.3, sd=1.2) # create a vector of i.i.d random numbers from a N(0.3,1.2) normal distributions


std.dev = function(x) sqrt(var(x)) # function to calculate the standard deviation of a vector

# function to calculate the two-tailed p-value of a t.test
# note that function arguments can have default values
t.test.p = function(x, mu=0)  {
    n = length(x)
    t = sqrt(n) * (mean(x) - mu) /std.dev(x)
    2 * (1 - pt(abs(t), n - 1)) # the object of the final line will be returned
}

std.dev(x)
t.test.p(x) # this will use the default value for mu
t.test.p(mu=1, x=x) 
t.test.p(x, 1)

Flow control: if, for , while, repeat¶

#function generates 3 samples of size n from U(0,1), stores the means, prints the mean of entries > 0.2 using conditional indexing
myfn = function(n=100)  
{
    tmp = rep(NA, 3)
    tmp[1] = mean(runif(n))
    tmp[2] = mean(runif(n))
    tmp[3] = mean(runif(n))
    mean(tmp[tmp > .2])
}
set.seed(1)
myfn()
myfn(1000)

Control flow: if¶

Example

#function does the same things as the previous function but using if statements instead of conditional indexing
myfna = function(n=100)
{
    tmp = rep(NA, 3)
    x <- mean(runif(n))
    if (x > 0.2)
        tmp[1] = x
    x = mean(runif(n))
    if (x > 0.2)
        tmp[2] = x
    x = mean(runif(n))
    if (x > 0.2)
        tmp[3] = x
    mean(tmp, na.rm=TRUE)
}
set.seed(1)
myfna()
myfna(1000)

Control Flow: For¶

#generates n samples of size obs of U(0,1) variables and saves means in x. Returns mean and standard deviation of x 
myfn1 = function(obs=10, n=100) 
{
    x = rep(NA, n)
    for (i in 1:n)
    {
        tmp = runif(obs)
        x[i] = mean(tmp)
    }
    c(mn=mean(x), std=sd(x))
}
set.seed(1)
myfn1()
myfn1(1000)

The functions while and repeat don't require loop variables.

The function ifelse¶

The function ifelse reduces the need for loops and can make code more efficient. Example:

x = c(0, 1, 1, 2)
y = c(44, 45, 56, 77)

z = rep(NA, 4)
for (i in 1:length(x))
{
    if (x[i] > 0)
        z[i] <- y[i] / x[i]
    else
        z[i] <- y[i] / 99
}
z

This can be replaced by:

(z = ifelse(x > 0, y / x, y / 99))

1

Create a vector containing all the dates in 2007, using seq and as.Date.
There is a version of cut for dates, called cut.Date. Use this to create a factor with values corresponding to the date of the first day of the week in which each of these dates falls. Start the weeks on Sundays.
Create x, a vector of length 100, with integer values in the range $1:5$, randomly ordered. (Hint: look at the function sample.)
Use paste to create a vector of labels: ("Colour 1", "Colour 2", "Colour 3", "Colour 4", "Colour 5")
Use the factor command to create a factor from the vector x, with the labels created above.
Create a data frame with 100 rows and two columns, one containing a random sample of the vector of dates created above, and the other containing the factor vector of colour names.
Select the rows for which the date is after 1st June 2007.

Solution

2

Generate a matrix with 10 rows and 5 columns, with random entries between 0 and 10. (Hint: look at runif)
Write a function using for to calculate the column means of the matrix.
Extract the even rows from the matrix.