Variables, data frames, indexing, functions etc

Computer representation of numbers

Real numbers are not stored exactly on computers. Use binary version of ``scientific'' notation e.g. $1.234 \times 10^2$. This needs care e.g.

In [1]:
x = seq(0, 0.5, 0.1) ##generate a sequence from 0 to 0.5 in steps of 0.1
x ##Look at x
  1. 0
  2. 0.1
  3. 0.2
  4. 0.3
  5. 0.4
  6. 0.5

Is x equal to (0, 0.1, 0.2, 0.3, 0.4, 0.5)? To find out, type

In [2]:
x == c(0, 0.1, 0.2, 0.3, 0.4, 0.5)
  1. TRUE
  2. TRUE
  3. TRUE
  4. FALSE
  5. TRUE
  6. TRUE

This is known as FAQ 7.31: see [https://cran.r-project.org/doc/FAQ/R-FAQ.html].

Rounding problems

Tiny inaccuracies can accumulate:

The sample variance of a vector x is often calculated as

$var (x) = (\sum x^2-n\bar{x}^2) / (n-1)$

Try it out and compare it with var():

In [3]:
myvar = function(x) (sum(x^2) - length(x) * mean(x)^2) / (length(x) - 1) #define a function myvar(x) to estimate the variance
x = seq(1:100)
myvar(x)
var(x)
x = seq(1:100) + 10000000000
myvar(x)
var(x)
841.666666666667
841.666666666667
0
841.666666666667

Can you see why there was a problem?

Variables

Basic Types of Variables

Variables are the equivalent of memories in your calculator. But you can have unlimited (almost!) quantities of them and they have names of your choosing. And different types. The basic types are

  • integer
  • double
  • character
  • logical: these take one of the two values TRUE or FALSE (or NA, see later)
  • factor or categorical

More Specialised Classes

As well as the basic types of variables, R recognises many more complicated objects such as

  • vectors, matrices, arrays: groups of objects all of the same type

  • lists of other objects which may be of different types

  • Specialised objects such as Dates, Linear Model fits

Special values of objects

There are some types of data which need to be treated specially in calculations:

  • NA The value NA is given to any data which R knows to be missing. This is not a character string, so a character string with value "NA" will be treated differently from one with the value NA.

  • Inf The result of e.g. dividing any non-zero number by zero

  • NaN The result of e.g. attempting to find the logarithm of a negative number.

In [4]:
as.numeric(c("a", "1"))
x / 0
log(-x)
Warning message in eval(expr, envir, enclos):
"NAs introduced by coercion"
  1. <NA>
  2. 1
  1. Inf
  2. Inf
  3. Inf
  4. Inf
  5. Inf
  6. Inf
  7. Inf
  8. Inf
  9. Inf
  10. Inf
  11. Inf
  12. Inf
  13. Inf
  14. Inf
  15. Inf
  16. Inf
  17. Inf
  18. Inf
  19. Inf
  20. Inf
  21. Inf
  22. Inf
  23. Inf
  24. Inf
  25. Inf
  26. Inf
  27. Inf
  28. Inf
  29. Inf
  30. Inf
  31. Inf
  32. Inf
  33. Inf
  34. Inf
  35. Inf
  36. Inf
  37. Inf
  38. Inf
  39. Inf
  40. Inf
  41. Inf
  42. Inf
  43. Inf
  44. Inf
  45. Inf
  46. Inf
  47. Inf
  48. Inf
  49. Inf
  50. Inf
  51. Inf
  52. Inf
  53. Inf
  54. Inf
  55. Inf
  56. Inf
  57. Inf
  58. Inf
  59. Inf
  60. Inf
  61. Inf
  62. Inf
  63. Inf
  64. Inf
  65. Inf
  66. Inf
  67. Inf
  68. Inf
  69. Inf
  70. Inf
  71. Inf
  72. Inf
  73. Inf
  74. Inf
  75. Inf
  76. Inf
  77. Inf
  78. Inf
  79. Inf
  80. Inf
  81. Inf
  82. Inf
  83. Inf
  84. Inf
  85. Inf
  86. Inf
  87. Inf
  88. Inf
  89. Inf
  90. Inf
  91. Inf
  92. Inf
  93. Inf
  94. Inf
  95. Inf
  96. Inf
  97. Inf
  98. Inf
  99. Inf
  100. Inf
Warning message in log(-x):
"NaNs produced"
  1. NaN
  2. NaN
  3. NaN
  4. NaN
  5. NaN
  6. NaN
  7. NaN
  8. NaN
  9. NaN
  10. NaN
  11. NaN
  12. NaN
  13. NaN
  14. NaN
  15. NaN
  16. NaN
  17. NaN
  18. NaN
  19. NaN
  20. NaN
  21. NaN
  22. NaN
  23. NaN
  24. NaN
  25. NaN
  26. NaN
  27. NaN
  28. NaN
  29. NaN
  30. NaN
  31. NaN
  32. NaN
  33. NaN
  34. NaN
  35. NaN
  36. NaN
  37. NaN
  38. NaN
  39. NaN
  40. NaN
  41. NaN
  42. NaN
  43. NaN
  44. NaN
  45. NaN
  46. NaN
  47. NaN
  48. NaN
  49. NaN
  50. NaN
  51. NaN
  52. NaN
  53. NaN
  54. NaN
  55. NaN
  56. NaN
  57. NaN
  58. NaN
  59. NaN
  60. NaN
  61. NaN
  62. NaN
  63. NaN
  64. NaN
  65. NaN
  66. NaN
  67. NaN
  68. NaN
  69. NaN
  70. NaN
  71. NaN
  72. NaN
  73. NaN
  74. NaN
  75. NaN
  76. NaN
  77. NaN
  78. NaN
  79. NaN
  80. NaN
  81. NaN
  82. NaN
  83. NaN
  84. NaN
  85. NaN
  86. NaN
  87. NaN
  88. NaN
  89. NaN
  90. NaN
  91. NaN
  92. NaN
  93. NaN
  94. NaN
  95. NaN
  96. NaN
  97. NaN
  98. NaN
  99. NaN
  100. NaN

Factors

Factors are variables which can only take one of a finite set of discrete values. They naturally occur as vectors, and can be

  • numeric e.g. drug doses with values 1mg, 2mg, 5mg
  • or character e.g. voting intention with values Liberal Democrat, Conservative, Labour, Other

Although factors are stored as numbers, along with the label corresponding to each number, they cannot be treated as numeric. Would it make sense to ask R to calculate mean(voting intention)?

A more useful function for factors is table which will count how many of each value occur in the vector.

Ordered Factors

Some factor variables have a natural ordering. Drug doses do, but voting intentions usually do not. R will treat the two types differently. It is important not to allow R to treat non-ordered factors as ordered ones, since the results could be meaningless.

Creating factors

Use cut to create factor variables from continuous ones:

In [5]:
age <- runif(100) * 50
table(cut(age, c(0, 10, 20, 30, 40, 50)))
 (0,10] (10,20] (20,30] (30,40] (40,50] 
     13      20      27      20      20 

The function factor() can be used to create factor variables from characters.

Data frames

  • For storing data which is a collection of observations (rows) of a set of variables (columns). E.g. book titles and prices.
  • Similar to a matrix but variables in different columns can have different types.
  • Always the same number of entries in each row, although some may be missing (NA).
  • Can be formed by reading in data e.g. from a spreadsheet, or constructed using the function data.frame.

You will need the following file Example Chicken weights csv

Download the file from the link to the folder where you are running R.

In [6]:
mydata = read.csv('chickwt.csv')
head(x = mydata)
weightfeed
179 horsebean
160 horsebean
136 horsebean
227 horsebean
217 horsebean
168 horsebean

Lists

A data frame is a kind of list, which is a vector of objects of possibly different types. For example, an employee record might be created by

In [7]:
Empl = list(employee = "Anna", spouse = "Fred", children = 3,
             child.ages = c(4, 7, 9))
Empl
$employee
'Anna'
$spouse
'Fred'
$children
3
$child.ages
  1. 4
  2. 7
  3. 9

Try the following:

In [8]:
Empl[[1]]
Empl$spouse

Empl[[4]]
length(Empl[[4]]) # numeric vector of length 3

Empl[4]
length(Empl[4]) # list of length 1
Empl[2:4]  # this works
'Anna'
'Fred'
  1. 4
  2. 7
  3. 9
3
$child.ages =
  1. 4
  2. 7
  3. 9
1
$spouse
'Fred'
$children
3
$child.ages
  1. 4
  2. 7
  3. 9

Components are always numbered and may be referred to by number. e.g. Empl[[1]]. If they are named, can also be referred to by name using the \$ operator eg. ```Empl$spouse```

Keeping track of objects

Once you have created some objects, how do you remind yourself what you called them?

In [9]:
ls()
str(Empl)
  1. 'age'
  2. 'Empl'
  3. 'mydata'
  4. 'myvar'
  5. 'x'
List of 4
 $ employee  : chr "Anna"
 $ spouse    : chr "Fred"
 $ children  : num 3
 $ child.ages: num [1:3] 4 7 9

ls() lists the names of all the objects in you workspace and `str() gives information about a specific object.

Operators

Arithmetic Operators: +, - , /, *, ^.

In [10]:
3^2
10 %% 3 # modulo reduction
10 %/% 3 # integer division

a = matrix(1:4, nrow=2)
b = matrix(c(2, 1, 2, 4), nrow=2)
a %*% b # matrix multiplication
9
1
3
5 14
8 20

Operators can also be used on vectors, with recycling if necessary.

In [11]:
x = c(1, 2, 3, 4)
y = c(5, 6)
x + 3
#(4,5,6,7) and
x + y
#(6,8,8,10)
  1. 4
  2. 5
  3. 6
  4. 7
  1. 6
  2. 8
  3. 8
  4. 10

Logical operators

  • == (equal), != (not equal), >, <, >=, <=

  • ! (not), | (or), || (or), & (and) && (and)

  • |, & work on vectors

  • || and && consider only one element

Examples:

In [12]:
x = c(TRUE, FALSE, TRUE)
y = c(FALSE, TRUE, TRUE)

x | y
x || y

x & y
x && y
x[3] && y[3]
  1. TRUE
  2. TRUE
  3. TRUE
TRUE
  1. FALSE
  2. FALSE
  3. TRUE
FALSE
TRUE

The && is also a "short circuit and", i.e. it won't evaluate its second argument if the first argument is FALSE. Compare the following

Dates

Example:

In [13]:
myDate = as.Date('10-Jan-1993', format="%d-%b-%Y")

class(myDate)
as.numeric(myDate)

myDate2 = as.Date('10-Jan-1994', format="%d-%b-%Y")

myDate2-myDate # can substract two dates
'Date'
8410
Time difference of 365 days

Some useful functions

Look at the following functions (you can use R help to see what they do using "?funcname"):

In [14]:
c(1, "a")
1:5
c(1,2,3,4,5)
seq(1, 10, by=2) #sequence from 1 to 10 in steps of 2
rep(c(1, 2), times=3)  # replicate (1,2) 3 times
rep(c(1, 2), each=3)   # replicate (1,2) first entry 3 times and second 3 times 
rep(c(1, 2), c(2, 3))  # replicate (1,2) first entry 2 times and second 3 times


paste(c(1, 2), c('x', 'y', 'z')) # create vector of concatenations of the two vectors
paste(c(1, 2), c('x', 'y', 'z'), collapse=' ') # create the previous vector and concatenate its entries into a single string

sort(mydata$weight) # sort ascending
sort(mydata$weight, decreasing=TRUE) # sort descending

table(rpois(20, 5)) # create a table with 20 random poisson variables with rate parameter lambda=5
  1. '1'
  2. 'a'
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  1. 1
  2. 3
  3. 5
  4. 7
  5. 9
  1. 1
  2. 2
  3. 1
  4. 2
  5. 1
  6. 2
  1. 1
  2. 1
  3. 1
  4. 2
  5. 2
  6. 2
  1. 1
  2. 1
  3. 2
  4. 2
  5. 2
  1. '1 x'
  2. '2 y'
  3. '1 z'
'1 x 2 y 1 z'
  1. 108
  2. 124
  3. 136
  4. 140
  5. 141
  6. 143
  7. 148
  8. 153
  9. 158
  10. 160
  11. 168
  12. 169
  13. 171
  14. 179
  15. 181
  16. 193
  17. 199
  18. 203
  19. 206
  20. 213
  21. 216
  22. 217
  23. 222
  24. 226
  25. 227
  26. 229
  27. 230
  28. 242
  29. 243
  30. 244
  31. 248
  32. 248
  33. 250
  34. 257
  35. 257
  36. 258
  37. 260
  38. 260
  39. 263
  40. 267
  41. 271
  42. 271
  43. 283
  44. 295
  45. 297
  46. 303
  47. 309
  48. 315
  49. 316
  50. 318
  51. 318
  52. 320
  53. 322
  54. 325
  55. 327
  56. 329
  57. 332
  58. 334
  59. 339
  60. 340
  61. 341
  62. 344
  63. 352
  64. 359
  65. 368
  66. 379
  67. 380
  68. 390
  69. 392
  70. 404
  71. 423
  1. 423
  2. 404
  3. 392
  4. 390
  5. 380
  6. 379
  7. 368
  8. 359
  9. 352
  10. 344
  11. 341
  12. 340
  13. 339
  14. 334
  15. 332
  16. 329
  17. 327
  18. 325
  19. 322
  20. 320
  21. 318
  22. 318
  23. 316
  24. 315
  25. 309
  26. 303
  27. 297
  28. 295
  29. 283
  30. 271
  31. 271
  32. 267
  33. 263
  34. 260
  35. 260
  36. 258
  37. 257
  38. 257
  39. 250
  40. 248
  41. 248
  42. 244
  43. 243
  44. 242
  45. 230
  46. 229
  47. 227
  48. 226
  49. 222
  50. 217
  51. 216
  52. 213
  53. 206
  54. 203
  55. 199
  56. 193
  57. 181
  58. 179
  59. 171
  60. 169
  61. 168
  62. 160
  63. 158
  64. 153
  65. 148
  66. 143
  67. 141
  68. 140
  69. 136
  70. 124
  71. 108
 0  2  3  4  5  6  7  9 10 
 1  4  3  3  3  1  3  1  1 

Matrices, arrays and indexing

In [15]:
(mymat = matrix(1:12, 3, 4)) # Entries go down columns unless you specify byrow=TRUE.
dim(mymat)
myarr = mymat
dim(myarr) = c(3,2,2) # creating an array
myarr
myarr[,,1] # using indexing to select the entries of the arrays third dimension y

x = c(2,4,6,8,10,12)
names(x) = c("a", "b", "c", "d", "e", "f")

## Use different types of indices
x[c(1,3,6,5)] 
x[c("a","c","f", "e")] ## gives same as above
x[c(TRUE,FALSE,TRUE,rep(FALSE,3))] # can also use logical vectors to select entries
x[c(-1,-4)] ## exclude first and fourth entry

y = x
(y[]  =  0 ) ## Empty. Select all, useful to replace all vector entries
names(y) ## will be the same as before

# compare with:
y = x
(y = 0) 

# Recycling:
x[c(1,3)] = 4.5 # recycling will be used if sub-vector selected for replacement is longer than the right-hand side.

(x[10] = 8) # replacing to an index greater than the length of the vector extends it, filling in with NA's

x[11] # returns NA

# indexing matrices and arrays
mymat[1:2, -2]
mymat[mymat>1] = NA # note: no comma

mymat = matrix(1:12, 3, 4)
mymat[cbind(rep(1,3), c(2,3,4))] = NA

# If the result has length 1 in any dimension, this is dropped unless you use the argument drop=FALSE:
mydata[1:2, 1] #is a vector
class(mydata[1:2, 1])

mydata[1, 1:2] #is a data.frame
class(mydata[1, 1:2])
1 4 7 10
2 5 8 11
3 6 9 12
  1. 3
  2. 4
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
14
25
36
a
2
c
6
f
12
e
10
a
2
c
6
f
12
e
10
a
2
c
6
b
4
c
6
e
10
f
12
0
  1. 'a'
  2. 'b'
  3. 'c'
  4. 'd'
  5. 'e'
  6. 'f'
0
8
NA: <NA>
1 7 10
2 8 11
  1. 179
  2. 160
'integer'
weightfeed
179 horsebean
'data.frame'

Indexing data frames

  • Data frames can be indexed like matrices, but only drop dimensions if you select from a single column, not if you select from a single row.

  • If you select rows from a data frame with only one column the result will be a vector unless you use drop=FALSE

  • Often want to select the rows of a data frame which meet some criterion.

  • use logical indexing

In [16]:
mydata[1,]
attributes(mydata[,2])

mydata[mydata$weight> 400,] #index by rows where weight is greater than 400
weightfeed
179 horsebean
$levels
  1. 'casein'
  2. 'horsebean'
  3. 'linseed'
  4. 'meatmeal'
  5. 'soybean'
  6. 'sunflower'
$class
'factor'
weightfeed
37423 sunflower
64404 casein

More Examples on matrices: Lower triangular, adding matrices, eigenvalues, column sums, ...

Check out the following:

In [17]:
mymat = matrix(1:12, nrow=3, )
mymat
mymat2 = matrix(1:12, nrow=3, byrow=TRUE)
mymat2
mymat + mymat2 #componentwise addition
mymat %*% t(mymat2) #matrix multiplication

mysq = matrix(rnorm(9), nrow=3)
solve(mysq) #invert mysq

mysym = mysq 
mysym[lower.tri(mysym)] = mysym[upper.tri(mysym)] #make symmetric version of mysq
eigen(mysym) #eigen decompositions
colSums(mymat) #column sums
1 4 7 10
2 5 8 11
3 6 9 12
1 2 3 4
5 6 7 8
9 101112
2 61014
7111519
12162024
70 158246
80 184288
90 210330
0.07146669-0.4187667 0.3390059
1.29757118-0.7294324 -0.1576092
-0.92321345-0.6068482 -0.2549748
eigen() decomposition
$values
[1]  0.634857 -0.660316 -1.247812

$vectors
           [,1]       [,2]       [,3]
[1,] -0.6416955  0.7622148 0.08517897
[2,] -0.5820237 -0.5562791 0.59312901
[3,]  0.4994750  0.3310321 0.80058886
  1. 6
  2. 15
  3. 24
  4. 33

Functions

Writing simple functions:

In [18]:
x = rnorm(100, mean=0.3, sd=1.2) # create a vector of i.i.d random numbers from a N(0.3,1.2) normal distributions


std.dev = function(x) sqrt(var(x)) # function to calculate the standard deviation of a vector

# function to calculate the two-tailed p-value of a t.test
# note that function arguments can have default values
t.test.p = function(x, mu=0)  {
    n = length(x)
    t = sqrt(n) * (mean(x) - mu) /std.dev(x)
    2 * (1 - pt(abs(t), n - 1)) # the object of the final line will be returned
}

std.dev(x)
t.test.p(x) # this will use the default value for mu
t.test.p(mu=1, x=x) 
t.test.p(x, 1)
1.16538947500526
0.00364280772494219
1.903917294932e-07
1.903917294932e-07

Flow control: if, for , while, repeat

In [19]:
#function generates 3 samples of size n from U(0,1), stores the means, prints the mean of entries > 0.2 using conditional indexing
myfn = function(n=100)  
{
    tmp = rep(NA, 3)
    tmp[1] = mean(runif(n))
    tmp[2] = mean(runif(n))
    tmp[3] = mean(runif(n))
    mean(tmp[tmp > .2])
}
set.seed(1)
myfn()
myfn(1000)
0.490111385922258
0.495000922671442

Control flow: if

Example

In [20]:
#function does the same things as the previous function but using if statements instead of conditional indexing
myfna = function(n=100)
{
    tmp = rep(NA, 3)
    x <- mean(runif(n))
    if (x > 0.2)
        tmp[1] = x
    x = mean(runif(n))
    if (x > 0.2)
        tmp[2] = x
    x = mean(runif(n))
    if (x > 0.2)
        tmp[3] = x
    mean(tmp, na.rm=TRUE)
}
set.seed(1)
myfna()
myfna(1000)
0.490111385922258
0.495000922671442

Control Flow: For

In [21]:
#generates n samples of size obs of U(0,1) variables and saves means in x. Returns mean and standard deviation of x 
myfn1 = function(obs=10, n=100) 
{
    x = rep(NA, n)
    for (i in 1:n)
    {
        tmp = runif(obs)
        x[i] = mean(tmp)
    }
    c(mn=mean(x), std=sd(x))
}
set.seed(1)
myfn1()
myfn1(1000)
mn
0.49969167268509
std
0.0977624464898482
mn
0.499504085900935
std
0.00902177635803868

The functions while and repeat don't require loop variables.

The function ifelse

The function ifelse reduces the need for loops and can make code more efficient. Example:

In [22]:
x = c(0, 1, 1, 2)
y = c(44, 45, 56, 77)

z = rep(NA, 4)
for (i in 1:length(x))
{
    if (x[i] > 0)
        z[i] <- y[i] / x[i]
    else
        z[i] <- y[i] / 99
}
z
  1. 0.444444444444444
  2. 45
  3. 56
  4. 38.5

This can be replaced by:

In [23]:
(z = ifelse(x > 0, y / x, y / 99))
  1. 0.444444444444444
  2. 45
  3. 56
  4. 38.5

1

  • Create a vector containing all the dates in 2007, using seq and as.Date.

  • There is a version of cut for dates, called cut.Date. Use this to create a factor with values corresponding to the date of the first day of the week in which each of these dates falls. Start the weeks on Sundays.

  • Create x, a vector of length 100, with integer values in the range $1:5$, randomly ordered. (Hint: look at the function sample.)

  • Use paste to create a vector of labels: ("Colour 1", "Colour 2", "Colour 3", "Colour 4", "Colour 5")

  • Use the factor command to create a factor from the vector x, with the labels created above.

  • Create a data frame with 100 rows and two columns, one containing a random sample of the vector of dates created above, and the other containing the factor vector of colour names.

  • Select the rows for which the date is after 1st June 2007.

Solution

2

  • Generate a matrix with 10 rows and 5 columns, with random entries between 0 and 10. (Hint: look at runif)

  • Write a function using for to calculate the column means of the matrix.

  • Extract the even rows from the matrix.

Solution