Real numbers are not stored exactly on computers. Use binary version of ``scientific'' notation e.g. $1.234 \times 10^2$. This needs care e.g.
x = seq(0, 0.5, 0.1) ##generate a sequence from 0 to 0.5 in steps of 0.1
x ##Look at x
Is x equal to (0, 0.1, 0.2, 0.3, 0.4, 0.5)? To find out, type
x == c(0, 0.1, 0.2, 0.3, 0.4, 0.5)
This is known as FAQ 7.31: see [https://cran.r-project.org/doc/FAQ/R-FAQ.html].
Tiny inaccuracies can accumulate:
The sample variance of a vector x
is often calculated as
$var (x) = (\sum x^2-n\bar{x}^2) / (n-1)$
Try it out and compare it with var()
:
myvar = function(x) (sum(x^2) - length(x) * mean(x)^2) / (length(x) - 1) #define a function myvar(x) to estimate the variance
x = seq(1:100)
myvar(x)
var(x)
x = seq(1:100) + 10000000000
myvar(x)
var(x)
Can you see why there was a problem?
Basic Types of Variables
Variables are the equivalent of memories in your calculator. But you can have unlimited (almost!) quantities of them and they have names of your choosing. And different types. The basic types are
TRUE
or FALSE
(or NA
, see later)As well as the basic types of variables, R recognises many more complicated objects such as
vectors, matrices, arrays: groups of objects all of the same type
lists of other objects which may be of different types
Specialised objects such as Dates, Linear Model fits
There are some types of data which need to be treated specially in calculations:
NA
The value NA
is given to any data which R
knows to be missing. This is not a character string, so a
character string with value "NA"
will be treated differently
from one with the value NA
.
Inf
The result of e.g. dividing any non-zero number by zero
NaN
The result of e.g. attempting to find the logarithm of a
negative number.
as.numeric(c("a", "1"))
x / 0
log(-x)
Factors are variables which can only take one of a finite set of discrete values. They naturally occur as vectors, and can be
Although factors are stored as numbers, along with the label
corresponding to each number, they cannot be treated as numeric. Would
it make sense to ask R to calculate mean(voting intention)
?
A more useful function for factors is table
which will count
how many of each value occur in the vector.
Some factor variables have a natural ordering. Drug doses do, but voting intentions usually do not. R will treat the two types differently. It is important not to allow R to treat non-ordered factors as ordered ones, since the results could be meaningless.
Use cut
to create factor variables from continuous ones:
age <- runif(100) * 50
table(cut(age, c(0, 10, 20, 30, 40, 50)))
The function factor()
can be used to create factor variables from characters.
NA
).data.frame
.You will need the following file Example Chicken weights csv
Download the file from the link to the folder where you are running R.
mydata = read.csv('chickwt.csv')
head(x = mydata)
A data frame is a kind of list, which is a vector of objects of possibly different types. For example, an employee record might be created by
Empl = list(employee = "Anna", spouse = "Fred", children = 3,
child.ages = c(4, 7, 9))
Empl
Try the following:
Empl[[1]]
Empl$spouse
Empl[[4]]
length(Empl[[4]]) # numeric vector of length 3
Empl[4]
length(Empl[4]) # list of length 1
Empl[2:4] # this works
Components are always numbered and may be referred to by
number. e.g. Empl[[1]]
. If they are named, can also be
referred to by name using the \$ operator eg. ```Empl$spouse```
Once you have created some objects, how do you remind yourself what you called them?
ls()
str(Empl)
ls()
lists the names of all the objects in you workspace and `str()
gives information about a specific object.
Arithmetic Operators: +, - , /, *, ^
.
3^2
10 %% 3 # modulo reduction
10 %/% 3 # integer division
a = matrix(1:4, nrow=2)
b = matrix(c(2, 1, 2, 4), nrow=2)
a %*% b # matrix multiplication
Operators can also be used on vectors, with recycling if necessary.
x = c(1, 2, 3, 4)
y = c(5, 6)
x + 3
#(4,5,6,7) and
x + y
#(6,8,8,10)
==
(equal), !=
(not equal), >, <, >=, <=
!
(not), |
(or), ||
(or), &
(and) &&
(and)
|
, &
work on vectors
||
and &&
consider only one element
Examples:
x = c(TRUE, FALSE, TRUE)
y = c(FALSE, TRUE, TRUE)
x | y
x || y
x & y
x && y
x[3] && y[3]
The && is also a "short circuit and", i.e. it won't evaluate its second argument if the first argument is FALSE. Compare the following
Example:
myDate = as.Date('10-Jan-1993', format="%d-%b-%Y")
class(myDate)
as.numeric(myDate)
myDate2 = as.Date('10-Jan-1994', format="%d-%b-%Y")
myDate2-myDate # can substract two dates
Look at the following functions (you can use R help to see what they do using "?funcname"):
c(1, "a")
1:5
c(1,2,3,4,5)
seq(1, 10, by=2) #sequence from 1 to 10 in steps of 2
rep(c(1, 2), times=3) # replicate (1,2) 3 times
rep(c(1, 2), each=3) # replicate (1,2) first entry 3 times and second 3 times
rep(c(1, 2), c(2, 3)) # replicate (1,2) first entry 2 times and second 3 times
paste(c(1, 2), c('x', 'y', 'z')) # create vector of concatenations of the two vectors
paste(c(1, 2), c('x', 'y', 'z'), collapse=' ') # create the previous vector and concatenate its entries into a single string
sort(mydata$weight) # sort ascending
sort(mydata$weight, decreasing=TRUE) # sort descending
table(rpois(20, 5)) # create a table with 20 random poisson variables with rate parameter lambda=5
(mymat = matrix(1:12, 3, 4)) # Entries go down columns unless you specify byrow=TRUE.
dim(mymat)
myarr = mymat
dim(myarr) = c(3,2,2) # creating an array
myarr
myarr[,,1] # using indexing to select the entries of the arrays third dimension y
x = c(2,4,6,8,10,12)
names(x) = c("a", "b", "c", "d", "e", "f")
## Use different types of indices
x[c(1,3,6,5)]
x[c("a","c","f", "e")] ## gives same as above
x[c(TRUE,FALSE,TRUE,rep(FALSE,3))] # can also use logical vectors to select entries
x[c(-1,-4)] ## exclude first and fourth entry
y = x
(y[] = 0 ) ## Empty. Select all, useful to replace all vector entries
names(y) ## will be the same as before
# compare with:
y = x
(y = 0)
# Recycling:
x[c(1,3)] = 4.5 # recycling will be used if sub-vector selected for replacement is longer than the right-hand side.
(x[10] = 8) # replacing to an index greater than the length of the vector extends it, filling in with NA's
x[11] # returns NA
# indexing matrices and arrays
mymat[1:2, -2]
mymat[mymat>1] = NA # note: no comma
mymat = matrix(1:12, 3, 4)
mymat[cbind(rep(1,3), c(2,3,4))] = NA
# If the result has length 1 in any dimension, this is dropped unless you use the argument drop=FALSE:
mydata[1:2, 1] #is a vector
class(mydata[1:2, 1])
mydata[1, 1:2] #is a data.frame
class(mydata[1, 1:2])
Data frames can be indexed like matrices, but only drop
dimensions if you select from a single column, not if you select from a single row.
If you select rows from a data frame with only one column the result will be a vector unless you use drop=FALSE
Often want to select the rows of a data frame which meet some criterion.
use logical indexing
mydata[1,]
attributes(mydata[,2])
mydata[mydata$weight> 400,] #index by rows where weight is greater than 400
Check out the following:
mymat = matrix(1:12, nrow=3, )
mymat
mymat2 = matrix(1:12, nrow=3, byrow=TRUE)
mymat2
mymat + mymat2 #componentwise addition
mymat %*% t(mymat2) #matrix multiplication
mysq = matrix(rnorm(9), nrow=3)
solve(mysq) #invert mysq
mysym = mysq
mysym[lower.tri(mysym)] = mysym[upper.tri(mysym)] #make symmetric version of mysq
eigen(mysym) #eigen decompositions
colSums(mymat) #column sums
Writing simple functions:
x = rnorm(100, mean=0.3, sd=1.2) # create a vector of i.i.d random numbers from a N(0.3,1.2) normal distributions
std.dev = function(x) sqrt(var(x)) # function to calculate the standard deviation of a vector
# function to calculate the two-tailed p-value of a t.test
# note that function arguments can have default values
t.test.p = function(x, mu=0) {
n = length(x)
t = sqrt(n) * (mean(x) - mu) /std.dev(x)
2 * (1 - pt(abs(t), n - 1)) # the object of the final line will be returned
}
std.dev(x)
t.test.p(x) # this will use the default value for mu
t.test.p(mu=1, x=x)
t.test.p(x, 1)
#function generates 3 samples of size n from U(0,1), stores the means, prints the mean of entries > 0.2 using conditional indexing
myfn = function(n=100)
{
tmp = rep(NA, 3)
tmp[1] = mean(runif(n))
tmp[2] = mean(runif(n))
tmp[3] = mean(runif(n))
mean(tmp[tmp > .2])
}
set.seed(1)
myfn()
myfn(1000)
Example
#function does the same things as the previous function but using if statements instead of conditional indexing
myfna = function(n=100)
{
tmp = rep(NA, 3)
x <- mean(runif(n))
if (x > 0.2)
tmp[1] = x
x = mean(runif(n))
if (x > 0.2)
tmp[2] = x
x = mean(runif(n))
if (x > 0.2)
tmp[3] = x
mean(tmp, na.rm=TRUE)
}
set.seed(1)
myfna()
myfna(1000)
#generates n samples of size obs of U(0,1) variables and saves means in x. Returns mean and standard deviation of x
myfn1 = function(obs=10, n=100)
{
x = rep(NA, n)
for (i in 1:n)
{
tmp = runif(obs)
x[i] = mean(tmp)
}
c(mn=mean(x), std=sd(x))
}
set.seed(1)
myfn1()
myfn1(1000)
The functions while
and repeat
don't require loop variables.
The function ifelse
reduces the need for loops and can make code more efficient. Example:
x = c(0, 1, 1, 2)
y = c(44, 45, 56, 77)
z = rep(NA, 4)
for (i in 1:length(x))
{
if (x[i] > 0)
z[i] <- y[i] / x[i]
else
z[i] <- y[i] / 99
}
z
This can be replaced by:
(z = ifelse(x > 0, y / x, y / 99))