SAMBa Induction Training: R Worksheet

This worksheet contains a few simple problems involving some basic methods in probability and statistics. You are expected to solve these problems using the programming language $\texttt{R}$.

The Central Limit Theorem states that the sum $Z = X_1 +...+ X_n$ of $n$ independent identically distributed (iid) random variables is approximately normally distributed for large $n$. In particular, if $X_1,... X_n$ have mean $\mu$ and variance $\sigma^2$, then the standardised sum $Z_s=((X_1-\mu) + ... + (X_n-\mu))/(\sqrt{n\sigma^2})$ has a standard normal distribution $N(0,1)$ for large $n$.

1

Susie D gets many emails each day. Let $X_i \overset{\text{iid}}{\sim} Pois(70)$ for $i=1,...,n$ denote the number of emails she receives during day $i$.

Write a function func that takes n as input, generates a random sample $x$ of $n$ independent draws of $X_i$ and returns the sample mean and variance of $x$. You may find the function rpois helpful.

Write a function function func2 that takes n as an input, makes a call to func, and uses both outputs of func to calculate an approximate standardised sum $Z_s$, and finally returns $Z_s$.

Use the function sapply with the first argument rep(500,1000) to generate $N = 1000$ such $Z_s$ for $n = 500$, saving the output to the variable Zs_samples.

Iteratively generate another $N = 1000$ such $Z_S$ for $n = 500$ by implementing a for loop that calls func2 during each iteration, and stores the result into a vector Zs_samples_loop.

Plot a histogram of your samples of $Z_s$. What do you observe? Use the function qqnorm on your samples of $Z_s$ to produce a normal QQ plot. What do you observe and why is this the case?

2

Data in $\texttt{R}$ are usually stored in data frames. A data frame is a table in which each column is a vector containing the values of one variable. The variables can be of either numeric, factor or character type.

One of Andreas K's favourite built in datasets in R is called trees. Use the command dat <- trees to create a new data frame with this data. Have a play around with this dataset so that you are comfortable extracting a particular row or column. Use the following commands to perform some exploratory data analysis.

Fit a simple linear model with response variable Volume and predictor variable Girth, storing the results as an object mod.

Use plot(mod) and summary(mod) to assess the goodness of fit of the model. Does Girth predict Volume well?

From the formula for the volume of a cylinder, we may expect the variable Girth$^2$ to be a better predictor for Volume. Add this variable as a new column Girth2 to the data frame dat.

Fit a linear model as before but with Girth replaced by Girth2. Did this improve the model?

3 (Bonus)

NOTE: This is more a test of your statistical intuition, rather than coding ability in $\texttt{R}$.

When Paul M was a child he sold newspapers at a kiosk. In the morning he bought newspapers at 25 Brazilian Reals each, but after the first 200 newspapers the price falls to 5 Reals each. He sells the newspapers at 100 Reals each. If any remain unsold at the end of the day, he gets no refund. From previous experiences, he infers the demand for newspapers each day is distributed according to a $N(178,21^2)$ distribution. He would like to know what the optimal number of newspapers to buy each day was.

Let $n$ be the number of papers Paul M buys each day. Write R functions for the following functions:

cost(n) to give the cost of buying $n$ papers,

profit(n,d) to give the profit from sales when there is demand for $d$ papers and Paul M bought $n$ papers, and

average.profit(n,nreps) to simulate nreps values of demands $d\sim N(178,21^2)$ and return the average profit for these cases.

Use average.profit to work out the optimal number of newspapers Paul M should buy each day.

You may find it helpful to use a small number of simulations initially, then increase this number to get accurate reuslts as your search closes in on the optimal value. Careful- a local maximum may not be a global maximum! Make sure to let Paul M know what you discover.

Can you think of a more efficient way to get the answer? Hint: investigate the set.seed function.