Occasionally you will want your user to input data via the keyboard, for instance this might be the name of a file to analyse. We can do this using the function input
:
text = input("Enter some text: ")
print(text)
You will need to enter the string when you are prompted into the command line (or cell output if using Jupyter), and hit the return key to confirm your choice.
The value read in by input is always a string
, we can check this with type
.
Most of the time the data we want to read will be in files. As an example, we will use the data files provided in the library you cloned from notebooks.azure.com (or alternatively downloaded as described in Setup).
Next, in order to read from a file we first need to open
the file:
file = open("./data/data/inflammation-01.csv")
Note that we have prepended the filename with ./data/data/
- this is a relative path to the file inflammation-01.csv
from the directory our notebook is running in.
You will have to do something similar depending on where you saved the data
files, and depending on the directory that you are currently running Python in.
If the file is in the same directory as you are currently running/working in, you don't need to prepend anything.
When we open the file it is a bit like picking a book off the shelf and opening it at the first page. We have not yet read in any of the data contained in the file. In order to do this we must read
from the file. In Python and many languages we can do this by reading each line of the file in turn.
line = file.readline()
print(line)
Python allows us to treat the file as a collection of lines which it reads in automatically so we can use the more readable form:
for line in file:
print(line)
This is the equivalent to reading all the lines in a book. If we run this again:
for line in file:
print(line)
There is no output. It is as if we have reached the end of the book and are stuck.
We need to close
the book and then we could open
it and read it again, or choose another book and read that instead:
file.close()
There are of course a number of libraries that we can use to read in our data. One of the most useful of these is numpy
a numerical library for Python that has a host of features and optimised libraries for performing efficient calculations. In particular, for our purposes, it has a function for reading 'csv' files and converting them automatically into numerical values, if possible. First we must import the library, and according to near universal tradition when we import numpy
we use the alias np
.
import numpy as np
You do not have to use the alias np
, in which case you can just import numpy
and write numpy
everywhere we use np
in the code that follows. However we mention and use it because if you look at anyone else's code it is almost certain that this is how they will use the library. To read in a 'csv' file we can now use the single command:
data = np.loadtxt(fname='./data/data/inflammation-01.csv', delimiter=',')
Let's check that something has happened and that the data has been read in:
print(data)
We can see that data
contains values and that printing them results in something different from just printing out each line in the file.
When print
is used with a numpy object and the data is bigger than can be neatly printed, the first and last few values are printed.
In between ellipsis is printed to indicate the data that is present in data
but ommitted for clarity.
To check that all the data has been read in as before we can access each 'line' of the data with:
for line in data:
print(line)
Now, as before when we read and printed each line of the file, the full data set is visible.
The format is slightly different, all the commas have been removed and instead of the original strings, each value has now been converted to a float
as indicated by the decimal point.
The use of the loop shows that we can treat it as a list. Each line of the original data can be indexed as though it were a list:
print(data[0])
print(data[17])
We can access individual items in the dataset with two indices and also use the slice that we applied to lists earlier:
print(data[0][0])
print(data[0][1])
print(data[0][2])
print(data[0][:3])
print(data[17][-3:])
We can also verify that the value in the dataset has been converted to a numerical type with type
:
print(type(data[0][0]))
This reveals that the value is not simply a float
but a special numpy.float
, the 64
refers to the amount of memory allocated to the value, we can think of this as how accurately the computer can represent the value.