Scientific Python Basics

Prepared by: Cindee Madison, Thomas Kluyver (Any errors are our own)
Thanks to: Justin Kitzes, Matt Davis

1. Individual things

The most basic component of any programming language are "things", also called variables or (in special cases) objects.

The most common basic "things" in Python are integers, floats, strings, booleans, and some special objects of various types. We'll meet many of these as we go through the lesson.

TIP: To run the code in a cell quickly, press Ctrl-Enter.

TIP: To quickly create a new cell below an existing one, type Ctrl-m then b. Other shortcuts for making, deleting, and moving cells are in the menubar at the top of the screen.

In [ ]:
# A thing
2
In [ ]:
# Use print to show multiple things in the same cell
# Note that you can use single or double quotes for strings
print(2)
print('hello')
In [ ]:
# Things can be stored as variables
a = 2
b = 'hello'
c = True  # This is case sensitive
print(a, b, c)
In [ ]:
# The type function tells us the type of thing we have
print(type(a))
print(type(b))
print(type(c))
In [ ]:
# What happens when a new variable point to a previous variable?
a = 1
b = a
a = 2
## What is b?
print(b)

2. Commands that operate on things

Just storing data in variables isn't much use to us. Right away, we'd like to start performing operations and manipulations on data and variables.

There are three very common means of performing an operation on a thing.

2.1 Use an operator

All of the basic math operators work like you think they should for numbers. They can also do some useful operations on other things, like strings. There are also boolean operators that compare quantities and give back a bool variable as a result.

In [ ]:
# Standard math operators work as expected on numbers
a = 2
b = 3
print(a + b)
print(a * b)
print(a ** b)  # a to the power of b (a^b does something completely different!)
print(a / b)   # Careful with dividing integers if you use Python 2
In [ ]:
# There are also operators for strings
print('hello' + 'world')
print('hello' * 3)
#print('hello' / 3)  # You can't do this!
In [ ]:
# Boolean operators compare two things
a = (1 > 3)
b = (3 == 3)
print(a)
print(b)
print(a or b)
print(a and b)

2.2 Use a function

These will be very familiar to anyone who has programmed in any language, and work like you would expect.

In [ ]:
# There are thousands of functions that operate on things
print(type(3))
print(len('hello'))
print(round(3.3))

TIP: To find out what a function does, you can type it's name and then a question mark to get a pop up help window. Or, to see what arguments it takes, you can type its name, an open parenthesis, and hit tab.

In [ ]:
round?
#round(
round(3.14159, 2)

TIP: Many useful functions are not in the Python built in library, but are in external scientific packages. These need to be imported into your Python notebook (or program) before they can be used. Probably the most important of these are numpy and matplotlib.

In [ ]:
# Many useful functions are in external packages
# Let's meet numpy
import numpy as np
In [ ]:
# To see what's in a package, type the name, a period, then hit tab
#np?
np.
In [ ]:
# Some examples of numpy functions and "things"
print(np.sqrt(4))
print(np.pi)  # Not a function, just a variable
print(np.sin(np.pi))

2.3 Use a method

Before we get any farther into the Python language, we have to say a word about "objects". We will not be teaching object oriented programming in this workshop, but you will encounter objects throughout Python (in fact, even seemingly simple things like ints and strings are actually objects in Python).

In the simplest terms, you can think of an object as a small bundled "thing" that contains within itself both data and functions that operate on that data. For example, strings in Python are objects that contain a set of characters and also various functions that operate on the set of characters. When bundled in an object, these functions are called "methods".

Instead of the "normal" function(arguments) syntax, methods are called using the syntax variable.method(arguments).

In [ ]:
# A string is actually an object
a = 'hello, world'
print(type(a))
In [ ]:
# Objects have bundled methods
#a.
print(a.capitalize())
print(a.replace('l', 'X'))

Exercise 1 - Conversion

Throughout this lesson, we will successively build towards a program that will calculate the variance of some measurements, in this case Height in Metres. The first thing we want to do is convert from an antiquated measurement system.

To change inches into metres we use the following equation (conversion factor is rounded)

$metre = \frac{inches}{39}$

  1. Create a variable for the conversion factor, called inches_in_metre.
  2. Create a variable (inches) for your height in inches, as inaccurately as you want.
  3. Divide inches by inches_in_metre, and store the result in a new variable, metres.
  4. Print the result

Bonus

Convert from feet and inches to metres.

TIP: A 'gotcha' for all python 2 users (it was changed in python 3) is the result of integer division. To make it work the obvious way, either:

  1. inches_in_metre = 39. (add the decimal to cast to a float, or use 39.4 to be more accurate)
  2. from __future__ import division - Put this at the top of the code and it will work
In [ ]:
 

3. Collections of things

While it is interesting to explore your own height, in science we work with larger slightly more complex datasets. In this example, we are interested in the characteristics and distribution of heights. Python provides us with a number of objects to handle collections of things.

Probably 99% of your work in scientific Python will use one of four types of collections: lists, tuples, dictionaries, and numpy arrays. We'll look quickly at each of these and what they can do for you.

3.1 Lists

Lists are probably the handiest and most flexible type of container.

Lists are declared with square brackets [].

Individual elements of a list can be selected using the syntax a[ind].

In [ ]:
# Lists are created with square bracket syntax
a = ['blueberry', 'strawberry', 'pineapple']
print(a, type(a))
In [ ]:
# Lists (and all collections) are also indexed with square brackets
# NOTE: The first index is zero, not one
print(a[0])
print(a[1])
In [ ]:
## You can also count from the end of the list
print('last item is:', a[-1])
print('second to last item is:', a[-2])
In [ ]:
# you can access multiple items from a list by slicing, using a colon between indexes
# NOTE: The end value is not inclusive
print('a =', a)
print('get first two:', a[0:2])
In [ ]:
# You can leave off the start or end if desired
print(a[:2])
print(a[2:])
print(a[:])
print(a[:-1])
In [ ]:
# Lists are objects, like everything else, and have methods such as append
a.append('banana')
print(a)

a.append([1,2])
print(a)

a.pop()
print(a)

TIP: A 'gotcha' for some new Python users is that many collections, including lists, actually store pointers to data, not the data itself.

Remember when we set b=a and then changed a?

What happens when we do this in a list?

HELP: look into the copy module

In [ ]:
a = 1
b = a
a = 2
## What is b?
print('What is b?', b)

a = [1, 2, 3]
b = a
print('original b', b)
a[0] = 42
print('What is b after we change a ?', b)

EXERCISE 2 - Store a bunch of heights (in metres) in a list

  1. Ask five people around you for their heights (in metres).
  2. Store these in a list called heights.
  3. Append your own height, calculated above in the variable metres, to the list.
  4. Get the first height from the list and print it.

Bonus

  1. Extract the last value in two different ways: first, by using the index for the last item in the list, and second, presuming that you do not know how long the list is.

HINT: len() can be used to find the length of a collection

In [ ]:
 

3.2 Tuples

We won't say a whole lot about tuples except to mention that they basically work just like lists, with two major exceptions:

  1. You declare tuples using () instead of []
  2. Once you make a tuple, you can't change what's in it (referred to as immutable)

You'll see tuples come up throughout the Python language, and over time you'll develop a feel for when to use them.

In general, they're often used instead of lists:

  1. to group items when the position in the collection is critical, such as coord = (x,y)
  2. when you want to make prevent accidental modification of the items, e.g. shape = (12,23)
In [ ]:
xy = (23, 45)
print(xy[0])
xy[0] = "this won't work with a tuple"

Anatomy of a traceback error

Traceback errors are raised when you try to do something with code it isn't meant to do. It is also meant to be informative, but like many things, it is not always as informative as we would like.

Looking at our error:

TypeError                                 Traceback (most recent call last)
<ipython-input-25-4d15943dd557> in <module>()
      1 xy = (23, 45)
      2 xy[0]
----> 3 xy[0] = 'this wont work with a tuple'

TypeError: 'tuple' object does not support item assignment

  1. The command you tried to run raise a TypeError This suggests you are using a variable in a way that its Type doesnt support
  2. the arrow ----> points to the line where the error occurred, In this case on line 3 of your code form the above line.
  3. Learning how to read a traceback error is an important skill to develop, and helps you know how to ask questions about what has gone wrong in your code.

3.3 Dictionaries

Dictionaries are the collection to use when you want to store and retrieve things by their names (or some other kind of key) instead of by their position in the collection. A good example is a set of model parameters, each of which has a name and a value. Dictionaries are declared using {}.

In [ ]:
# Make a dictionary of model parameters
convertors = {'inches_in_feet' : 12,
              'inches_in_metre' : 39}

print(convertors)
print(convertors['inches_in_feet'])
In [ ]:
## Add a new key:value pair
convertors['metres_in_mile'] = 1609.34
print(convertors)
In [ ]:
# Raise a KEY error
print(convertors['blueberry'])

3.4 Numpy arrays (ndarrays)

Even though numpy arrays (often written as ndarrays, for n-dimensional arrays) are not part of the core Python libraries, they are so useful in scientific Python that we'll include them here in the core lesson. Numpy arrays are collections of things, all of which must be the same type, that work similarly to lists (as we've described them so far). The most important are:

  1. You can easily perform elementwise operations (and matrix algebra) on arrays
  2. Arrays can be n-dimensional
  3. There is no equivalent to append, although arrays can be concatenated

Arrays can be created from existing collections such as lists, or instantiated "from scratch" in a few useful ways.

When getting started with scientific Python, you will probably want to try to use ndarrays whenever possible, saving the other types of collections for those cases when you have a specific reason to use them.

In [ ]:
# We need to import the numpy library to have access to it 
# We can also create an alias for a library, this is something you will commonly see with numpy
import numpy as np
In [ ]:
# Make an array from a list
alist = [2, 3, 4]
blist = [5, 6, 7]
a = np.array(alist)
b = np.array(blist)
print(a, type(a))
print(b, type(b))
In [ ]:
# Do arithmetic on arrays
print(a**2)
print(np.sin(a))
print(a * b)
print(a.dot(b), np.dot(a, b))
In [ ]:
# Boolean operators work on arrays too, and they return boolean arrays
print(a > 2)
print(b == 6)

c = a > 2
print(c)
print(type(c))
print(c.dtype)
In [ ]:
# Indexing arrays
print(a[0:2])

c = np.random.rand(3,3)
print(c)
print('\n')
print(c[1:3,0:2])

c[0,:] = a
print('\n')
print(c)
In [ ]:
# Arrays can also be indexed with other boolean arrays
print(a)
print(b)
print(a > 2)
print(a[a > 2])
print(b[a > 2])

b[a == 3] = 77
print(b)
In [ ]:
# ndarrays have attributes in addition to methods
#c.
print(c.shape)
print(c.prod())
In [ ]:
# There are handy ways to make arrays full of ones and zeros
print(np.zeros(5), '\n')
print(np.ones(5), '\n')
print(np.identity(5), '\n')
In [ ]:
# You can also easily make arrays of number sequences
print(np.arange(0, 10, 2))

EXERCISE 3 - Using Arrays for simple analysis

Revisit your list of heights

  1. turn it into an array
  2. calculate the mean
  3. create a mask of all heights greater than a certain value (your choice)
  4. find the mean of the masked heights

BONUS

  1. find the number of heights greater than your threshold
  2. mean() can take an optional argument called axis, which allows you to calculate the mean across different axes, eg across rows or across columns. Create an array with two dimensions (not equal sized) and calculate the mean across rows and mean across columns. Use 'shape' to understand how the means are calculated.
In [ ]:
 

4. Repeating yourself

So far, everything that we've done could, in principle, be done by hand calculation. In this section and the next, we really start to take advantage of the power of programming languages to do things for us automatically.

We start here with ways to repeat yourself. The two most common ways of doing this are known as for loops and while loops. For loops in Python are useful when you want to cycle over all of the items in a collection (such as all of the elements of an array), and while loops are useful when you want to cycle for an indefinite amount of time until some condition is met.

The basic examples below will work for looping over lists, tuples, and arrays. Looping over dictionaries is a bit different, since there is a key and a value for each item in a dictionary. Have a look at the Python docs for more information.

In [ ]:
# A basic for loop - don't forget the white space!
wordlist = ['hi', 'hello', 'bye']
for word in wordlist:
    print(word + '!')

Note on indentation: Notice the indentation once we enter the for loop. Every idented statement after the for loop declaration is part of the for loop. This rule holds true for while loops, if statements, functions, etc. Required identation is one of the reasons Python is such a beautiful language to read.

If you do not have consistent indentation you will get an IndentationError. Fortunately, most code editors will ensure your indentation is correction.

NOTE In Python the default is to use four (4) spaces for each indentation, most editros can be configured to follow this guide.

In [ ]:
# Indentation error: Fix it!
for word in wordlist:
    new_word = word.capitalize()
   print(new_word + '!') # Bad indent
In [ ]:
# Sum all of the values in a collection using a for loop
numlist = [1, 4, 77, 3]

total = 0
for num in numlist:
    total = total + num
    
print("Sum is", total)
In [ ]:
# Often we want to loop over the indexes of a collection, not just the items
print(wordlist)

for i, word in enumerate(wordlist):
    print(i, word, wordlist[i])
In [ ]:
# While loops are useful when you don't know how many steps you will need,
# and want to stop once a certain condition is met.
step = 0
prod = 1
while prod < 100:
    step = step + 1
    prod = prod * 2
    print(step, prod)
    
print('Reached a product of', prod, 'at step number', step)

TIP: Once we start really generating useful and large collections of data, it becomes unwieldy to inspect our results manually. The code below shows how to make a very simple plot of an array. We'll do much more plotting later on, this is just to get started.

In [ ]:
# Load up pylab, a useful plotting library
%matplotlib inline
import matplotlib.pyplot as plt

# Make some x and y data and plot it
y = np.arange(100)**2
plt.plot(y)

EXERCISE 4 - Variance

We can now calculate the variance of the heights we collected before.

As a reminder, sample variance is the calculated from the sum of squared differences of each observation from the mean:

$variance = \frac{\Sigma{(x-mean)^2}}{n-1}$

where mean is the mean of our observations, x is each individual observation, and n is the number of observations.

First, we need to calculate the mean:

  1. Create a variable total for the sum of the heights.
  2. Using a for loop, add each height to total.
  3. Find the mean by dividing this by the number of measurements, and store it as mean.

Note: To get the number of things in a list, use len(the_list).

Now we'll use another loop to calculate the variance:

  1. Create a variable sum_diffsq for the sum of squared differences.
  2. Make a second for loop over heights.
    • At each step, subtract the height from the mean and call it diff.
    • Square this and call it diffsq.
    • Add diffsq on to sum_diffsq.
  3. Divide diffsq by n-1 to get the variance.
  4. Display the variance.

Note: To square a number in Python, use **, eg. 5**2.

Bonus

  1. Test whether variance is larger than 0.01, and print out a line that says "variance more than 0.01: " followed by the answer (either True or False).
In [ ]:
 

5. Making choices

Often we want to check if a condition is True and take one action if it is, and another action if the condition is False. We can achieve this in Python with an if statement.

TIP: You can use any expression that returns a boolean value (True or False) in an if statement. Common boolean operators are ==, !=, <, <=, >, >=. You can also use is and is not if you want to check if two variables are identical in the sense that they are stored in the same location in memory.

In [ ]:
# A simple if statement
x = 3
if x > 0:
    print('x is positive')
elif x < 0:
    print('x is negative')
else:
    print('x is zero')
In [ ]:
# If statements can rely on boolean variables
x = -1
test = (x > 0)
print(type(test)); print(test)

if test:
    print('Test was true')

6. Creating chunks with functions and modules

One way to write a program is to simply string together commands, like the ones described above, in a long file, and then to run that file to generate your results. This may work, but it can be cognitively difficult to follow the logic of programs written in this style. Also, it does not allow you to reuse your code easily - for example, what if we wanted to run our logistic growth model for several different choices of initial parameters?

The most important ways to "chunk" code into more manageable pieces is to create functions and then to gather these functions into modules, and eventually packages. Below we will discuss how to create functions and modules. A third common type of "chunk" in Python is classes, but we will not be covering object-oriented programming in this workshop.

In [ ]:
# We've been using functions all day
x = 3.333333
print(round(x, 2))
print(np.sin(x))
In [ ]:
# It's very easy to write your own functions
def multiply(x, y):
    return x*y
In [ ]:
# Once a function is "run" and saved in memory, it's available just like any other function
print(type(multiply))
print(multiply(4, 3))
In [ ]:
# It's useful to include docstrings to describe what your function does
def say_hello(time, people):
    '''
    Function says a greeting. Useful for engendering goodwill
    '''
    return 'Good ' + time + ', ' + people

Docstrings: A docstring is a special type of comment that tells you what a function does. You can see them when you ask for help about a function.

In [ ]:
say_hello('afternoon', 'friends')
In [ ]:
# All arguments must be present, or the function will return an error
say_hello('afternoon')
In [ ]:
# Keyword arguments can be used to make some arguments optional by giving them a default value
# All mandatory arguments must come first, in order
def say_hello(time, people='friends'):
    return 'Good ' + time + ', ' + people
In [ ]:
say_hello('afternoon')
In [ ]:
say_hello('afternoon', 'students')

EXERCISE 5 - Creating a variance function

Finally, let's turn our variance calculation into a function that we can use over and over again. Copy your code from Exercise 4 into the box below, and do the following:

  1. Turn your code into a function called calculate_variance that takes a list of values and returns their variance.
  2. Write a nice docstring describing what your function does.
  3. In a subsequent cell, call your function with different sets of numbers to make sure it works.

Bonus

  1. Refactor your function by pulling out the section that calculates the mean into another function, and calling that inside your calculate_variance function.
  2. Make sure it can works properly when all the data are integers as well.
  3. Give a better error message when it's passed an empty list. Use the web to find out how to raise exceptions in Python.
In [ ]:
 

EXERCISE 6 - Putting the calculate_mean and calculate_variance function(s) in a module

We can make our functions more easily reusable by placing them into modules that we can import, just like we have been doing with numpy. It's pretty simple to do this.

  1. Copy your function(s) into a new text file, in the same directory as this notebook, called stats.py.
  2. In the cell below, type import stats to import the module. Type stats. and hit tab to see the available functions in the module. Try calculating the variance of a number of samples of heights (or other random numbers) using your imported module.
In [ ]: