Now that you understand the basics of programming in Python, we'll move on to discuss strategies for managing the code that you write. In this lesson, we'll cover four topics:
Before we get started, let's review the code that we'll use for this lesson, which comes from our earlier Scientific Programming II lesson. In that lesson, we developed code to load a small data table of bird sightings, calculate the mean number of birds seen per present species, and save the results. By the final exercise in that lesson, we ended up with a cell in our notebook containing the following code:
import pandas as pd # 1
birds_df = pd.read_csv('birds_sm.csv', index_col='Species') # 2
def count_birds(birds_df): # 3
years = birds_df.columns
results_df = pd.DataFrame(index=['Mean'], columns=years) # 4
for year in years: # 5
birds_this_year = birds_df[year]
sum_counts = birds_this_year.sum() # 6
species_seen = (birds_this_year > 0).sum()
if species_seen == 0: # 7
results_df[year] = 0
else:
results_df[year] = sum_counts / species_seen
return results_df # 8
results_df = count_birds(birds_df) # 9
results_df.to_csv('birds_results.csv') # 10
ax = results_df.T.plot() # 11
fig = ax.get_figure()
fig.savefig('birds_results.pdf')
As a reminder, this code does the following (the numbers below correspond to the numbers in the comments in the code above):
pandas
moduleWe'll use this code as a starting point for this lesson on modules and testing. If you haven't done so already, you'll need to download the file birds_sm.csv and make sure that it's in the same directory where you eventually save your code files.
Up to this point, we've written and executed all of our Python code in Jupyter notebooks. A notebook is a great way to try new things and to write prototype code, and some "advanced" data scientists use notebooks for their entire workflow. However, the more common means of structuring scientific software is to save your code in various plain text files and to run these files from the command line. We'll look at how to do this now.
To get started, open a plan text editor on your computer (you don't want to use nano
for this exercise — you'll want a fully fledged graphical editor). If you haven't downloaded a good plain text editor yet, you might try Sublime Text.
Exercise 1
Using your editor, copy and paste the entire block of code above into a file called
runall.py
. Change the data table frombirds_sm.csv
tobirds_lg.csv
, since we're now interested in running our full analysis. See the notes below for some potential issues to avoid.
As you start to work with plain text code files, there are three important pitfalls to note. First, be careful with file extensions. Some editors will try to append something like .txt
when you save a file — you want to be sure that your files actually have the extension .py
as shown above. Second, watch out for indenting. The code in these files should all be flush against the left margin of the editor, with indents used, as before, only for the for loop and the if statement. Third, never use tabs. You should always use spaces to indent code, with the standard size of an indent being four spaces. Most good text editors will have an option to automatically substitute spaces when you press the tab key (in Sublime Text, look at the bar at the bottom of the window, click on the words Tab Size, and check the option to indent using spaces).
At this point, we have a file that contains the same code that we had in a Jupyter notebook cell at the end of the Scientific Programming II lesson. In the notebook, we could press a button or use Ctrl-Enter to execute a cell. So how do we run the code contained in a file? We can actually do this quite easily from the command line.
Exercise 2
Open a terminal window and navigate to the directory holding your
runall.py
script. Make sure that it also contains the filebirds_sm.csv
. If you still have the results filesbirds_results.csv
andbirds_results.pdf
in this directory from the previous lesson, delete them usingrm
. Then type the commandpython runall.py
and hit Return. What happens?Bonus: Open your
runall.py
file and add a line to print the contents of the results data frame. Run the script again.
The statement python my_file.py
, run at the command line, thus does the exact same thing as executing a cell, or several cells, in a Jupyter notebook containing the same code.
Now that you've seen how to get your code out of a notebook and into a plain text file, let's start thinking about how to organize your code into files. Perhaps the single most important advice that we can give you for managing your code is to separate the reusable, conceptual parts of your code (the part that does the "science") from the code used to run a specific individual analysis. There are at least three important reasons to do this:
Looking at the code that we've written above, which part is the scientific "guts" and which part is specific to this analysis? Sometimes, as in this case, it's fairly obvious — here, the conceptual analysis part of the code occurs in the count_birds
function, while the rest of the code loads and saves a particular data set.
(As an aside, you might argue that the loading and saving parts are actually conceptual as well, and that we should consider everything other than just picking the input data table to be part of the core scientific analysis. This is reasonable, and in this case, I would encourage you to refactor the function above so that you have a single function that takes in a path to the data file and then saves the result data frame and plots. You'll see later, though, that this design would make our code harder to test.)
Our task, then, is to separate our count_birds
function from the rest of the code.
Exercise 3
Open your
runall.py
file and create a new file calledanalysis_functions.py
. Cut the functioncount_birds
out ofrunall.py
and paste it intoanalysis_functions.py
. At the top of this file, before the function, add the lineimport pandas as pd
, since our function makes use of thepandas
library.Run the
runall.py
script again and you'll see an error. What's the problem? Can you fix it? (Hint: Think about how you tell yourrunall.py
to find thepandas
module, and then how you call a function within that module.)
At this point, we have two files: the file analysis_functions.py
contains the scientific core of our analysis, the function count_birds
, and the file runall.py
contains the surrounding code that executes and saves the results from our particular data set analysis. Below we'll refer to the analysis_functions.py
file as a module, which is the common term for a file containing functions that is designed to be imported, and we'll refer to runall.py
as a script, the term for a file that is meant to be run itself.
Now that we've separated our core, conceptual scientific code out into a module, we can start to think not about the technicalities of writing it, but the bigger picture question of "righting" it — that is, how can we tell that our code is actually giving us the correct answer? As disciplinary scientists, we would never trust a lab measurement that we made with uncalibrated instruments. Similarly, as computational scientists, we shouldn't trust the results that our code gives us until we have tested it.
In this lesson, we'll focus on unit tests, perhaps the most basic type of testing that we can run. Unit tests focus on a single "unit" of code, which in our case will be functions that we've written. We'll write tests to ensure that when our function is given a certain set of arguments as input, it generates output that we know to be correct. Once we have a complete test suite for a function, we can run the entire suite to make sure that all the tests pass (i.e., that our function gives the correct output for all the combinations of input that we have decided to test).
In our example, we'll focus on testing the count_birds
in our analysis_functions.py
module. Our strategy will be simply to give this function a data frame for which we can calculate the correct result by hand, and we'll make sure that the function gives us back this known correct answer. We can do this several times for different test data sets. If our function passes all of our tests, it gives us more confidence that if we run the function on a different data set, perhaps a huge one for which we can't verify the results by hand, we'll get an accurate result.
Although we won't discuss it explicitly here, testing has another huge mental benefit for the long-term maintenance and reuse of code, which is that if we make changes to the internals of our function, we can run our tests again to make sure that we haven't accidentally broken anything (this is known as a "regression"). This makes us more free to continue to improve the performance of our code over time, and helps avoid the dreaded "it's working, don't touch it" phenomena.
In this lesson, we're going to use the simple and very popular nose
package to write and run our tests.
We'll start by creating a new text file called test_analysis_functions.py
, which will hold our unit tests. At the top of this file, type (or copy) in the following lines to import pandas
, numpy
(an important scientific package that, among other things, contains some useful functions for testing), and our module:
import pandas as pd
import numpy as np
import analysis_functions as af
Now, let's write our first test function, which will simply test to make sure that our function gives the right answer for the first year in our small data set. This is a very simple test which, if nothing else, will at least catch that the function runs at all, without giving an error, and that it gives a correct result for a very typical case.
Test functions (written for the nose
testing package) can contain any type of Python code, like regular functions, but have a few key features. First, they don't take any arguments. Second, they contain at least one assert
statement that compares two variables, the one calculated by our function and a correct one that we enter by hand.
The test function below load our small data table, uses count_birds
to analyze it, and tests the result for 2010. Copy and paste this code into the test_analysis_functions.py
file, below the import statements:
def test_count_birds_small_table():
input_df = pd.read_csv('birds_sm.csv', index_col='Species')
results_df = af.count_birds(input_df)
np.testing.assert_array_equal(results_df['2010'], 24.5)
The first of the three lines in this test function reads in our small data table. The second line runs the counts_birds
function on the table, and saves the resulting data frame as results_df
. The third line uses a function from numpy
to assert that the two arguments given to the function, here results_df['2010']
and 24.5, are equal. There are a few different ways to write assert statements, but this function from numpy
will work for all reasonably simple kinds of tests.
Note that we had to calculate the correct value for 2010, which is 24.5, by looking at the csv file manually. Make sure that you get this right. And resist the temptation to run the function, print the result, assume that it's correct, and then make a test based on that output — you'll never catch a bug that way!
Now we're ready to run our suite of tests (so far, just this one test). Open a terminal window, and cd
to the directory containing your Python files. Type nosetests
, and examine the output. It should look something like this:
.
----------------------------------------------------------------------
Ran 1 test in 0.238s
OK
The dot on the first line indicates that we had one test, and that it passed. There is one character printed for each test. A '.' means the test passed, a 'F' means the test failed, and an 'E' means there was an error in the test function itself or in a function that it called.
Just for fun, try changing your test so that it fails (for example, assert that the answer for 2010 should be 30). What output do you see now? Don't forget to change the test back so that it passes after you're done.
In addition to reading in data, we could also have constructed the input data frame manually, which gives us an easy way to test and verify the behavior of our function under special circumstances (like only one species, or an empty input table). We can create a data table manually using the code below:
input_df = pd.DataFrame([[1,2],[3,4]], index=['Sp1', 'Sp2'],
columns=['2010', '2011'])
This code uses the function DataFrame
from the pandas
library in order to create a new data frame called input_df
. The first argument, which looks a bit funny, is just a list of lists (remember lists from the first scientific programming lesson?), which is how we can tell pandas
that we want to make a two dimensional table. The second argument index
tells the function what names we want to use for the index of the records (i.e., the names of the rows), and the third argument columns
is a list of the names of the columns. Note that it's OK for the arguments to a function to appear on multiple lines.
It's not always obvious what a line of code like the above will actually do. To see it better, open a terminal window, navigate to your folder containing the Python lessons, and run the command jupyter notebook
. Create a new notebook using the menu bar near the top of the screen. In that notebook, first import pandas
, then copy the line of code above into a cell, run the cell, and then print(input_df)
. This will show you what our manually created data frame actually looks like, and in particular what numbers have ended up in the rows and columns of the table.
Exercise 4
Here's another test function that looks a lot like the first, except that we've manually constructed our input data frame using the
pd.Dataframe
function.def test_count_birds_simulated_data(): input_df = pd.DataFrame([[1,2],[3,4]], index=['Sp1', 'Sp2'], columns=['2010', '2011']) results_df = af.count_birds(input_df) np.testing.assert_array_equal(results_df, [[XXX, YYY]])
Figure out what the number XXX and YYY should be in the final assert statement. Fill them in, then copy the entire test function into your
test_analysis_functions.py
file, and runnosetests
again to make sure that this function also passes.
Exercise 5
Recall that when we were writing the `count_birds` function, we had to be careful to specify what the function should do if there were no species seen in a particular year. Make a copy of the test function from the previous exercise, and using it as a starting point, design a test to ensure that the `
count_birds
function returns a result of zero for a year in which no birds of any species were seen. Add this third function to your test file and runnosetests
again.
Of course, our count_birds
function was relatively simple, and so there wasn't a whole lot of behavior to test. Many of the functions you write will take a number of different parameters as inputs, and you'll want to test several different combinations of them.
When designing tests, I'd recommend thinking in two steps. First, think about writing tests that will make sure that all possible behavior for your function is tested (that's why we wrote the test about to species with zero counts above — we want to verify all possible behavior). Second, think about how the arguments that you, or someone else, might give your function could cause it to break. Can you verify that it works correctly in these cases? Think in particular about boundary cases (what if an input argument is zero? What if it's huge?).
Although here we've written our function first and then our tests, this isn't the way that software development proceeds in reality. There's always an interaction between writing your code and its tests — writing tests often highlights ways in which you need to modify your code, and modifying your code often adds new features that you know you should test. Both code and tests are thus best written at the same time. (There's also a school of thought called Test Driven Development, or TDD, that says that all of your tests should be written first, and that you should never actually write any production code until there's already a test to determine whether it works. This is an interesting ideal, if one that's difficult to follow in practice.)
A final and very important part of managing your scientific code is writing useful documentation. First and foremost, good documentation will help you, and your collaborators, better understand and maintain your code. Documentation is also an important component of a reproducible workflow, which we'll discuss in the next lesson — in the way that code helps to reproduce the results of your analysis, think of documentation as helping you reproduce the thought process that led you to write your code in a particular way.
Every programming language has its own particularguidelines and style for documentation — regardless of what language you're using, I'd suggest Googling something like "SomeLanguage style guide" to read about common conventions. Below, we'll briefly discuss documentation in the context of Python code, functions, and modules, but a similar high-level conceptual framework will apply to any language.
In short, I would suggest mentally dividing your documentation into four basic levels, ranging from the smallest to the broadest scale.
At the smallest scale are line-by-line comments on your code, such as
# If no species present, mean counts per species should be zero
if species_seen == 0:
results_df[year] = 0
else:
results_df[year] = sum_counts / species_seen
Code comments such as these should generally be restricted to one line, although two or three lines would be OK for cases that require more explanation. Most importantly, comments should describe what the code is intended to do and why, not simply repeat literally what the code does. The above comment, for example, explains the purpose of the subsequent lines. In contrast, the comment below is basically useless, as it simply repeats what anyone reading the code could have already told you.
# Set x to zero
x = 0
A better option would be
# Initialize running count of individuals
x = 0
There's an art to determining how many comments are too many. I tend to personally be quite verbose with my comments, as I tend to use comments as markers to help me find the sections of my code that perform particular conceptual steps. Most of my code thus has a comment every three to five lines or so, although some would find this excessive.
Finally, make sure not to let your comments get out of date — when you update your code, you must update the corresponding comments. An out of date or incorrect comment is worse than no comment at all.
Exercise 6
Open your
analysis_functions.py
module and add a few comments to various lines to help explain what's happening there.
All languages have conventions surrounding how to document the operations of a function. In Python, a description of a function is known as a docstring. Many scientific Python packages use a convention similar to the below.
def count_birds(birds_df):
"""
Calculate mean number of birds per species present.
Parameters
----------
birds_df : DataFrame
Table with species in rows and years in columns, values are counts
Returns
-------
: DataFrame
One row titled Mean for the result with column for each year
"""
Docstrings in Python begin and end with triple quotes, on their own lines, and they are indented just like the code that comes within the function.
Other than the specific syntax, two main features of this docstring apply to any scientific programming language. First, the first line of the function description is a single line describing the high level purpose and/or function of the function. Second, the inputs and outputs of the function are clearly described. If the function were more complex, it would also be common to include here an example of how it could be used.
Exercise 7
Copy the above docstring (feel free to modify it if you'd like) into your
count_birds
function.
At a higher level, you should also provide some overarching documentation of each of your Python module files. This is usually a relatively short summary, compared to a function-level docstring, that states the purposes of the module and lists what the module contains.
"""
Module containing functions to calculate mean number of sightings of
bird species.
Functions
---------
count_birds - Calculate mean number of birds per species present.
"""
Like function docstrings, a module docstring starts and ends with triple quotes. Module docstrings appear at the very top of your module file, above the import statements and any other Python code.
Exercise 8
Copy the above module documentation to the top of your
analysis_functions.py
module.
At the highest level of documentation, we find information that is intended (mostly) to be read by users of your code to gain an overview of everything that your code does. We won't talk about this level in detail, as it is most important for larger projects that are shared widely and maintained on an ongoing basis (which may not apply to many of your research projects). That said, even for relatively small projects, it never hurts to create a short document, or even a script, that shows off how your code can and should be used.
If you are interested in learning more about this process for Python packages, I'd suggest having a look at Sphinx, which is the tool used to create documentation for Python itself as well as most of the main scientific packages. The websites documenting core Python, matplotlib, and numpy provide some useful examples of Sphinx in use as well as some general documentation styles that you might wish to review.