Programming with R

Learning Objectives

Exposure to different debugging strategies and options in R

Explain strategies to minimize time debugging

Debug a function

When I am embarking upon a particular analysis, I first outline the different functions I will need, their input and outputs. I then write the functions in the order in which they are needed, testing/debugging the functions along the way.

Basic debugging strategies

(modified by a tutorial from Chris Paciorek)

Read and think about the error message. Sometimes it’s inscrutable, but often it just needs a bit of deciphering. Looking up a given error message in the R mailing list archive or on Stack Overflow or simply doing a web search with the exact message in double quotes can be a good strategy.
Fix errors from the top down — fix the first error that is reported, because later errors are often caused by the initial error. It’s common to have a string of many errors, which looks daunting, caused by a single initial error.
Is the bug reproducible – does it always happen in the same way at at the same point? It can help to restart R and see if the bug persists – this can sometimes help in figuring out if there is a scoping issue (your function is not looking for objects where you think it is) and we are using a global variable that we did not mean to.
Another basic strategy is to build up code in pieces (or tear it back in pieces to a simpler version). This allows you to isolate where the error is occurring.
If you’ve written your code modularly (functions are not nested within each other) with lots of functions, you can test individual functions. Often the error will be in what gets passed into and out of each function.
At the beginning of time (the 1970s?), the standard debugging strategy was to insert print statements in one’s code to see the value of a variable and thereby decipher what could be going wrong. We have better tools nowadays.
R is a scripting language, so you can usually run your code line by line to figure out what is happening. This can be a decent approach, particularly for simple code. However, when you are trying to find errors that occur within a series of many nested function calls or when the errors involve variable scoping (how R looks for variables that are not local to a function), or in other complicated situations, using formal debugging tools can be much more effective. Finally, if the error occurs inside of functions provided by R, rather than ones you write, it can be hard to run the code in those functions line by line.

Some common causes of bugs

Some of these are R-specific, while others are common to a variety of languages.

Parenthesis mis-matches
[[...]] vs. [...]
== vs. =
Silent type conversion when you don’t want it, or lack of coercion where you’re expecting it!!!!!!
Using the wrong function or variable name
Giving unnamed arguments to a function in the wrong order
Forgetting to define a variable in the environment of a function and having R, get that variable as a global variable from one of the enclosing environments. At best the types are not compatible and you get an error; at worst, you use a garbage value and the bug is hard to trace. In some cases your code may work fine when you develop the code (if the variable exists in the enclosing environment), but then may not work when you restart R if the variable no longer exists or is different.

More insidious bugs - Comparing real numbers exactly using == is dangerous because numbers on a computer are only represented to limited numerical precision. For example,

    1/3 == 4*(4/12-3/12)

## [1] FALSE

You want to compare an entire vector but your code just compares the first value (e.g., in an if statement) – consider using identical or all.equal
In an if-else statement, the else cannot be on its own line (unless all the code is enclosed in {}) because R will see the if part of the statement, which is a valid R statement, will execute that, and then will encounter the else and return an error.

General tips to avoid bugs

Use core R functionality and algorithms already coded. Figure out if a functionality already exists in (or can be adapted from) an R package (or potentially in a C/Fortran library/package): code that is part of standard mathematical/numerical packages will probably be more efficient and bug-free than anything you would write.
Code in a modular fashion, making good use of functions, so that you don’t need to debug the same code multiple times. Smaller functions are easier to debug, easier to understand, and can be combined in a modular fashion (like the UNIX utilities).
Write code for clarity and accuracy first; then worry about efficiency. Write an initial version of the code in the simplest way, without trying to be efficient (e.g., you might use for loops even if you’re coding in R); then make a second version that employs efficiency tricks and check that both produce the same output.
Plan out your code in advance, including all special cases/possibilities.
Write tests for your code early in the process.
Build up code in pieces, testing along the way. Make big changes in small steps, sequentially checking to see if the code has broken on test case(s).
Don’t hard code numbers
use variables (e.g., number of iterations, parameter values in simulations), even if you don’t expect to change the value, as this makes the code more readable and reduces bugs when you use the same number multiple times; e.g. speedOfLight <- 3e8 or nIts <- 1000.

R options for debigging

Interactive debugging via the browser

The core strategy for interactive debugging is to use the browser function, which pauses the current execution, and provides an interpreter, allowing you to view the current state of R. You can invoke browser in four ways

by inserting a call to browser() in your code if you suspect where things are going wrong
by invoking the browser after every step of a function using debug
by using options(error = recover) to invoke the browser when an error occurs
by temporarily modifying a function to allow browsing using trace

Once in the browser, you can execute any R commands you want. In particular, using ls to look at the objects residing in the current function environment, looking at the values of objects, and examining the classes of objects is often helpful.

Using debug to step through code

To step through a function, use debug(nameOfFunction). Then run your code. When the function is executed, R will pause execution just before the first line of the function. You are now using the browser and can examine the state of R and execute R statements.

Once in the browser context, you can use ‘n’ or return to step to the next line, ‘c’ to execute the entire current function or current loop, and `’ to stop debugging. We’ll see this in the screencast demo.

To unflag the function so that calling it doesn’t invoke debug, use undebug(nameOfFunction). In addition to working with functions you write you can use debug with standard R functions and functions from packages. For example you could do debug(glm).

Tracing errors in the call stack

traceback and recover allow you to see the call stack at the time of the error - i.e., they will show you all the functions that have been called, in the order called. This helps pinpoint where in a series of function calls the error may be occurring.

If you’ve run the code and gotten an error, you can invoke traceback after things have gone awry. R will show you the call stack, which can help pinpoint where an error is occurring.

More helpful is to be able to browse within the call stack. To do this invoke options(error = recover) (potentially in your .Rprofile if you do a lot of programming). Then when an error occurs, recover gets called, usually from the function in which the error occurred. The call to recover allows you to navigate the stack of active function calls at the time of the error and browse within the desired call. You just enter the number of the call you’d like to enter (or 0 to exit). You can then look around in the frame of a given function, entering an empty line when you want to return to the list of calls again.

You can also combine this with options(warn = 2), which turns warnings into errors to get to the point where a warning was issued.

Using trace to temporarily insert code

trace lets you temporarily insert code into a function (including standard R functions and functions in packages!) that can then be easily removed. You can use trace in a variety of ways.

The most flexible way to use trace is to use the argument edit = TRUE and then insert whatever code you want wherever you want in the function given as the first argument to trace. If I want to ensure I use a particular editor, such as emacs, I can use the argument edit = “emacs”. A standard approach would be to add a line with browser() at some point in the function to be able to step through the code from that point.

You can also use trace without directly editing the function. Here are a couple examples:

trace(lm, recover) # invoke recover when the function (lm in this case) starts
trace(lm, exit = browser) # invoke browser when the function ends

You call untrace, e.g., untrace(lm), to remove the temporarily inserted code; otherwise it’s removed when the session ends.

To figure out why warnings are being issued, you can do trace(warning, recover) which will insert a call to recover whenever warning is called.

Of course you can manually change the code in a function without using trace, but it’s very easy to forget to change things back (and a pain to remember exactly what you changed) and hard to do this with functions in packages, so trace is a nice way to do things.

Challenge - Debugging

Use browser() to find the bug in this function to plot the number of bee operations in New Zealand

 hb <- read.csv('data/beeOperations.csv')
 hb <- hb[order(hb$year),]

 plotVar <- function(hb, var, ylabs,
                    types=c("hobby", "semicom", "com"),
                    div=10^3){
    ## function to plot honey bee data by region
    ## takes hb, a data.frame with the varaibles to be plotted
    ## var, the name of the column of interest
    ## types, the different types of honey bee keepers
    ## div, a scalar to divide the values on the yaxis by to avoid
    ## ugly numbers
    layout(matrix(1:(length(unique(hb$region))+1), ncol=2))
    par(oma=c(3,3,2,1), mar=c(2,2,3,1), mgp=c(2,1,0),
        cex.axis=1.5)
    ntypes <- length(types)
    for(i in unique(hb$region)){
       this.region <- hb[hb$region == i,]
        plot(x=this.region$year[this.region$type == types[1]],
             y=this.region[,var][this.region$type == types[1]],
             type="l",
             ylim=c(0, max(hb[,var], na.rm=TRUE)),
             xlab="Year",
             ylab=ylabs,
             yaxt="n",
             main=i)
        ltys <- c("dashed", "dotted")
        for(j in 2:length(types)){
        points(x=this.region$year[this.region$type == types[j]],
               y=this.region[,var][this.region$type == types[j]],
               type="l", lty=ltys[j])
        }
        if(i %in% unique(hb$region)[1:4]){
            axis(2, pretty(c(0, max(hb[,var], na.rm=TRUE))),
                 labels=pretty(c(0, max(hb[,var], na.rm=TRUE)))/div)
        }
    }
}


plotVar(hb, "hives", "Hives")
plotVar(hb, "apiaries", "Apiaries")
plotVar(hb, "beekeepers", "Beekeepers")
plotVar(hb, "hives_beekeeper", "Hives per beekeeper")
plotVar(hb, "apiary_beekeeper", "Apiaries per beekeeper", div=1)

Practice defensive programming

When writing functions, and software more generally, you’ll want to warn the user or stop execution when there is an error and exit gracefully, giving the user some idea of what happened. Here are some things to consider:

check function inputs and warn users if the code will do something they might not expect or makes particular choices;
check inputs to if and the ranges in for loops;
provide reasonable default arguments;
document the range of valid inputs;
check that the output produced is valid; and
stop execution based on checks and give an informative error message.

The warning and stop functions allow you to do stop execution and issue warnings or errors in the same way that base R code does; in general they would be called based on an if statement. More succinctly, to stop code if a condition is not satisfied, you can use stopifnot. This allow you to catch errors that can be anticipated.

Here’s an example of building a robust square root function using stop and warning. Note you could use stopifnot(is.numeric(x)) in place of one of the checks here.

mysqrt <- function(x) {
    if (is.list(x)) {
        warning("x is a list; converting to a vector")
        x <- unlist(x)
    }
    if (!is.numeric(x)) {
        stop("What is the square root of 'bob'?")
        ## alternatively: stopifnot(is.numeric(x))
    } else {
        if (any(x < 0)) {
            warning("mysqrt: found negative values; proceeding anyway")
            x[x >= 0] <- (x[x >= 0])^(1/2)
            x[x < 0] <- NaN
            return(x)
        } else return(x^(1/2))
    }
}
mysqrt(c(1, 2, 3))
mysqrt(c(5, -7))
mysqrt(c("asdf", "sdf"))
mysqrt(list(5, 3, "ab"))

Catch run-time errors with `try` statements

Also, sometimes a function you call will fail, but you want to continue execution. For example, suppose you are doing a stratified analysis in which you take subsets of your data based on some categorical variable and fit a statistical model for each value of the categorical variable. If some of the subsets have no or very few observations, the statistical model fitting might fail. To do this, you might be using a for loop or lapply. You want your code to continue and fit the model for the rest of the cases even if one (or more) of the cases cannot be fit. You can wrap the function call that may fail within the try function (or tryCatch) and then your code won’t stop, even when an error occurs. Here’s a toy example.

set.seed(0)
nCats <- 30
n <- 100
y <- rnorm(n)
x <- rnorm(n)
cats <- sample(1:nCats, n, replace = TRUE)
data <- data.frame(y, x, cats)

params <- matrix(NA, nr = nCats, nc = 2)
for(i in 1:nCats){
    sub <- data[data$cats == i, ]
    fit <- try(
        lm(y ~ x, data = sub) )
    if(!is(fit, "try-error"))
        params[i, ] = fit$coef
}
params

##              [,1]       [,2]
##  [1,]  0.32220511  0.4166179
##  [2,]  0.47524008 -0.6892083
##  [3,] -0.96340970  0.8029426
##  [4,]  0.91113661  0.4604388
##  [5,]  0.50491729  0.2285024
##  [6,]  9.48183414 13.2340397
##  [7,]          NA         NA
##  [8,] -0.14232948  0.1600386
##  [9,] -0.08800939  1.1267142
## [10,]  0.43568330         NA
## [11,]  0.16216315  0.0224648
## [12,] -0.08576737 -0.1363258
## [13,] -0.43036777  1.0122300
## [14,]  0.39878728  0.1609506
## [15,] -0.57225776 -0.1405326
## [16,]  0.76815674  1.0656086
## [17,]  0.72675075         NA
## [18,] -0.38587004  0.3660344
## [19,] -1.32035875 -0.1513769
## [20,] -6.29053429 -8.1642739
## [21,] -1.01985678  0.5352108
## [22,] -0.27658878  0.2977353
## [23,]  0.27678418  0.2170440
## [24,]  0.11880266  0.2487930
## [25,] -0.35410886 -1.0930109
## [26,]  0.21821623 -0.5951547
## [27,]  0.33884808 -0.1287089
## [28,]  0.61548557  1.5411494
## [29,] -0.21555810  0.1195182
## [30,] -2.24250180 -1.4846365

The seventh stratum had no observations, so that call to lm failed, but the loop continued because we ‘caught’ the error with try. In this example, we could have checked the sample size for the subset before doing the regression, but in other contexts, we may not have an easy way to check in advance whether the function call will fail.