It’s come up a few times in my career working with lakes that I’ve been asked to visualize the results of a sonde-based sampling program. These programs (and the data collected by them) are common, and it’s easy to see why: taking a boat to one of the deeper places in a lake and recording the temperature, oxygen saturation, and pH from a multi-parameter sonde is a relatively easy and engaging way to collect data. Automated multi-parameter sonde stations mounted on rafts are becoming more and more common as well, collecting vast amounts of high-resolution water quality data with a depth component.

An excellent example of such data is that of the Lake Champlain Long-Term monitoring program - a joint effort between the states of Vermont and New York. The data I’ll use for this post was collected between 1992 and 2018 from Mallett’s Bay (where I grew up learning to sail!). To make it easier to see what’s going on, I’ll only use data from 1993 for the first exercise. As usual, I’ll use the tidyverse, and to help with date processing I’ll use the lubridate package.

```
library(tidyverse)
library(lubridate)
sonde_tbl_1993
```

```
## # A tibble: 252 x 3
## date depth temp
## <date> <dbl> <dbl>
## 1 1993-05-13 1 9.68
## 2 1993-05-13 2 9.63
## 3 1993-05-13 3 9.49
## 4 1993-05-13 4 9.22
## 5 1993-05-13 5 9.14
## 6 1993-05-13 6 8.91
## 7 1993-05-13 7 8.9
## 8 1993-05-13 8 8.85
## 9 1993-05-13 9 8.78
## 10 1993-05-13 10 8.73
## # … with 242 more rows
```

The best way to start visualizing this type of data is to use points, using some other aesthetic (colour is probably best) to visualize the variability in temperature. I used `scale_colour_gradient2()`

instead of the default colour scale because it allowed for a more intuitive reading of the graphic (red is hot, blue is cold), and using a middle value helps show the depth of the thermocline deepening as the year progresses.

```
ggplot(sonde_tbl_1993, aes(x = date, y = depth, colour = temp)) +
geom_point() +
scale_y_reverse() +
scale_colour_gradient2(
midpoint = 15,
high = scales::muted("red"),
low = scales::muted("blue")
)
```

This type of visualization is excellent for many reasons. First, it makes no assumptions…the points represent was was measured and no more. It’s also a good way to visualize what was done: without reading any description of the sampling program, we can see that measurements were taken for all depths every two weeks from May to October. Usually a value was collected at 1 m depth intervals, but later in the year values were collected at 1 m intervals above the thermocline, below which values were only collected every 5 m.

However, visualizing these measurements as points is not ideal. It’s difficult to see the deepening of the thermocline over time, and difficult to estimate what the temperature or thermal structure might be at a specific date (that might not have been measured directly). A common way of visualizing this is to “fill in” the missing pieces mathematically, creating a heatmap:

At its heart, this graphic is just a `ggplot()`

with a `geom_raster()`

layer. The complication is, `geom_raster()`

requires equally spaced points in both the x- and y-direction. The rest of this post is about how to make that happen.

### Linear interpolation

If you had the foresight to design a sampling program with exactly equal numbers of days between samplings (and the good luck/determination to carry it out perfectly), you can skip the interpolation section and head straight for the plotting. However, even if you did collect perfect data (in-situ sensor data is likely to be nearly perfect), you might want it at higher resolution, or you might want to fill in values that are missing due to sensor errors. To make that happen, we need to interpolate.

I interpolate in the depth direction first, because I think this is the better assumption: as you go down in depth, a reasonable way to estimate the temperature at a depth which you did not measure is to draw a straight line between the temperatures at the depths that you did measure.

In the above figure I’ve done this for one sampling event (the one on August 8, 1993), but to make the figure we need to do this for every sampling date. A good way to go about this is to make a function that will estimate the temperature at a given depth for a given sampling date. The approach I’ve taken is to filter our original data to only contain values for that date, then use `approx(known_x, known_y, xout = new_x)$y`

to do the interpolation. Here `target_date`

has to be a single value, but `target_depth`

can be a vector of depths, which we’ll see is useful in a moment.

```
estimate_temp_by_date <- function(target_date, target_depth) {
data_for_date <- sonde_tbl_1993 %>%
filter(date == target_date) %>%
arrange(depth)
# approx() is one way to do a linear interpolation
approx(data_for_date$depth, data_for_date$temp, xout = target_depth)$y
}
estimate_temp_by_date(ymd("1993-05-13"), c(0, 1, 1.5, 2))
```

```
## [1] NA 9.680 9.655 9.630
```

Notice here that the function returns `NA`

for values outside the range measured (there was never any measurement above 1 m depth), returns the value that was measured for depths that are in `sonde_tbl_1993`

, and interpolates if the requested depth is between two other measured depths.

Scaling this up to the entire dataset can be done a few ways, but I like the grouped mutate method the best. First, we need to create a tibble containing all the dates and depths for which we want to compute a value. Because of how we’ve written our estimator function, the dates have to be dates that are in `sonde_tbl_1993`

, but depths can be anything we want. Here I’ve chosen to create a sequence of depths from 1 to 20 with 100 equally-spaced values in between. This is helpful for creating a figure, because you can increase `length.out`

if the figure looks pixelly. I’ve used `crossing()`

to do this…for those familiar with base R, `expand.grid(..., stringsAsFactors = FALSE)`

does something similar (I prefer `crossing()`

because forgetting `stringsAsFactors = FALSE`

can lead to some hard-to-track-down errors).

Second, we need to compute the temperature values. Most people familiar with dplyr know the pattern of `group_by()`

(assign groups to unique combinations of one or more variables) and `summarise()`

(compute summary values for each group), but the `group_by() %>% mutate()`

pattern is useful when a column computation depends on a grouped summary value. In this case, we need to pass a different `target_date`

(of length 1) to `estimate_temp_by_date()`

for each sampling date.

```
temp_interp_depth <- crossing(
# the same dates as sonde_tbl_1993
tibble(date = unique(sonde_tbl_1993$date)),
# depths can now be any value
tibble(depth = seq(1, 20, length.out = 100))
) %>%
group_by(date) %>%
mutate(temp = estimate_temp_by_date(date[1], depth))
```

Now we have a tibble that contains equally-spaced `depth`

values, but unequally-spaced `date`

values:

To interpolate in the date dimension, we can use a similar process. First, we write a function that estimates the temperature at any date given a depth that is in `temp_interp_depth`

(the tibble we just calculated). This assumption becomes more and more spurious as the time between sampling events increases, but I think it’s still the best way to process this type of data.

```
# create a function that will, given a depth, estimate the temp on any given day
estimate_temp_by_depth <- function(target_depth, target_date) {
data_for_depth <- temp_interp_depth %>%
filter(depth == target_depth) %>%
arrange(date)
approx(data_for_depth$date, data_for_depth$temp, xout = target_date)$y
}
estimate_temp_by_depth(
target_depth = 1,
target_date = seq(ymd("1993-05-12"), ymd("1993-05-15"), by = 1)
)
```

```
## [1] NA 9.680000 9.905385 10.130769
```

Next, we create a tibble of the values we want using `crossing()`

, then use a grouped mutate (grouping by `depth`

this time) to estimate the temperature at all depth/date combinations.

```
temp_raster <- crossing(
# dates can now be any value
tibble(date = seq(ymd("1993-05-13"), ymd("1993-10-06"), by = 1)),
# depths must be the same as in temp_interp_depth
tibble(depth = unique(temp_interp_depth$depth))
) %>%
group_by(depth) %>%
mutate(temp = estimate_temp_by_depth(depth[1], date))
```

Finally, we have equally-spaced values in both the date and depth dimensions. This can be visualized using `geom_raster()`

, with `temp`

mapped to the `fill`

aesthetic. I’ve again used `scale_fill_gradient2()`

to ensure that red values represent “hot”, blue values represent “cool”, and that there is some way to visualize the depth of the thermocline. Finally, I’ve used `coord_cartesian(expand = FALSE)`

to eliminate the white border around the outside of the raster layer…I think it looks nicer that way.

```
ggplot(temp_raster, aes(date, depth, fill = temp)) +
geom_raster() +
scale_y_reverse() +
scale_fill_gradient2(
midpoint = 15,
high = scales::muted("red"),
low = scales::muted("blue")
) +
coord_cartesian(expand = FALSE)
```

And bingo! We’ve got a heatmap (quite literally representing heat in this case). It’s worth noting that there are still some parts of the figure that look “boxy”, which is a result of interpolating in the date dimension (a lot of variability in temperature is probably occurring in the two weeks between sampling events that isn’t well-characterized by the linear interpolation).

### Fitting a 2D smoother

There are many instances in which the linear interpolation method won’t work. The most common reason I can think of is when you have more than one season of data that you want to summarise into a single figure. For example, the dataset in this post includes 304 sampling events from 1992 to 2018:

```
sonde_tbl %>%
summarise(min(date), max(date), n_distinct(date))
```

```
## # A tibble: 1 x 3
## `min(date)` `max(date)` `n_distinct(date)`
## <date> <date> <int>
## 1 1992-06-23 2018-10-08 304
```

Plotting this one one figure requires some manipulation beforehand. In particular, we need a way to get dates from any year on to the same axis. I’ll do this by calculating the `day_in_year`

. `lubridate::floor_date()`

is very handy for this. Next, I’ll pretend that all the dates are in a single year (purely for plotting purposes).

```
sonde_tbl_yearless <- sonde_tbl %>%
mutate(
day_in_year = as.numeric(date - floor_date(date, unit = "years"), unit = "days"),
date_label = ymd("2019-01-01") + day_in_year
)
```

In this case, interpolating by date makes no sense: what happened on June 1 in 1993 has nothing to do with what happened on June 2 in 2018, yet on the figure these two values plot next to each other. Yet to make a proper heatmap we still need to estimate the temperature at equally-spaced values in the depth and date dimensions.

We can calculate these using a 2D smoother. I’m going to use the `loess()`

smoother here, but a GAM fit using `mgcv::gam()`

is probably more appropriate in many cases. Both have inherent assumptions, notably that the data is in some sense, “smooth”. From the above plot I think it’s a reasonable assumption: there is a similar relationship between years (some years being slightly different than others in the timing and magnitude of warming). An assumption that is unlikely to be true is that of independence: values at these points are *very* dependent on the values at other data points (e.g., the ones above, below, before, and after them). This shouldn’t keep you from using a smoother to create an informative graphic, however you should never do any kind of statistical test on the smooth.

The first step (for any type of smoother) is to compute and evaluate the fit. For a GAM, you should use the `mgcv::gam.check()`

function to make sure you GAM isn’t terribly misguided…for all smoothers, you will have to graphically compare the points plot that you created with the smooth to see if the smooth is a good representation of an “average” year. You will have to fiddle with the options…for `loess()`

this will mean adjusting the `span`

parameter; for a GAM you’ll have to adjust (at least) the value of `k`

. From some trial and error, a `loess()`

smooth with a `span`

of 0.2 seems reasonable to me.

```
smooth_fit <- loess(
temp ~ day_in_year + depth,
data = sonde_tbl_yearless,
span = 0.2
)
```

Then we need to (1) create a tibble with equally-spaced date and depth values and (2) estimate the temperature at each date/depth combination. I use the same approach here as above for step 1 (`crossing()`

); for step 2 I use the `predict()`

function with the model fit as the first argument and a `tibble()`

of “new data” as the second. This will work for most smooth fit objects as long as the columns in `newdata`

are named identically to the ones you used to fit the model.

```
temp_raster_smooth <- crossing(
tibble(date = seq(ymd("2019-04-02"), ymd("2019-10-27"), by = 1)),
tibble(depth = seq(1, 20, length.out = 100))
) %>%
mutate(
day_in_year = as.numeric(date - floor_date(date, unit = "years"), unit = "days"),
temp = predict(
smooth_fit,
newdata = tibble(day_in_year = day_in_year, depth = depth)
)
)
```

Now we have a tibble with equally-spaced date and depth values, so we can use similar ggplot2 code to plot it!

I think this is a good representation of the dataset as a whole: on an average year, Mallett’s Bay begins warming at all depths in early May, developing a warm mixed layer in early June. This warm mixed layer becomes progressively thicker over the course of the year, until the surface temperature begins decreasing in September. By October, the entire water column has mixed and cooled considerably.

As somebody who has been swimming in Mallett’s Bay for over 25 years, I can unequivocally vouch for this story.