Chapter 6 Combining trait, weather, and image datasets

The objective of this vignette is to walk through how to combine our several types of data, and demonstrate several realistic analyses that can be done on these merged data.

For the first analysis, we want to figure out how the number of sufficiently warm days affects the amount of canopy cover at our site. We do this by combining the canopy cover data with the meteorological data on growing degree days, then modeling and plotting their relationship. We are specifically interested in figuring out when the increase in canopy cover starts to slow down in response to warm temperature days.

The second analysis compares greenness from image data with canopy cover.

6.1 Get and join data

Here we combine two dataframes. The first contains all the canopy cover values for 2018, which was created in the traits vignette. The second is the cumulative growing degree days for all of 2018, which were calculated from the daily minimum and maximum temperatures in the weather vignette. They are combined by their common column, the date.

library(dplyr)
library(ggplot2)
library(jsonlite)
library(lubridate)
library(traits)
library(sf)
library(stringr)
options(betydb_url = "https://terraref.ncsa.illinois.edu/bety/",
        betydb_api_version = 'beta', 
        betydb_key = '9999999999999999999999999999999999999999')

trait_canopy_cover <- betydb_query(table     = "search", 
                      trait     = "canopy_cover", 
                      date      = "~2018",
                      limit     =  "none")
trait_canopy_cover_day = trait_canopy_cover %>% 
  mutate(trans_date = with_tz(ymd_hms(raw_date), "America/Phoenix"), 
         day = as.Date(raw_date))

weather <- fromJSON('https://terraref.ncsa.illinois.edu/clowder/api/geostreams/datapoints?stream_id=46431&since=2018-01-01&until=2018-12-31', flatten = FALSE)
weather <- weather$properties %>% 
  mutate(time = with_tz(ymd_hms(weather$end_time), "America/Phoenix"))
daily_values = weather %>% 
  mutate(day = as.Date(time), 
         air_temp_converted = air_temperature - 273.15) %>% 
  group_by(day) %>% 
  summarise(min_temp = min(air_temp_converted), 
            max_temp = max(air_temp_converted), 
            gdd = ifelse(sum(min_temp, max_temp) / 2 > 10, 
                         (max_temp + min_temp) / 2 - 10, 0))
daily_values <- daily_values %>% 
  mutate(gdd_cum = cumsum(gdd))

trait_weather_df <- full_join(trait_canopy_cover_day, daily_values, by = "day") %>% 
  select(day, cultivar, mean, gdd_cum) %>% 
  na.omit()

6.2 Plot and model relationship between GDD and canopy cover for each cultivar

We are interested in how growing degree days affects canopy cover. To investigate this, we are going to model and plot their relationship. We are using a logistic growth model here because it is appropriate for the shape of the GDD-cover relationship.

The logistic growth model is specified as

\[y = \frac{c}{1+e^{a + b * \textrm{x}}}\]

where \(y\) is the response variable canopy cover, \(x\) is the predictor growing degree days, \(c\) is the asymptote or maximum canopy cover, \(a\) is the initial value for canopy cover, and \(b\) is the steepness of the curve. (reference)

We want to know the relationship for each cultivar, so we’ll start of by determining the parameters of the model for one of the cultivars in our dataset. We provide estimated values for the asymptote \(c\) and initial canopy cover value \(a\), and provide canopy cover \(y\) with corresponding growing degree days \(x\) for one measurement of the chosen cultivar.

The below provides better estimates for the \(c\), \(a\), and \(b\) parameters, which are used to plot the model as an orange line on top of the black points which are actual values.

single_cultivar <- trait_weather_df %>% 
  filter(cultivar == "PI656026")

c <- 90
a <- 0.1
y <- single_cultivar$mean[3]
g <- single_cultivar$gdd_cum[3]
b <- ((log((c/y) - 1)) - a)/g
model_single_cultivar <- nls(mean ~ c / (1 + exp(a + b * gdd_cum)), 
                             start = list(c = c, a = a, b = b),
                             data = single_cultivar)
summary(model_single_cultivar)
coef(model_single_cultivar)

single_c <- coef(model_single_cultivar)[1]
single_a <- coef(model_single_cultivar)[2]
single_b <- coef(model_single_cultivar)[3]

single_cultivar <- single_cultivar %>% 
  mutate(mean_predict = single_c / (1 + exp(single_a + single_b * gdd_cum)))
ggplot(single_cultivar) +
  geom_point(aes(x = gdd_cum, y = mean)) +
  geom_line(aes(x = gdd_cum, y = mean_predict), color = "orange") +
  labs(x = "Cumulative growing degree days", y = "Canopy Height")

We then calculate the inflection point for this cultivar’s model.

The maximum growth rate is the change in canopy cover per day at the rate of maximum growth. The growing degree day at which maximum growth is obtained is called the inflection point. This occurs near the midpoint of the y-axis, or \(\frac{c - a}{2}\).

inf_y <- (as.numeric(single_c) - as.numeric(single_a)) / 2
inf_x <- ((log((as.numeric(single_c) / inf_y) - 1)) - as.numeric(single_a)) / as.numeric(single_b)

ggplot(single_cultivar) +
  geom_point(aes(x = gdd_cum, y = mean)) +
  geom_line(aes(x = gdd_cum, y = mean_predict), color = "orange") +
  geom_hline(yintercept = inf_y, linetype = "dashed") +
  geom_vline(xintercept = inf_x) +
  labs(x = "Cumulative growing degree days", y = "Canopy Height")

We then use the parameters from a single cultivar to run a model for each of the rest of the cultivars. These results are used to plot the model predictions, which are shown as an orange line. We also calculated the inflection point from each cultivar’s model, which will be used in the following section.

all_cultivars <- c(day = as.double(), cultivar = as.character(), mean = as.numeric(), 
                   gdd_cum = as.numeric(), mean_predict = as.numeric(), 
                   inf_y = as.numeric(), inf_x = as.numeric())

for(each_cultivar in unique(trait_weather_df$cultivar)){
  each_cultivar_df <- filter(trait_weather_df, cultivar == each_cultivar)
  each_cultivar_model <- nls(mean ~ c / (1 + exp(a + b * gdd_cum)), 
                             start = list(c = c, a = a, b = b), 
                             data = each_cultivar_df)
  model_c <- coef(each_cultivar_model)[1]
  model_a <- coef(each_cultivar_model)[2]
  model_b <- coef(each_cultivar_model)[3]
  each_cultivar_df <- each_cultivar_df %>% 
    mutate(mean_predict = model_c / (1 + exp(model_a + model_b * gdd_cum)), 
           inf_y = (as.numeric(model_c) - as.numeric(model_a)) / 2, 
           inf_x = ((log((as.numeric(model_c) / inf_y) - 1)) - 
                      as.numeric(single_a)) / as.numeric(single_b))
  all_cultivars <- rbind(each_cultivar_df, all_cultivars)
}

ggplot(all_cultivars) +
  geom_point(aes(x = gdd_cum, y = mean)) +
  geom_line(aes(x = gdd_cum, y = mean_predict), color = "orange") +
  facet_wrap(~cultivar, scales = "free_y") +
  geom_hline(yintercept = inf_y, linetype = "dashed") +
  geom_vline(xintercept = inf_x) +
  labs(x = "Cumulative growing degree days", y = "Canopy Height")

6.3 Create histogram of growth rate for all cultivars

The last thing that we are going to do is assess the difference in this relationship among the cultivars. We are going to use the inflection point from the logistic growth model, which indicates when canopy cover stops increasing as quickly with increasingly more warm days. The resulting inflection points for each cultivar are plotted as a histogram.

ggplot(data.frame(inf_points = unique(all_cultivars$inf_x))) +
  geom_histogram(aes(x = inf_points), bins = 300) +
  xlim(min(all_cultivars$gdd_cum), max(all_cultivars$gdd_cum)) +
  labs(x = "Inflection points", y = "Number")

6.4 Get image data

In this example we will extract our plot data from a series of images taken in May of Season 6, measure its “greeness” annd plot that against the plant heights from above in this vignette.

The chosen statistic here is the normalised green-red difference index, \(\textrm{NGRDI}=\frac{R-G}/{R+G}\) (Rasmussen et al., 2016), which uses the red and green bands from the image raster.

Below we retrieve all the available plots for a particular date, then find and convert the plot boundary JSON into tuples. We will use these tuples to extract the data for our plot.

# Making the query for our site
sites <- betydb_query(table     = "sites",  
                      sitename  = "MAC Field Scanner Season 6 Range 19 Column 1")

# Assigning the geometry of the site (GeoJSON format)
site.geom <- sites$geometry

# Convert the polygon to something we can clip with. CRS value represents WGS84 Lat/Long
site.shape <- st_as_sfc(site.geom,crs = 4326)
site.poly <- st_cast(site.shape, "POINT")
site.clip <- as(site.poly,"Spatial")

These are the names of the full field RGB data for the month of May. We will be extracting our plot data from these files. A compressed file containing these images can be found on Clowder. The code below downloads the image files into a .zip file, which takes a few minutes, and then unzips that file so the image files are accessible.

if(!file.exists("rgb_images.zip")){
  download.file("https://terraref.ncsa.illinois.edu/clowder/files/5c8175874f0c78f6486d6870/blob", destfile = "rgb_images.zip")
  unzip("rgb_images.zip", exdir = ".")
}

We will loop through these images, extract our plot data, and calculate the “greeness” of each extract. We are using the name of the file to extract the date for later.

library(raster)

# Get file paths for all image files
image_files <- list.files(".", pattern = "*.tif")
image_files_paths <- file.path(".", image_files)

# Extract the date from the file name
getDate <- function(file_name){
  date <- str_match_all(file_name, '[0-9]{4}-[0-9]{2}-[0-9]{2}')[[1]][,1]
  return(date)
}

# Returns the greeness value of the plot in the specified file
getGreeness <- function(file_name, clip_coords){
  
  band_image_red   <- raster(file_name, band = 1)
  red_crop         <- crop(band_image_red, clip_coords)

  band_image_green <- raster(file_name, band = 2)
  green_crop       <- crop(band_image_green, clip_coords)

  add_rasters      <- green_crop + red_crop
  numerator        <- cellStats(add_rasters, stat = "sum")

  subtract_rasters <- green_crop - red_crop
  denominator      <- cellStats(subtract_rasters, stat = "sum")

  greeness         <- numerator / denominator

  return(greeness)
}

# Extract all the dates from the images
date <- sapply(image_files_paths, getDate, USE.NAMES = FALSE)

# Extract all the greeness for the plot
greeness <- sapply(image_files_paths, getGreeness, clip_coords=site.clip, USE.NAMES = FALSE)

# Build the final day and greeness
greenness_df <- data.frame(date, greeness) %>% 
  as_tibble() %>% 
  mutate(day = as.Date(date))

We then pull in the canopy data for our charting purposes.

trait_canopy_cover <- betydb_query(table     = "search", 
                                   trait     = "canopy_cover", 
                                   date      = "~2018 May",
                                   limit     =  "none")
                       
trait_canopy_cover_day <- trait_canopy_cover %>% 
  mutate(trans_date = with_tz(ymd_hms(raw_date), "America/Phoenix"), 
         day = as.Date(raw_date))

We now need to add the height data to the data set to plot.

We then determine the average canopy cover across the site for the day that the sensor data were collected. The relationship between our greenness metric and average canopy cover are plotted.

trait_canopy_cover_daily <- trait_canopy_cover_day %>% 
  filter(day %in% greenness_df$day) %>% 
  group_by(day) %>% 
  summarise(mean_canopy_cover = mean(mean), 
            sd_canopy_cover = sd(mean))
sensor_trait_df <- left_join(trait_canopy_cover_daily, greenness_df, by = "day")

ggplot(sensor_trait_df, aes(x = mean_canopy_cover, y = greeness)) +
  geom_point()