# DATA SOURCE: Case study on Soil Moisture

The website Awesome public datasets contains links to several hundred datasets. This example deals with a randomly selected one, Hyperspectral benchmark dataset on soil moisture

That dataset is described as being a sequence of measurements of a soil sample, recording temperature, soil moisture, and the intensity at each of 125 light frequency bands (in nm) of an image of the sample.

## Process

2. Uncompress the ZIP file

This results in a directory with a CSV file, a README.txt, and a license file.

1. Use the “Import Dataset” tool in the Environment tab to read in the data. The tool constructs a command, which is used in the following chunk.
library(readr)
Soil <-
col_types = cols(datetime = col_datetime(format = "%Y-%m-%d  %H:%M:%S")))

Notice the datetime format string used in the command. If working with time data, it’s helpful to know about how datetimes can be converted to machine readable format.

Looking at the Soil data frame …

nrow(Soil)
## [1] 679
head(names(Soil))
## [1] "index"            "datetime"         "soil_moisture"
## [4] "soil_temperature" "454"              "458"
range(Soil\$datetime)
## [1] "2017-05-16 11:26:07 UTC" "2017-05-26 14:08:10 UTC"

So we have slightly more than 10 days of measurements.

## What to use the data for

The data were not gathered specifically for the purpose of teaching statistics, so there is no documentation that we can draw on to decide how to fit this dataset into a course. Creativity is required. Some ideas:

1. Telling a story with simple graphics. Look at the time series of soil temperature and of soil_moisture.
1. How much of moisture is accounted for by temperature.
library(ggformula)
library(dplyr)
gf_point(soil_temperature ~ datetime,  data = Soil)

gf_point(soil_moisture ~ datetime, data = Soil) 
1. What time of day were the measurements taken?
Soil %>%
mutate(time = lubridate::hour(datetime) + lubridate::minute(datetime) / 60) %>%
gf_jitter(time ~  1, data = Soil, width = 0.2) %>%
gf_violin(fill  = "blue", alpha = 0.4,  color = NA) %>%
gf_lims(x  = c(0, 2))
1. Soil moisture as a function of time of day
Soil %>%
mutate(time = lubridate::hour(datetime) + lubridate::minute(datetime) / 60,
day = as.character(lubridate::mday(datetime))) %>%
gf_point(soil_moisture ~  time, color = ~  day, data = Soil) 

2. Simple analysis. Is soil moisture a function of temperature?
1. Is there a correlation?
gf_point(soil_moisture ~ soil_temperature, data = Soil)
1. Illuminating the pattern.
Soil %>%
mutate(time = lubridate::hour(datetime) + lubridate::minute(datetime) / 60,
day = as.character(lubridate::mday(datetime))) %>%
gf_path(soil_moisture ~ soil_temperature, color = ~ day)

Perhaps we can use the spectral measures to read soil moisture?

library(rpart)
library(rpart.plot)
library(randomForest)
# need to rename variable
newnames <- function(df) {
orig <- names(df)
new <- ifelse(is.na(parse_number(orig)),  orig, paste0("v",  orig))
return(new)
}
names(Soil) <- newnames(Soil)
## Warning: 4 parsing failures.
## row col expected           actual
##   1  -- a number index
##   2  -- a number datetime
##   3  -- a number soil_moisture
##   4  -- a number soil_temperature
mod1 <-
Soil %>%
select(-index, -datetime) %>%
randomForest(soil_moisture ~ . - soil_temperature, data = .)
Tmp <- importance(mod1)
Res <- tibble(score = Tmp[, 1], wavelength = row.names(Tmp)) %>%
arrange(desc(score))
head(Res)
## # A tibble: 6 x 2
##   score wavelength
##   <dbl> <chr>
## 1 1001. v950
## 2  981. v946
## 3  549. v942
## 4  491. v830
## 5  355. v454
## 6  309. v826
gf_point(soil_moisture ~ v950,  data = Soil)

gf_point(v950 ~ v454,  color = ~  soil_moisture, data = Soil)