Life as a grad student, as many of you already know, is a constant battle with completing assignments on time, keeping up the grades for future funding opportunities and getting enough sleep. My first semester in the Master of Spatial Analysis program is nearing its end in what seems like the shortest semester of school I’ve ever taken. During this time I have had the pleasure to strengthen some of my skills in the R language for statistical computing for the purpose of water quality trend analysis (SA8904 — GIS Project Management). This post will show how I have used R for water quality trend analysis, using publicaly available data from the USGS (due to a NDA that prohibits me to share my school work).
Download
I retrieved my data from the USGS open data portal for water quality analysis. You can find your own, but I was specifically looking for a site that would become a good example for this blog. For the following examples, a “good” site would be one that reported often, regularly, and for a reasonable length of time. This would allow the non-parametric tests to perform as expected. Sites exhibiting large temporal gaps and anomalies (outliers) were avoided. There are many other criteria to consider when performing water quality trend analysis in a research setting and this post is not aimed at providing that level of detail or guidance (there is plenty of literature on the subject).
If you want to follow along with me, you can practice navigating the USGS portal and locate station: USGS 07227500 Canadian Rv nr Amarillo, TX. Alternatively, you can download the raw data as I retrieved it on 2013-11-22 here: gist.github.com/MichaelMarkieta/7610033/raw/07227500.txt (right-click > save as…) and open in your favourite spreadsheet software (note: TAB-delimited). As you can see, this water quality station in Texas has been reporting from the late 1940s and I believe continues to report to this day. I devised the following examples by looking at one water quality parameter:
”# 70303 — Dissolved solids, water, filtered, tons per acre-foot”
Filter
Using a pivot table in Excel, I extracted the sample dates (sample_dt) and test result values (result_va) for the dissolved solids (70303) at this water quality reporting station. The exact definition of dissolved solids varies but Health Canada provides a thorough description. While the presence, increase, or variance in dissolved solids in our drinking water is not particularly of concern, it does have a direct influence on water palatability (taste) and pipe scaling. The result of the pivot table has 823 records of dissolved solids measurements (gist.github.com/MichaelMarkieta/7610033/raw/07227500_70303.txt).
Remove Outliers
If you were to sort the 823 records by the result value, you would immediately see that there are 2 records that are extreme outliers. Using R we can confirm this anamolie in our data by testing for skewness using the moments package and boxplot function like so:
> library(moments) > data <- read.csv("07227500_70303.csv") > skewness(data[,2]) [1] 18.65909 > boxplot(data[,2])
Those 2 pesky outliers can be safely assumed as anomalies (contaminated samples, errors produced by the equipment, etc.) and can be removed from the data set.
Date,Result 3/9/1958,89 6/1/1961,67
The cleaned version can be retrieved here: gist.github.com/MichaelMarkieta/7610033/raw/07227500_70303_clean.txt. For good measure, check the skewness and boxplot in R after removing those two values. The remaining outliers represent the nature of the data and are within a reasonable distance from the mean result values.
> data <- read.csv("07227500_70303_clean.csv") > skewness(data[,2]) [1] 1.319998 > boxplot(data[,2])
Analysis
Now we can use the Mann-Kendall trend tests on our water quality data to check if there is a monotonic trend (a gradual and steady change over time). Here we use the Kendall package and the MannKendall function to determine if a trend exists on a vector of temporal data. Presently, our data is just a simple pointer to the csv on our hard drive. We must convert the csv table into a temporal vector of data using the XTS package before running our MannKendall function. All together:
> library(xts) > library(Kendall) > data <- read.csv("07227500_70303_clean.csv") > data_xts <- xts(data[,2],as.Date(data[,1],"%m/%d/%Y")) > MannKendall(data_xts) tau = 0.227, 2-sided pvalue =< 2.22e-16
With this bit of code we’ve determined that there is a significant (p-value=0.000) monotonic trend of increasing (tau=0.277) dissolved solids over time at the “USGS 07227500 Canadian Rv nr Amarillo, TX” water quality reporting station. We can visualize this increasing trend using a LOWESS line fitted to the data on a plot in R. As we can see from the plot, the dissolved solids result value is not only increasing over time but also fanning-out. Of course, this method could be extended for any water quality parameter; detecting seasonal variation; identifying shifts in trends (positive to negative); and so on.
> x <- time(data_xts) > y <- coredata(data_xts)[,1] > plot(x,y) > lines(lowess(x,y, f=0.5),lwd=2, col=2)
Resources
Here are a list of R packages that are useful to have for water quality testing, as well as other statistical computation and visualization:
If this post helped you and you enjoy my site I would happily accept Litecoin donations:
LKPfT772e9HxvXYcA8LVDctTmENoqQxQF3