Skip to main content
Open access
Technical Papers
Jul 14, 2017

Estimation of the Variogram Using Kendall’s Tau for a Robust Geostatistical Interpolation

Publication: Journal of Hydrologic Engineering
Volume 22, Issue 9

Abstract

The estimation of an appropriate variogram is a crucial step toward the description of spatial dependence, the geostatistical interpolation of environmental variables, and the subsequent hydrological engineering. The classical variogram in the literature ideally necessitates a normal distribution of the variable and is not robust against outliers within the data. These presumptions are hardly given under empirical conditions and, therefore, a new estimation method is proposed for the variogram. The new method is based on the description of spatial dependence by the robust rank coefficient τ and generalizes the method from the Gaussian to the general case of empirical distributions. The conversion of the robust estimate using a Monte-Carlo simulation and subsequent quantile-quantile transformation with the empirical marginal distribution performs the generalization. Monthly precipitation data from South Africa serve as the variable and were artificially contaminated with outliers. The effects on the variogram and subsequent geostatistical interpolation were investigated for the proposed, classical, and four existing robust variogram models in this comparative study. The investigation revealed that the proposed variogram describes a distinct spatial dependence structure under empirical conditions, which is robust against outliers. The cross validation of the linear estimator demonstrates that the proposed variogram tends to improve the bias and spread of the resulting error distribution, and hence the quality of the geostatistical interpolation.

Introduction

Many hydrological variables are only measured at some distinct measurement locations and require spatial interpolation prior to subsequent hydrological applications. Geostatistical models are often implemented as preferred methods of interpolation for many environmental variables. The theory of geostatistical models (Matheron 1963, 1965; Journel 1974) predicts sets of possible outcomes of the random variable Z(x). Hereby, it regards the existing values of the variable at every location x =(x1,x2) [for the two-dimensional (2D) space] within the domain D as one realization z(x) of Z(x). Unfortunately, many variables bear only one singular value for one event, are nonrepeating events, and are only partially known at a few measurement locations xi. The spatial description of the existing but unknown distribution of Z(x) can be estimated as follows:
E[Z(x)]=m(x)=constant
(1)
E{[Z(x+h)Z(x)]2}=2γ(h)
(2)
presuming an ergodic variable and an intrinsic random function (RF).
The theory assumes the increment Z(x+h)Z(x) as a stationary RF with the variance only depending on the translation vector h. The function γ(h) is the (semi) variogram, which forms the basis for the structural analysis and succeeding interpolation methods.
The variogram plays a central role within geostatistical models because it describes the spatial dependence of the variable and is eventually deducted from the few samples (or observations). The estimation of the sample variogram is a crucial step because it determines the spatial predictions over the entire domain D. The classical estimator of the sample variogram is according to Matheron (1962)
2γ^M(h)=1N(h)i=1N(h)[z(xi)z(xi+h)]2
(3)
where z(xi) = observations; and N(h) = number of pairs for one distance class h. The values γ^M(h) of the sample variogram are calculated in practice for distance intervals (h±Δh). A careful choice of the constant intervals should guarantee a minimum number of pairs of observed values in each distance class in order to ensure a stable estimation of the respective variogram values (Armstrong 1984).
Unfortunately, the estimation of the sample variogram [Eq. (3)] suffers a major setback: it imposes the assumption of Gaussianity on the underlying empirical data and the independence of random errors in its estimation. However, outliers, as a relatively small portion of the observations, frequently induce deviations from this assumption. They can unduly affect the estimation of the variogram and thereby the outcome of the geostatistical model; the arithmetic mean of the squared differences becomes a biased estimator of the distribution of the squared increments. In addition, every outlier affects the estimation of the variogram values γ^M(h) of several distance classes h according to its spatial positioning.
There are many geostatistical interpolation models from the kriging family that incorporate the variogram with its description of spatial dependence. They are also implemented within the field of hydrological engineering, whether for scientific investigations or for applied water resources management. In this context, the precipitation recordings from rain gauges are an important input variable to geostatistical interpolations (e.g., Subyani 2004; Basistha et al. 2008; Keblouti et al. 2012) and subsequent system analyses. A rising drawback is that many rain-gauge networks, especially in developing countries, have faced a systematic decline in quantity and quality in the last years; the number of operating rain gauges has decreased and the recorded data are more prone to outliers because of the lack of sufficient quality controls (Schneider et al. 2011).
In addition, many geostatistical applications only regard the resulting linear estimator Z and disregard the estimation variance σK2 as a measure of associated uncertainty. Therefore, the objective of this study is twofold: (1) implementation of the proposed variogram as a possible descriptor of spatial dependence, and (2) validation of an interpolated linear estimator Z, which is robust against outliers. A geostatistical interpolation with a robust linear estimator could improve the quality of future water resources engineering, especially in developing countries with deteriorating observation networks.

Materials and Methods

Existing Robust Estimators for the Variogram

It has been shown from a broad spectrum of applied sciences that observed data often contain outliers (Hampel 2001; Huber 2011). The problem has been addressed by robust estimators of the variogram, attempting to provide alternative models insensitive against outliers. The following section briefly reviews the existing estimators used in this study.

Trimmed Mean

The observations z(xi) within one distance class h are sorted and the α portion of the lowest and highest values are eliminated from the data without replacement. Consequently, the α-trimmed mean is the arithmetic mean of the central (12α)-portion of the observed data, enabling the estimation of a variogram γ^T(h), which is robust against outliers. It is a compromise between the representative arithmetic mean (α=0) with a statistical breakdown point of 0% (Hampel et al. 1986) and the robust median (α=0.5) with a statistical breakdown point of 50% (Hampel 2001). The median diverts from the arithmetic mean and is a decreasingly appropriate representation of the distribution of the squared increments (Chilès and Delfiner 1999) in case of an outlier-induced skewness [Eq. (3)].

Cressie-Hawkins Approach

Cressie and Hawkins (1980) introduced a robust estimator by targeting the induced skewness by the squares of the increments and replacing them with a lower-order power. It is classified as a robust estimator of location of the squared increments (Cressie 1991) and is formulated
2γ^CH(h)=[1N(h)i=1N(h)|z(xi)z(xi+h)|(1/2)]40.457+0.494N(h)+0.045N2(h)
(4)
where the power reduction envisages normally distributed square roots of the absolute increments. The observed skew of the resulting distribution is assumed to be negligible and promotes the arithmetic mean as an unbiased estimator (Cressie and Hawkins 1980). The fourth power subsequently rescales the arithmetic mean, and the denominator is bias-correction for the assumption of Gaussianity of the increments over all distance classes (Lark 2000).
Although Eq. (4) reduces the effects of outliers, it retains a statistical breakdown point of 0%; one possible outlier, if large enough, can still alter the estimate considerably. However, the robust variogram mitigates the effect of outliers and is widely applied in practice (Marchant and Lark 2007).

Dowd Approach

The robust estimator for the variogram proposed by Dowd (1984) is a deduction from the aforementioned considerations; it advocates the median as descriptor because of its statistical robustness, but implicitly presumes a symmetrical distribution of the increments around zero. The median of the absolute increments is an estimator of scale (Lark 2000) and the variogram is formulated
2γ^D(h)=2.198{median[|z(xi)z(xi+h)|]}2
(5)
where the second power rescales the estimate to the scale of the variogram. The constant factor ensures consistency with the standard deviation of the normal distribution (Dowd 1984), bearing the same statistical breakdown point as the median.

Genton Approach

Genton (1998) proposed the last examined and existing estimator for a robust variogram. The estimator implements the robust estimate of scale QNh (Rousseeuw and Croux 1992, 1993)
QNh=2.2191[|Vi(h)Vj(h)|;i<j](k)
(6)
with Vi(h)=z(xi)z(xi+h) and k=[(Nh+1)2], implying a total number of [N2] absolute differences of the increments Vi(h) for a data size N. The absolute differences are sorted and the kth value is taken as representative estimator. The constant factor corrects for consistency at Gaussianity (Genton 1998). The variogram is deduced by applying Eq. (6) to every distance class h and rescaling the differences by the second power to the scale of the variogram
2γ^G(h)=[QNh]2
(7)
The derived robust estimator is numerically expensive, but possesses a maximal breakdown point of 50% (Genton 1998).

Proposed Estimator for the Variogram Using Kendall’s Tau

The robust core of the proposed method is the description of the spatial relationship by the nonparametric rank correlation coefficient τ (Kendall 1938). The coefficient of Kendall’s τ(h) is estimated for a certain distance class h
τ(h)=i=1N(h)j=1N(h)sign[Vi(h)]·sign[Vj(h)]N(h)(N(h)1)
(8)
where Vi(h)=z(xi)z(xi+h); and N(h) = number of observation pairs separated by the respective distance. The estimation of the coefficient of Kendall’s τ(h) includes both possible definitions of the difference between a pair of observations [i.e., ±Vi(h)], but it omits the case when i=j.
The progression of Kendall’s τ(h) with increasing distance h is characterized by the theoretical maximum [τ(0)=1], a decline for increasing distances h, and absence of spatial dependence beyond a certain range [τ(h)0]. In general, the coefficient by Kendall’s τ(h) is a robust estimator of linear dependence with a higher statistical breakdown point than the comparable Pearson correlation coefficient ρN(h) (Lindskog et al. 2003). Furthermore, the relationship between Kendall’s τ and Pearson’s ρN of two frequency distributions, e.g., f[Vi(h)] and f[Vj(h)], can be estimated under the assumption of Gaussianity according to Cramér (1946)
ρN(h)=sin[π2·τ(h)]
(9)
which Lindskog et al. (2003) extended to all elliptical distributions.
However, the derivation of empirical distributions from the assumed Gaussianity hampers the application of Eq. (9). Therefore, a preceding preparation of the empirical distributions is implemented in two steps in order to circumvent the problem of non-Gaussianity in empirical data.
At first, a Monte-Carlo simulation independently generates two variables, R1 and R2, each with e.g., 1,000 realizations, yielding Gaussian distributions with N[0|1]. The two distributions are correlated by
R1=R1
(10a)
R2=ρN·R1+[1ρN2(h)]·R2
(10b)
which factually is the solution of a simple Cholesky decomposition of the 2×2 correlation matrix from R1 and R2. Thus, the two simulated random variables are correlated by ρN(h) and are still Gaussian with N[0|1].
Second, a quantile-quantile transformation of the linear dependent variables R1 and R2 with the empirical marginal cumulative distribution FE[z(xi)] is conducted, yielding the correlated cumulative random distributions FE(R1) and FE(R2). These two generated random distributions exhibit the underlying empirical distribution of the data, and the corresponding covariance Cov[FE(R1),FE(R2)] can be calculated for the distance class h. The resulting sample variogram γ^tau(h) is consequently defined as follows:
γ^tau(h)=C(0)Cov[FE(R1),FE(R2)]
(11)
where C(0) = a priori variance, being calculated from the observations and implying second-order stationarity. The estimation of γ^tau(h) is repeated for each distance class h of the proposed variogram.
Finally, the proposed method is based on the robust description of spatial dependence by the coefficient of Kendall’s τ and theoretically extends the estimation of a variogram from the Gaussian to the general case of empirical distributions.

Study Area and Data

The rectangular study area extends from 27″00′00″ to 30″30′00″ E and from 24″30′00″ to 28″00′00″ S (Fig. 1). The study area covers a projected area of approximately 132,000  km2 and is roughly located around the City of Johannesburg within the Republic of South Africa. It encompasses parts of the provinces North-West, Gauteng, Mpumalanga, and Free State, including extensions of the coastal plains in its north and parts of the Great Escarpment in the east. The study area exhibits an elevation from 608 to 2,331 meters above sea level (m.a.s.l.) with an average altitude of 1,445 m.a.s.l. (Fig. 1).
Fig. 1. Study area, elevation, and location of rain gauges
Precipitation recordings from rain gauges serve as input data and originate from four different sources: (1) Hydrological Information Service (HIS) of the Department of Water Affairs (DWA) of South Africa (DWA 2008); (2) Global Historical Climatology Network (GHCN, United States) of the National Climate Data Center (NCDC) (Vose et al. 1992); (3) Climate Research Unit (CRU, Great Britain) of the University of East Anglia (Mitchell and Jones 2005); and (4) internal database of the University of KwaZulu-Natal (Republic of South Africa) (Lynch 2004). The original data were subject to a test on duplicity and an automated quality control as described by Lebrenz (2013). Although the data from the CRU are provided in monthly resolution, the daily recordings from the HIS, NCDC, and University of KwaZulu-Natal required accumulation to monthly values. The 24 months of the years 1986 and 1987 are the study period, thus yielding 4,224 values of monthly precipitation from 176 rain gauges within the study area. The temporal evolution of the number of operating rain gauges motivated this choice of the study period; the number reached its maximum during the 1980s and has been steadily decreasing ever since (Lynch 2004). The crosses in Fig. 1 indicate the spatial locations of the rain gauges. These processed data form the first set of data, which is assumed to be without outliers.
Two additional data sets were derived by artificially inducing outliers to the first set. Therefore, the arithmetic mean μo and standard deviation σo were calculated from the original 176 monthly precipitation values of a specific month. The data were subsequently contaminated with outliers by artificially replacing an arbitrarily selected 2.5 and 5% portions of the original values with a randomly generated value from a normal distribution with N[3×μo|2×σo]. The three sets of monthly precipitation data ultimately served as input to the variogram models.
Fig. 2 illustrates the absolute frequency distributions of the monthly precipitation values for January 1986. The empirical distribution of the data without outliers already deviates from Gaussianity and exhibits a slight skewness. The introduction of outliers skews the distribution to higher values; the range nearly doubled, the arithmetic mean and standard deviation increased, and the skewness shifted further to more positive values.
Fig. 2. Absolute frequency distributions of monthly precipitation from 176 rain gauges for January 1986: (a) data set without outliers; (b) data set with 2.5% artificially induced outliers; (c) data set with 5% artificially induced outliers

Interpolation and Evaluation

A theoretical variogram model needed to be fitted after the data-driven estimation of the sample variograms by the six examined models. The mathematically admissible spherical variogram in combination with a nugget model (Cressie 1991; Chilès and Delfiner 1999) was chosen as the theoretical model. The combined model contains three parameters: c0 and c1 as the respective sills of the nugget and spherical portion and a as the range of influence. The three defining parameters are practically estimated by fitting the theoretical variogram to the values of the sample variogram. However, the estimated values γ^(h) of the sample variogram are based on a varying number of pairs of observations for different distance classes hk (with k=1,,K). A common observation is that these values are considerably fluctuating with γ^(hk+1)<γ^(hk). Therefore, an automated monotonization routine started at the first distance class (k=1) and scanned the distance classes in ascending order until two variogram values fulfilled γ^(hk+1)<γ^(hk).
These two variogram values were then replaced by their common mean. The routine subsequently started over again at the first distance class, repeated the scan for the updated variogram values, and replaced the next two values for which γ^(hk+1)<γ^(hk). The routine finally aborted when a monotonic increase of all variogram values [γ^M(hk+1)γ^M(hk)] was obtained for all distances classes K. This routine might replace several consecutive variogram values backward over a corresponding number of distance classes in order to ensure a monotonic increase over the entire range. As a consequence, the monotonization routine factually smooths the shape of the sample variogram and eases the fitting of the theoretical variogram.
A line through the values of the first two distance classes of the monotonized variogram was then extended onto the variogram axis. Thus, the sill of the nugget portion c0 of the theoretical variogram model was estimated. The sill of the spherical portion c1 equals the difference between the monotonized variogram value of the last distance class γ^(hK) and the nugget portion c0. The distance at which the tangent to the curve at the origin intersected the sill c1 defines two-thirds of the range a of the spherical variogram model (Journel and Huijbregts 2003). The three parameters fully describe the theoretical variogram and finally enable the subsequent geostatistical interpolation. The widely applied ordinary kriging (Matheron 1965) was selected as interpolation method, enabling the calculation of the linear estimator Z(x) and the associated kriging variance σK2(x) at every chosen but unknown location x.
A subsequent cross validation for the evaluation of the quality of interpolation was conducted by eliminating each observed value z(xi) from the data in consecutive turns and recalculating the linear estimator Z(xi) and the associated kriging variance σK2(xi) from only the remaining observations by ordinary kriging.
The distribution of the differences between the observed and the estimated value [Z(xi)z(xi)] were statistically described by three objective functions: mean error (ME), root-mean square error (RMSE), and mean square deviation ratio (R)
ME=1ni=1n[Z(xi)z(xi)]
(12a)
RMSE={1ni=1n[Z(xi)z(xi)]2}0.5
(12b)
R=1ni=1n[Z(xi)z(xi)]2σK2(xi)
(12c)
The ME quantifies the bias of the error distribution, the RMSE describes the mean range over which the errors are spread, and R evaluates the appropriateness of the variogram. R should be ideally 1 if the resulting errors are normally distributed and the input data are of second-order stationarity. The kriging variance σK2 would scale the errors to a mean of 1, resulting in a χ2-distribution with one degree of freedom (Lark 2000).
The aforementioned procedures of fitting the theoretical variogram, geostatistical interpolation, and evaluation were implemented for each of the six different variogram models and three sets of data: without outliers, with a 2.5% portion of artificial outliers, and with a 5% portion of artificial outliers.

Results and Discussion

This section presents the analysis of the outcomes for January 1986 and a qualitative comparison of the results from cross-validation for the 24 months of January 1986–December 1987.

Results for January 1986

Fig. 3 illustrates the resulting sample variograms of the six methods together with Matheron’s original variogram of the outlier-free data set.
Fig. 3. Sample variograms for the six estimation models for January 1986; thick solid line represents data without outliers, dashed line represents 2.5% outliers, chain-dotted line represents 5% outliers, and thin dotted line is Matheron’s variogram for the data set without outliers for comparison: (a) Matheron; (b) trimmed mean; (c) Cressie-Hawkins; (d) Dowd; (e) Genton; (f) Kendall’s τ
The nonrobust variogram by Matheron exhibits a high degree of alteration for all distances and especially close to the origin when the data are contaminated [Fig. 3(a)]. The increasing portion of outliers evokes a deterioration of variogram values, and the representativeness of the derived theoretical variogram becomes doubtful.
The robust models of the trimmed mean [Fig. 3(b)] and Cressie and Hawkins (1980) [Fig. 3(c)] are comparatively diminishing the effects of a 2.5% portion of outliers. Moreover, the models by Dowd (1984) [Fig. 3(d)] and Genton (1998) [Fig. 3(e)] only display a minimal outlier-induced deviation, especially for the important smaller distances close to the origin. As a summary, all four existing estimators for a robust sample variogram are reasonably reproducing Matheron’s variogram (1962) in the presence of a 2.5% portion of outliers.
All existing models exhibited a systematic increase in the absolute variogram models as well as in the variation between variogram values of consecutive distance classes for the data set containing 5% outliers. In particular, the models of the trimmed mean [Fig. 3(b)] and Dowd [Fig. 3(d)] considerably deteriorate, whereas the models by Cressie-Hawkins [Fig. 3(c)] and Genton [Fig. 3(e)] show a relative small alteration.
Regarding the respective sample variogram without outliers for the proposed method using Kendall’s τ [Fig. 3(f)], the apparent characteristic is the smoother shape, which increases from the origin almost monotonically and exhibits a distinct sill after a certain distance. This property should ease the fitting of the theoretical variogram model. The existing variogram estimators show in comparison a more bumpy development with higher variability over increasing distances and hardly display any sill, supporting use of Kendall’s τ as a measure for the description of spatial dependence. Its distinct shape is also maintained for an increasing number of outliers, which is attributed to the robustness of Kendall’s τ. However, the induced outliers clearly increase the absolute values of the variogram through their direct impact on the a priori variance C(0) [Eq. (11)].
Table 1 displays the three selected objective functions for January 1986 as descriptors of the error distribution from the cross validation. The illustrated values may surprise because Matheron’s variogram does not yield the best ME for the outlier-free data set. However, the marginal distribution (Fig. 2) is slightly skewed and diverts from the normal assumption. The non-Gaussianity is typical for empirical distributions and the variogram of Matheron becomes a biased estimator. The induced outliers are shifting the distribution further to the left, which is reflected by a highest positive ME of all variogram models. The other four existing models yield a better ME than Matheron’s variogram and exhibit a certain robustness in both cases of induced outliers. The proposed method using Kendall’s τ shows the best ME for the data set without and with a 2.5% portion and ranks second for the data set with a 5% portion of outliers. The new method shows the smallest change (ΔME=0.09 and 3.87 mm) for both cases of adding 2.5% of outliers to the data.
Table 1. Summary Statistics of the Error Distributions for January 1986, Subdivided into the Six Variogram Models and Three Data Sets: Without, with 2.5% Outliers, and with 5% Outliers
VariogramME (mm)RMSE (mm)R
Without outliers2.5%5%Without outliers2.5%5%Without outliers2.5%5%
Matheron0.110.728.0931.6834.3336.942.670.480.40
Trimmed mean0.100.418.0831.7133.7834.312.5210.391.24
Cressie-Hawkins0.400.133.9933.7931.8635.169.963.174.09
Dowd0.250.026.8233.9633.8744.575.634.1335.88
Genton0.080.253.6234.0433.9640.624.765.6011.48
Kendall’s τ0.000.093.9632.0931.4833.762.181.121.30

Note: Bold values are indicate the best values.

The mean spread of the errors is originally the lowest for Matheron’s variogram, but the outliers induce a high deterioration. The model of the trimmed mean exhibits a similar RMSE for the outlier-free data set, but relatively limits its deterioration with contamination. The remaining models numerically improve the mean spread of their errors for the data set with 2.5% of outliers, with Genton’s model being the least influenced (ΔRMSE=0.08  mm). However, the spread of errors increases for all implemented models when the data contain 5% outliers. The model by Dowd hereby shows the highest RMSE, being attributed to the considerable deterioration of its variogram [Fig. 3(d)]. The proposed model with Kendall’s τ exhibits the best RMSE for both contaminated data sets of January 1986.
The shape of the error distribution appears to be relatively close to normal for the model by Matheron in case of all three data sets. In contrast, the deviation ratio R greatly deteriorates for the trimmed mean when exposed to 2.5% outliers and for the models by Dowd and by Genton when 5% outliers are present. The other robust estimators yield comparable values with Kendall’s τ apparently bearing an error distribution close to Gaussianity for all three data sets. Low values in the deviation ratio R correspond to high variogram sills (compare Fig. 3) for the data set with 5% of outliers; high variogram values produce high estimation variances and a low deviation ratio R [Eq. (12c)].

Results for January 1986–December 1987

The comparative investigation was extended to the entire study period in order to generalize the findings. The respective error distributions from the cross validation were derived and evaluated by the outlined methodology for the 24 months of January 1986–December 1987. A ranking among the six variogram models was applied to each month, each objective function, and the three data sets for qualitative comparison.
Table 2 illustrates the summary of rankings with respect to the ME. It shows that the model using Kendall’s τ constitutes the best method for the outlier-free data set in 9 out of 24 months. This method also exhibits the best average ranking and its estimation of centrality tends to be superior even to Matheron’s variogram. The finding might result from the general deviation of the empirical data from the Gaussian assumption. Genton’s estimation method represents the best method for 6 months, but its relatively low average ranking indicates that its performance considerably deteriorated for other months. Dowd’s variogram exhibits the worst average ranking, questioning the appropriate representation of the distribution of increments by the median [Eq. (5)] for skewed distributions.
Table 2. Summary of Rankings of the ME from All 24 Months and Three Data Sets
VariogramNumber of first ranksAverage rank
Without outliers2.5%5%Without outliers2.5%5%
Matheron3403.04.24.8
Trimmed mean1303.13.45.0
Cressie-Hawkins3463.33.02.4
Dowd2414.03.44.7
Genton6493.83.52.0
Kendall’s τ9582.73.02.2

Note: Bold values are indicate the best values.

The insertion of 2.5% outliers into the data changes the qualitative ranking of the methods (Table 2). All estimation methods represent the best method for a similar number of months during the 2 years. The average ranking underlines the deterioration of Matheron’s variogram in the presence of outliers and indicates an inappropriate estimation method when regarding the mean error. The method using Kendall’s τ and the Cressie-Hawkins method exhibit a superior average ranking for the precipitation data with 2.5% outliers.
The data containing a 5% portion of outliers exposes further insights. The average rankings of the variograms by Matheron, trimmed mean, and Dowd systematically decrease, indicating their inappropriateness in the estimation of the mean error for nearly all 24 months. On the contrary, the number of first ranks and the average ranking of the models by Cressie-Hawkins and Genton, as well as the model using Kendall’s τ, show superior values in case of the data set with 5% outliers. The finding is attributed to the importance of a robust description of the relative spatial dependence rather than a robust estimation of the absolute variogram values. The geostatistical calculation of the estimator Z largely depends on the allocation of relative variogram values to the nearby rain gauges rather than on their absolute values because multiplication of the variogram values would not affect the linear estimator Z.
The ranking with respect to the RMSE of the error distributions (Table 3) substantiates the findings. For the case of the outlier-free data, the proposed method using Kendall’s τ bears the best RMSE for 6 out of 24 months, with Matheron’s and Genton’s variograms displaying a similar number of first ranks. However, the superior average ranking proposes that the variogram using Kendall’s τ have a better average ranking when not being the best.
Table 3. Summary of Rankings of the RMSE from All 24 Months and Three Data Sets
VariogramNumber of first ranksAverage rank
Without outliers2.5%5%Without outliers2.5%5%
Matheron5112.94.93.6
Trimmed mean3413.43.44.2
Cressie-Hawkins3653.33.02.8
Dowd2404.03.15.1
Genton5543.52.93.1
Kendall’s τ64132.53.22.3

Note: Bold values are indicate the best values.

The introduction of an increasing portion of outliers to the data alters the comparative rankings. The variogram by Matheron clearly deteriorates and displays a poor average ranking for both data sets containing outliers. The sensitivity to outliers of the classical model and the subsequent increase of the sill is reflected in a high spread of the error distribution. In contrast, the other existing variogram models exhibit a similar number of first ranks for the case of a 2.5% portion of outliers, with Cressie-Hawkins’ variogram exhibiting the absolute highest number. In case of the data set with a 5% portion of outliers, the number of first ranks abruptly decreases for the model of the trimmed mean and that by Dowd, which indicates that these models have reached their breakdown thresholds. The similar number of first ranks and the good average ranking among the models of Genton and Cressie-Hawkins suggests that these methods are relatively able to limit the effects of outliers on the RMSE. The proposed method using Kendall’s τ shows similar good rankings for data set with a 2.5% portion of outliers and a superior performance when the portion of outliers is further increased to 5%. The proposed method is relatively robust in limiting the spread of errors because of the already outlined robust description of spatial dependence.
Table 4 displays the summarized ranking with regard to the mean deviation ratio R, revealing that the model by Cressie-Hawkins comparably fails in the description of a normal error distribution in the case of outlier-free data. The models of Dowd and with Kendall’s τ have the same number of first ranks and a comparable average ranking when no outliers are present. The variogram by Genton displays the highest number of first ranks and the best average ranking for the data sets without and with a 2.5% portion of outliers, whereas the model using Kendall’s τ bears the best performance for the data set with a 5% portion of outliers. They appear to be barely influenced by outliers and generally are suitable models to describe Gaussian error distributions.
Table 4. Summary of Rankings of R from All 24 Months and Three Data Sets
VariogramNumber of first ranksAverage rank
Without outliers2.5%5%Without outliers2.5%5%
Matheron3343.33.02.7
Trimmed mean1343.64.73.6
Cressie-Hawkins0225.04.33.8
Dowd6202.73.65.2
Genton8852.02.43.0
Kendall’s τ6693.12.82.6

Note: Bold values are indicate the best values.

The variogram model by Matheron does not show the same relative deterioration in performance for the contaminated data compared to the objective functions of the ME and RMSE. The indicated average ranking leads to the conclusion, that it has an error distribution closer to normal than the respective distributions of the trimmed mean, Cressie-Hawkins, and Dowd. The outliers induce exaggerated sills of the variogram, leading to increased kriging variances σK2 by the geostatistical interpolation and ultimately to a relatively low mean deviation ratio R [Eq. (12c)].

Conclusions

This comparative study investigates the outcomes from the proposed method for a sample variogram using Kendall’s τ and from five existing variogram models. The proposed method proved to be a good descriptor of spatial dependence for empirical non-Gaussian distributions, which are typically encountered with environmental variables. The resulting variogram elaborates a relatively smooth and distinct slope near the origin and a clear sill after a certain range, which the other examined methods show little evidence of under empirical conditions. The distinct description of spatial dependence is advantageous for the subsequent fitting procedure of the theoretical variogram. The contamination by outliers directly affects the nonrobust a priori variance C(0), and the proposed method is, therefore, not robust in terms of absolute variogram values.
However, the maintenance of the distinct description of spatial dependence in the presence of outliers is important for the subsequent geostatistical interpolation. The robust core of the proposed method ensures an adequate estimation of the variogram values close to the origin. This circumstance becomes important because the relative variogram values of the smaller distances dominate the outcome for the linear estimator Z(x) at the unknown location x.
Therefore, the subsequent cross validation of the linear estimator Z(x) underlines that the proposed method tends to outperform the existing variogram models and produces an improved bias, smaller spread of errors, and a good Gaussian estimate of the error distribution even for data containing outliers. This study advocates the proposed method as an alternative for geostatistical interpolations, which have the objective of a robust linear estimator with an improved centrality and spread of its error distribution.

Acknowledgments

This research was executed at the Institute for Modelling Hydraulic and Environmental Systems of the University of Stuttgart.

References

Armstrong, M. (1984). “Common problems seen in variograms.” Math. Geol., 16(3), 305–313.
Basistha, A., Arya, D. S., and Goel, N. K. (2008). “Spatial distribution of rainfall in Indian Himalayas—A case study of Uttarakhand region.” Water Resour. Manage., 22(10), 1325–1346.
Chilès, J. P., and Delfiner, P. (1999). Geostatistics: Modeling spatial uncertainty, Wiley, New York.
Cramér, H. (1946). Mathematical methods of statistics, 1st Ed., Princeton University Press, Princeton, NJ.
Cressie, N., and Hawkins, D. M. (1980). “Robust estimation of the variogram: I.” Math. Geol., 12(2), 115–125.
Cressie, N. A. C. (1991). Statistics for spatial data, Wiley, New York.
Dowd, P. A. (1984). “The variogram and kriging: Robust and resistant estimators.” Geostatistics for natural resources characterization: Part 1, G. Verly, M. David, A. G. Journel, and A. Marechal, eds., Springer, Dordrecht, Netherlands, 91–106.
DWA. (2008). “Hydrological Information System.” ⟨⟩.
Genton, M. G. (1998). “Highly robust variogram estimation.” Math. Geol., 30(2), 213–221.
Hampel, F. R. (2001). “Robust statistics: A brief introduction and overview.”, Eidgenössische Technische Hochschule, Zürich, Switzerland.
Hampel, F. R., Ronchetti, E. M., Rousseseeuw, P. J., and Stahel, W. A. (1986). Robust statistics: The approach based on influence functions, Wiley, New York.
Huber, P. (2011). International encyclopedia of statistical science, Springer, Berlin.
Journel, A. (1974). “Simulations conditionelles—Théorie et Pratique.” Ph.D. thesis, Centre de Géostatistique, Ecole de Mines, Fontainebleu, France (in French).
Journel, A. G., and Huijbregts, C. J. (2003). Mining geostatistics, 5th Ed., Blackburn, Caldwell, NJ.
Keblouti, M., Ouerdachi, L., and Boutaghane, H. (2012). “Spatial interpolation of annual precipitation in Annaba-Algeria—Comparison and evaluation of methods.” Energy Procedia, 18, 468–475.
Kendall, M. G. (1938). “A new measure of rank correlation.” Biometrika, 30(1–2), 81–93.
Lark, R. M. (2000). “A comparison of some robust estimators of the variogram for use in soil survey.” Eur. J. Soil Sci., 51(1), 137–157.
Lebrenz, H. (2013). “Addressing the input uncertainty for hydrological modeling by a new geostatistical method.” Ph.D. thesis, Univ. of Stuttgart, Stuttgart, Germany.
Lindskog, F., McNeil, A., and Schmock, U. (2003). “Kendall’s tau for elliptical distributions.” Credit risk: Measurement, evaluation andmanagement, Physica-Verlag, Heidelberg, Germany, 149–156.
Lynch, S. (2004). “Development of a raster database of annual, monthly and daily rainfall for southern Africa.”, Water Research Commission, Pretoria, South Africa.
Marchant, B., and Lark, R. (2007). “Robust estimation of the variogram by residual maximum likelihood.” Geoderma, 140(1–2), 62–72.
Matheron, G. (1962). “Traité de géostatistique appliqué.” Mémoires du Bureau de Recherches Géologiques et Minières, Éditions Technip, Paris, 333 (in French).
Matheron, G. (1963). “Principles of geostatistics.” Econ. Geol., 58(8), 1246–1266.
Matheron, G. (1965). “Les variables régionalisées et leur estimation: Une application de la théorie des fonctions aléatoires aux sciences de la nature.” Ph.D. thesis, Univ. of Paris, Paris (in French).
Mitchell, T. D., and Jones, P. D. (2005). “An improved method of constructing a database of monthly climate observations and associated high-resolution grids.” Int. J. Climatol., 25(6), 693–712.
Rousseeuw, P. J., and Croux, C. (1992). “Explicit scale estimators with high breakdown point.” L1-statistical analysis and related methods, Y. Dodge, ed., North Holland, Amsterdam, Netherlands, 77–92.
Rousseeuw, P. J., and Croux, C. (1993). “Alternatives to the median absolute deviation.” J. Am. Stat. Assoc., 88(424), 1273–1283.
Schneider, U., Becker, A., Meyer-Christoffer, A., Ziese, M., and Rudolf, B. (2011). “Global precipitation analysis products of the GPCC.”, Global Precipitation Climatology Centre, Offenbach, Germany.
Subyani, A. M. (2004). “Geostatistical study of annual and seasonal mean rainfall patterns in southwest Saudi Arabia/distribution géostatistique de la pluie moyenne annuelle et saisonnière dans le sud-ouest de l’Arabie Saoudite.” Hydrol. Sci. J., 49(5), 803–817.
Vose, R. S., et al. (1992). “The global historical climatology network: Long-term monthly temperature, precipitation, sea-level pressure, and station pressure data.”, Carbon Dioxide Information Analysis Center, Oak Ridge National Laboratory, Oak Ridge, TN.

Information & Authors

Information

Published In

Go to Journal of Hydrologic Engineering
Journal of Hydrologic Engineering
Volume 22Issue 9September 2017

History

Received: Oct 13, 2016
Accepted: Apr 14, 2017
Published online: Jul 14, 2017
Published in print: Sep 1, 2017
Discussion open until: Dec 14, 2017

Authors

Affiliations

Professor, Institute of Civil Engineering, Univ. of Applied Sciences and Arts—Northwestern Switzerland, 4132 Muttenz, Switzerland; Institute for Modelling Hydraulic and Environmental Systems, Univ. of Stuttgart, 70174 Stuttgart, Germany (corresponding author). E-mail: [email protected]
A. Bárdossy
Professor, Institute for Modelling Hydraulic and Environmental Systems, Univ. of Stuttgart, 70174 Stuttgart, Germany.

Metrics & Citations

Metrics

Citations

Download citation

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

View Options

Figures

Tables

Media

Share

Share

Copy the content Link

Share with email

Email a colleague

Share