Skip to content

Fitting loss distributions

From the previous sections, we have discussed the most common distributions in modelling insurance losses and frequency, and the estimation of each distribution. Sometimes, we can estimate various values of parameters via different estimation method. we would like to examine whether the data sample matches with a probability distribution and a method. This is all about the model fitting.

In this section, we want to discuss three ways to test the model, Kolmogorov-Smirnoff (K-S) test, Chi-square goodness-of-0fit test and Akaike Information Criteria (AIC).

Kolmogorov-Smirnoff (K-S) test

The K-S test is an procedure for testing whether the sample outcome is coming from a specific population distribution. The test is based on the empirical distribution function.

The primary idea of the K-S test is to reject the hypothesis H0H_0 if there is a significant difference between the e.d.f from the given sample and hopothesised c.d.f from a paticular population distribution.

if the maximum absolute difference dnd_n between F0F_0 and the estimated c.d.fc.d.f is large

Test Procedures

In order to provide the test procedures, we should start from the c.d.f. We set the null hypothesis H0H_0 and the alternative hypothesis HaH_a for the population distribution with given parameters.

H0:XF0(x)Ha:XF0(x)\begin{align*} H_0: X \sim F_0(x) \\ H_a: X \sim F_0(x) \end{align*}

Let X1,,XnX_1,\cdots, X_n be a random sample, we have the e.d.fF^n(x)\hat{F}_n(x) as:

F^n(x)=#(xix)n\hat{F}_n(x) = \frac{\#(x_i \leq x)}{n}

where #(x+ix)\#(x+i \leq x) is the number of xix_i satisfying xix,for i=1,,nx_i \leq x, \text{for} \ i = 1,\cdots ,n.

The Kolmogorov-Smirnoff Test Statistic is given by:

dn=sup<x<F^n(x)F0(x)d_n = \sup_{-\infin < x < \infin} |\hat{F}_n(x) - F_0(x) |

The value of dnd_n is trying to find the largest distance between the c.d.f and e.d.f. The larger dnd_n represents the larger numerical discrepancy between the estimated and hypothesised c.d.f.

If we assume that the null hypothesis is true, the d_n should be smaller than a critical value dαd_{\alpha} at a significance level α\alpha. The critical value dαd_{\alpha} can be found in the K-S table. Equivalently, the p-value should be larger than α\alpha if the null hypothesis is true.

Implementation

We will use R package to utilise the K-S test in practical. The following R codes demonstrate the inputs and outputs of the steps. We test the data with the assumption that the sample is drawn from the exponential and lognormal distributions.

r
# Input data
theft=read.table("theft.txt")
# Collect test data
X=theft[,1]
# Apply K-S test for exponential distribution with parameter
ks.test(theft,"pexp",1/mean(X))

# lognormal distribution
ks.test(theft,"plnorm",6.654601, sqrt(2.291516))
text

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  theft
D = 0.2005, p-value = 0.0001291
alternative hypothesis: two-sided
text

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  theft
D = 0.087018, p-value = 0.3236
alternative hypothesis: two-sided

From the above results, we can see that the p-value for the exponential distribution is 0.0001291, which is less than 0.05. Therefore, we reject the null hypothesis that the sample is drawn from the exponential distribution. On the other hand, the p-value for the lognormal distribution is 0.3236 > 0.05. Therefore, we fail to reject the null hypothesis that the sample is drawn from the lognormal distribution.

The lornormal distribution is a better fit for the data than the exponential distribution.

Limitations

Chi-square goodness-of-fit test

The Chi-square goodness-of-fit test is another statistical test to determine whether the sample data is consistent with a hypothesised distribution. The test is based on the difference between the observed and expected frequencies.

When we have a random sample X1,,XnX_1,\cdots, X_n from a population, we can divide the sample into kk intervals of the form Ii=[ci,ci+1) for i=1,,kI_i = [c_i, c_{i+1}) \ \text{for} \ i = 1, \cdots, k.

Let OiO_i be the observed frequency and EiE_i be the expected frequency for the ii-th interval based on the hypothesised distribution. Then, we compare the observed and expected frequencies for each interval.

The Chi-square test statistic is given by:

χGF2=i=1k(OiEi)2Ei\chi^2_{GF} = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}

The test statistic follows a Chi-square distribution with k1k-1 degrees of freedom. We can use the Chi-square distribution table to find the critical value at a significance level α\alpha.

Akaike Information Criteria (AIC)

The last method to test the model is the Akaike Information Criteria (AIC). The AIC is a measure of the goodness of fit of a statistical model. It is based on the likelihood function and the number of parameters in the model.

The AIC is given by:

AIC=2log(L)+srAIC = -2 \log(L) + s \cdot r

where LL is the likelihood function, ss is the critical value (usually s=2s=2), and rr is the number of parameters in the model.

Powered by VitePress