Fitting loss distributions

From the previous sections, we have discussed the most common distributions in modelling insurance losses and frequency, and the estimation of each distribution. Sometimes, we can estimate various values of parameters via different estimation method. we would like to examine whether the data sample matches with a probability distribution and a method. This is all about the model fitting.

In this section, we want to discuss three ways to test the model, Kolmogorov-Smirnoff (K-S) test, Chi-square goodness-of-0fit test and Akaike Information Criteria (AIC).

Kolmogorov-Smirnoff (K-S) test

The K-S test is an procedure for testing whether the sample outcome is coming from a specific population distribution. The test is based on the empirical distribution function.

The primary idea of the K-S test is to reject the hypothesis $H_0$ if there is a significant difference between the e.d.f from the given sample and hopothesised c.d.f from a paticular population distribution.

if the maximum absolute difference $d_n$ between $F_0$ and the estimated $c.d.f$ is large

Test Procedures

In order to provide the test procedures, we should start from the c.d.f. We set the null hypothesis $H_0$ and the alternative hypothesis $H_a$ for the population distribution with given parameters.

\begin{align*} H_0: X \sim F_0(x) \\ H_a: X \sim F_0(x) \end{align*}

Let $X_1,\cdots, X_n$ be a random sample, we have the e.d.f $\hat{F}_n(x)$ as:

\hat{F}_n(x) = \frac{\#(x_i \leq x)}{n}

where $\#(x+i \leq x)$ is the number of $x_i$ satisfying $x_i \leq x, \text{for} \ i = 1,\cdots ,n$ .

The Kolmogorov-Smirnoff Test Statistic is given by:

d_n = \sup_{-\infin < x < \infin} |\hat{F}_n(x) - F_0(x) |

The value of $d_n$ is trying to find the largest distance between the c.d.f and e.d.f. The larger $d_n$ represents the larger numerical discrepancy between the estimated and hypothesised c.d.f.

If we assume that the null hypothesis is true, the d_n should be smaller than a critical value $d_{\alpha}$ at a significance level $\alpha$ . The critical value $d_{\alpha}$ can be found in the K-S table. Equivalently, the p-value should be larger than $\alpha$ if the null hypothesis is true.

Implementation

We will use R package to utilise the K-S test in practical. The following R codes demonstrate the inputs and outputs of the steps. We test the data with the assumption that the sample is drawn from the exponential and lognormal distributions.

# Input data
theft=read.table("theft.txt")
# Collect test data
X=theft[,1]
# Apply K-S test for exponential distribution with parameter
ks.test(theft,"pexp",1/mean(X))

# lognormal distribution
ks.test(theft,"plnorm",6.654601, sqrt(2.291516))

text


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  theft
D = 0.2005, p-value = 0.0001291
alternative hypothesis: two-sided

text


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  theft
D = 0.087018, p-value = 0.3236
alternative hypothesis: two-sided

From the above results, we can see that the p-value for the exponential distribution is 0.0001291, which is less than 0.05. Therefore, we reject the null hypothesis that the sample is drawn from the exponential distribution. On the other hand, the p-value for the lognormal distribution is 0.3236 > 0.05. Therefore, we fail to reject the null hypothesis that the sample is drawn from the lognormal distribution.

The lornormal distribution is a better fit for the data than the exponential distribution.

Limitations

Chi-square goodness-of-fit test

The Chi-square goodness-of-fit test is another statistical test to determine whether the sample data is consistent with a hypothesised distribution. The test is based on the difference between the observed and expected frequencies.

When we have a random sample $X_1,\cdots, X_n$ from a population, we can divide the sample into $k$ intervals of the form $I_i = [c_i, c_{i+1}) \ \text{for} \ i = 1, \cdots, k$ .

Let $O_i$ be the observed frequency and $E_i$ be the expected frequency for the $i$ -th interval based on the hypothesised distribution. Then, we compare the observed and expected frequencies for each interval.

The Chi-square test statistic is given by:

\chi^2_{GF} = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}

The test statistic follows a Chi-square distribution with $k-1$ degrees of freedom. We can use the Chi-square distribution table to find the critical value at a significance level $\alpha$ .

Akaike Information Criteria (AIC)

The last method to test the model is the Akaike Information Criteria (AIC). The AIC is a measure of the goodness of fit of a statistical model. It is based on the likelihood function and the number of parameters in the model.

The AIC is given by:

AIC = -2 \log(L) + s \cdot r

where $L$ is the likelihood function, $s$ is the critical value (usually $s=2$ ), and $r$ is the number of parameters in the model.

Fitting loss distributions ​

Kolmogorov-Smirnoff (K-S) test ​

Test Procedures ​

Implementation ​

Limitations ​

Chi-square goodness-of-fit test ​