The objective of this vignette is to demonstrate the application of Poisson regression in analyzing insurance data, specifically focusing on the freMTPLfreq and freMTPLsev datasets from Charpentier (2014). These datasets provide information on insurance contracts and claims related to French motor third-party liability insurance.
By leveraging Poisson regression, our goal is to model the frequency of claims and investigate the factors influencing claim occurrence within the insurance data. Understanding these factors is crucial for accurate risk assessment, pricing strategies, and effective decision-making in the insurance industry.
The Data used in this vignette come from French motor third party liability insurance portfolio.
The first dataset, freMTPLfreq, encompasses details regarding contracts and clients obtained from a French insurance company, related to some motor insurance portfolio.
The second dataset, freMTPLsev, contains claims information, from the same company.
For convenience, the freMTPLfreq table will be named CONTRACTS hereafter. The freMTPLsev will be named CLAIMS
Dictionaries
The list of the 10 variables from the freMTPLfreq dataset is reported in Table 1. Similarly, the two variables encapsulated within the freMTPLsev dataset are reported in table Table 2.
Table 1: Content of the freMTPLfreq dataset: CONTRACTS
Attribute
Type
Description
PolicyID
Numeric
Unique identifier for the contract
ClaimNB
Numeric
Number of claims during the exposure period
Exposure
Numeric
Exposure duration in years
Power
Factor
Power of the car (ordered categorical)
CarAge
Numeric
Age of the car in years
DriverAge
Numeric
Age of the driver in years
Brand
Factor
Brand of the car categorized
Gaz
Factor
Type of fuel used (diesel or regular)
Region
Factor
Region in France where the driver resides
Density
Numeric
Population density in the city where the driver resides
Table 2: Content of the freMTPLsev dataset: CLAIMS
In the realm of automobile insurance, Poisson regression is a reliable tool for understanding and predicting accident frequencies, repair costs, and claims trends.
By utilizing Poisson regression, insurers can anticipate forthcoming challenges, refine pricing strategies, and ensure resilience in a dynamic landscape of risk.
Pay Attention
The results from Poisson regression models are valid if:
the responses are independent.
the responses are distributed according to a Poisson distribution with parameter Lambda.
In this analysis, we will explore the relationship between the response variable ClaimNb and the explanatory variables DriverAge and Density. This modeling framework aligns with the principles outlined by Agresti (2013), a prominent figure in statistical methodology, who emphasizes the significance of considering multiple explanatory factors in regression analysis.
To model the frequency of insurance claims, we employ a Poisson regression approach. The response variable in our model, denoted as ClaimNb, represents the count of insurance claims and is assumed to follow a Poisson distribution:
ClaimNb ∼ Poisson(λ),
where λ is the mean rate of claims. The Poisson regression model relates λ to a set of predictor variables through a logarithmic link function. This link function ensures that the predicted rate of claims is always positive, as required by the Poisson distribution. More precisely, we express the natural logarithm of λ as a linear combination of the predictors:
log (λ) = β0 + β1 × DriverAge + β2 × Density,
where DriverAge represents the age of the driver, Density indicates the population density of the city in which the driver resides, and β0, β1, and β2 are the regression coefficients that need to be estimated.
The estimated lambda parameter, which represents the mean of claims, is: 0.05.
Call:
glm(formula = ClaimNb ~ DriverAge + Density, family = poisson,
data = CONTRACTS.f)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.5436 0.0843 -30.16 < 2e-16 ***
DriverAge(22,26] -0.3902 0.1009 -3.87 0.00011 ***
DriverAge(26,42] -0.7148 0.0819 -8.73 < 2e-16 ***
DriverAge(42,74] -0.7152 0.0807 -8.87 < 2e-16 ***
DriverAge(74,Inf] -0.6946 0.0924 -7.52 5.5e-14 ***
Density(40,200] 0.1721 0.0377 4.57 4.9e-06 ***
Density(200,500] 0.2965 0.0421 7.04 1.9e-12 ***
Density(500,4.5e+03] 0.4935 0.0375 13.15 < 2e-16 ***
Density(4.5e+03,Inf] 0.6987 0.0583 11.98 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 42434 on 135106 degrees of freedom
Residual deviance: 42071 on 135098 degrees of freedom
AIC: 55722
Number of Fisher Scoring iterations: 6
dispersiontest(reg)
Overdispersion test
data: reg
z = 7, p-value = 5e-12
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
1
This is a Poisson regression model predicting ClaimNb (number of claims) with DriverAge and Density as predictors. The coefficients in the model indicate the change in the log count of claims associated with each predictor level compared to a reference level.
For example, as DriverAge increases from 22 to 26, the log count of claims decreases by 0.39. Similarly, as Density increases within each category, the log count of claims increases.
All coefficients in the model are statistically significant, reinforcing the reliability of these predictors in explaining the variation in the number of insurance claims.
Presence of overdispersion
Overdispersion can lead to an inadequate model fit, inflated standard errors, incorrect inference, and biased parameter estimates. As such, the following interpretations should be approached with caution. Addressing overdispersion may require alternative modeling approaches, such as negative binomial regression or quasi-Poisson regression, which offer greater flexibility in capturing the variability in the data.
reg_coef <-tidy(reg)reg_coef$p.value <-format(reg_coef$p.value, scientific =TRUE, digits =3)reg_coef <- reg_coef |>mutate(significance =case_when( p.value <0.001~"***", p.value <0.01~"**", p.value <0.05~"*",TRUE~"" ))kable(reg_coef, format ="html", escape =FALSE) |>kable_styling(full_width =FALSE) |>add_footnote(c("Significance levels: *** p < 0.001, ** p < 0.01, * p < 0.05"), notation ="none")
Table 3: Coefficients
term
estimate
std.error
statistic
p.value
significance
(Intercept)
-2.54
0.08
-30.2
7.79e-200
DriverAge(22,26]
-0.39
0.10
-3.9
1.10e-04
***
DriverAge(26,42]
-0.71
0.08
-8.7
2.64e-18
***
DriverAge(42,74]
-0.72
0.08
-8.9
7.56e-19
***
DriverAge(74,Inf]
-0.69
0.09
-7.5
5.46e-14
***
Density(40,200]
0.17
0.04
4.6
4.92e-06
***
Density(200,500]
0.30
0.04
7.0
1.91e-12
***
Density(500,4.5e+03]
0.49
0.04
13.2
1.74e-39
***
Density(4.5e+03,Inf]
0.70
0.06
12.0
4.65e-33
***
Significance levels: *** p < 0.001, ** p < 0.01, * p < 0.05
Code to create the table
reg_count_ratio <-tidy(exp(coef(reg)[-1]))reg_count_ratio <- reg_count_ratio |>mutate(p.value = reg_coef$p.value[-1]) |>mutate(significance =case_when( p.value <0.001~"***", p.value <0.01~"**", p.value <0.05~"*",TRUE~"" )) |> dplyr::select(-p.value)kable(reg_count_ratio, format ="html", escape =FALSE) |>kable_styling(full_width =FALSE) |>add_footnote(c("Significance levels: *** p < 0.001, ** p < 0.01, * p < 0.05"), notation ="none")
Table 4: Count Ratio
names
x
significance
DriverAge(22,26]
0.68
***
DriverAge(26,42]
0.49
***
DriverAge(42,74]
0.49
***
DriverAge(74,Inf]
0.50
***
Density(40,200]
1.19
***
Density(200,500]
1.35
***
Density(500,4.5e+03]
1.64
***
Density(4.5e+03,Inf]
2.01
***
Significance levels: *** p < 0.001, ** p < 0.01, * p < 0.05
Each count ratio represents the change in the count of making a claim associated with a one-unit increase in the predictor variable, compared to the reference category DriverAge (18,22]. For example, a count ratio of 0.68 for DriverAge (22,26] implies that the count of making a claim for individuals aged 22 to 26 is approximately 32% lower compared to the reference category.
Similarly, count ratios above 1 for Density categories suggest an increase in the count of making a claim as the population density increases within each category.
For more similar claim frequency datasets with a Poisson-like distribution, see ausprivauto0405 (import with data("ausprivauto0405")): Australian automobile dataset, norauto: Norwegian automobile dataset (import with data("norauto")), beMTPL16: Belgian automobile dataset (import with data("beMTPL16")), or pg17trainpol (import with data("pg17trainpol")).