Title: | Zero-Inflated and Hurdle Modelling Using Bayesian Inference |
---|---|
Description: | When considering count data, it is often the case that many more zero counts than would be expected of some given distribution are observed. It is well-established that data such as this can be reliably modelled using zero-inflated or hurdle distributions, both of which may be applied using hurdlr functions. Bayesian analysis methods are used to best model problematic count data that cannot be fit to any typical distribution. The hurdlr package functions are flexible and versatile, and can be applied to varying count distributions, parameter estimation with or without covariate information, and are able to allow for multiple hurdles as it is also not uncommon that count data have an abundance of large-number observations which would be considered outliers of the typical distribution. In lieu of throwing out data or mis-specifying the typical distribution, these extreme observations can be applied to a second, extreme distribution. With the given functions of the hurdlr package, such a two-hurdle model may be easily specified in order to best manage data that is both zero-inflated and over-dispersed. |
Authors: | Taylor Trippe [aut, cre], Earvin Balderama [aut] |
Maintainer: | Taylor Trippe <[email protected]> |
License: | GPL (>=2) |
Version: | 0.1 |
Built: | 2025-02-28 05:47:09 UTC |
Source: | https://github.com/ebalderama/hurdlr |
dist_ll
is the data likelihood fuction for hurdle model
regression using hurdle
.
dist_ll(y, hurd = Inf, lam = NULL, size = 1, mu = NULL, xi = NULL, sigma = NULL, dist = c("poisson", "nb", "lognormal", "gpd"), g.x = F, log = T)
dist_ll(y, hurd = Inf, lam = NULL, size = 1, mu = NULL, xi = NULL, sigma = NULL, dist = c("poisson", "nb", "lognormal", "gpd"), g.x = F, log = T)
y |
numeric response vector. |
hurd |
numeric threshold for 'extreme' observations of two-hurdle
models. |
lam |
current value for the poisson likelihood lambda parameter. |
size |
size parameter for negative binomial likelihood distributions. |
mu |
current value for the negative binomial or log normal likelihood mu parameter. |
xi |
current value for the generalized pareto likelihood xi parameter. |
sigma |
current value for the generalized pareto likelihood sigma parameter. |
dist |
character specification of response distribution. |
g.x |
logical operator. |
log |
logical operator. if |
Currently, Poisson, Negative Binomial, log-Normal, and Generalized Pareto distributions are available.
The log-likelihood of the zero-inflated Poisson fit for the current iteration of the MCMC algorithm.
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
Density, distribution function, quantile function and random
generation for the Generalized Pareto distribution with parameters
mu
, sigma
, and xi
.
dgpd(x, mu = 0, sigma = 1, xi = 1, log = F) mgpd(x, mu = 0, sigma = 1, xi = 1, log = F) pgpd(q, mu = 0, sigma = 1, xi = 1, lower.tail = T) qgpd(p, mu = 0, sigma = 1, xi = 1, lower.tail = T) rgpd(n, mu = 0, sigma = 1, xi = 1)
dgpd(x, mu = 0, sigma = 1, xi = 1, log = F) mgpd(x, mu = 0, sigma = 1, xi = 1, log = F) pgpd(q, mu = 0, sigma = 1, xi = 1, lower.tail = T) qgpd(p, mu = 0, sigma = 1, xi = 1, lower.tail = T) rgpd(n, mu = 0, sigma = 1, xi = 1)
x , q
|
vector of quantiles. |
mu |
location parameter. |
sigma |
(non-negative) scale parameter. |
xi |
shape parameter. |
log |
logical; if |
lower.tail |
logical; if |
p |
numeric predictor matrix. |
n |
number of random values to return. |
The generalized pareto distribution has density
dgpd
gives the continuous density, pgpd
gives the distribution
function, qgpd
gives the quantile function, and rgpd
generates random deviates.
mgpd
gives a probability mass function for a discretized version of GPD.
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
dexp(1,rate=.5) #Exp(rate) equivalent to gpd with mu=0 AND xi=0, and sigma=1/rate. dgpd(1,mu=0,sigma=2,xi=0) #cannot take xi=0. dgpd(1,mu=0,sigma=2,xi=0.0000001) #but can get close. ##"mass" function of GPD mgpd(8) == pgpd(8.5) - pgpd(7.5)
dexp(1,rate=.5) #Exp(rate) equivalent to gpd with mu=0 AND xi=0, and sigma=1/rate. dgpd(1,mu=0,sigma=2,xi=0) #cannot take xi=0. dgpd(1,mu=0,sigma=2,xi=0.0000001) #but can get close. ##"mass" function of GPD mgpd(8) == pgpd(8.5) - pgpd(7.5)
hurdle
is used to fit single or
double-hurdle regression models to count data via Bayesian inference.
hurdle(y, x = NULL, hurd = Inf, dist = c("poisson", "nb", "lognormal"), dist.2 = c("gpd", "poisson", "lognormal", "nb"), control = hurdle_control(), iters = 1000, burn = 500, nthin = 1, plots = FALSE, progress.bar = TRUE)
hurdle(y, x = NULL, hurd = Inf, dist = c("poisson", "nb", "lognormal"), dist.2 = c("gpd", "poisson", "lognormal", "nb"), control = hurdle_control(), iters = 1000, burn = 500, nthin = 1, plots = FALSE, progress.bar = TRUE)
y |
numeric response vector. |
x |
numeric predictor matrix. |
hurd |
numeric threshold for 'extreme' observations of two-hurdle models.
|
dist |
character specification of response distribution. |
dist.2 |
character specification of response distribution for 'extreme' observations of two-hurdle models. |
control |
list of parameters for controlling the fitting process,
specified by |
iters |
number of iterations for the Markov chain to run. |
burn |
numeric burn-in length. |
nthin |
numeric thinning rate. |
plots |
logical operator. |
progress.bar |
logical operator. |
Setting dist
and dist.2
to be the same distribution creates a
single dist
-hurdle model, not a double-hurdle model. However, this
is being considered in future package updates.
hurdle
returns a list which includes the items
measure of model dimensionality where
) is the
Deviance Information Criterion where
Posterior Predictive Ordinate (PPO) measure of fit
Conditional Predictive Ordinate (CPO) measure of fit
posterior mean(s) of third-component parameter(s) if
hurd != Inf
posterior means of the log-likelihood distributions of all model components
posterior means regression coefficients
posterior deviation where
posterior distributions of regression coefficients
posterior distribution(s) of third-component parameter(s) if
hurd != Inf
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
#Generate some data: p=0.5; q=0.25; lam=3; mu=10; sigma=7; xi=0.75; n=200 set.seed(2016) y <- rbinom(n,1,p) nz <- sum(1-y) extremes <- rbinom(sum(y),1,q) ne <- sum(extremes) nt <- n-nz-ne yt <- sample(mu-1,nt,replace=TRUE,prob=dpois(1:(mu-1),3)/(ppois(mu-1,lam)-ppois(0,lam))) yz <- round(rgpd(nz,mu,sigma,xi)) y[y==1] <- c(yt,yz)
#Generate some data: p=0.5; q=0.25; lam=3; mu=10; sigma=7; xi=0.75; n=200 set.seed(2016) y <- rbinom(n,1,p) nz <- sum(1-y) extremes <- rbinom(sum(y),1,q) ne <- sum(extremes) nt <- n-nz-ne yt <- sample(mu-1,nt,replace=TRUE,prob=dpois(1:(mu-1),3)/(ppois(mu-1,lam)-ppois(0,lam))) yz <- round(rgpd(nz,mu,sigma,xi)) y[y==1] <- c(yt,yz)
Various parameters for fitting control of hurdle
model regression using hurdle
.
hurdle_control(a = 1, b = 1, size = 1, beta.prior.mean = 0, beta.prior.sd = 1000, beta.tune = 1, pars.tune = 0.2, lam.start = 1, mu.start = 1, sigma.start = 1, xi.start = 1)
hurdle_control(a = 1, b = 1, size = 1, beta.prior.mean = 0, beta.prior.sd = 1000, beta.tune = 1, pars.tune = 0.2, lam.start = 1, mu.start = 1, sigma.start = 1, xi.start = 1)
a |
shape parameter for gamma prior distributions. |
b |
rate parameter for gamma prior distributions. |
size |
size parameter for negative binomial likelihood distributions. |
beta.prior.mean |
mu parameter for normal prior distributions. |
beta.prior.sd |
standard deviation for normal prior distributions. |
beta.tune |
Markov-chain tuning for regression coefficient estimation. |
pars.tune |
Markov chain tuning for parameter estimation of 'extreme' observations distribution. |
lam.start |
initial value for the poisson likelihood lambda parameter. |
mu.start |
initial value for the negative binomial or log normal likelihood mu parameter. |
sigma.start |
initial value for the generalized pareto likelihood sigma parameter. |
xi.start |
initial value for the generalized pareto likelihood xi parameter. |
A list of all input values.
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
Data likelihood fuction for zero-inflated negative binomial
model regression using zero_nb
.
loglik_zinb(y, z, mu, size, p)
loglik_zinb(y, z, mu, size, p)
y |
numeric response vector. |
z |
vector of binary operators. |
mu |
current value for the negative binomial likelihood mu parameter. |
size |
size parameter for negative binomial distribution. |
p |
vector of 'extra' zero-count probabilities. |
The log-likelihood of the zero-inflated negative binomial fit for the current iteration of the MCMC algorithm.
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
Data likelihood fuction for zero-inflated Poisson model
regression using zero_poisson
.
loglik_zip(y, z, lam, p)
loglik_zip(y, z, lam, p)
y |
numeric response vector. |
z |
vector of binary operators. |
lam |
current value for the Poisson likelihood lambda parameter. |
p |
vector of 'extra' zero-count probabilities. |
The log-likelihood of the zero-inflated Poisson fit for the current iteration of the MCMC algorithm.
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
Density function of the discrete log normal distribution
whose logarithm has mean equal to meanlog
and standard deviation
equal to sdlog
.
mlnorm(x, meanlog = 0, sdlog = 1, log = T)
mlnorm(x, meanlog = 0, sdlog = 1, log = T)
x |
vector of quantiles. |
meanlog |
mean of the distribution on the log scale. |
sdlog |
standard deviation of the distribution on the log scale. |
log |
logical; if |
Discrete log-normal distributional density.
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
PE
is used to calculate the likelihood of a user-defined
'extreme' value count observation in a double-hurdle regression model.
PE(p, q, log = T)
PE(p, q, log = T)
p |
vector of zero-count probabilities. |
q |
vector of 'extreme' count probabilities. |
log |
logical operator. If |
A vector of probabilities.
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
PT
is used to calculate the likelihood of a user-defined
'typical' value count observation in a double-hurdle regression model.
PT(p, q, log = T)
PT(p, q, log = T)
p |
vector of zero-count probabilities. |
q |
vector of 'typical' count probabilities. |
log |
logical operator. If |
A vector of probabilities.
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
PZ
is used to calculate the likelihood of a
zero-value count observation in a single or double-hurdle regression model.
PZ(p, log = T)
PZ(p, log = T)
p |
vector of zero-count probabilities. |
log |
logical operator. If |
A vector of probabilities.
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
MCMC algorithm for updating the second-component likelihood
parameters in hurdle model regression using hurdle
.
update_beta(y, x, hurd, dist, like.part, beta.prior.mean, beta.prior.sd, beta, XB, beta.acc, beta.tune, g.x = F)
update_beta(y, x, hurd, dist, like.part, beta.prior.mean, beta.prior.sd, beta, XB, beta.acc, beta.tune, g.x = F)
y |
numeric response vector of observations within the bounds of the
second component of the likelihood function, |
x |
optional numeric predictor matrix for response observations within
the bounds of the second component of the likelihood function,
|
hurd |
numeric threshold for 'extreme' observations of two-hurdle models. |
dist |
character specification of response distribution for the third component of the likelihood function. |
like.part |
numeric vector of the current third-component likelihood values. |
beta.prior.mean |
mu parameter for normal prior distributions. |
beta.prior.sd |
standard deviation for normal prior distributions. |
beta |
numeric matrix of current regression coefficient parameter values. |
XB |
|
beta.acc |
numeric matrix of current MCMC acceptance rates for regression coefficient parameters. |
beta.tune |
numeric matrix of current MCMC tuning values for regression coefficient estimation. |
g.x |
logical operator. |
A list of MCMC-updated regression coefficients for the estimation of the second-component likelihood parameter as well as each coefficient's MCMC acceptance ratio.
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
MCMC algorithm for updating the third-component likelihood
parameters in hurdle model regression using hurdle
.
update_pars(y, hurd, dist, like.part, a, b, size, lam, mu, xi, sigma, lam.acc, mu.acc, xi.acc, sigma.acc, lam.tune, mu.tune, xi.tune, sigma.tune, g.x = F)
update_pars(y, hurd, dist, like.part, a, b, size, lam, mu, xi, sigma, lam.acc, mu.acc, xi.acc, sigma.acc, lam.tune, mu.tune, xi.tune, sigma.tune, g.x = F)
y |
numeric response vector of observations within the bounds of the
third component of the likelihood function, |
hurd |
numeric threshold for 'extreme' observations of two-hurdle models. |
dist |
character specification of response distribution for the third component of the likelihood function. |
like.part |
numeric vector of the current third-component likelihood values. |
a |
shape parameter for gamma prior distributions. |
b |
rate parameter for gamma prior distributions. |
size |
size parameter for negative binomial likelihood distributions. |
lam |
current value for the poisson likelihood lambda parameter. |
mu |
current value for the negative binomial or log normal likelihood mu parameter. |
xi |
current value for the generalized pareto likelihood xi parameter. |
sigma |
current value for the generalized pareto likelihood sigma parameter. |
lam.acc , mu.acc , xi.acc , sigma.acc
|
current MCMC values for third-component parameter acceptance rates. |
lam.tune , mu.tune , xi.tune , sigma.tune
|
current MCMC tuning values for each third-component parameter. |
g.x |
logical operator. |
A list of MCMC-updated likelihood estimator(s) for the third-component parameter(s) and each parameter's MCMC acceptance ratio.
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
MCMC algorithm for updating the likelihood probabilities in
hurdle model regression using hurdle
.
update_probs(y, x, hurd, p, q, beta.prior.mean, beta.prior.sd, pZ, pT, pE, beta, XB2, XB3, beta.acc, beta.tune)
update_probs(y, x, hurd, p, q, beta.prior.mean, beta.prior.sd, pZ, pT, pE, beta, XB2, XB3, beta.acc, beta.tune)
y |
numeric response vector. |
x |
optional numeric predictor matrix. |
hurd |
numeric threshold for 'extreme' observations of two-hurdle models. |
p |
numeric vector of current 'p' probability parameter values for zero-value observations. |
q |
numeric vector of current 'q' probability parameter values for 'extreme' observations. |
beta.prior.mean |
mu parameter for normal prior distributions. |
beta.prior.sd |
standard deviation for normal prior distributions. |
pZ |
numeric vector of current 'zero probability' likelihood values. |
pT |
numeric vector of current 'typical probability' likelihood values. |
pE |
numeric vector of current 'extreme probability' likelihood values. |
beta |
numeric matrix of current regression coefficient parameter values. |
XB2 |
|
XB3 |
|
beta.acc |
numeric matrix of current MCMC acceptance rates for regression coefficient parameters. |
beta.tune |
numeric matrix of current MCMC tuning values for regression coefficient estimation. |
A list of MCMC-updated regression coefficients for the estimation of the parameters 'p' (the probability of a zero-value observation) and 'q' (the probability of an 'extreme' observation) as well as each coefficient's MCMC acceptance ratio.
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
zero_nb
is used to fit zero-inflated
negative binomial regression models to count data via Bayesian inference.
zero_nb(y, x, size, a = 1, b = 1, mu.start = 1, beta.prior.mean = 0, beta.prior.sd = 1, iters = 1000, burn = 500, nthin = 1, plots = T, progress.bar = T)
zero_nb(y, x, size, a = 1, b = 1, mu.start = 1, beta.prior.mean = 0, beta.prior.sd = 1, iters = 1000, burn = 500, nthin = 1, plots = T, progress.bar = T)
y |
numeric response vector. |
x |
numeric predictor matrix. |
size |
size parameter for negative binomial likelihood distributions. |
a |
shape parameter for gamma prior distributions. |
b |
rate parameter for gamma prior distributions. |
mu.start |
initial value for mu parameter. |
beta.prior.mean |
mu parameter for normal prior distributions. |
beta.prior.sd |
standard deviation for normal prior distributions. |
iters |
number of iterations for the Markov chain to run. |
burn |
numeric burn-in length. |
nthin |
numeric thinning rate. |
plots |
logical operator. |
progress.bar |
logical operator. |
Fits a zero-inflated negative binomial (ZINB) model.
zero_nb
returns a list which includes the items
numeric vector; posterior distribution of mu parameter
numeric matrix; posterior distributions of regression coefficients
numeric vector; posterior distribution of parameter 'p', the probability of a given zero observation belonging to the model's zero component
numeric vector; posterior log-likelihood
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>
zero_poisson
is used to fit zero-inflated
poisson regression models to count data via Bayesian inference.
zero_poisson(y, x, a = 1, b = 1, lam.start = 1, beta.prior.mean = 0, beta.prior.sd = 1, iters = 1000, burn = 500, nthin = 1, plots = T, progress.bar = T)
zero_poisson(y, x, a = 1, b = 1, lam.start = 1, beta.prior.mean = 0, beta.prior.sd = 1, iters = 1000, burn = 500, nthin = 1, plots = T, progress.bar = T)
y |
numeric response vector. |
x |
numeric predictor matrix. |
a |
shape parameter for gamma prior distributions. |
b |
rate parameter for gamma prior distributions. |
lam.start |
initial value for lambda parameter. |
beta.prior.mean |
mu parameter for normal prior distributions. |
beta.prior.sd |
standard deviation for normal prior distributions. |
iters |
number of iterations for the Markov chain to run. |
burn |
numeric burn-in length. |
nthin |
numeric thinning rate. |
plots |
logical operator. |
progress.bar |
logical operator. |
Fits a zero-inflated Poisson (ZIP) model.
zero_poisson
returns a list which includes the items
numeric vector; posterior distribution of lambda parameter
numeric matrix; posterior distributions of regression coefficients
numeric vector; posterior distribution of parameter 'p', the probability of a given zero observation belonging to the model's zero component
numeric vector; posterior log-likelihood
Taylor Trippe <[email protected]>
Earvin Balderama <[email protected]>