[BUG] Update GLM Normal and Gamma distribution parameter calculations to use self.scale_#718
Conversation
…to use `self.scale_`
fkiraly
left a comment
There was a problem hiding this comment.
Can you please explain how you infer that using self.scale_ is the right thing to do?
This seems to be a scalar, rather than a value that gets predicted for each point in the prediction - and, in particular, a scalar that is fitted in fit, rather than coming out of predict.
So, it seems wrong? Can you please explain?
|
Hi @fkiraly and under the GLM formulation, Var(Y|X) = φ * V(μ) For the families used here:
So even though scale is scalar, the predictive variance is not necessarily constant. But self.scale_ estimates the dispersion of the data and is the quantity used by statsmodels to model Var(Y|X), Using it is therefore consistent with both GLM theory and the statsmodels implementation when constructing P(Y|X).. Let me know if i'm wrong somewhere or if any changes are needed, this is something that i discovered while working with the gamma distribution... |
|
Oh ok, can I confirm what you are saying:
Please correct me if you think I am wrong. |
|
Yes, that's correct
|
|
about get_predictionIn
and
They also say that prediction results are used to construct:
which indicates that the associated standard errors (including Gamma familySeparately, for the Gamma family, the variance function is documented as:
www.statsmodels.org/stable/generated/statsmodels.genmod.families.family.Gamma.html i.e., this is implying that the conditional variance of while |
Hm, this reference only tells me there is a scalar Though I think what you are saying is very plausible. It is just very unfortunate that |
|
Yea, it's true that However, we can verify the correctness of the theoretical GLM assumption ( I ran a simulation with a known Gamma distribution (True Variance = 50.0) and increasing sample size N. You can check it out here: If This should confirm that using mean_se for the predictive distribution is incorrect, as it would lead to arbitrarily narrow confident intervals for large datasets. scale allows us to recover the true data dispersion. |
|
ok, that is a very strong argument. You made a mistake in that the predictive variance would be This does not defeat the argument though. |
Reference Issues/PRs
Fixes #717
What does this implement/fix? Explain your changes.
This PR fixes the construction of predictive distributions in
GLMRegressorfor the Normal and Gamma families.Previously, the implementation used
mean_sefromget_prediction().summary_frame()to derivesigma(Normal) andalpha/beta(Gamma). However,mean_serepresents the standard error of the estimated mean, not the conditional variance of the response.This PR updates the implementation to use the fitted dispersion parameter (
scale_) instead, following standard GLM theory:sigma = sqrt(scale_)alpha = 1 / scale,beta = 1 / (scale * mu)This ensures that predictive distributions model observation noise rather than estimation uncertainty.
Does your contribution introduce a new dependency? If yes, which one?
No, this change does not introduce any new dependencies.
What should a reviewer concentrate their feedback on?
skprodistribution interfaces.Did you add any tests for the change?
No new tests were added in this PR.
I would be happy to add tests for probabilistic calibration if requested.
Any other comments?
This change aligns the probabilistic output of
GLMRegressorwith the theoretical definition of GLMs and the documentation ofstatsmodels.Feedback and suggestions are welcome.
PR checklist
For all contributions
For new estimators
python_dependenciestag and ensured dependency isolation.