Recently, I’ve been working on two problems that might be related to semiotic issues in predictive modeling (i.e. instead of a standard regression table, how can we plot coefficient values in a regression model). To be more specific, I have a variable of interest Y that is observed for several individuals i, with explanatory variables \mathbf{x}_i, year t, in a specific region z_i\in\{A,B,C,D,E\}. Suppose that we have a simple (standard) linear model (forget about time here) y_i=\beta_0+\beta_1x_{1,i}+\cdots+\beta_kx_{k,i}+\sum_j \alpha_j \mathbf{1}(z_i\in j)+\varepsilon_i

Let us forget the temporal effect to focus on the spatial effect today. And consider some simulated dataset. There will be only one (continuous) explanatory variable. And I will generate correlated covariates, just to be more realistic.

n=1000 library(mnormt) r=.5 Sigma=matrix(c(1,r,r,1), 2, 2) set.seed(1) X=rmnorm(n,c(0,0),Sigma) X1=cut(X[,1],c(-100,quantile(X[,1],c(.1,.4,.7,.85)), 100),labels=LETTERS[1:5]) X2=X[,2] Y=5+X[,1]-X[,2]+rnorm(n)/2 db=data.frame(Y,X1,X2) |

Here we have y_i=\beta_0+\beta_1x_{1,i}+\sum_{j\in\{A,B,C,D,E\}} \alpha_j \mathbf{1}(z_i\in j)+\varepsilon_i The goal here is to get to graph to visualize the vector \hat\alpha=(\hat\alpha_A,\cdots,\hat\alpha_E). Let us run the linear regression

reg1=lm(Y~X1+X2,data=db) idx=which(substr(names(reg1$coefficients), 1,2)=="X1") v1=reg1$coefficients[idx] names(v1)=LETTERS[2:5] barplot(v1,col=rgb(0,0,1,.4)) |

Note that it is possible to add some sort of “confidence interval” to discuss significance (or to avoid to spend hours discussing differences in bar heights that are not significantly different)

library(Hmisc) sv1=summary(reg1)$coefficients[idx,2] (bp1=barplot(v1,ylim=range(c(0,v1+2*sv1)))) errbar(bp1[,1],v1,v1-2*sv1,v1+2*sv1,add=TRUE) |

My main concern here is the “reference” that is considered. Should A be the reference? Why not B

db$X1=relevel(db$X1,"B") reg1=lm(Y~X1+X2,data=db) idx=which(substr(names(reg1$coefficients),1,2)=="X1") v1=reg1$coefficients[idx] names(v1)=LETTERS[c(1,3:5)] library(Hmisc) sv1=summary(reg1)$coefficients[idx,2] (bp1=barplot(v1) errbar(bp1[,1],v1,v1-2*sv1,v1+2*sv1,add=TRUE) |

Why not the smallest one? Why not the largest one?… What if there is no simple way to choose. Furthermore, let us get back to the original point, which is that there might be some temporal aspects. More precisely, we can have \hat\alpha^{(t)}=(\hat\alpha_A^{(t)},\cdots,\hat\alpha_E^{(t)}). If we have also \hat\alpha^{(t+1)} and we get another plot, how do we interpret it. If for E the bar is taller, it means that relative to A, the difference has increased. I have the feeling that the interpretation is more complicated because we do not see, on that graph, changes in \hat\alpha^{(t)}_A.

Let us try something else. First, let us get back to the original setting

db$X1=relevel(db$X1,"A") |

Consider here the regression without the intercept, so that all values remain

reg1=lm(Y~0+X1+X2,data=db) idx=which(substr(names(reg1$coefficients),1,2)=="X1") v1=reg2$coefficients[idx] names(v1)=LETTERS[1:5] barplot(v1) |

It can be hard to read, especially if Y takes (very) large values, and you think that barplots should start at 0. But still, having those 5 values is nice. Why not rescale that graph?

A natural idea my be to consider the case where no spatial component is considered, and to look at the difference with that reference.

reg1=lm(Y~1+X2,data=db) reg2=lm(Y~0+X1+X2,data=db) idx=which(substr(names(reg2$coefficients),1,2)=="X1") v1=reg2$coefficients[idx] v2=v1-reg1$coefficients["(Intercept)"] barplot(v2,col=rgb(0,0,1,.4)) sv2=summary(reg2)$coefficients[idx,2] (bp2=barplot(v2,ylim=range(c(v2-2*sv2,v2+2*sv2)))) errbar(bp2[,1],v2,v2-2*sv2,v2+2*sv2,add=TRUE) |

I like that graph, I should admit it. Now, I still have some remaining questions. For instance, can we insure that when only the intercept is considered, the value of \hat\beta_0 is somewhere between \hat\beta_A,\cdots,\hat\beta_E? Is it possible that \hat\beta_A-\hat\beta_0,\cdots,\hat\beta_E-\hat\beta_0 are all positive? In that case, I would find that hard to interpret.

Actually, if I really want values that can be seen as compared to some average, why not consider a (weighted) average of \hat\beta_A,\cdots,\hat\beta_E? (weights being here proportion in each class, in each region)

w=table(db$X1) v3=v1-sum(w*v1)/sum(w) (bp3=barplot(v3,ylim=range(c(v3-2*sv3,v3+2*sv3)))) errbar(bp3[,1],v3,v3-2*sv3,v3+2*sv3,add=TRUE) |

I like that one. But what if, instead of normalizing at the end, we normalize the original dependent variable. By “normalize”, I mean “rescale”, to have a centered variable.

db$Y0=db$Y-mean(db$Y) reg3=lm(Y0~0+X1+X2,data=db) sv3=summary(reg3)$coefficients[idx,2] (bp3=barplot(v3,ylim=range(c(v3-2*sv3,v3+2*sv3)))) errbar(bp3[,1],v3,v3-2*sv3,v3+2*sv3,add=TRUE) |

This one is nice, because it is extremely simple to explain. But what if instead of a linear regression, we add a logistic one (with Y\in\{0,1\})? or a Poisson regression…

So maybe it cannot be the best solution here. Let us try something else… In insurance ratemaking, people like to use “zonier“. It is a two-stage regression. The idea is to run a regression without any spatial components, first. Then, consider the regression of residuals on spatial variables. Here, it would be something like

reg1=lm(Y~1+X2,data=db) reg4=lm(residuals(reg1)~0+db$X1) |

Since we focus on residuals, those are centered, and we have an easy interpretation of respective values

sv4=summary(reg4)$coefficients[idx,2] v4=reg4$coefficients (bp4=barplot(v4,names.arg=LETTERS[1:5]))) errbar(bp4[,1],v4,v4-2*sv4,v4+2*sv4,add=TRUE) |

I guess that it can also be use in generalized linear models, with Pearson (or deviance) residuals.

Another possible idea can be the following. Again, the goal is not to have the true values, but to visualize on a graph how regions can be different. Here, all of them are significantly different. And in region A, Y is smaller, ceteris paribus (other things equal in the sense that we have taken into account x_1). And in region E it is larger. Here, the graph helps to “see” those differences.

Why not consider a completely different graph. What if we plot vector a instead of \alpha, where a_A can be interpreted as the value of the coefficient if we consider region A against “not region A“. What if we consider 5 regressions where dichotomous versions of Z are considered : Z_j=\mathbf{1}_{Z=j}.

v5=sv5=rep(NA,5) names(v5)=LETTERS[1:5] for(k in 1:5){ reg=lm(Y~I(X1==LETTERS[k])+X2,data=db) v5[k]=reg$coefficients[2] sv5[k]=summary(reg)$coefficients[2,2]} |

We can plot that sequence of values, including some confidence intervals (that would be related to significance with respect to all other regions)

(bp5=barplot(v5,ylim=range(c(v5-2*sv5,v5+2*sv5)))) errbar(bp5[,1],v5,v5-2*sv5,v5+2*sv5,add=TRUE) |

Looking at values does not give intuitive results, but I have the feeling that it is easy to explain what we plot (we compare each region to “the rest of the world”), and the ordering of a seems to be consistent with \alpha (but I could not prove it).

Here are some ideas I got. I should be able to provide other graphs, but I would love to discuss with anyone on that topics, to find a proper and nice way to visualize effects of a categorical explanatory variable in a regression model (that can be a logistic one). Comments are open…