This morning, Benoit sent me an email, about an exercise he found in an econometric textbook, about linear regression. Consider the following dataset,

Here, variable X denotes the income, and Y the expenses. The goal was to fit a linear regression (actually, in the email, it was mentioned that we should try to fit an heteroscedastic model, but let us skip this part). So Benoit’s question was more or less: **how do you fit a linear regression from a contingency table?**

Usually, when I got an email on Saturday morning, I try to postpone. But the kids had their circus class, so I had some time to answer. And this did not look like a complex puzzle… Let us import this dataset in R, so that we can start playing

> df=read.table("http://freakonometrics.free.fr/baseexo.csv",sep=";",header=TRUE) > M=as.matrix(df[,2:ncol(df)]) > M[is.na(M)]<-0 > M X14 X19 X21 X23 X25 X27 X29 X31 X33 X35 [1,] 74 13 7 1 0 0 0 0 0 0 [2,] 6 4 2 7 4 0 0 0 0 0 [3,] 2 3 2 2 4 0 0 0 0 0 [4,] 1 1 2 3 3 2 0 0 0 0 [5,] 2 0 1 3 2 0 6 0 0 0 [6,] 2 0 2 1 0 0 1 2 1 0 [7,] 0 0 0 2 0 0 1 1 3 0 [8,] 0 1 0 1 0 0 0 0 2 0 [9,] 0 0 0 0 1 1 0 1 0 1

The first idea I had was to use those counts as weights. Weighted least squares should be perfect. The dataset is built from this matrix,

> W=as.vector(M) > x=df[,1] > X=rep(x,ncol(M)) > y=as.numeric(substr(names(df)[-1],2,3)) > Y=rep(y,each=nrow(M)) > base1=data.frame(X1=X,Y1=Y,W1=W)

Here we have

> head(base1,10) X1 Y1 W1 1 16 14 74 2 23 14 6 3 25 14 2 4 27 14 1 5 29 14 2 6 31 14 2 7 33 14 0 8 35 14 0 9 37 14 0 10 16 19 13

The regression is the following,

> summary(reg1) Call: lm(formula = Y1 ~ X1, data = base1, weights = W1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.35569 2.03022 2.145 0.038 * X1 0.68263 0.09016 7.572 3.04e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 7.892 on 40 degrees of freedom Multiple R-squared: 0.589, Adjusted R-squared: 0.5787 F-statistic: 57.33 on 1 and 40 DF, p-value: 3.038e-09

It looks like the output is the same as what Benoit found, so we should be happy. Now, I had a second thought. Why not create the implied dataset. Using replicates, we should be able to create the dataset that was used to get this contingency table,

> vX=vY=rep(NA,sum(W)) > sumW=c(0,cumsum(W)) > for(i in 1:length(W)){ + if(W[i]>0){ + vX[(1+sumW[i]):sumW[i+1]]=X[i] + vY[(1+sumW[i]):sumW[i+1]]=Y[i] + }} > base2=data.frame(X2=vX,Y2=vY)

Here, the dataset is much larger, and there is no weight,

> tail(base2,10) X2 Y2 172 31 31 173 33 31 174 37 31 175 31 33 176 33 33 177 33 33 178 33 33 179 35 33 180 35 33 181 37 35

If we run a linear regression on this dataset, we obtain

> summary(reg2) Call: lm(formula = Y2 ~ X2, data = base2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.35569 0.95972 4.538 1.04e-05 *** X2 0.68263 0.04262 16.017 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.731 on 179 degrees of freedom Multiple R-squared: 0.589, Adjusted R-squared: 0.5867 F-statistic: 256.5 on 1 and 179 DF, p-value: < 2.2e-16

If we compare the two regressions, we have

> rbind(coefficients(summary(reg1)), + coefficients(summary(reg2))) Estimate Std. Error t value Pr(>|t|) (Intercept) 4.3556857 2.03021637 2.145429 3.804237e-02 X1 0.6826296 0.09015771 7.571506 3.038443e-09 (Intercept) 4.3556857 0.95972279 4.538483 1.036711e-05 X2 0.6826296 0.04261930 16.016913 2.115373e-36

The estimators are exactly the same (which does not surprise me), but standard deviation (ans significance levels) are quite different. And to be honest, I find that surprising. Which approach here is the most legitimate (since they are finally not equivalent)?

This may help (frequency vs analytical weights)

http://www.parisschoolofeconomics.eu/docs/dupraz-yannick/using-weights-in-stata(1).pdf

Arthur, that was a question I always postponed to follow up on. Thanks for doing that. Now, I have a related question: Given a representative sample, should one use (1) no weighting, (2) weighted samples, (3) or the “implied” data set as you describe above? I have an actual case where I am looking at a demand system and estimated it without weighting (1) and I still think it is right to do that. (3) would seem strange as it also suggests more certainty about values with a higher sample weight (multiplier). What is your take on that?

Awesome, another question ! I like that… this is exactly why I created this blog.

You might be too optimistic using the weighted-sample procedure (correcting the degrees of freedom), since it makes as if all data were observed. For instance, if you study French population with a 1000 sample, the weighted procedure just uses the observed data, when the implied data-set procedure corrects to make as if 70 millions persons where observed. But actually they are not!

The implied data set procedure is valid when data IS observed but one cannot be accessed individual data for privacy reasons or other motives.

Next step : how should we do if, in the above survey example, the data is only available grouped by 20 persons, for privacy reasons? (We therefore have a 50 observations reduced set, standing for an actual 1000 observations set, sampling a 60 millions population) I guess a mix of the 2 procedures is the answer, but which mix?

[to moderator] I actually messed up “weighting sample” and “implied data set procedure” in my previous comment. Please post this instead :

You might be too optimistic using the weighted-sample procedure (correcting the degrees of freedom), since it makes as if all data were observed. For instance, if you study French population with a 1000 sample, the weighted procedure just uses the observed data, when the implied data-set procedure corrects to make as if 70 millions persons where observed. But actually they are not!

The implied data set procedure is valid when data IS observed but one cannot be accessed individual data for privacy reasons or other motives.

Next step : how should we do if, in the above survey example, the data is only available grouped by 20 persons, for privacy reasons? (We therefore have a 50 observations reduced set, standing for an actual 1000 observations set, sampling a 60 millions population) I guess a mix of the 2 procedures is the answer, but which mix?

What you are observing can be summarized as “Frequencies are not weights.” When you expand the data, then N=181 is used in the model/error degrees of freedom and any statistics (such as stderrs and p-values) that use the DF. When you don’t expand the data, you are doing a weighted least squares, where the weight of the i_th observation is w_i / sum(w_i).

SAS regression procedures support two different statements: the FREQ statement is used to specify repeated observations, whereas the WEIGHT statement is used to specify how each obs contributes to the model fit. See http://support.sas.com/kb/22/600.html

this is clever, I like that ! two different statements !

Yes. AFAIK, SPSS uses the weights as frequencies, which I’ve seen cause problems.

Il me semble qu’il y a exactement cet exemple (enfin : cette comparaison entre une régression pondérée et une régression non pondérée) au début de Harmless econometrics. Un truc comme : je veux avoir un coefficient de régression mais, au lieu de données individuelles, je n’ai que des données groupées (si on parle de personnes, je peux avoir l’information qu’au niveau de la commune de résidencce, par exemple). On peut tout à fait faire la régression sur les groupes, pondérés par leurs effectifs, à condition de corriger les degrés de liberté pour les tests.

j’ai l’impression que tu évoques des choses liées à l’ecological fallacy. On travaille dessus avec Baptiste Coulmont, je n’ai pas encore les idées assez claires pour faire un billet propre sur le sujet

Your first count is incorrect. Should be 18 not 16.

yes, there was a typo in the csv file, but still…

Did you notice that ratios between standard deviation estimates from reg1 and reg2 equal to the same value (2.11542) for both intercept and slope? Are you the underlying model when performing reg1 and reg2 are equivalent?

I should add that this is also handled differently between different R functions/packages. Most R packages that employ maximum likelihood do use case weights, I think, including glm.nb() from MASS if I recall correctly while glm() and lm() do not.

From Poisson-like models I know that they are often weigthed by 1/counts. Shouldn’t this also be the case here?

Cheers,

Andrej

Hello Benoit,

I think reg2 gives the correct estimates of the standard deviation. If you use the counts as weights, the model fitting procedure does not “know” that the weights are actually counts (and cannot adjust the degrees of freedoms appropriately, which would affect the estimate of the standard deviation). The absolute magnitude of the weights is not of importance.

Try this:

lm(formula = Y1 ~ X1, data = base1, weights = W1 * 2)

The estimate of the standard deviations is not changed.

On the other hand, the model reg2 has the correct number of degrees of freedom, resulting in “correct” estimates of the standard deviation. If you multiply all counts in your original contingency table by two, the standard deviations will be shrunk by sqrt(2) I believe.

Best greetings,

Sebastian

Just an additional thought:

It can be sensible to use the counts as weights, if take the averages for each x value over all available y values. For example like this:

n <- 10^4

x <- sample(20, n, TRUE)

y <- rnorm(n, 0, sd = 10) + x

w <- tapply(y, x, length)

y.means <- tapply(y, x, mean)

x.values <- as.numeric(names(y.means))

plot(y ~ jitter(x), pch = '.')

points(y.means ~ x.values, col = 'red')

summary(lm(y ~ x))

summary(lm(y.means ~ x.values, weights = w))

The two models yield very similar estimates. The mean of the standard deviation of the coefficients, if the procedure is repeated, should be the same I think.

lm() does _not_ interpret the weights as case weights but just as inversely proportional to the variance. In either case the weighted residual sum of squares is minimized, hence the coefficient estimates are identical. The main difference is that for case weights the total number of observations is the sum of weights sum(w) while for propotionality weights it is the number of weights with non-zero weight sum(w > 0).

Here, vcov(reg1) / vcov(reg2) is 4.475 which corresponds to 179 / 40. 179 are the residual degrees of freedom of reg2: sum(base1$W1) – 2 which can also be computed as df.residual(reg2). Analogously, 40 are the residual degrees of freedom of reg1: sum(base1$W1 > 0) – 2 or just df.residual(reg1).

Here, the case weights are probably more appropriate as there are really 181 observations in the dataset.

Further comments:

– I wasn’t able to access the .csv file.

– The lowest X value in the image is 18 but in the data it appears to be 16.

– Instead of read.table() with additional arguments, read.csv2() could be used and should have the right defaults.

– The as.data.frame() method for “table” objects might be helpful, as it expands to X and Y variable plus Freq (= weights).

– Going from base1 to base2 can be simplified via

base2 <- base1[rep(1:nrow(base1), base1$W1), 1:2]

thanks Achim for your detailed comment ! (on your tip on as.data.frame(as.table(M)))

“lm() does _not_ interpret the weights as case weights”, I have to keep that in mind ! I might have wrongly given that explanation in class when I tried to give heuristics on weighted least squares…

about 16 vs 18, indeed, their was a typo when I typed it (which is not a big deal since I want to compare two regression on the same dataset)