Click on image to return to General Register Office for Scotland - Homepage

Occasional Papers

Modelling Census Under-Enumeration - A Logistic Regression Perspective

2. Using Logistic Regression Modelling to Investigate Census Under-Enumeration

The main objective of the analysis in this paper is to find a multivariable logistic regression model that can investigate the independent effect of specific deprivation and Census variables on the proportion of residents imputed in a ward (i.e. synthetic individuals). The advantage of logistic regression over other forms of statistical regression is that it can be used to predict group membership of a binary outcome.

The analysis was carried out in the statistical analysis package SAS, and the output produced first gives the response profile (see Table 2.1 (11 Kb PDF file)). The response profile refers to the outcome variable, which is defined to be whether a person is imputed (an event) or not (a nonevent)[Footnote 1].  

Logistic regression allows a discrete outcome, such as group membership, to be predicted from a set of variables that may be continuous, discrete, binary or a mixture of any of these. It results in a model that selects only the variables that are significant predictors of the outcome variable. The outcome variable in logistic regression is binary and takes the value 1 with a probability of success (an event) p or the value 0 with probability of failure (nonevent) 1-p.  

Another advantage of logistic regression analysis is that it makes no assumption about the distribution of the independent variables. They do not have to be normally distributed, linearly related or of equal variance. The difference between logistic and linear regression is reflected both in the choice of a parametric model and the underlying assumptions. Once this difference is accounted for, the methods employed in logistic regression analysis follow the same general principles as linear regression.

After fitting a model, the emphasis shifts from the computation and assessment of significance of the estimated coefficients to the interpretation of their values. So the power of a logistic model is assessed by examination of the classificatory tables. The higher the percentage correct of positive cases (in this paper, accurately predicting the number imputed and not imputed) the better the predictive power of the model.

The strength of this type of analysis lies in the fact that an entire set of variables can simultaneously be taken into account. The results then show which variables are independent predictors of the binary variable. For example, it might be shown in the preliminary analyses that the floor level is highly correlated to under-enumeration, and wards with a high proportion of residents on the 3rd floor or above have a high number of imputed persons. However, when account is taken of the fact that residents of the 3rd floor or above are much more likely to be resident in flats, and much more likely to be renting, the impact of floor level on the likelihood of a person being imputed is removed. Any impact that floor level has on under-enumeration is better accounted for by other variables and it can be concluded that it is not floor level that is relevant but tenure and dwelling type. It might be true, however, that floor level is a good predictor of tenure and dwelling and through its effect on these variables it (indirectly) affects under-enumeration.

Odds ratios

The relationship between the predictor and outcome variables is not a linear function in logistic regression. Instead, the logistic regression function is used, which is expressed in Formula 2.1 (20 Kb PDF file).

Thus the interpretation of any fitted model relies on an understanding of odds ratios. These odds ratios are derived from the estimated coefficients in the model. Odds can be understood as the ratio of the probability of an event occurring over the probability of it not occurring. 

An odds ratio of 1 indicates equal odds in the two groups, therefore when the 95% confidence interval of the odds ratio estimates contains 1, no inference can be made as to the direction of the change in odds. Thus such a variable is not considered a useful predictor in the logistic model.

Overdispersion

Over-dispersion occurs when binary data exhibit variances larger than those permitted by the binomial model. It is usually caused by clustering and lack of independence. It is therefore sometimes referred to as ‘extra variation’. It seems to be the norm and not the exception (McCullagh and Nelder, 1989). For a correctly specified model the Pearson chi-squared statistic (and the deviance) divided by the degrees of freedom should be approximately equal to one. When their values are much larger, the assumption of binomial variability may not be valid and the data exhibits over-dispersion. In effect, when data is over-dispersed, terms in the model appear more significant than they actually are.

The preliminary analysis carried out prior to the logistic modelling suggested that over-dispersion was a problem that existed in this particular data set. Similar individuals are likely to reside in the same area. However this similarity between individuals in a ward implies that the clustering is not properly explained by the model. There are other factors that the modelling process cannot take into account which affect the variability – e.g. geography, terrain, and enumerator effect. Therefore, in the logistic regression analyses, in order to correct for over-dispersion, the covariance matrix was multiplied by a heterogeneity factor.

Note:The file(s) listed above can be viewed in Adobe Portable Document Format (pdf) Get Acrobat Reader Download the latest version of Adobe Acrobat Reader free.
 

Footnote 1

The population of Scotland after the 2001Census was 5,062,011 – this was made up of 4,863,024 people counted by the Census and an additional 198,987 people imputed after the Census Coverage Survey.

 


Page last updated: 18 October 2006


If you have any comments about this website please use our contact form.

© Crown Copyright 2008