Click on image to return to General Register Office for Scotland - Homepage

Occasional Paper

Modelling Census Under-Enumeration - A Logistic Regression Perspective

Results - Assessment of How Well the Models Fit the Data

The assessment of fit, and the predictive power, of the model is a multi-faceted investigation involving summary tests and measures as well as diagnostic statistics. The most straightforward way to summarise results of a fitted logistic regression model is by a classification table. This table is the result of cross-classifying the outcome variable with a dichotomous variable whose values are derived from the estimated logistic probabilities [Hosmer & Lemeshow (2000) p.156]. Classification tables use different estimated probabilities to predict group membership. Therefore a model is said to be of reasonable fit if, based on some criterion, it predicts group membership correctly.

The accuracy of the classification is measured by the sensitivity and specificity. Sensitivity is defined as the ability to correctly predict an event, while specificity is the ability to predict a non-event correctly. A more complete description of the classification accuracy is given by the area under the Receiver Operating Characteristic (ROC) curve. ROC curves plot the probability of detecting a true signal (sensitivity) against a false signal (1 – specificity) for an entire range of possible cut-points.

As a general rule we require c, the area under the ROC curve, to be 0.5 < c ≤ 1, in order to ascertain how good the model is at discriminating between events (synthetic individuals) and non-events (actual individuals).

A good prediction curve is one that is well above the diagonal line. The graph  (Figure 4.7.1) shows the ROC curves for all the models. The diagonal line has been superimposed in order to determine which curve achieves the best ‘lift’ and is therefore the best model. 

Furthermore, Figure 4.7.1 (18 Kb PDF file) shows that even the ‘best’ model does not achieve optimal lift (the area under the ‘best’ model, is 0.631, which is not greatly above the minimum value of 0.5). Classification is sensitive to the relative sizes of the two component groups and always favours classification into the larger group. This is particularly true for this dataset, where the model is trying to split the population of Scotland (5,062,011) into 198,987 (events i.e. imputed persons) and 4,863,024 (non-events).

Although the final model lift is not optimal, it does have some ability to discriminate between the events and non-events. More importantly, the analysis results show which variables are significantly associated with under-enumeration.

Note:The file(s) listed above can be viewed in Adobe Portable Document Format (pdf) Get Acrobat Reader Download the latest version of Adobe Acrobat Reader free.


Page last updated: 17 October 2006


If you have any comments about this website please use our contact form.

© Crown Copyright 2008