Merging SHS and FRS Data
February 7, 2021
Some notes as I try to add some Scottish Household Survey Data to my FRS based dataset.
Why?
Because a lot of my public access FRS is blank. In particular I’ve decided I can’t really proceed with housing-related benefits modelling without Local Housing Allowance identifiers and council taxes. And these aren’t in the public FRS datasets I use.
Plus, there’s loads of good stuff about housing, heating and transport in the SHS which might be useful later on.
HOW
There’s some theory about this, and some software; see King et. al, EuroStat and the StatMatch software.
I’d really like to replicate StatMatch in Julia.
For now, I’m using a rather hacked, ad-hoc implementation based on King’s Coarsened Exact Matching idea.
There’s a large literature on matching more generally, used as a technique in evaluation studies, but I don’t think much of it is useful for what I’m after here. Propensity Score Matching is fun - and I’d also like to implement a Julia version, but the kind of matching produced isn’t really useful here, since it matches on scores and not characteristics (a white young male could get matched with a black old female if they have the same score - we need to aviod that, I think).
So I’m just using an hand-coded matching thing - select records from SHS (the Donor) and FRS (the Recipient) based on a bunch of characteristics, but just use a hand-written program. ‘Coarsened’ here means progressively widening and then dropping characteristics if there are no perfect matches; for example, we might match by tenure type, but if there’s no private renter in the SHS amonst those that match on ouseful characteristics, we might find one that rents in any way (e.g. from a council) or, in extremis, drop tenure type as a matching criterial for that observation.
This video is a good intro.
Our strategy has to be slightly different:
- Matching in evaluation is usually done only over observations that actually match (have ‘common support’ in the jargon - if some bins have just one side (treated/donor, etc.) or some propensity scores don’t match, those observations are dropped, but we have to match for every FRS observation since we never want to lose observations). So for some, we might just use a bad match;
- some matches may be catastrophic - assigning a male health record to a female, for example;
- Coarsened matching coarsens across all observations the same (all renting for everyone, even if using private renting works for some), but we might want tight matches where available
Li-Chung-ing
This is idea suggested to us on a previous project by Li-Chun Zhang of the University of Southampton.
We can get an idea of the errors produced by this procedure by recording not just the best match but progressively more coarsened matches, and then using all the matches in your simulation - bootstrapping of a sort.
SHS (Donor) Side
SHS has a seriously weird stucture. Not everyone in a household is sampled - instead there’s a randomly chosen person and there’s also a bunch of stuff for the ‘highest income person’.
Household Characteristics
The object initially is to match in household records. In future I might match in individual level stuff (health, transport) in which case we’ll need to match a bit differently (include gender, for example, de-emphasise household characteristics like accomodation type)
- shelter : sheltered accomodation
- tenure : tenure type
- acctype : type of dwelling
- singlepar : lone parent hhld flag
- numadults : num adults
- numkids : num children
- empstathigh : employment status, highest income person (HIP)
- sochigh : socio-economic, HIP
- agehigh : age HIP
- ethnichigh : ethnic group, hip
- datayear : data year (2016..18)
See this script
for the actual SHS->FRS mappings, and the coarse_match
function in
Utils.jl for a simple matching algorithm.
Todo: match on income, benefit receipts.
The mean of annetinc
is 27k in the SHS, but mean hhinc
in FRS is 38k, so I need to construct something or
at least figure out the constuction of these.
SHS benefit receipts are also problematic because of the reporting of adults.
… much later …
Oversampling of small councils. Really messes up matching - you end up replicating the oversampling.
To fix this, select randomly from all the matches, with the select conditioned by sample frequencies.
But just using the best matches makes little difference, so use all matches, even bad ones, but with probability of choosing a (crude) function of match quality and implied sample weight.
code | name | sample weight | Modelled hhls | Actual 2019 hhlds | %diff | |
S12000033 | Aberdeen City | 104 | 99870 | 108381 | 7.85 | |
S12000034 | Aberdeenshire | 107 | 122813 | 112114 | -9.54 | |
S12000041 | Angus | 72 | 56233 | 54221 | -3.71 | |
S12000035 | Argyll and Bute | 55 | 41088 | 41789 | 1.68 | |
S12000036 | City of Edinburgh | 102 | 213161 | 238269 | 10.54 | |
S12000005 | Clackmannanshire | 31 | 24368 | 23890 | -2.00 | |
S12000006 | Dumfries and Galloway | 90 | 72128 | 69699 | -3.49 | |
S12000042 | Dundee City | 88 | 62531 | 70685 | 11.54 | |
S12000008 | East Ayrshire | 74 | 58782 | 55387 | -6.13 | |
S12000045 | East Dunbartonshire | 57 | 54659 | 46228 | -18.24 | |
S12000010 | East Lothian | 56 | 47461 | 46771 | -1.47 | |
S12000011 | East Renfrewshire | 51 | 48210 | 39345 | -22.53 | |
S12000014 | Falkirk | 93 | 75877 | 72672 | -4.41 | |
S12000047 | Fife | 102 | 184519 | 169239 | -9.03 | |
S12000049 | Glasgow City | 105 | 240153 | 294622 | 18.49 | |
S12000017 | Highland | 110 | 120264 | 109514 | -9.82 | |
S12000018 | Inverclyde | 48 | 32990 | 37614 | 12.29 | |
S12000019 | Midlothian | 47 | 39363 | 39733 | 0.93 | |
S12000020 | Moray | 58 | 47011 | 42932 | -9.50 | |
S12000013 | Na h-Eileanan Siar | 14 | 13271 | 12833 | -3.41 | |
S12000021 | North Ayrshire | 87 | 65530 | 64140 | -2.17 | |
S12000050 | North Lanarkshire | 103 | 155653 | 152443 | -2.11 | |
S12000023 | Orkney Islands | 14 | 9147 | 10589 | 13.62 | |
S12000048 | Perth and Kinross | 90 | 68979 | 69003 | 0.03 | |
S12000038 | Renfrewshire | 108 | 89898 | 86683 | -3.71 | |
S12000026 | Scottish Borders | 74 | 53834 | 54715 | 1.61 | |
S12000027 | Shetland Islands | 13 | 11621 | 10439 | -11.33 | |
S12000028 | South Ayrshire | 68 | 57283 | 52588 | -8.93 | |
S12000029 | South Lanarkshire | 109 | 166000 | 147434 | -12.59 | |
S12000030 | Stirling | 49 | 40263 | 39654 | -1.54 | |
S12000039 | West Dunbartonshire | 53 | 40713 | 43030 | 5.39 | |
S12000040 | West Lothian | 94 | 81950 | 78966 | -3.78 | |
totals | 2495622 | 2495622 | 0.00 | |||
sample weight = number of households in 2019 NRA estimates/total number of cases of that council in pooled shs