Merging SHS and FRS Data

February 7, 2021

Some notes as I try to add some Scottish Household Survey Data to my FRS based dataset.

Why?

Because a lot of my public access FRS is blank. In particular I’ve decided I can’t really proceed with housing-related benefits modelling without Local Housing Allowance identifiers and council taxes. And these aren’t in the public FRS datasets I use.

Plus, there’s loads of good stuff about housing, heating and transport in the SHS which might be useful later on.

HOW

There’s some theory about this, and some software; see King et. al, EuroStat and the StatMatch software.

I’d really like to replicate StatMatch in Julia.

For now, I’m using a rather hacked, ad-hoc implementation based on King’s Coarsened Exact Matching idea.

There’s a large literature on matching more generally, used as a technique in evaluation studies, but I don’t think much of it is useful for what I’m after here. Propensity Score Matching is fun - and I’d also like to implement a Julia version, but the kind of matching produced isn’t really useful here, since it matches on scores and not characteristics (a white young male could get matched with a black old female if they have the same score - we need to aviod that, I think).

So I’m just using an hand-coded matching thing - select records from SHS (the Donor) and FRS (the Recipient) based on a bunch of characteristics, but just use a hand-written program. ‘Coarsened’ here means progressively widening and then dropping characteristics if there are no perfect matches; for example, we might match by tenure type, but if there’s no private renter in the SHS amonst those that match on ouseful characteristics, we might find one that rents in any way (e.g. from a council) or, in extremis, drop tenure type as a matching criterial for that observation.

This video is a good intro.

Our strategy has to be slightly different:

Matching in evaluation is usually done only over observations that actually match (have ‘common support’ in the jargon - if some bins have just one side (treated/donor, etc.) or some propensity scores don’t match, those observations are dropped, but we have to match for every FRS observation since we never want to lose observations). So for some, we might just use a bad match;
some matches may be catastrophic - assigning a male health record to a female, for example;
Coarsened matching coarsens across all observations the same (all renting for everyone, even if using private renting works for some), but we might want tight matches where available

Li-Chung-ing

This is idea suggested to us on a previous project by Li-Chun Zhang of the University of Southampton.

We can get an idea of the errors produced by this procedure by recording not just the best match but progressively more coarsened matches, and then using all the matches in your simulation - bootstrapping of a sort.

SHS (Donor) Side

SHS has a seriously weird stucture. Not everyone in a household is sampled - instead there’s a randomly chosen person and there’s also a bunch of stuff for the ‘highest income person’.

Household Characteristics

The object initially is to match in household records. In future I might match in individual level stuff (health, transport) in which case we’ll need to match a bit differently (include gender, for example, de-emphasise household characteristics like accomodation type)

shelter : sheltered accomodation
tenure : tenure type
acctype : type of dwelling
singlepar : lone parent hhld flag
numadults : num adults
numkids : num children
empstathigh : employment status, highest income person (HIP)
sochigh : socio-economic, HIP
agehigh : age HIP
ethnichigh : ethnic group, hip
datayear : data year (2016..18)

See this script for the actual SHS->FRS mappings, and the coarse_match function in Utils.jl for a simple matching algorithm.

Todo: match on income, benefit receipts. The mean of annetinc is 27k in the SHS, but mean hhinc in FRS is 38k, so I need to construct something or at least figure out the constuction of these.

SHS benefit receipts are also problematic because of the reporting of adults.

… much later …

Oversampling of small councils. Really messes up matching - you end up replicating the oversampling.

To fix this, select randomly from all the matches, with the select conditioned by sample frequencies.

But just using the best matches makes little difference, so use all matches, even bad ones, but with probability of choosing a (crude) function of match quality and implied sample weight.

code	name	sample weight	Modelled hhls	Actual 2019 hhlds	%diff
S12000033	Aberdeen City	104	99870	108381	7.85
S12000034	Aberdeenshire	107	122813	112114	-9.54
S12000041	Angus	72	56233	54221	-3.71
S12000035	Argyll and Bute	55	41088	41789	1.68
S12000036	City of Edinburgh	102	213161	238269	10.54
S12000005	Clackmannanshire	31	24368	23890	-2.00
S12000006	Dumfries and Galloway	90	72128	69699	-3.49
S12000042	Dundee City	88	62531	70685	11.54
S12000008	East Ayrshire	74	58782	55387	-6.13
S12000045	East Dunbartonshire	57	54659	46228	-18.24
S12000010	East Lothian	56	47461	46771	-1.47
S12000011	East Renfrewshire	51	48210	39345	-22.53
S12000014	Falkirk	93	75877	72672	-4.41
S12000047	Fife	102	184519	169239	-9.03
S12000049	Glasgow City	105	240153	294622	18.49
S12000017	Highland	110	120264	109514	-9.82
S12000018	Inverclyde	48	32990	37614	12.29
S12000019	Midlothian	47	39363	39733	0.93
S12000020	Moray	58	47011	42932	-9.50
S12000013	Na h-Eileanan Siar	14	13271	12833	-3.41
S12000021	North Ayrshire	87	65530	64140	-2.17
S12000050	North Lanarkshire	103	155653	152443	-2.11
S12000023	Orkney Islands	14	9147	10589	13.62
S12000048	Perth and Kinross	90	68979	69003	0.03
S12000038	Renfrewshire	108	89898	86683	-3.71
S12000026	Scottish Borders	74	53834	54715	1.61
S12000027	Shetland Islands	13	11621	10439	-11.33
S12000028	South Ayrshire	68	57283	52588	-8.93
S12000029	South Lanarkshire	109	166000	147434	-12.59
S12000030	Stirling	49	40263	39654	-1.54
S12000039	West Dunbartonshire	53	40713	43030	5.39
S12000040	West Lothian	94	81950	78966	-3.78
totals			2495622	2495622	0.00

sample weight = number of households in 2019 NRA estimates/total number of cases of that council in pooled shs

Category: Blog Tags: Data Merging