# Merging SHS and FRS Data

February 7, 2021

Some notes as I try to add some Scottish Household Survey Data to my FRS based dataset.

## Why?

Because a lot of my public access FRS is blank. In particular I’ve decided I can’t really proceed with housing-related benefits modelling without Local Housing Allowance identifiers and council taxes. And these aren’t in the public FRS datasets I use.

Plus, there’s loads of good stuff about housing, heating and transport in the SHS which might be useful later on.

## HOW

There’s some theory about this, and some software; see King et. al, EuroStat and the StatMatch software.

I’d really like to replicate StatMatch in Julia.

For now, I’m using a rather hacked, ad-hoc implementation based on King’s Coarsened Exact Matching idea.

There’s a large literature on matching more generally, used as a technique in evaluation studies, but I don’t think much of it is useful for what I’m after here. Propensity Score Matching is fun - and I’d also like to implement a Julia version, but the kind of matching produced isn’t really useful here, since it matches on scores and not characteristics (a white young male could get matched with a black old female if they have the same score - we need to aviod that, I think).

So I’m just using an hand-coded matching thing - select records from SHS (the Donor) and FRS (the Recipient) based on a bunch of characteristics, but just use a hand-written program. ‘Coarsened’ here means progressively widening and then dropping characteristics if there are no perfect matches; for example, we might match by tenure type, but if there’s no private renter in the SHS amonst those that match on ouseful characteristics, we might find one that rents in any way (e.g. from a council) or, in extremis, drop tenure type as a matching criterial for that observation.

This video is a good intro.

Our strategy has to be slightly different:

• Matching in evaluation is usually done only over observations that actually match (have ‘common support’ in the jargon - if some bins have just one side (treated/donor, etc.) or some propensity scores don’t match, those observations are dropped, but we have to match for every FRS observation since we never want to lose observations). So for some, we might just use a bad match;
• some matches may be catastrophic - assigning a male health record to a female, for example;
• Coarsened matching coarsens across all observations the same (all renting for everyone, even if using private renting works for some), but we might want tight matches where available

## Li-Chung-ing

This is idea suggested to us on a previous project by Li-Chun Zhang of the University of Southampton.

We can get an idea of the errors produced by this procedure by recording not just the best match but progressively more coarsened matches, and then using all the matches in your simulation - bootstrapping of a sort.

## SHS (Donor) Side

SHS has a seriously weird stucture. Not everyone in a household is sampled - instead there’s a randomly chosen person and there’s also a bunch of stuff for the ‘highest income person’.

### Household Characteristics

The object initially is to match in household records. In future I might match in individual level stuff (health, transport) in which case we’ll need to match a bit differently (include gender, for example, de-emphasise household characteristics like accomodation type)

• shelter : sheltered accomodation
• tenure : tenure type
• acctype : type of dwelling
• singlepar : lone parent hhld flag
• numadults : num adults
• numkids : num children
• empstathigh : employment status, highest income person (HIP)
• sochigh : socio-economic, HIP
• agehigh : age HIP
• ethnichigh : ethnic group, hip
• datayear : data year (2016..18)

See this script for the actual SHS->FRS mappings, and the coarse_match function in Utils.jl for a simple matching algorithm.

Todo: match on income, benefit receipts. The mean of annetinc is 27k in the SHS, but mean hhinc in FRS is 38k, so I need to construct something or at least figure out the constuction of these.

SHS benefit receipts are also problematic because of the reporting of adults.

… much later …

Oversampling of small councils. Really messes up matching - you end up replicating the oversampling.

To fix this, select randomly from all the matches, with the select conditioned by sample frequencies.

But just using the best matches makes little difference, so use all matches, even bad ones, but with probability of choosing a (crude) function of match quality and implied sample weight.

 code name sample weight Modelled hhls Actual 2019 hhlds %diff S12000033 Aberdeen City 104 99870 108381 7.85 S12000034 Aberdeenshire 107 122813 112114 -9.54 S12000041 Angus 72 56233 54221 -3.71 S12000035 Argyll and Bute 55 41088 41789 1.68 S12000036 City of Edinburgh 102 213161 238269 10.54 S12000005 Clackmannanshire 31 24368 23890 -2.00 S12000006 Dumfries and Galloway 90 72128 69699 -3.49 S12000042 Dundee City 88 62531 70685 11.54 S12000008 East Ayrshire 74 58782 55387 -6.13 S12000045 East Dunbartonshire 57 54659 46228 -18.24 S12000010 East Lothian 56 47461 46771 -1.47 S12000011 East Renfrewshire 51 48210 39345 -22.53 S12000014 Falkirk 93 75877 72672 -4.41 S12000047 Fife 102 184519 169239 -9.03 S12000049 Glasgow City 105 240153 294622 18.49 S12000017 Highland 110 120264 109514 -9.82 S12000018 Inverclyde 48 32990 37614 12.29 S12000019 Midlothian 47 39363 39733 0.93 S12000020 Moray 58 47011 42932 -9.50 S12000013 Na h-Eileanan Siar 14 13271 12833 -3.41 S12000021 North Ayrshire 87 65530 64140 -2.17 S12000050 North Lanarkshire 103 155653 152443 -2.11 S12000023 Orkney Islands 14 9147 10589 13.62 S12000048 Perth and Kinross 90 68979 69003 0.03 S12000038 Renfrewshire 108 89898 86683 -3.71 S12000026 Scottish Borders 74 53834 54715 1.61 S12000027 Shetland Islands 13 11621 10439 -11.33 S12000028 South Ayrshire 68 57283 52588 -8.93 S12000029 South Lanarkshire 109 166000 147434 -12.59 S12000030 Stirling 49 40263 39654 -1.54 S12000039 West Dunbartonshire 53 40713 43030 5.39 S12000040 West Lothian 94 81950 78966 -3.78 totals 2495622 2495622 0.00
sample weight = number of households in 2019 NRA estimates/total number of cases of that council in pooled shs

