Merging SHS and FRS Data

February 7, 2021

Some notes as I try to add some Scottish Household Survey Data to my FRS based dataset.

Why?

Because a lot of my public access FRS is blank. In particular I’ve decided I can’t really proceed with housing-related benefits modelling without Local Housing Allowance identifiers and council taxes. And these aren’t in the public FRS datasets I use.

Plus, there’s loads of good stuff about housing, heating and transport in the SHS which might be useful later on.

HOW

There’s some theory about this, and some software; see King et. al, EuroStat and the StatMatch software.

I’d really like to replicate StatMatch in Julia.

For now, I’m using a rather hacked, ad-hoc implementation based on King’s Coarsened Exact Matching idea.

There’s a large literature on matching more generally, used as a technique in evaluation studies, but I don’t think much of it is useful for what I’m after here. Propensity Score Matching is fun - and I’d also like to implement a Julia version, but the kind of matching produced isn’t really useful here, since it matches on scores and not characteristics (a white young male could get matched with a black old female if they have the same score - we need to aviod that, I think).

So I’m just using an hand-coded matching thing - select records from SHS (the Donor) and FRS (the Recipient) based on a bunch of characteristics, but just use a hand-written program. ‘Coarsened’ here means progressively widening and then dropping characteristics if there are no perfect matches; for example, we might match by tenure type, but if there’s no private renter in the SHS amonst those that match on ouseful characteristics, we might find one that rents in any way (e.g. from a council) or, in extremis, drop tenure type as a matching criterial for that observation.

This video is a good intro.

Our strategy has to be slightly different:

Li-Chung-ing

This is idea suggested to us on a previous project by Li-Chun Zhang of the University of Southampton.

We can get an idea of the errors produced by this procedure by recording not just the best match but progressively more coarsened matches, and then using all the matches in your simulation - bootstrapping of a sort.

SHS (Donor) Side

SHS has a seriously weird stucture. Not everyone in a household is sampled - instead there’s a randomly chosen person and there’s also a bunch of stuff for the ‘highest income person’.

Household Characteristics

The object initially is to match in household records. In future I might match in individual level stuff (health, transport) in which case we’ll need to match a bit differently (include gender, for example, de-emphasise household characteristics like accomodation type)

See this script for the actual SHS->FRS mappings, and the coarse_match function in Utils.jl for a simple matching algorithm.

Todo: match on income, benefit receipts. The mean of annetinc is 27k in the SHS, but mean hhinc in FRS is 38k, so I need to construct something or at least figure out the constuction of these.

SHS benefit receipts are also problematic because of the reporting of adults.

… much later …

Oversampling of small councils. Really messes up matching - you end up replicating the oversampling.

To fix this, select randomly from all the matches, with the select conditioned by sample frequencies.

But just using the best matches makes little difference, so use all matches, even bad ones, but with probability of choosing a (crude) function of match quality and implied sample weight.

code

name

sample weight

Modelled hhls

Actual 2019 hhlds

%diff

 

 S12000033

 Aberdeen City

104

99870

108381

7.85

 

 S12000034

 Aberdeenshire

107

122813

112114

-9.54

 

 S12000041

 Angus

72

56233

54221

-3.71

 

 S12000035

 Argyll and Bute

55

41088

41789

1.68

 

 S12000036

 City of Edinburgh

102

213161

238269

10.54

 

 S12000005

 Clackmannanshire

31

24368

23890

-2.00

 

 S12000006

 Dumfries and Galloway

90

72128

69699

-3.49

 

 S12000042

 Dundee City

88

62531

70685

11.54

 

 S12000008

 East Ayrshire

74

58782

55387

-6.13

 

 S12000045

 East Dunbartonshire

57

54659

46228

-18.24

 

 S12000010

 East Lothian

56

47461

46771

-1.47

 

 S12000011

 East Renfrewshire

51

48210

39345

-22.53

 

 S12000014

 Falkirk

93

75877

72672

-4.41

 

 S12000047

 Fife

102

184519

169239

-9.03

 

 S12000049

 Glasgow City

105

240153

294622

18.49

 

 S12000017

 Highland

110

120264

109514

-9.82

 

 S12000018

 Inverclyde

48

32990

37614

12.29

 

 S12000019

 Midlothian

47

39363

39733

0.93

 

 S12000020

 Moray

58

47011

42932

-9.50

 

 S12000013

 Na h-Eileanan Siar

14

13271

12833

-3.41

 

 S12000021

 North Ayrshire

87

65530

64140

-2.17

 

 S12000050

 North Lanarkshire

103

155653

152443

-2.11

 

 S12000023

 Orkney Islands

14

9147

10589

13.62

 

 S12000048

 Perth and Kinross

90

68979

69003

0.03

 

 S12000038

 Renfrewshire

108

89898

86683

-3.71

 

 S12000026

 Scottish Borders

74

53834

54715

1.61

 

 S12000027

 Shetland Islands

13

11621

10439

-11.33

 

 S12000028

 South Ayrshire

68

57283

52588

-8.93

 

 S12000029

 South Lanarkshire

109

166000

147434

-12.59

 

 S12000030

 Stirling

49

40263

39654

-1.54

 

 S12000039

 West Dunbartonshire

53

40713

43030

5.39

 

 S12000040

 West Lothian

94

81950

78966

-3.78

 

totals

  

2495622

2495622

0.00

 
       
       
sample weight = number of households in 2019 NRA estimates/total number of cases of that council in pooled shs
Category: Blog Tags: Data Merging
Merging SHS and FRS Data - February 7, 2021 - Graham Stark