Originality declaration

I, [Leandra Lee], confirm that the work presented in this assessment is my own. Where information has been derived from other sources, I confirm that this has been indicated in the work.

date: 17 December, 2021

Obesity and Diet in London

library(sf)
library(tidyverse)
library(tmap)
library(janitor)
library(here)
library(broom)
library(car)
library(cowplot)
library(corrr)
library(sp)
library(spdep)
library(spgwr)
library(spatialreg)

Project Scope

Research question

  • Is there a relationship between diet and obesity in London?

Hypothesis

  • Higher weights of average food product purchased, higher volumes of average drink product purchased, higher weights of carbs in the average product purchased, higher weights of fat in the average product purchased, higher fraction of soft drinks purchased and lower nutritional energy of the average product lead to a higher percentage of the population aged 16+ with a BMI of 30+.

Analysis

  • Map obesity rates across London MSOA
  • Bring in data on food purchases (e.g. weight, carb, fat, energy)
  • Conduct multiple linear regression with obesity as the dependent variable and selected food purchase data as predictor variables
  • Experiment with the spatial lag and spatial error models, comparing their results
  • Run geographically weighted regression (GWR) to see if there is local variation that is not captured by the global model

Assumptions

  • Assume that data on food purchases does not vary too much over time, such that even though the data was collected in 2015, it can be used to explain obesity rates that were modelled for 2006 to 2008.
  • Assume that the modelled estimates of obesity are an accurate representation of the actual obesity percentages in MSOAs.

Data

Statistical GIS Boundary Files for London

  • This data contains various GIS boundary files for London at different spatial aggregation units, including Lower Super Output Area (LSOA), Middle Super Output Area (MSOA), Wards and Boroughs.
  • For this analysis, I will use MSOA data so that the spatial units correspond with the data available on obesity.
  • I will transform the data to a projected coordinate reference system, the British National Grid.

Obesity data at MSOA level

  • This MSOA atlas provides a summary of demographic and related data for each Middle Super Output Area in Greater London.
  • The dataset contains two columns related to obesity:
    • percentage of measured children in Year 6 who were classified as obese, 2009/10-2011/12
    • a modelled estimate of the percentage of the population aged 16+ with a BMI of 30+, 2006-2008
  • I will use the modelled estimate data as the data has no missing values.
  • The last row of the dataset contains information on the MSOA average for London, which I will filter out

Tesco supermarket grocery data

  • This dataset contains aggregated information on food purchases at different geographic aggregation (LSOA, MSOA, Ward, Borough) in London, enriched with information coming from the census.
  • Data on selected months from 2015 are available. I have chosen to use December MSOA data for this analysis as it seems to have complete information for all MSOAs, but other months can be considered too.
  • The columns in the dataset that I am interested in are:
    • weight
    • volume
    • fat
    • carb
    • energy_tot
    • f_soft_drinks

MSOA boundary data

msoa_shape <- st_read(here::here("data", 
                           "statistical-gis-boundaries-london", 
                           "ESRI", "MSOA_2011_London_gen_MHW.shp")) %>%
  st_transform(., 27700)
## Reading layer `MSOA_2011_London_gen_MHW' from data source 
##   `N:\GIS\casa0005-final-exam-leandrafaith\data\statistical-gis-boundaries-london\ESRI\MSOA_2011_London_gen_MHW.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 983 features and 12 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
## Projected CRS: OSGB 1936 / British National Grid
summary(msoa_shape)
##    MSOA11CD           MSOA11NM           LAD11CD            LAD11NM         
##  Length:983         Length:983         Length:983         Length:983        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    RGN11CD            RGN11NM             USUALRES        HHOLDRES    
##  Length:983         Length:983         Min.   : 5184   Min.   : 5184  
##  Class :character   Class :character   1st Qu.: 7338   1st Qu.: 7240  
##  Mode  :character   Mode  :character   Median : 8156   Median : 8064  
##                                        Mean   : 8315   Mean   : 8213  
##                                        3rd Qu.: 9110   3rd Qu.: 9002  
##                                        Max.   :14719   Max.   :14657  
##    COMESTRES        POPDEN           HHOLDS       AVHHOLDSZ    
##  Min.   :   0   Min.   :  2.90   Min.   :2031   Min.   :1.600  
##  1st Qu.:   9   1st Qu.: 47.45   1st Qu.:2850   1st Qu.:2.300  
##  Median :  42   Median : 72.90   Median :3212   Median :2.500  
##  Mean   : 102   Mean   : 83.42   Mean   :3323   Mean   :2.507  
##  3rd Qu.: 106   3rd Qu.:115.55   3rd Qu.:3746   3rd Qu.:2.700  
##  Max.   :2172   Max.   :247.20   Max.   :5936   Max.   :3.900  
##           geometry  
##  MULTIPOLYGON :983  
##  epsg:27700   :  0  
##  +proj=tmer...:  0  
##                     
##                     
## 
# qtm(msoa_shape)

There are 983 MSOAs in London.

Obesity data

msoa_atlas <- read_csv(here::here("data", "msoa-data.csv"),
                       locale=locale(encoding = "latin1")) %>%
  clean_names()
## New names:
## * `House Prices;Sales;2011` -> `House Prices;Sales;2011...129`
## * `House Prices;Sales;2011` -> `House Prices;Sales;2011...130`
## Rows: 984 Columns: 207
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (2): Middle Super Output Area, MSOA Name
## dbl (205): Age Structure (2011 Census);All Ages;, Age Structure (2011 Census...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# drop the last row that contains average MSOA data
msoa_atlas <- msoa_atlas %>%
  filter(str_detect(middle_super_output_area, "^E02"))

msoa_obesity <- msoa_atlas %>%
  select(middle_super_output_area,
         obesity_percentage_of_the_population_aged_16_with_a_bmi_of_30_modelled_estimate_2006_2008) %>%
  rename(., 
         obesity_percent = obesity_percentage_of_the_population_aged_16_with_a_bmi_of_30_modelled_estimate_2006_2008)

summary(msoa_obesity)
##  middle_super_output_area obesity_percent
##  Length:983               Min.   : 9.80  
##  Class :character         1st Qu.:16.90  
##  Mode  :character         Median :21.10  
##                           Mean   :20.88  
##                           3rd Qu.:25.10  
##                           Max.   :33.90
# join obesity data to MSOA boundary
msoa_shape <- msoa_shape %>%
  left_join(.,
            msoa_obesity, 
            by = c("MSOA11CD" = "middle_super_output_area"))

Datatypelist <- msoa_shape %>% 
  st_drop_geometry() %>%
  summarise_all(class) %>%
  pivot_longer(everything(), 
               names_to="All_variables", 
               values_to="Variable_class")

qtm(msoa_shape, 
    fill = "obesity_percent",
    fill.palette = "Blues")

Tesco grocery purchase data

grocery_purchases <- read_csv(here::here("data",
                                         "Dec_msoa_grocery.csv"))
## Rows: 983 Columns: 202
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (1): area_id
## dbl (201): weight, weight_perc2.5, weight_perc25, weight_perc50, weight_perc...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
selected_variables <- grocery_purchases %>%
  select(area_id, weight, volume, fat, 
         carb, energy_tot, f_soft_drinks, population)

h1 <- ggplot(selected_variables, aes(x=weight)) +
  geom_histogram(color = "darkblue",
                 fill = "lightblue") +
  theme_bw()

h2 <- ggplot(selected_variables, aes(x=volume)) +
  geom_histogram(color = "darkblue",
                 fill = "lightblue") +
  theme_bw()

h3 <- ggplot(selected_variables, aes(x=fat)) +
  geom_histogram(color = "darkblue",
                 fill = "lightblue") +
  theme_bw()

h4 <- ggplot(selected_variables, aes(x=carb)) +
  geom_histogram(color = "darkblue",
                 fill = "lightblue") +
  theme_bw()

h5 <- ggplot(selected_variables, aes(x=energy_tot)) +
  geom_histogram(color = "darkblue",
                 fill = "lightblue") +
  theme_bw()

h6 <- ggplot(selected_variables, aes(x=f_soft_drinks)) +
  geom_histogram(color = "darkblue",
                 fill = "lightblue") +
  theme_bw()

plot_grid(h1, h2, h3, h4, h5, h6)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

selected_variables <- selected_variables %>%
  mutate(log_f_soft_drinks = log(f_soft_drinks))

Most of the variables appear normally distributed, except for f_soft_drinks, which I will do a log transformation of before using it in the regression model.

shape_joined <- msoa_shape %>%
  left_join(., 
            selected_variables, 
            by = c("MSOA11CD" = "area_id"))

Multiple Linear Regression

Check for multicollinearity

# check their correlations are OK
# pearson correlation is only appropriate for interval or ratio data
# we are only concerned about multicollinearity between predictor variables
myvars <- shape_joined %>%
  dplyr::select(weight,
                volume,
                fat,
                carb,
                energy_tot,
                log_f_soft_drinks) %>%
  st_drop_geometry()

# get correlation matrix
cormat <- cor(myvars, use="complete.obs", method="pearson")

# significance test
sig1 <- corrplot::cor.mtest(myvars, conf.level = .95)

# create a correlogram
corrplot::corrplot(cormat, type="lower",
                   method = "circle", 
                   order = "original", 
                   tl.cex = 0.7,
                   p.mat = sig1$p, sig.level = .05, 
                   col = viridis::viridis(100, option = "plasma"),
                   diag = FALSE)

The size of the circle reflects the strength of the relationships as captured by the Pearson correlation coefficient. Crosses indicate statistically insignificant relationships at the 95% level of confidence.

It looks like the nutritional energy of the average product (energy_tot) is highly correlated with the weight of fat and carb in the average product, so I will remove energy_tot as a predictor variable.

Weight and volume are highly correlated, so I will remove volume as a predictor variable.

Fat and carb are highly correlated, so I will remove carb as a predictor variable.

I will check for multicollinearity between the remaining predictor variables.

myvars2 <- myvars %>%
  select(-volume, -energy_tot, -carb)

cormat <- cor(myvars2, use="complete.obs", method="pearson")

# significance test
sig1 <- corrplot::cor.mtest(myvars2, conf.level = .95)

# create a correlogram
corrplot::corrplot(cormat, type="lower",
                   method = "circle", 
                   order = "original", 
                   tl.cex = 0.7,
                   p.mat = sig1$p, sig.level = .05, 
                   col = viridis::viridis(100, option = "plasma"),
                   diag = FALSE)

The remaining variables no longer exhibit strong multicollinearity, so I will refine my hypothesis as follows:

  • Hypothesis: Higher weights of average food product purchased, higher weights of fat in the average product purchased and a higher fraction of soft drinks purchased lead to a higher percentage of the population aged 16+ with a BMI of 30+.

Check for linear relationship between x and y

regression_data <- shape_joined %>%
  dplyr::select(MSOA11CD,
                obesity_percent,
                weight,
                fat,
                log_f_soft_drinks)

p1 <- ggplot(regression_data, aes(x=weight, y=obesity_percent)) + 
  geom_point() +
  geom_smooth(method=lm) +
  theme_classic()

p2 <- ggplot(regression_data, aes(x=fat, y=obesity_percent)) + 
  geom_point() +
  geom_smooth(method=lm) +
  theme_classic()

p3 <- ggplot(regression_data, aes(x=log_f_soft_drinks, y=obesity_percent)) + 
  geom_point() +
  geom_smooth(method=lm) +
  theme_classic()

plot_grid(p1, p2, p3)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

There seems to be a slight positive relationship between all our predictor variables and the percentage of obesity in the population.

Linear model

Now I will fit a linear model using the ordinary least squares method.

eq <- obesity_percent ~
  weight + fat + log_f_soft_drinks

ols_model <- lm(eq,
                data = regression_data)

# show the summary of those outputs
tidy(ols_model)
## # A tibble: 4 x 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)        29.3      3.06         9.60 6.64e-21
## 2 weight              0.0216   0.00265      8.15 1.15e-15
## 3 fat                 1.37     0.204        6.69 3.78e-11
## 4 log_f_soft_drinks   8.88     0.413       21.5  3.56e-84
glance(ols_model)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.417         0.416  3.95      234. 2.19e-114     3 -2742. 5494. 5519.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
summary(ols_model)
## 
## Call:
## lm(formula = eq, data = regression_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0491  -2.8129  -0.2523   2.8492  11.4064 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       29.321925   3.055460   9.597  < 2e-16 ***
## weight             0.021597   0.002652   8.145 1.15e-15 ***
## fat                1.367470   0.204447   6.689 3.78e-11 ***
## log_f_soft_drinks  8.877024   0.413230  21.482  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.946 on 979 degrees of freedom
## Multiple R-squared:  0.4174, Adjusted R-squared:  0.4157 
## F-statistic: 233.8 on 3 and 979 DF,  p-value: < 2.2e-16

The results indicate that 41.7% of the variation in the percentage of obesity can be explained by the average weight of food purchased, the weight of fat in the average product, and the fraction of soft drinks purchased. The individual p-values for all predictor variables are significant at the 99.9% level of significance.

The coefficients of all the predictor variables are positive, indicating a positive linear relationship between these variables and obesity. For example, an increase in the weight of fat (in grams) in the average product purchased is associated with a 1.37 increase in the percentage of obesity in the population, holding all other variables constant.

Check for normally distributed residuals

shape_joined <- shape_joined %>%
  mutate(ols_model_res = residuals(ols_model))

ggplot(shape_joined, aes(x=ols_model_res)) +
  geom_histogram(color = "darkblue",
                 fill = "lightblue") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The residuals appear to be normally distributed.

Check for homoscedasticity

Homoscedasticity refers to equal error variance for all x values. That is, the errors about the regression line do not vary with x. This can be verified using a residuals vs fits plot.

par(mfrow = c(2, 2))
plot(ols_model)

The residuals vs fitted values plot at the top left corner shows that when the fitted values are large, most of the residuals are negative. This may be a violation of the equal variance assumption, but otherwise, the other residuals show equal variance.

The Normal Q-Q plot in the top right corner can be used to check if the residuals are normally distributed. If they are, they should follow a straight line, which is the case here.

Check for independent residuals

tm_shape(shape_joined) +
  tm_polygons("ols_model_res")
## Variable(s) "ols_model_res" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

The map shows that MSOAs with positive residuals seem to be neighbouring other MSOAs with positive residuals, while those with negative residuals also exhibit some clustering.

To obtain a more quantitative measure of spatial autocorrelation, I will turn to Moran’s I.

Moran’s I

# calculate the centroids of MSOAs
centroids <- shape_joined %>%
  st_centroid() %>%
  st_geometry()
## Warning in st_centroid.sf(.): st_centroid assumes attributes are constant over
## geometries of x
# the poly2nb function builds a neighbours list based on regions with contiguous boundaries, that is sharing one or more boundary point
# queen = TRUE means a single shared boundary point meets the contiguity condition
neighbours_list <- shape_joined %>%
  poly2nb(., queen=T)

summary(neighbours_list)
## Neighbour list object:
## Number of regions: 983 
## Number of nonzero links: 52 
## Percentage nonzero weights: 0.005381413 
## Average number of links: 0.05289929 
## 932 regions with no links:
## 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
## 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
## 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
## 118 119 120 121 124 125 126 127 128 129 130 131 132 133 134 135 136 137
## 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155
## 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
## 174 175 176 177 178 179 180 181 182 183 184 185 187 189 190 191 192 193
## 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211
## 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229
## 230 231 232 233 234 235 236 237 238 239 240 241 243 244 245 246 247 248
## 249 250 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267
## 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285
## 286 287 288 289 290 291 292 293 294 295 296 298 299 300 302 303 305 306
## 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 325
## 326 328 329 330 331 332 333 334 335 336 337 338 339 341 342 343 344 345
## 347 348 349 350 351 352 353 354 355 356 357 358 359 361 363 364 366 370
## 371 372 373 374 375 376 377 378 379 381 383 384 385 386 387 388 389 390
## 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408
## 410 411 412 413 414 416 417 418 419 420 421 422 423 424 425 426 427 428
## 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446
## 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464
## 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482
## 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
## 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518
## 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536
## 537 538 540 541 543 544 545 546 547 548 549 550 551 552 553 554 555 556
## 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574
## 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592
## 593 594 595 596 597 598 599 600 601 602 603 604 606 607 608 609 610 611
## 612 613 614 615 616 617 618 619 621 622 624 626 627 629 630 631 632 633
## 634 636 640 641 643 644 645 646 648 649 650 651 652 653 654 655 656 657
## 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675
## 676 677 678 679 680 681 682 683 684 685 686 687 690 691 692 693 694 695
## 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713
## 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731
## 732 733 736 737 738 739 740 741 742 743 745 747 748 749 750 751 752 753
## 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771
## 772 773 774 775 776 777 778 779 780 782 783 784 785 786 787 788 790 791
## 792 793 794 795 796 798 799 800 801 802 803 804 805 806 807 808 809 810
## 811 812 813 814 815 816 817 818 820 821 822 823 824 825 827 828 829 830
## 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848
## 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866
## 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884
## 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902
## 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920
## 921 922 923 924 925 926 927 928 929 930 933 934 935 936 937 938 939 940
## 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958
## 959 960 961 962 963 964 965 966 968 969 970 971 972 973 974 975 976 977
## 978 979 980 981 982 983
## Link number distribution:
## 
##   0   1   2 
## 932  50   1 
## 50 least connected regions:
## 8 122 123 186 188 242 251 297 304 324 327 340 346 360 362 365 367 368 369 380 382 409 415 539 542 605 620 623 625 628 635 637 638 639 642 647 688 689 734 735 744 746 781 789 797 819 826 931 932 967 with 1 link
## 1 most connected region:
## 301 with 2 links
plot(neighbours_list, st_geometry(centroids), col="red")
plot(shape_joined$geometry, add = T)

932 out of 983 MSOAs are not connected to other regions, which could present errors later in our analysis.

Thus, I will use k nearest neighbours with k set to 4 to generate the neighbours list.

# knearneigh returns a matrix with the indices of points belonging to the set of the k nearest neighbours of each other
knn_msoa <- centroids %>%
  knearneigh(., k=4)

# convert the knn object returned by knearneigh into a neighbours list
knn_nb <- knn_msoa %>%
  knn2nb()

plot(knn_nb, st_geometry(centroids), col="blue")
plot(shape_joined$geometry, add = T)

Now run Moran’s I test using row standardised spatial weights.

# add spatial weights to neighbours list
msoa_spatial_weights_list <- knn_nb %>%
  nb2listw(., style="W")

# run Moran's I test for spatial autocorrelation
global_moran <- shape_joined %>%
  pull(ols_model_res) %>%
  as.vector() %>%
  moran.test(., msoa_spatial_weights_list) 

global_moran
## 
##  Moran I test under randomisation
## 
## data:  .  
## weights: msoa_spatial_weights_list    
## 
## Moran I statistic standard deviate = 25.065, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5380476445     -0.0010183299      0.0004625464

The global Moran’s I statistic is 0.54 (1 = clustered, 0 = no pattern, -1 = dispersed) which shows that we have some distinctive clustering.

The p-value is less than 0.05, indicating that the results are statistically significant at the 95% level of significance. We can reject the null hypothesis that the model residuals are randomly distributed among MSOAs.

The standard deviate (or Z-score) is positive, indicating that the spatial distribution of high values and/or low values of residuals is more spatially clustered than would be expected if underlying spatial processes were random.

Spatial Regression Models

Now that we have established that spatial autocorrelation is present, there are two ways to deal with it, either by using a spatially lagged regression model or a spatial error model.

Lagrange Multiplier Test

I will use the lagrange multiplier test to decide whether to use the spatial lag or error model.

lm.LMtests(ols_model, msoa_spatial_weights_list, 
           test = c("LMerr", "LMlag", "RLMerr", "RLMlag", "SARMA"))
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = eq, data = regression_data)
## weights: msoa_spatial_weights_list
## 
## LMerr = 623.19, df = 1, p-value < 2.2e-16
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = eq, data = regression_data)
## weights: msoa_spatial_weights_list
## 
## LMlag = 766.87, df = 1, p-value < 2.2e-16
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = eq, data = regression_data)
## weights: msoa_spatial_weights_list
## 
## RLMerr = 10.928, df = 1, p-value = 0.0009472
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = eq, data = regression_data)
## weights: msoa_spatial_weights_list
## 
## RLMlag = 154.6, df = 1, p-value < 2.2e-16
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = eq, data = regression_data)
## weights: msoa_spatial_weights_list
## 
## SARMA = 777.8, df = 2, p-value < 2.2e-16

The results show that the robust spatially lagged regression model (RLMlag) has a p-value that is much more significant than the robust spatial error model (RLMerr), so I will pick the spatially lagged regression model.

Spatially lagged regression model

# Spatial simultaneous autoregressive lag model estimation
splag_model <- lagsarlm(eq, 
                        data = regression_data, 
                        msoa_spatial_weights_list)

# what do the outputs show?
tidy(splag_model)
## # A tibble: 5 x 5
##   term              estimate std.error statistic      p.value
##   <chr>                <dbl>     <dbl>     <dbl>        <dbl>
## 1 rho                0.740     0.0199      37.1  0           
## 2 (Intercept)        9.60      1.93         4.98 0.000000647 
## 3 weight             0.00979   0.00178      5.50 0.0000000370
## 4 fat                0.395     0.130        3.04 0.00239     
## 5 log_f_soft_drinks  3.48      0.281       12.4  0
glance(splag_model)
## # A tibble: 1 x 6
##   r.squared   AIC   BIC deviance logLik  nobs
##       <dbl> <dbl> <dbl>    <dbl>  <dbl> <int>
## 1     0.775 4726. 4755.    5962. -2357.   983
summary(splag_model)
## 
## Call:
## lagsarlm(formula = eq, data = regression_data, listw = msoa_spatial_weights_list)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -7.98025 -1.69662 -0.08926  1.72955  8.87246 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                    Estimate Std. Error z value  Pr(>|z|)
## (Intercept)       9.5995064  1.9289249  4.9766 6.471e-07
## weight            0.0097920  0.0017789  5.5044 3.704e-08
## fat               0.3954316  0.1302135  3.0368  0.002391
## log_f_soft_drinks 3.4769186  0.2813487 12.3580 < 2.2e-16
## 
## Rho: 0.74041, LR test value: 770.31, p-value: < 2.22e-16
## Asymptotic standard error: 0.019949
##     z-value: 37.114, p-value: < 2.22e-16
## Wald statistic: 1377.5, p-value: < 2.22e-16
## 
## Log likelihood: -2357.018 for lag model
## ML residual variance (sigma squared): 6.0647, (sigma: 2.4627)
## Number of observations: 983 
## Number of parameters estimated: 6 
## AIC: 4726, (AIC for lm: 5494.3)
## LM test for residual autocorrelation
## test value: 9.728, p-value: 0.0018148

The spatially lagged regression model incorporates spatial dependence explicitly by adding a “spatially lagged” dependent variable, rho, on the right-hand side of the regression equation.

The results show that rho is statistically significant, and the estimate of rho (0.74) is larger than its standard error (0.02), so we can conclude that there is some spatial dependence which we should bear in mind when interpreting the results of our regression model.

In our scenario, this means that the obesity percentage in one MSOA may be associated with or influenced by the obesity percentage in the four nearest neighbouring MSOAs (as defined in our neighbours list).

To check if this is true, I will use Moran’s I to identify spatial autocorrelation in the dependent variable, obesity percentage.

global_moran_obesity <- shape_joined %>%
  pull(obesity_percent) %>%
  as.vector() %>%
  moran.test(., msoa_spatial_weights_list) 

global_moran_obesity
## 
##  Moran I test under randomisation
## 
## data:  .  
## weights: msoa_spatial_weights_list    
## 
## Moran I statistic standard deviate = 35.292, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.7582270485     -0.0010183299      0.0004628242

The Moran’s I statistic confirms that the spatial distribution of high values and/or low values of obesity percentage across MSOAs is spatially clustered.

# write out the residuals
shape_joined <- shape_joined %>%
  mutate(splag_model_res = residuals(splag_model))

tm_shape(shape_joined) + 
  tm_polygons(col = "splag_model_res") +  
  tm_compass(type = "arrow", position = c("right", "top") , size = 2) + 
  tm_scale_bar(breaks = c(0,1,2), text.size = 0.7, position =  c("right", "bottom")) + 
  tm_layout(bg.color = "white", legend.outside = TRUE)
## Variable(s) "splag_model_res" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

# check that the residuals from the spatially lagged model are now no longer exhibiting spatial autocorrelation
splag_moran <- moran.test(shape_joined$splag_model_res, 
                          msoa_spatial_weights_list)

splag_moran
## 
##  Moran I test under randomisation
## 
## data:  shape_joined$splag_model_res  
## weights: msoa_spatial_weights_list    
## 
## Moran I statistic standard deviate = -2.0055, p-value = 0.9775
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##     -0.0441412871     -0.0010183299      0.0004623305

The results of Moran’s I test for spatial autocorrelation show that the residuals of the spatially lagged regression model are no longer spatially autocorrelated, as the Moran I statistic is close to 0. However, this result is not statistically significant, as the p-value is large.

Geographically Weighted Regression

Perhaps the spatial autocorrelation that we have observed for obesity percentages in London MSOAs is indicative of some local variation that the global OLS model has failed to capture.

GWR explores how the relationship between a dependent variable (Y) and one or more independent variables (the Xs) might vary geographically.

centroids2 <- st_coordinates(centroids)

shape_joined2 <- cbind(shape_joined, centroids2)

# find optimal kernel bandwidth
GWR_bandwidth <- gwr.sel(eq,
                         data = shape_joined2,
                         coords = cbind(shape_joined2$X,
                                        shape_joined2$Y),
                        # provide an adaptive bandwidth as opposed to a fixed bandwidth
                        adapt = TRUE)
## Adaptive q: 0.381966 CV score: 13556.81 
## Adaptive q: 0.618034 CV score: 14215.75 
## Adaptive q: 0.236068 CV score: 12837.09 
## Adaptive q: 0.145898 CV score: 11941.14 
## Adaptive q: 0.09016994 CV score: 10918.93 
## Adaptive q: 0.05572809 CV score: 9857.78 
## Adaptive q: 0.03444185 CV score: 8916.646 
## Adaptive q: 0.02128624 CV score: 8126.211 
## Adaptive q: 0.01315562 CV score: 7465.366 
## Adaptive q: 0.008130619 CV score: 6932.924 
## Adaptive q: 0.005024999 CV score: 6770.535 
## Adaptive q: 0.002617092 CV score: 6726.558 
## Adaptive q: 0.002341337 CV score: 6795.212 
## Adaptive q: 0.003729339 CV score: 6671.612 
## Adaptive q: 0.003646188 CV score: 6667.365 
## Adaptive q: 0.003426197 CV score: 6660.891 
## Adaptive q: 0.003117146 CV score: 6666.343 
## Adaptive q: 0.003370807 CV score: 6660.507 
## Adaptive q: 0.003330116 CV score: 6660.58 
## Adaptive q: 0.003370807 CV score: 6660.507
GWR_bandwidth
## [1] 0.003370807

The optimal bandwidth is about 0.0034 meaning 0.34% of all the total spatial units should be used for the local regression based on k-nearest neighbours. This is about 3 of the 983 MSOAs.

# fit the gwr model based on adaptive bandwidth
gwr_abw <- gwr(eq,
               data = shape_joined2, 
               coords = cbind(shape_joined2$X, shape_joined2$Y), 
               adapt = GWR_bandwidth, 
               # matrix output
               hatmatrix = TRUE, 
               # standard error
               se.fit = TRUE)

# print the results of the model
gwr_abw
## Call:
## gwr(formula = eq, data = shape_joined2, coords = cbind(shape_joined2$X, 
##     shape_joined2$Y), adapt = GWR_bandwidth, hatmatrix = TRUE, 
##     se.fit = TRUE)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.003370807 (about 3 of 983 data points)
## Summary of GWR coefficient estimates at data points:
##                         Min.    1st Qu.     Median    3rd Qu.       Max.
## X.Intercept.      -91.660266   4.460280  24.726248  43.582462 148.532683
## weight             -0.109115   0.012258   0.038205   0.066856   0.166229
## fat               -10.604538  -1.420237   0.177817   1.151848   8.159289
## log_f_soft_drinks  -8.065644   2.002665   4.849726   7.987723  25.560716
##                    Global
## X.Intercept.      29.3219
## weight             0.0216
## fat                1.3675
## log_f_soft_drinks  8.8770
## Number of data points: 983 
## Effective number of parameters (residual: 2traceS - traceS'S): 474.7981 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 508.2019 
## Sigma (residual: 2traceS - traceS'S): 2.112379 
## Effective number of parameters (model: traceS): 353.8443 
## Effective degrees of freedom (model: traceS): 629.1557 
## Sigma (model: traceS): 1.898502 
## Sigma (ML): 1.518844 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 4723.684 
## AIC (GWR p. 96, eq. 4.22): 3965.167 
## Residual sum of squares: 2267.671 
## Quasi-global R2: 0.9133393
results <- as.data.frame(gwr_abw$SDF)
# names(results)

The coefficients for the weight of fat in the average food product range from a minimum value of -10.6 (1 gram increase in fat resulting in a drop in obesity percentage of 10.6) to +8.16 (1 gram increase in fat resulting in a increase in obesity percentage of 8.16).

For half of the MSOAs in the dataset, as the weight of the average food product increases by 1 gram, the obesity percentage will increase between 0.012 and 0.067 points (the inter-quartile range between the 1st Qu and the 3rd Qu).

# save localR2 and coefficients to original data frame
shape_joined2$abw_localR2 <- results$localR2

shape_joined2$coef_weight <- results$weight

shape_joined2$coef_fat <- results$fat

shape_joined2$coef_log_f_soft_drinks <- results$log_f_soft_drinks

Map local R2

# map local R2
legend_title = expression("Adaptive bandwidth: Local R2")

map_abgwr1 = tm_shape(shape_joined2) +
  tm_fill(col = "abw_localR2", title = legend_title, 
          palette = "Oranges", style = "cont") +
  tm_borders(col = "gray", lwd = .1) + 
  tm_compass(type = "arrow", position = c("right", "top") , size = 2) + 
  tm_scale_bar(breaks = c(0,1,2), text.size = 0.7, position =  c("right", "bottom")) +
  tm_layout(bg.color = "white", legend.outside = TRUE)

map_abgwr1

The local R-squared values range from 0.2 to 0.8, with several MSOAs near the center having higher R-squared values.

Map the coefficients

# weight
legend_title = expression("Weight of average food product (grams)")

map_abgwr2 = tm_shape(shape_joined2) +
  tm_fill(col = "coef_weight", title = legend_title, 
          palette = "RdBu", style = "cont") + 
  tm_borders(col = "gray", lwd = .1)  +
  tm_compass(type = "arrow", position = c("right", "top") , size = 2) +
  tm_scale_bar(breaks = c(0,1,2), text.size = 0.7, position =  c("left", "bottom")) + 
  tm_layout(bg.color = "white", legend.outside = TRUE)

map_abgwr2
## Variable(s) "coef_weight" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

# fat
legend_title = expression("Weight of fat in the average product (grams)")

map_abgwr3 = tm_shape(shape_joined2) +
  tm_fill(col = "coef_fat", title = legend_title, 
          palette = "RdBu", style = "cont") + 
  tm_borders(col = "gray", lwd = .1)  + 
  tm_compass(type = "arrow", position = c("right", "top") , size = 2) +
  tm_scale_bar(breaks = c(0,1,2), text.size = 0.7, position =  c("left", "bottom")) + 
  tm_layout(bg.color = "white", legend.outside = TRUE) 

map_abgwr3
## Variable(s) "coef_fat" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

# soft drinks
legend_title = expression("Log(fraction of soft drinks purchased)")

map_abgwr4 = tm_shape(shape_joined2) +
  tm_fill(col = "coef_log_f_soft_drinks", title = legend_title, 
          palette = "RdBu", style = "cont") + 
  tm_borders(col = "gray", lwd = .1)  + 
  tm_compass(type = "arrow", position = c("right", "top") , size = 2) +
  tm_scale_bar(breaks = c(0,1,2), text.size = 0.7, position =  c("left", "bottom")) + 
  tm_layout(bg.color = "white", legend.outside = TRUE) 

map_abgwr4
## Variable(s) "coef_log_f_soft_drinks" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

The coefficients for the weight of the average food product show that the relationship between this variable and obesity percentage is positive for most MSOAs, especially in some central areas towards the south of the river.

Interestingly, on the outskirts of London, the relationship between weight of the average food product and obesity is negative (indicated in red on the map), which is unexpected. There could be some other factors at play that are unrelated to diets, perhaps related to lifestyle habits.

The coefficients for the weight of fat in the average product show that some MSOAs have negative coefficients, meaning that an increase in the weight of fat in the average product leads to a decrease in obesity percentage.

The coefficients for the log transformed fraction of soft drinks purchased show that there is mostly a positive relationship between this variable and obesity percentage across MSOAs.

Assessing statistical significance

Roughly, if a coefficient estimate has an absolute value of t greater than 1.96 and the sample is sufficiently large, then it is statistically significant.

# compute t statistic
shape_joined2$t_weight = results$weight / results$weight_se

# categorise t values
shape_joined2$t_weight_cat <- cut(shape_joined2$t_weight,
                     breaks = c(min(shape_joined2$t_weight), 
                              -1.96, 1.96, 
                              max(shape_joined2$t_weight)),
                     labels = c("sig","nonsig", "sig"))

# map statistically significant coefs
legend_title = expression("Weight of avg food product: significant")

map_sig_weight = tm_shape(shape_joined2) + 
  tm_fill(col = "t_weight_cat", title = legend_title, 
          legend.hist = TRUE, midpoint = NA, 
          textNA = "", colorNA = "white") +  
  tm_borders(col = "grey", lwd = .1)  + 
  tm_compass(type = "arrow", position = c("right", "top") , size = 2) + 
  tm_scale_bar(breaks = c(0,1,2), text.size = 0.7, position =  c("right", "bottom")) + 
  tm_layout(bg.color = "white", legend.outside = TRUE)

map_sig_weight

For slightly more than half of the MSOAs, the results for the weight of the average food product are statistically significant in explaining obesity percentages.

# compute t statistic
shape_joined2$t_fat = results$fat / results$fat_se

# categorise t values
shape_joined2$t_fat_cat <- cut(shape_joined2$t_fat,
                     breaks = c(min(shape_joined2$t_fat), 
                              -1.96, 1.96, 
                              max(shape_joined2$t_fat)),
                     labels = c("sig","nonsig", "sig"))

# map statistically significant coefs 
legend_title = expression("Weight of fat in avg food product: significant")

map_sig_fat = tm_shape(shape_joined2) + 
  tm_fill(col = "t_fat_cat", title = legend_title, 
          legend.hist = TRUE, midpoint = NA, 
          textNA = "", colorNA = "white") +  
  tm_borders(col = "grey", lwd = .1)  + 
  tm_compass(type = "arrow", position = c("right", "top") , size = 2) + 
  tm_scale_bar(breaks = c(0,1,2), text.size = 0.7, position =  c("right", "bottom")) + 
  tm_layout(bg.color = "white", legend.outside = TRUE)

map_sig_fat

This map shows that the weight of fat in the average food product is not statistically significant for most of the MSOAs, so we should take this into account when interpreting the results of this variable in the GWR.

Reflection

In conclusion, I have looked at three dietary factors and their relationship with obesity percentages, namely, the weight of the average food product, the weight of fat in the average food product, and the fraction of soft drinks purchased.

I have used three different models to model the relationship between these factors and obesity, namely, the ordinary least squares multiple linear regression model, the spatially lagged regression model and finally, the geographically weighted regression model.

In general, the results show that higher weights of the average food product and fat in the average food product are associated with higher obesity percentages, so policymakers can look into ways to encourage people to reduce these numbers to combat obesity.