1 Preparation

dataFolder <- paste0(ws, "data/")

Load data from geopackage

charging_stations <- st_read(dsn= paste0(dataFolder, "HH.gpkg"), layer="chargingStations")
## Reading layer `chargingStations' from data source `/home/slautenb/Documents/lehre/HD/ws_2020_21/heikaLab/R4UrbanDataLab_2020/data/HH.gpkg' using driver `GPKG'
## Simple feature collection with 1061 features and 12 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 552777.7 ymin: 5921782 xmax: 581876.5 ymax: 5954451
## projected CRS:  ETRS89 / UTM zone 32N

2 Non-spatial regression

What drives the clustered pattern of charging stations? One assumption could be that it is related to population density. The more people inhabit a plce the higher the number of charing statiosn. This is of course overly simplistic but it is a start.

Read population shapefile

pop <- st_read(dsn= paste0(dataFolder, "popPoly.shp"))
## Reading layer `popPoly' from data source `/home/slautenb/Documents/lehre/HD/ws_2020_21/heikaLab/R4UrbanDataLab_2020/data/popPoly.shp' using driver `ESRI Shapefile'
## Simple feature collection with 7163 features and 6 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 522696.9 ymin: 5898155 xmax: 616696.9 ymax: 5975155
## projected CRS:  ETRS89 / UTM zone 32N
summary(pop)
##        id            left             top              right       
##  Min.   :   1   Min.   :523197   Min.   :5899655   Min.   :524197  
##  1st Qu.:1832   1st Qu.:546197   1st Qu.:5918655   1st Qu.:547197  
##  Median :3623   Median :570197   Median :5937655   Median :571197  
##  Mean   :3629   Mean   :569821   Mean   :5937938   Mean   :570821  
##  3rd Qu.:5424   3rd Qu.:593197   3rd Qu.:5956655   3rd Qu.:594197  
##  Max.   :7237   Max.   :616197   Max.   :5975655   Max.   :617197  
##      bottom          pop_total                geometry   
##  Min.   :5898655   Min.   :   -1.0   POLYGON      :7163  
##  1st Qu.:5917655   1st Qu.:   -1.0   epsg:25832   :   0  
##  Median :5936655   Median :   30.0   +proj=utm ...:   0  
##  Mean   :5936938   Mean   :  471.8                       
##  3rd Qu.:5955655   3rd Qu.:  248.0                       
##  Max.   :5974655   Max.   :20088.0

-1 represents mising data so these need to be set to NA

idx <- which(pop$pop_total == -1)
pop$pop_total[idx] <- NA
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(pop) + 
  tm_fill(col="pop_total", convert2density = TRUE, alpha=.5, pallette="plasma", style="kmeans") + 
  tm_shape(charging_stations) + 
  tm_dots(alpha=.5)

## Spatial join between the data

To study the relationship between the two variables we need to combine them. A straight forward approach is a spatial join that counts the number of charging places (one charging station could have more than one charging point) per vector grid cell of the population data set.

First we join the polygon information to the charging stations.

Add a simple ID to the charging stations since gml_id is cumbersome.

charging_stations$statID <- paste0("charge_", 1:nrow(charging_stations))
chargingStationsPoly <- st_join(charging_stations, pop)

2.0.1 Aggregate the number of charging points per grid cell

head(chargingStationsPoly)
## Simple feature collection with 6 features and 19 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 563600.1 ymin: 5924303 xmax: 571126.4 ymax: 5938687
## projected CRS:  ETRS89 / UTM zone 32N
##                            gml_id standort anzahl_ladepunkte ladesaeule_status
## 1 APP_STROMNETZ_EMOBILITY_7035387      398                 2  teilweise belegt
## 2 APP_STROMNETZ_EMOBILITY_7035388      441                 2              frei
## 3 APP_STROMNETZ_EMOBILITY_7035389      305                 2  teilweise belegt
## 4 APP_STROMNETZ_EMOBILITY_7035390      476                 2              frei
## 5 APP_STROMNETZ_EMOBILITY_7035391      052                 3            belegt
## 6 APP_STROMNETZ_EMOBILITY_7035392      076                 3              frei
##                    adresse           koordinaten typ ladepunkt     stecker
## 1          Am Kaiserkai 53  N53.541290 E9.987390  AC      <NA> Typ2/Schuko
## 2    Neuenfelder Straße 19 N53.497900 E10.002170  AC      <NA> Typ2/Schuko
## 3   Bellealliancestraße 40  N53.568036 E9.960359  AC      <NA> Typ2/Schuko
## 4            Wallgraben 46  N53.463710 E9.978910  AC      <NA> Typ2/Schuko
## 5 Wandsbeker Zollstraße 15 N53.574780 E10.074180  DC      <NA>     CHAdeMO
## 6                Grasweg 6 N53.592820 E10.002150  AC      <NA>        Typ2
##   status          authmethod_1 authmethod_2   statID   id     left     top
## 1 belegt Öffentlich zugänglich   RFID-Karte charge_1 3277 565196.9 5933655
## 2   frei Öffentlich zugänglich   RFID-Karte charge_2 3359 566196.9 5928655
## 3   frei Öffentlich zugänglich   RFID-Karte charge_3 3120 563196.9 5936655
## 4   frei Öffentlich zugänglich   RFID-Karte charge_4 3285 565196.9 5925655
## 5 belegt Öffentlich zugänglich   RFID-Karte charge_5 3735 571196.9 5937655
## 6   frei Öffentlich zugänglich   RFID-Karte charge_6 3348 566196.9 5939655
##      right  bottom pop_total                     geom
## 1 566196.9 5932655       790 POINT (565431.5 5932941)
## 2 567196.9 5927655      2214 POINT (566478.7 5928127)
## 3 564196.9 5935655     13242 POINT (563600.1 5935892)
## 4 566196.9 5924655       578 POINT (564988.1 5924303)
## 5 572196.9 5936655      9534 POINT (571126.4 5936750)
## 6 567196.9 5938655     10461 POINT (566328.9 5938687)

We will drop geometries here since they tend to mix up things.

chargingStationsPolyWide <- chargingStationsPoly %>% 
  st_drop_geometry() %>% 
  pivot_wider( id_cols = id, names_from = statID, values_from = anzahl_ladepunkte)

Next we simply add up numbers across rows, ignoring NA values

chargingStationsPolyWide$sumChargingPoints <- chargingStationsPolyWide %>%
  select(-one_of("id")) %>% 
  rowSums(na.rm=TRUE)

Drop the other columns

chargingStationsPolyWide <- chargingStationsPolyWide %>% 
  select(c(id, sumChargingPoints ))

2.0.2 Join aggregated table to polygons

We can join the resulting aggregated data frame to the polygon data set based on glm_id. Let us check if that is a qunique id:

length(pop$id)
## [1] 7163
length(unique(pop$id))
## [1] 7163

Yes, no dublicates.

popCharging <- merge(pop, chargingStationsPolyWide, by="id", all.x=TRUE)
dim(popCharging)
## [1] 7163    8
head(popCharging)
## Simple feature collection with 6 features and 7 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 522696.9 ymin: 5969155 xmax: 523696.9 ymax: 5975155
## projected CRS:  ETRS89 / UTM zone 32N
##   id     left     top    right  bottom pop_total sumChargingPoints
## 1  1 523196.9 5975655 524196.9 5974655         3                NA
## 2  2 523196.9 5974655 524196.9 5973655         7                NA
## 3  3 523196.9 5973655 524196.9 5972655        10                NA
## 4  4 523196.9 5972655 524196.9 5971655        18                NA
## 5  5 523196.9 5971655 524196.9 5970655        10                NA
## 6  6 523196.9 5970655 524196.9 5969655        34                NA
##                         geometry
## 1 POLYGON ((522696.9 5975155,...
## 2 POLYGON ((522696.9 5974155,...
## 3 POLYGON ((522696.9 5973155,...
## 4 POLYGON ((522696.9 5972155,...
## 5 POLYGON ((522696.9 5971155,...
## 6 POLYGON ((522696.9 5970155,...
summary(popCharging)
##        id            left             top              right       
##  Min.   :   1   Min.   :523197   Min.   :5899655   Min.   :524197  
##  1st Qu.:1832   1st Qu.:546197   1st Qu.:5918655   1st Qu.:547197  
##  Median :3623   Median :570197   Median :5937655   Median :571197  
##  Mean   :3629   Mean   :569821   Mean   :5937938   Mean   :570821  
##  3rd Qu.:5424   3rd Qu.:593197   3rd Qu.:5956655   3rd Qu.:594197  
##  Max.   :7237   Max.   :616197   Max.   :5975655   Max.   :617197  
##                                                                    
##      bottom          pop_total       sumChargingPoints          geometry   
##  Min.   :5898655   Min.   :    3.0   Min.   : 4.00     POLYGON      :7163  
##  1st Qu.:5917655   1st Qu.:   22.0   1st Qu.: 4.00     epsg:25832   :   0  
##  Median :5936655   Median :  104.0   Median : 8.00     +proj=utm ...:   0  
##  Mean   :5936938   Mean   :  663.6   Mean   :11.99                         
##  3rd Qu.:5955655   3rd Qu.:  480.0   3rd Qu.:16.00                         
##  Max.   :5974655   Max.   :20088.0   Max.   :56.00                         
##                    NA's   :2067      NA's   :6969

We have a lot of NA values which indicate - in our case here - that no charging points are present. SO we should set them to zero.

idx <- which(is.na(popCharging$sumChargingPoints))
popCharging$sumChargingPoints[idx] <- 0

For pop_total we have real missing values. For the regression analysis it is best to drop those rows.

idx <- which(is.na(popCharging$pop_total))
popChargingCleaned <- popCharging[-idx,]

2.1 Fit non-spatial regression models

2.1.1 Linear model

ggplot(popChargingCleaned, mapping= aes(x=pop_total, y= sumChargingPoints)) +
  geom_point()

So it looks as if we have a relationship between the two variables. Lets fit a simple linear model to the data.

lmModel <- lm(sumChargingPoints ~ pop_total, data=popChargingCleaned)
summary(lmModel)
## 
## Call:
## lm(formula = sumChargingPoints ~ pop_total, data = popChargingCleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.099  -0.107   0.271   0.358  42.846 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.806e-01  3.728e-02  -10.21   <2e-16 ***
## pop_total    1.259e-03  2.228e-05   56.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.443 on 5094 degrees of freedom
## Multiple R-squared:  0.3853, Adjusted R-squared:  0.3852 
## F-statistic:  3193 on 1 and 5094 DF,  p-value: < 2.2e-16

So we have a signisifcant positive effect of population on the number of chargng points. For grid cells with 1000 inhabitants more we expect on average 1.259 mor charging points.

How does the regression line looks like?

ggplot(popChargingCleaned, mapping= aes(x=pop_total, y= sumChargingPoints)) +
  geom_point(alpha=.25) + 
  geom_smooth(method="lm") + 
  labs(title="Charging stations Hamburg", subtitle = "1sqm grid cells") +
  xlab("Population") + ylab("Number of charging points")
## `geom_smooth()` using formula 'y ~ x'

The regression line fit the data not too well.

How do the residulas look like?

par(mfrow=c(2,3))
plot(lmModel, which=1:6)

Long story short: not normally distributed, a few grid cells with large cook’s distance (influential points).

2.1.2 Count regression

Another option would be to fit a generalized linear model with a Poisson distribution and a log link:

poisModel <- glm(sumChargingPoints ~ pop_total, data=popChargingCleaned, family=poisson)
summary(poisModel)
## 
## Call:
## glm(formula = sumChargingPoints ~ pop_total, family = poisson, 
##     data = popChargingCleaned)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -19.4121   -0.6592   -0.6255   -0.6181   14.2015  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.662e+00  2.900e-02   -57.3   <2e-16 ***
## pop_total    3.660e-04  2.816e-06   130.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 16759.2  on 5095  degrees of freedom
## Residual deviance:  8609.3  on 5094  degrees of freedom
## AIC: 9384.3
## 
## Number of Fisher Scoring iterations: 6
ggplot(popChargingCleaned, mapping= aes(x=pop_total, y= sumChargingPoints)) +
  geom_point(alpha=.25) + 
  geom_smooth(method="glm", method.args = list(family = "poisson")) +
  labs(title="Charging stations Hamburg", subtitle = "1sqm grid cells, Poisson GLM") +
  xlab("Population") + ylab("Number of charging points")
## `geom_smooth()` using formula 'y ~ x'

AIC(lmModel)
## [1] 23569.92
AIC(poisModel)
## [1] 9384.344

The Poisson model fits the data better, indicated by the smaller AIC. However, it looks unrealistic for large population values. We might think about transforming the predictor, e.g. by a square root transformation.

poisModel <- glm(sumChargingPoints ~ sqrt(pop_total), data=popChargingCleaned, family=poisson)
summary(poisModel)
## 
## Call:
## glm(formula = sumChargingPoints ~ sqrt(pop_total), family = poisson, 
##     data = popChargingCleaned)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -11.4747   -0.5059   -0.3777   -0.3270   14.2358  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.1664820  0.0469653  -67.42   <2e-16 ***
## sqrt(pop_total)  0.0577123  0.0005307  108.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 16759.2  on 5095  degrees of freedom
## Residual deviance:  6548.9  on 5094  degrees of freedom
## AIC: 7323.9
## 
## Number of Fisher Scoring iterations: 6

AIC value smaller -> better fit.

ggplot(popChargingCleaned, mapping= aes(x=pop_total, y= sumChargingPoints)) +
  geom_point(alpha=.25) + 
  geom_smooth(method="glm", formula= y ~ sqrt(x), method.args = list(family = "poisson")) + 
  labs(title="Charging stations Hamburg", subtitle = "1sqm grid cells, Poisson GLM") +
  xlab("Population") + ylab("Number of charging points")

2.2 Exercise

Rerun the analysis using the data from Berlin. The population data is stored in the geopackage Berlin.gpkg, the name of the layer is population.

Take care: the Berlin data does not contain information on the number of charging points. The simplest solution would be to add a field count, set it to 1 and use that instead of the number of charging points. Also the files have different names for the attribute tables. The id of the population data set is named idPoly while the charging stations have a field id as an unique identifier.

2.2.1 Check spatial autocorrelation

2.2.1.1 Define neighborhood

Define neighborhood as polygons sharing at least one point - queen neighborhood as the default.

pop.nb <- poly2nb(popChargingCleaned)
print(pop.nb)
## Neighbour list object:
## Number of regions: 5096 
## Number of nonzero links: 31556 
## Percentage nonzero weights: 0.1215131 
## Average number of links: 6.192308 
## 6 regions with no links:
## 1487 5912 6828 6934 7153 7163

Six regions without links were reported!

Convert to sp objects. This is needed for some analysis. sp and sf are two packes designed to work with spatial data. sf is more recent and from my perspective easier to work with and faster. Sinc sp has been around for a longer time it has more packages accepting it as an input (and not sf objects). While a nuisance, it is not a big effort to convert from one representation to another and it is a lossless conversion.

Get the residuals to the polygon data set:

popChargingCleaned$residLm <- residuals(lmModel)
popChargingCleaned$residPoissonSqrt <- residuals(poisModel)
popChargingCleanedSp <- as(popChargingCleaned, "Spatial")
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition
## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum European Terrestrial Reference System 1989 in CRS
## definition
plot(popChargingCleanedSp, border="grey")
plot(pop.nb, coords=coordinates(popChargingCleanedSp)[,1:2], col="brown", add=TRUE, points=FALSE)

The neighbor list is symetric.

is.symmetric.nb(pop.nb)
## [1] TRUE

Create spatial weight Matrix from the neighbors using row standardization

pop.lw <- nb2listw(pop.nb, style="W", zero.policy = TRUE)
hist(unlist(pop.lw$weights), las=1, xlab="weights", main="Continguity nb, W")

2.2.1.1.1 kNN as an alternative

As an alternative we could use e.g. the 8 nearest neighbors

pop.k8nb <- knn2nb(knearneigh(coordinates(popChargingCleanedSp), k = 8))
plot(popChargingCleanedSp, border="grey")
plot(pop.k8nb, coords=coordinates(popChargingCleanedSp), col="red", add=TRUE, cex=.7)

The neighbor list is not symetric.

is.symmetric.nb(pop.k8nb)
## [1] FALSE

Create weighted list.

pop.k8lw <- nb2listw(pop.k8nb, style="W")

All census tracks have the same number of neighbors

par(mfrow=c(1,2))
hist(unlist(pop.lw$weights), las=1, xlab="weights", main="Continguity nb, W", col="blue")
hist(unlist(pop.k8lw$weights), las=1, xlab="weights", main="nn-8 nb, W", col="blue")

We could also calculate distances and inversely weight by distance before row-standardization.

dlist <- nbdists(pop.k8nb, coordinates(popChargingCleanedSp))
dlist <- lapply(dlist, function(x) 1/x)
pop.k8lwDist <- nb2listw(pop.k8nb, glist=dlist, style = "W")
hist(unlist(pop.k8lwDist$weights), las=1, xlab="weights", main="nn-8 nb, idw, W", col="blue")

The idw (inverse distance weighted) approach can be applied to the other neighborhood definitions as well.

2.2.1.2 Global spatial autocorrelation - Moran’s I

Moran’s I is calculated based on a neighborhood definition (weighted list) as well as the definition of the number of zones (typically the number of features) and the global sum of weights. Convenience functions help to get the required parameter.

moran.test provides the according test statistics. Like usual, don’t get just excited about p-values but also look at the absolute numbers.

lm.morantest(lmModel, listw = pop.lw, zero.policy=TRUE)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = sumChargingPoints ~ pop_total, data =
## popChargingCleaned)
## weights: pop.lw
## 
## Moran I statistic standard deviate = 59.959, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.5006840535    -0.0003277572     0.0000698220
lm.morantest(poisModel, listw = pop.lw, zero.policy=TRUE)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm(formula = sumChargingPoints ~ sqrt(pop_total), family =
## poisson, data = popChargingCleaned)
## weights: pop.lw
## 
## Moran I statistic standard deviate = 45.506, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     3.887178e-01    -1.574914e-03     7.356117e-05

This indicates strong positive spatial autocorrelation for both the linear model as well as the GLM.

2.2.2 Correlogram

A correlogram basically calculates Moran’s for different distance classes. We use the function correlog from package ncf for this purpose. The method works on distance and does not consider a weight matrix W.

increment specifies the size of the bins (distance bands) for the calculation and is measured in the coordinates of the spatial data (meters for our data set).

resamp allows the specification of the number of iterations for a permutation test.

The results contain the value of Moran’s I in slot correlation, the average of the distances for eac distance class in * mean.of.class$, the number of pairs within each distance class in n, the p-value (if resamp > 0) in p.

Here I plot the first 20 distance bands and reflecting a scaled version of the number of pairs in the distance band for the point size .

correlogResidLm <- ncf::correlog(coordinates(popChargingCleanedSp)[,1], coordinates(popChargingCleanedSp)[,2], popChargingCleanedSp$residLm,
na.rm=TRUE, increment=500, resamp=0)

par(mar=c(5,7, 4.1, 0.1))
plot(correlogResidLm$correlation[1:20] ~ correlogResidLm$mean.of.class[1:20] , type="b", pch=16, 
     cex=sqrt(correlogResidLm$n)/70, lwd=1.5, xlab="distance ", ylab="", cex.lab=1, cex.axis=1, 
     las=1, main="Linear model, Hamburg")
abline(h=0)
mtext(text = "Moran's I", side = 2, line = 5, cex=1)

We see how spatial autocorrelation decreases with distance between grid cells.

2.2.3 Local Moran’s I

First we source a script file that helps us with plotting.

source("plotLocalMoransI.r")
## Loading required package: maptools
## Checking rgeos availability: TRUE
mylocalMI(plotvar= popChargingCleanedSp$residLm, nb =  pop.lw, shape =  popChargingCleanedSp, 
          main = "LISA on residuals of linear model, Hamburg", verbose = FALSE, zero.policy = TRUE)

Clearly, there is a cluster of high residuals in the center and several clusters with low residuals. surrounded by cells with low residuals.

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(popChargingCleanedSp) + 
  tm_fill(col="residLm", palette = "-RdBu", midpoint=0)
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

So for sure we have a problem with spatial autocorrelation. Let’s look at different possibilities how to deal with it.

2.3 Spatial lag model

We fit a spatial lag model that has the form:

\(y = \rho W y + X \beta + \epsilon\)

The operation is computational intense so you potentially can get a cup of coffee while you wait for the resuls. Setting method=“SE_classic” is much faster by an approximation used and might be fine for our case.

lagModel2 <- lagsarlm(sumChargingPoints ~ pop_total, data=popChargingCleaned, listw=pop.lw, 
                      zero.policy=TRUE, method="SE_classic")
summary(lagModel2)
## 
## Call:
## lagsarlm(formula = sumChargingPoints ~ pop_total, data = popChargingCleaned, 
##     listw = pop.lw, method = "SE_classic", zero.policy = TRUE)
## 
## Residuals:
##         Min          1Q      Median          3Q         Max 
## -15.4662575  -0.0083005   0.1732037   0.2114703  33.8309176 
## 
## Type: lag 
## Regions with no neighbours included:
##  1487 5912 6828 6934 7153 7163 
## Coefficients: (numerical Hessian approximate standard errors) 
##                Estimate  Std. Error z value  Pr(>|z|)
## (Intercept) -2.2143e-01  2.6245e-02 -8.4371 < 2.2e-16
## pop_total    5.2420e-04  1.5712e-05 33.3636 < 2.2e-16
## 
## Rho: 0.707, LR test value: 2941, p-value: < 2.22e-16
## Approximate (numerical Hessian) standard error: 4.0254e-06
##     z-value: 175630, p-value: < 2.22e-16
## Wald statistic: 3.0848e+10, p-value: < 2.22e-16
## 
## Log likelihood: -10311.45 for lag model
## ML residual variance (sigma squared): 2.9729, (sigma: 1.7242)
## Number of observations: 5096 
## Number of parameters estimated: 4 
## AIC: 20631, (AIC for lm: 23570)

This would be the more intense computaion - which lead for the data at hands to very similar results.

lagModel <- lagsarlm(sumChargingPoints ~ pop_total, data=popChargingCleaned, listw=pop.lw, zero.policy=TRUE)
summary(lagModel)

Interpretation of the regression coefficients is not straight forward. We need to distinguish between the direct, the indicrect (via the lagged response) and the total effect The total effect of a unit change in one of the predictors is: \(\hat{\beta}/ (1- \hat{\rho})\)

Using the full weight matrix is computational intense for a weight matrix of our size:

impacts(lagModel2, listw= pop.lw)

We will speed up computation by proving a vector of traces of powers of the spatial weights matrix created using trW, for approximate impact measures

W <- as(pop.lw, "CsparseMatrix")
#trMatc <- trW(W, type="mult")
trMC <- trW(W, type="MC")
impacts(lagModel2, tr=trMC)
## Impact measures (lag, trace):
##                 Direct    Indirect       Total
## pop_total 0.0006064485 0.001182576 0.001789024

The see that the total effect is nearly three times as big as the direct effects. This is indicated by the big \(\rho\) value.

2.3.1 Spatial Durbin linear (SLX, spatially lagged X) model

We now fit a lm model augmented with the spatially lagged RHS variables. The idea is to use a lagged version of the predictors in addition to the predictors.

slxModel <- lmSLX(sumChargingPoints ~ pop_total, data=popChargingCleaned, listw=pop.lw, zero.policy=TRUE)
summary(slxModel)
## 
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), 
##     data = as.data.frame(x), weights = weights)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.206  -0.177   0.322   0.482  44.504 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.6074666  0.0375937  -16.16   <2e-16 ***
## pop_total      0.0006282  0.0000379   16.58   <2e-16 ***
## lag.pop_total  0.0009433  0.0000467   20.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.351 on 5093 degrees of freedom
## Multiple R-squared:  0.4309, Adjusted R-squared:  0.4307 
## F-statistic:  1928 on 2 and 5093 DF,  p-value: < 2.2e-16
impacts(slxModel, tr=trMC)
## Impact measures (SLX, estimable):
##                Direct     Indirect       Total
## pop_total 0.000628213 0.0009432578 0.001571471
popChargingCleaned$residLag <- residuals(lagModel2)
popChargingCleaned$residSlx <- residuals(slxModel)
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(popChargingCleaned) + 
  tm_fill(col="residLag", palette = "-RdBu", midpoint=0, alpha=0.5) +
  tm_layout(title="Residuals spatial lag model, Hamburg")
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(popChargingCleaned) + 
  tm_fill(col="residSlx", palette = "-RdBu", midpoint=0, alpha=0.5) +
  tm_layout(title="Residuals spatially lagged X model, Hamburg")

If we study the distribution of the residuals we see that the pattern is different, having a more random pattern for the spatial lag model, which is beneficial.

2.4 Spatial eigenvector mapping

If we want to use the count regression approach (which is more suitable), the spatial eigenvector mapping approach is the tool at hands. Unfortunately, it is again computational intense.

sevmPoiss <- ME(sumChargingPoints ~ sqrt(pop_total), data=popChargingCleaned, family=poisson, zero.policy = TRUE, listw=pop.lw)
## Warning: glm.fit: fitted rates numerically 0 occurred

## Warning: glm.fit: fitted rates numerically 0 occurred

## Warning: glm.fit: fitted rates numerically 0 occurred

## Warning: glm.fit: fitted rates numerically 0 occurred

## Warning: glm.fit: fitted rates numerically 0 occurred

## Warning: glm.fit: fitted rates numerically 0 occurred

## Warning: glm.fit: fitted rates numerically 0 occurred
print(sevmPoiss)
##   Eigenvector ZI pr(ZI)
## 0          NA NA   0.01
## 1         695 NA   0.01
## 2         118 NA   0.01
## 3         617 NA   0.01
## 4         256 NA   0.01
## 5         539 NA   0.01
## 6          71 NA   0.01
## 7         814 NA   0.09
sevmPoissModel <- glm(sumChargingPoints ~ sqrt(pop_total) + fitted(sevmPoiss), data=popChargingCleaned, family=poisson)
summary(sevmPoissModel)
## 
## Call:
## glm(formula = sumChargingPoints ~ sqrt(pop_total) + fitted(sevmPoiss), 
##     family = poisson, data = popChargingCleaned)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -9.4185  -0.5059  -0.3354  -0.2436  12.2322  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -3.523e+00  5.414e-02 -65.075   <2e-16 ***
## sqrt(pop_total)          5.607e-02  6.324e-04  88.663   <2e-16 ***
## fitted(sevmPoiss)vec695  6.257e-01  1.275e+00   0.491    0.624    
## fitted(sevmPoiss)vec118  2.739e+01  1.510e+00  18.143   <2e-16 ***
## fitted(sevmPoiss)vec617  1.961e+01  1.701e+00  11.528   <2e-16 ***
## fitted(sevmPoiss)vec256 -1.990e+01  1.571e+00 -12.666   <2e-16 ***
## fitted(sevmPoiss)vec539 -2.060e+01  1.257e+00 -16.386   <2e-16 ***
## fitted(sevmPoiss)vec71  -3.895e+01  1.939e+00 -20.093   <2e-16 ***
## fitted(sevmPoiss)vec814 -1.679e+01  1.809e+00  -9.282   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 16759.2  on 5095  degrees of freedom
## Residual deviance:  4950.9  on 5087  degrees of freedom
## AIC: 5739.9
## 
## Number of Fisher Scoring iterations: 6

Presumably we have missed a lot of other predictors that are now partially absorbed into different eigenvectors. It might be worth to plot and investigate the eigenvectors that made it into the model. Therefore, we attach the selected eigenvectors to the sf object and plot them.

selEV <- fitted(sevmPoiss)

popChargingCleanedSelEV <- st_sf(data.frame(popChargingCleaned, selEV))
for(aName in colnames(selEV))
{ 
  plot(popChargingCleanedSelEV[aName], border=NA) 
}

Most selected spatial eigenvectors represent rather short ranges spatial autocorrelation pattern which represent presumably the spillover effects that we captured in the spatial lag modell. In addition we see two relatively long ranging spatial structure (vec 71, 118 and 256).

3 Exercises

Try the following changes to the analysis:

3.1 Exercise a

How do the estimated regression coefficients change if the 8-nearest neighbors are used in the idw version both for the spatial lag and the spatial lagged X model?

3.2 Exercise b

Rerun the analysis using the data from Berlin. The population data is stored in the geopackage Berlin.gpkg, the name of the layer is population.