1 Preparation

dataFolder <- paste0(ws, "data/")

Load data from geopackage

charging_stations <- st_read(dsn=paste0(dataFolder, "chargingStations_berlin.shp"))
## Reading layer `chargingStations_berlin' from data source `/home/slautenb/Documents/lehre/HD/ws_2020_21/heikaLab/R4UrbanDataLab_2020/data/chargingStations_berlin.shp' using driver `ESRI Shapefile'
## Simple feature collection with 486 features and 20 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 373818.6 ymin: 5807999 xmax: 407135.9 ymax: 5829471
## projected CRS:  ETRS89 / UTM zone 33N

2 Non-spatial regression

What drives the clustered pattern of charging stations? One assumption could be that it is related to population density. The more people inhabit a plce the higher the number of charing statiosn. This is of course overly simplistic but it is a start.

Read population shapefile

pop <- st_read(dsn= paste0(dataFolder, "Berlin.gpkg"), layer="population")
## Reading layer `population' from data source `/home/slautenb/Documents/lehre/HD/ws_2020_21/heikaLab/R4UrbanDataLab_2020/data/Berlin.gpkg' using driver `GPKG'
## Simple feature collection with 805 features and 2 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 372713.8 ymin: 5807488 xmax: 407713.8 ymax: 5830488
## projected CRS:  ETRS89 / UTM zone 33N
summary(pop)
##    pop_total             idPoly               geom    
##  Min.   :    0.229   Min.   :  1   POLYGON      :805  
##  1st Qu.:  826.406   1st Qu.:202   epsg:NA      :  0  
##  Median : 2751.180   Median :403   +proj=utm ...:  0  
##  Mean   : 3999.399   Mean   :403                      
##  3rd Qu.: 5724.300   3rd Qu.:604                      
##  Max.   :20881.894   Max.   :805                      
##  NA's   :14

-1 represents mising data so these need to be set to NA

idx <- which(pop$pop_total == -1)
pop$pop_total[idx] <- NA
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(pop) + tm_fill(col="pop_total", convert2density = TRUE, alpha=.5, pallette="plasma", style="kmeans") + tm_shape(charging_stations) + tm_dots(alpha=.5)

## Spatial join between the data

To study the relationship between the two variables we need to combine them. A straight forward approach is a spatial join that counts the number of charging places (one charging station could have more than one charging point) per vector grid cell of the population data set.

First we join the polygon information to the charging stations.

Add a simple ID to the charging stations since gml_id is cumbersome.

charging_stations$statID <- paste0("charge_", 1:nrow(charging_stations))
chargingStationsPoly <- st_join(charging_stations, pop)

2.0.1 Aggregate the number of charging points per grid cell

head(chargingStationsPoly)
## Simple feature collection with 6 features and 23 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 389462.1 ymin: 5820993 xmax: 391281.2 ymax: 5821719
## projected CRS:  ETRS89 / UTM zone 33N
##   id quelle p1_kw         steckertyp p2_kw steckert_1 p3_kw steckert_2 p4_kw
## 1  1 bnetza    11 AC Steckdose Typ 2    11       <NA>    NA       <NA>    NA
## 2  2 bnetza    11 AC Steckdose Typ 2    11       <NA>    NA       <NA>    NA
## 3  3 bnetza    11 AC Steckdose Typ 2    11       <NA>    NA       <NA>    NA
## 4  4 bnetza    11 AC Steckdose Typ 2    11       <NA>    NA       <NA>    NA
## 5  5 bnetza    11 AC Steckdose Typ 2    11       <NA>    NA       <NA>    NA
## 6  6 bnetza    22 AC Steckdose Typ 2    22       <NA>    NA       <NA>    NA
##                inbetrieb_   x_wert  y_wert                       betreiber
## 1 2017/03/21 00:00:00.000 391281.2 5821719                     Allego GmbH
## 2 2017/03/21 00:00:00.000 391281.2 5821719                     Allego GmbH
## 3 2017/03/13 00:00:00.000 390867.9 5820993                     Allego GmbH
## 4 2017/01/26 00:00:00.000 389462.1 5821715                     Allego GmbH
## 5 2016/12/21 00:00:00.000 389951.3 5821431                     Allego GmbH
## 6 2010/04/14 00:00:00.000 391180.4 5821101 innogy eMobility Solutions GmbH
##              ladeeinr_a      plz_ort           adresse ladeart bemerkunge
## 1 Normalladeeinrichtung 10115 Berlin  Anklamer Str. 12    <NA>       <NA>
## 2 Normalladeeinrichtung 10115 Berlin  Anklamer Str. 12    <NA>       <NA>
## 3 Normalladeeinrichtung 10115 Berlin      Borsigstr. 1    <NA>       <NA>
## 4 Normalladeeinrichtung 10115 Berlin     Kieler Str. 5    <NA>       <NA>
## 5 Normalladeeinrichtung 10115 Berlin Habersaathstr. 34    <NA>       <NA>
## 6 Normalladeeinrichtung 10115 Berlin     Torstra�e 165    <NA>       <NA>
##   anschlussl         steckert_3   statID pop_total idPoly
## 1         22 AC Steckdose Typ 2 charge_1 14295.598    299
## 2         22 AC Steckdose Typ 2 charge_2 14295.598    299
## 3         22 AC Steckdose Typ 2 charge_3 11397.844    334
## 4         22 AC Steckdose Typ 2 charge_4  6535.285    297
## 5         22 AC Steckdose Typ 2 charge_5  4881.026    333
## 6         44 AC Steckdose Typ 2 charge_6 11397.844    334
##                   geometry
## 1 POINT (391281.2 5821719)
## 2 POINT (391281.2 5821719)
## 3 POINT (390867.9 5820993)
## 4 POINT (389462.1 5821715)
## 5 POINT (389951.3 5821431)
## 6 POINT (391180.4 5821101)

We will drop geometries here since they tend to mix up things. We have no information on the number of charging points so we will simply use the count

chargingStationsPoly$count <- 1
chargingStationsPolyWide <- chargingStationsPoly %>%
  st_drop_geometry() %>% 
  pivot_wider( id_cols = idPoly, names_from = statID, values_from = count)
head(chargingStationsPolyWide[,1:10])
## # A tibble: 6 x 10
##   idPoly charge_1 charge_2 charge_3 charge_4 charge_5 charge_6 charge_7 charge_8
##    <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1    299        1        1       NA       NA       NA       NA       NA       NA
## 2    334       NA       NA        1       NA       NA        1        1       NA
## 3    297       NA       NA       NA        1       NA       NA       NA        1
## 4    333       NA       NA       NA       NA        1       NA       NA       NA
## 5    368       NA       NA       NA       NA       NA       NA       NA       NA
## 6    404       NA       NA       NA       NA       NA       NA       NA       NA
## # … with 1 more variable: charge_9 <dbl>

Next we simply add up numbers across rows, ignoring NA values

chargingStationsPolyWide$sumChargingStations <- chargingStationsPolyWide %>% 
  select(-one_of("idPoly")) %>% 
  rowSums(na.rm=TRUE)
head(chargingStationsPolyWide[,c("idPoly", "sumChargingStations", "charge_1", "charge_2")])
## # A tibble: 6 x 4
##   idPoly sumChargingStations charge_1 charge_2
##    <int>               <dbl>    <dbl>    <dbl>
## 1    299                   6        1        1
## 2    334                   4       NA       NA
## 3    297                   3       NA       NA
## 4    333                   3       NA       NA
## 5    368                   8       NA       NA
## 6    404                   5       NA       NA

Drop the other columns

chargingStationsPolyWide <- chargingStationsPolyWide %>% select(c(idPoly, sumChargingStations ))
head(chargingStationsPolyWide)
## # A tibble: 6 x 2
##   idPoly sumChargingStations
##    <int>               <dbl>
## 1    299                   6
## 2    334                   4
## 3    297                   3
## 4    333                   3
## 5    368                   8
## 6    404                   5

2.0.2 Join aggregated table to polygons

We can join the resulting aggregated data frame to the polygon data set based on glm_id. Let us check if that is a qunique id:

length(pop$idPoly)
## [1] 805
length(unique(pop$idPoly))
## [1] 805

Yes, no dublicates.

popCharging <- merge(pop, chargingStationsPolyWide, by="idPoly", all.x=TRUE)
dim(popCharging)
## [1] 805   4
head(popCharging)
## Simple feature collection with 6 features and 3 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 372713.8 ymin: 5829488 xmax: 378713.8 ymax: 5830488
## projected CRS:  ETRS89 / UTM zone 33N
##   idPoly   pop_total sumChargingStations                       geometry
## 1      1  659.475547                  NA POLYGON ((372713.8 5830488,...
## 2      2  491.451880                  NA POLYGON ((373713.8 5830488,...
## 3      3  194.851756                  NA POLYGON ((374713.8 5830488,...
## 4      4    2.889551                  NA POLYGON ((375713.8 5830488,...
## 5      5  386.571960                  NA POLYGON ((376713.8 5830488,...
## 6      6 1120.683779                  NA POLYGON ((377713.8 5830488,...
summary(popCharging)
##      idPoly      pop_total         sumChargingStations          geometry  
##  Min.   :  1   Min.   :    0.229   Min.   : 1.000      POLYGON      :805  
##  1st Qu.:202   1st Qu.:  826.406   1st Qu.: 1.000      epsg:NA      :  0  
##  Median :403   Median : 2751.180   Median : 2.000      +proj=utm ...:  0  
##  Mean   :403   Mean   : 3999.399   Mean   : 2.531                         
##  3rd Qu.:604   3rd Qu.: 5724.300   3rd Qu.: 3.000                         
##  Max.   :805   Max.   :20881.894   Max.   :36.000                         
##                NA's   :14          NA's   :613

We have a lot of NA values which indicate - in our case here - that no charging points are present. So we should set them to zero.

idx <- which(is.na(popCharging$sumChargingStations))
popCharging$sumChargingStations[idx] <- 0

For pop_total we have real missing values. For the regression analysis it is best to drop those rows.

idx <- which(is.na(popCharging$pop_total))
popChargingCleaned <- popCharging[-idx,]

2.1 Fit non-spatial regression models

2.1.1 Linear model

ggplot(popChargingCleaned, mapping= aes(x=pop_total, y= sumChargingStations)) + geom_point()

The relationship looks different from the one from Hamburg. There are some interesting outliers.

lmModel <- lm(sumChargingStations ~ pop_total, data=popChargingCleaned)
summary(lmModel)
## 
## Call:
## lm(formula = sumChargingStations ~ pop_total, data = popChargingCleaned)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.352 -0.511 -0.058  0.201 33.707 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.407e-01  8.100e-02  -2.971  0.00306 ** 
## pop_total    2.138e-04  1.424e-05  15.016  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.62 on 789 degrees of freedom
## Multiple R-squared:  0.2223, Adjusted R-squared:  0.2213 
## F-statistic: 225.5 on 1 and 789 DF,  p-value: < 2.2e-16

So we have a signisifcant positive effect of population on the number of chargng points. For grid cells with 1000 inhabitants more we expect on average 2.132 more charging stations.

How does the regression line looks like?

ggplot(popChargingCleaned, mapping= aes(x=pop_total, y= sumChargingStations)) + geom_point(alpha=.25) + geom_smooth(method="lm") + labs(title="Charging stations Berlin", subtitle = "1sqm grid cells") + xlab("Population") + ylab("Number of charging stations")
## `geom_smooth()` using formula 'y ~ x'

How do the residulas look like?

par(mfrow=c(2,3))
plot(lmModel, which=1:6)

Long story short: not normally distributed (thought not as bad as for Hamburg), a few grid cells with large cook’s distance (influential points).

2.1.2 Count regression

Another option would be to fit a generalized linear model with a Poisson distribution and a log link:

poisModel <- glm(sumChargingStations ~ pop_total, data=popChargingCleaned, family=poisson)
summary(poisModel)
## 
## Call:
## glm(formula = sumChargingStations ~ pop_total, family = poisson, 
##     data = popChargingCleaned)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6867  -0.8051  -0.6393  -0.5709  12.0378  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.822e+00  8.582e-02  -21.23   <2e-16 ***
## pop_total    2.063e-04  7.513e-06   27.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1710.0  on 790  degrees of freedom
## Residual deviance: 1063.3  on 789  degrees of freedom
## AIC: 1565.2
## 
## Number of Fisher Scoring iterations: 6
ggplot(popChargingCleaned, mapping= aes(x=pop_total, y= sumChargingStations)) + geom_point(alpha=.25) + geom_smooth(method="glm", method.args = list(family = "poisson")) + labs(title="Charging stations Berlin", subtitle = "1sqm grid cells, Poisson GLM") + xlab("Population") + ylab("Number of charging stations")
## `geom_smooth()` using formula 'y ~ x'

AIC(lmModel)
## [1] 3012.243
AIC(poisModel)
## [1] 1565.221

The Poisson model fits the data better, indicated by the smaller AIC. HWe might think about transforming the predictor, e.g. by a square root transformation.

poisModel <- glm(sumChargingStations ~ sqrt(pop_total), data=popChargingCleaned, family=poisson)
summary(poisModel)
## 
## Call:
## glm(formula = sumChargingStations ~ sqrt(pop_total), family = poisson, 
##     data = popChargingCleaned)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1408  -0.8106  -0.5149  -0.3159  11.5776  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.282084   0.147313  -22.28   <2e-16 ***
## sqrt(pop_total)  0.037337   0.001482   25.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1710.01  on 790  degrees of freedom
## Residual deviance:  994.35  on 789  degrees of freedom
## AIC: 1496.2
## 
## Number of Fisher Scoring iterations: 6
AIC(poisModel)
## [1] 1496.236

AIC value smaller -> better fit.

ggplot(popChargingCleaned, mapping= aes(x=pop_total, y= sumChargingStations)) + geom_point(alpha=.25) + geom_smooth(method="glm", formula= y ~ sqrt(x), method.args = list(family = "poisson")) + labs(title="Charging stations Berlin", subtitle = "1sqm grid cells, Poisson GLM") + xlab("Population") + ylab("Number of charging stations")

2.1.3 Check spatial autocorrelation

2.1.3.1 Define neighborhood

Define neighborhood as polygons sharing at least one point - queen neighborhood as the default.

pop.nb <- poly2nb(popChargingCleaned)
print(pop.nb)
## Neighbour list object:
## Number of regions: 791 
## Number of nonzero links: 5908 
## Percentage nonzero weights: 0.9442511 
## Average number of links: 7.469027

No regions without links were reported!

Convert to sp objects. This is needed for some analysis. sp and sf are two packes designed to work with spatial data. sf is more recent and from my perspective easier to work with and faster. Sinc sp has been around for a longer time it has more packages accepting it as an input (and not sf objects). While a nuisance, it is not a big effort to convert from one representation to another and it is a lossless conversion.

Get the residuals to the polygon data set:

popChargingCleaned$residLm <- residuals(lmModel)
popChargingCleaned$residPoissonSqrt <- residuals(poisModel)
popChargingCleanedSp <- as(popChargingCleaned, "Spatial")
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition
## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum European Terrestrial Reference System 1989 in CRS
## definition
plot(popChargingCleanedSp, border="grey")
plot(pop.nb, coords=coordinates(popChargingCleanedSp)[,1:2], col="brown", add=TRUE, points=FALSE)

The neighbor list is symetric.

is.symmetric.nb(pop.nb)
## [1] TRUE

Create spatial weight Matrix from the neighbors using row standardization

pop.lw <- nb2listw(pop.nb, style="W", zero.policy = TRUE)
hist(unlist(pop.lw$weights), las=1, xlab="weights", main="Continguity nb, W")

2.1.3.1.1 kNN as an alternative

As an alternative we could use e.g. the 8 nearest neighbors

pop.k8nb <- knn2nb(knearneigh(coordinates(popChargingCleanedSp), k = 8))
plot(popChargingCleanedSp, border="grey")
plot(pop.k8nb, coords=coordinates(popChargingCleanedSp), col="red", add=TRUE, cex=.7)

The neighbor list is not symetric.

is.symmetric.nb(pop.k8nb)
## [1] FALSE

Create weighted list.

pop.k8lw <- nb2listw(pop.k8nb, style="W")

All census tracks have the same number of neighbors

par(mfrow=c(1,2))
hist(unlist(pop.lw$weights), las=1, xlab="weights", main="Continguity nb, W", col="blue")
hist(unlist(pop.k8lw$weights), las=1, xlab="weights", main="nn-8 nb, W", col="blue")

We could also calculate distances and inversely weight by distance before row-standardization.

dlist <- nbdists(pop.k8nb, coordinates(popChargingCleanedSp))
dlist <- lapply(dlist, function(x) 1/x)
pop.k8lwDist <- nb2listw(pop.k8nb, glist=dlist, style = "W")
hist(unlist(pop.k8lwDist$weights), las=1, xlab="weights", main="nn-8 nb, idw, W", col="blue")

The idw (inverse distance weighted) approach can be applied to the other neighborhood definitions as well.

2.1.3.2 Global spatial autocorrelation - Moran’s I

Moran’s I is calculated based on a neighborhood definition (weighted list) as well as the definition of the number of zones (typically the number of features) and the global sum of weights. Convenience functions help to get the required parameter.

moran.test provides the according test statistics. Like usual, don’t get just excited about p-values but also look at the absolute numbers.

lm.morantest(lmModel, listw = pop.lw, zero.policy=TRUE)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = sumChargingStations ~ pop_total, data =
## popChargingCleaned)
## weights: pop.lw
## 
## Moran I statistic standard deviate = 7.1768, p-value = 3.569e-13
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.1305577551    -0.0022004182     0.0003421887

Significant but low spatial autocorrelation.

lm.morantest(poisModel, listw = pop.lw, zero.policy=TRUE)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm(formula = sumChargingStations ~ sqrt(pop_total), family =
## poisson, data = popChargingCleaned)
## weights: pop.lw
## 
## Moran I statistic standard deviate = 13.861, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.2710026154    -0.0072753282     0.0004030688

This indicates strong positive spatial autocorrelation for both the linear model as well as the GLM. However, the absolute size for the linear model is relatively week.

2.1.4 Correlogram

A correlogram basically calculates Moran’s for different distance classes. We use the function correlog from package ncf for this purpose. The method works on distance and does not consider a weight matrix W.

increment specifies the size of the bins (distance bands) for the calculation and is measured in the coordinates of the spatial data (meters for our data set).

resamp allows the specification of the number of iterations for a permutation test.

The results contain the value of Moran’s I in slot correlation, the average of the distances for eac distance class in * mean.of.class$, the number of pairs within each distance class in n, the p-value (if resamp > 0) in p.

Here I plot the first 20 distance bands and reflecting a scaled version of the number of pairs in the distance band for the point size .

correlogResidLm <- ncf::correlog(coordinates(popChargingCleanedSp)[,1], coordinates(popChargingCleanedSp)[,2], popChargingCleanedSp$residLm,
na.rm=TRUE, increment=500, resamp=0)

par(mar=c(5,7, 4.1, 0.1))
plot(correlogResidLm$correlation[1:20] ~ correlogResidLm$mean.of.class[1:20] , type="b", pch=16, cex=sqrt(correlogResidLm$n)/70, lwd=1.5, xlab="distance ", ylab="", cex.lab=1, cex.axis=1, las=1, main="Linear model, Berlin")
abline(h=0)
mtext(text = "Moran's I", side = 2, line = 5, cex=1)

We see how spatial autocorrelation decreases with distance between grid cells.

correlogResidPoisSqrt <- ncf::correlog(coordinates(popChargingCleanedSp)[,1], coordinates(popChargingCleanedSp)[,2], popChargingCleanedSp$residPoissonSqrt,
na.rm=TRUE, increment=500, resamp=0)

par(mar=c(5,7, 4.1, 0.1))
plot(correlogResidPoisSqrt$correlation[1:20] ~ correlogResidPoisSqrt$mean.of.class[1:20] , type="b", pch=16, cex=sqrt(correlogResidPoisSqrt$n)/70, lwd=1.5, xlab="distance ", ylab="", cex.lab=1, cex.axis=1, las=1, main="Poisson model sqrt trnaformation, Berlin")
abline(h=0)
mtext(text = "Moran's I", side = 2, line = 5, cex=1)

2.1.5 Local Moran’s I

First we source a script file that helps us with plotting.

source("plotLocalMoransI.r")
## Loading required package: maptools
## Checking rgeos availability: TRUE
mylocalMI(plotvar= popChargingCleanedSp$residLm, nb =  pop.lw, shape =  popChargingCleanedSp, main = "LISA on residuals of linear model, Berlin", verbose = FALSE, zero.policy = TRUE)

Clearly, there is a cluster of high residuals in the center and two small clusters with low residuals. surrounded by cells with low residuals.

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(popChargingCleanedSp) + tm_fill(col="residLm", palette = "-RdBu", midpoint=0)
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

So for sure we have a problem with spatial autocorrelation. Let’s look at different possibilities how to deal with it.

2.2 Spatial lag model

We fit a spatial lag model that has the form:

\(y = \rho W y + X \beta + \epsilon\)

The operation is computational intense so you potentially can get a cup of coffee while you wait for the resuls. Setting method=“SE_classic” is much faster by an approximation used and might be fine for our case.

lagModel2 <- lagsarlm(sumChargingStations ~ pop_total, data=popChargingCleaned, listw=pop.lw, zero.policy=TRUE, method="SE_classic")
summary(lagModel2)
## 
## Call:
## lagsarlm(formula = sumChargingStations ~ pop_total, data = popChargingCleaned, 
##     listw = pop.lw, method = "SE_classic", zero.policy = TRUE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.61518 -0.45280 -0.07562  0.17764 33.44959 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value  Pr(>|z|)
## (Intercept) -2.0726e-01  7.7910e-02 -2.6602  0.007809
## pop_total    1.5194e-04  1.6197e-05  9.3803 < 2.2e-16
## 
## Rho: 0.348, LR test value: 45.56, p-value: 1.4805e-11
## Asymptotic standard error: 0.052755
##     z-value: 6.5966, p-value: 4.2071e-11
## Wald statistic: 43.515, p-value: 4.2071e-11
## 
## Log likelihood: -1480.341 for lag model
## ML residual variance (sigma squared): 2.4272, (sigma: 1.558)
## Number of observations: 791 
## Number of parameters estimated: 4 
## AIC: 2968.7, (AIC for lm: 3012.2)
## LM test for residual autocorrelation
## test value: 14.021, p-value: 0.00018079

This would be the more intense computaion - which lead for the data at hands to very similar results.

lagModel <- lagsarlm(sumChargingStations ~ pop_total, data=popChargingCleaned, listw=pop.lw, zero.policy=TRUE)
summary(lagModel)

Interpretation of the regression coefficients is not straight forward. We need to distinguish between the direct, the indicrect (via the lagged response) and the total effect The total effect of a unit change in one of the predictors is: \(\hat{\beta}/ (1- \hat{\rho})\)

Using the full weight matrix is computational intense for a weight matrix of our size:

impacts(lagModel2, listw= pop.lw)

We will speed up computation by proving a vector of traces of powers of the spatial weights matrix created using trW, for approximate impact measures

W <- as(pop.lw, "CsparseMatrix")
#trMatc <- trW(W, type="mult")
trMC <- trW(W, type="MC")
impacts(lagModel2, tr=trMC)
## Impact measures (lag, trace):
##                 Direct    Indirect       Total
## pop_total 0.0001549312 7.81008e-05 0.000233032

The see that the total effect is nearly 1.5 times as big as the direct effects.

2.3 Spatial Durbin linear (SLX, spatially lagged X) model

We now fit a lm model augmented with the spatially lagged RHS variables. The idea is to use a lagged version of the predictors in addition to the predictors.

slxModel <- lmSLX(sumChargingStations ~ pop_total, data=popChargingCleaned, listw=pop.lw, zero.policy=TRUE)
summary(slxModel)
## 
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), 
##     data = as.data.frame(x), weights = weights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.724 -0.581 -0.061  0.297 33.497 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.053e-01  9.079e-02  -4.464  9.2e-06 ***
## pop_total      1.016e-04  3.216e-05   3.158 0.001648 ** 
## lag.pop_total  1.525e-04  3.925e-05   3.885 0.000111 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.606 on 788 degrees of freedom
## Multiple R-squared:  0.2369, Adjusted R-squared:  0.2349 
## F-statistic: 122.3 on 2 and 788 DF,  p-value: < 2.2e-16
impacts(slxModel, tr=trMC)
## Impact measures (SLX, estimable):
##                 Direct     Indirect        Total
## pop_total 0.0001015569 0.0001524842 0.0002540411
popChargingCleaned$residLag <- residuals(lagModel2)
popChargingCleaned$residSlx <- residuals(slxModel)
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(popChargingCleaned) + tm_fill(col="residLag", palette = "-RdBu", midpoint=0, alpha=0.5) + tm_layout(title="Residuals spatial lag model, Berlin")
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(popChargingCleaned) + tm_fill(col="residSlx", palette = "-RdBu", midpoint=0, alpha=0.5) + tm_layout(title="Residuals spatially lagged X model, Berlin")

2.4 Spatial eigenvector mapping

If we want to use the count regression approach (which is more suitable), the spatial eigenvector mapping approach is the tool at hands. Unfortunately, it is again computational intense.

sevmPoiss <- ME(sumChargingStations ~ sqrt(pop_total), data=popChargingCleaned, family=poisson, zero.policy = TRUE, listw=pop.lw)
print(sevmPoiss)
##   Eigenvector ZI pr(ZI)
## 0          NA NA   0.01
## 1          10 NA   0.02
## 2          28 NA   0.39
sevmPoissModel <- glm(sumChargingStations ~ sqrt(pop_total) + fitted(sevmPoiss), data=popChargingCleaned, family=poisson)
summary(sevmPoissModel)
## 
## Call:
## glm(formula = sumChargingStations ~ sqrt(pop_total) + fitted(sevmPoiss), 
##     family = poisson, data = popChargingCleaned)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0407  -0.7163  -0.5142  -0.3293  10.0131  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.884923   0.143510 -20.103  < 2e-16 ***
## sqrt(pop_total)         0.028309   0.001569  18.044  < 2e-16 ***
## fitted(sevmPoiss)vec10 16.900378   1.480986  11.412  < 2e-16 ***
## fitted(sevmPoiss)vec28  4.715084   1.175176   4.012 6.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1710.01  on 790  degrees of freedom
## Residual deviance:  836.43  on 787  degrees of freedom
## AIC: 1342.3
## 
## Number of Fisher Scoring iterations: 6

Presumably we have missed a lot of other predictors that are now partially absorbed into different eigenvectors. It might be worth to plot and investigate the eigenvectors that made it into the model. Therefore, we attach the selected eigenvectors to the sf object and plot them.

selEV <- fitted(sevmPoiss)

popChargingCleanedSelEV <- st_sf(data.frame(popChargingCleaned, selEV))
for(aName in colnames(selEV))
{ 
  plot(popChargingCleanedSelEV[aName], border=NA) 
}

These eigenvectors reval intresting pattern which should be investigated to explore further hypothesis.

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(popChargingCleanedSelEV) + tm_fill(col="vec28", alpha=.5, palette = "-RdBu", midpoint=0) + tm_shape(charging_stations) + tm_dots()
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(popChargingCleanedSelEV) + tm_fill(col="vec10", alpha=.5, palette = "-RdBu", midpoint=0, n=7) + tm_shape(charging_stations) + tm_dots()