This document analyzes the 2006 Oakland Street Tree Survey, using R. All code and outputs are shows. This code and analysis is taken from an ongoing project for a data analytic class in the LSE statistics department. The code and data is available at: https://github.com/noahkouchekinia/Oakland_Street_Trees[https://github.com/noahkouchekinia/Oakland_Street_Trees]
setwd("/home/noah/Documents/Code Samples/Street_Trees/")
library(tidyr); library(dplyr) #Data manipulation
library(tree); library(lattice); library(caret) #For predictive modeling
library(tigris); library(rgdal); library(sp); library(raster); library(rgeos) #Spatial Data
library(ggplot2); library(plotly); library(leaflet); library(RColorBrewer) #Data visualization
library(taxize) #Manages calls to taxonomy API
I read the data into R from a flat file. The data-set is available at at: https://data.oaklandnet.com/dataset/Street-Trees/9e7e-63pp
<- read.csv("./Data/TreesAlongSidewalks.csv") trees
The data-set needs some cleaning and recording to be usable. This is done below.
#Five entries contain mostly mostly values
<- trees[complete.cases(trees[1:30]),]
trees
#The longitude and latitude of each tree needs to be pulled out of "the_geom" variable (using regex)
$lng <- sub(" .*","",sub('POINT \\(','', trees$the_geom))
trees$lng <-as.numeric(trees$lng)
trees
$lat <- sub("\\)","",sub('POINT \\(.* ','', trees$the_geom))
trees$lat <- as.numeric(trees$lat)
trees
$the_geom <- NULL
trees
#Many binary factor variables are missrepresented as numbers
$PLANAREAID <- as.factor(trees$PLANAREAID)
trees$COUNCILDIS <- as.factor(trees$COUNCILDIS)
trees$SURVEYOR <- as.factor(trees$SURVEYOR)
trees
$TREE <- as.factor(ifelse(trees$TREE == -1, "YES","NO"))
trees$WELL <- as.factor(ifelse(trees$WELL == -1, "YES","NO"))
trees$HVW <- as.factor(ifelse(trees$HVW == -1, "YES","NO"))
trees$STUMP <- as.factor(ifelse(trees$STUMP == -1, "YES","NO"))
trees$BRKNHNGLMB <- as.factor(ifelse(trees$BRKNHNGLMB == -1, "YES","NO"))
trees$OPENCAVITY <- as.factor(ifelse(trees$OPENCAVITY == -1, "YES","NO"))
trees$STRCTRCNT <- as.factor(ifelse(trees$STRCTRCNT == -1, "YES","NO"))
trees$WATERMET <- as.factor(ifelse(trees$WATERMET == -1, "YES","NO"))
trees$FIREHYD <- as.factor(ifelse(trees$FIREHYD == -1, "YES","NO"))
trees$DRIVEWAY <- as.factor(ifelse(trees$DRIVEWAY == -1, 'Yes','No'))
trees$LIGHTPOLE <- as.factor(ifelse(trees$LIGHTPOLE == -1, 'Yes','No'))
trees$STRTCRNR <- as.factor(ifelse(trees$STRTCRNR == -1, 'Yes','No'))
trees$STAKEGUARD <- as.factor(ifelse(trees$STAKEGUARD == -1, 'Yes','No'))
trees$DEAD <- as.factor(ifelse(trees$DEAD == -1, 'Yes','No'))
trees
#date and time collumns need to be coded as such
$COLLDATE <- as.Date(as.character(trees$COLLDATE), format = "%m/%d/%Y")
trees$COLLTIME <- as.Date(as.character(trees$COLLTIME), format = "%H:%M:%S")
trees
$ARB_DATE <- as.Date(as.character(trees$ARB_DATE), format = "%m/%d/%Y")
trees$ARB_TIME <- as.Date(as.character(trees$ARB_TIME), format = "h:m:s")
trees
#The content of the species collumn is used inconsistantly. I split it into other variables.
<- c('Other','TBD','Dead', 'Stump', 'Unknown', 'Tree well only')
not_tree $SPECIES[trees$SPECIES %in% not_tree] <- NA
trees
<- c('Fig Tree','Fruit tree','Queen Palm','Guava','Banana',
not_species 'Walnut tree', 'Almond tree', 'Apricot tree', 'Shrub')
$COLLOQIALNAME <- ifelse(trees$SPECIES %in% not_species, trees$SPECIES, NA)
trees$SPECIES[trees$SPECIES %in% not_species] <- NA
trees
$GENUS <- sub('\\s.*', '', trees$SPECIES)
trees$GENUS[trees$GENUS == ''] <- NA
trees$GENUS <- as.factor(trees$GENUS)
trees
$SPECIES <- sub('^\\w*\\s(\\w*).*', '\\1', trees$SPECIES)
trees$SPECIES[trees$SPECIES == 'sp'] <- NA
trees
rm(not_species); rm(not_tree)
#Many variables are unhelpful
$SEGMENTID <- NULL #Cannot find deffinition
trees$LOCATION <- NULL #Poorly defined
trees$TRUNKDIAM <- NULL #Redundant (See TRUNKDIA_N)
trees$TREE_KEY <- NULL #Redundant (See COLLDATE and COLLTIME)
trees$ADDRNUM <- NULL #Cannot find deffinition
trees$SYM_CODE <- NULL #Cannot find deffinition
trees$FACILITYID <- NULL #Redundant (See OBJECTID)
trees$LEGACYID <- NULL #Redundant (See OBJECTID)
trees$WARRANTYDA <- NULL #Every entry is NA
trees$INSTALLDAT <- NULL #Every entry is NA
trees$INSTALL_ID <- NULL #Every entry is NA
trees$DESIGNATIO <- NULL #Mostly empty
trees$CONDITION <- NULL #Almost entirely empty
trees$CONDITIOND <- NULL #Entirely Empty
trees
#The arborist comments make more sense as charecter strings than factors
$ARB_CMNT <- as.character(trees$ARB_CMNT)
trees$ARB_CMNT[trees$ARB_CMNT=="None"] <- NA
trees
#The IS_PROBLEM collumn hold information on wether a free is a current problem, past problem,
#or never a problem. I want a binary variable that stores wether a tree was ever a problem.
$WAS_PROBLEM <- as.factor(ifelse(trees$IS_PROBLEM == "False", "NO",
treesifelse(trees$IS_PROBLEM %in% c('Fixed','true'),'YES', NA)))
The data-set is now much cleaner, and consistently formatted. And Ready for further analysis.
The data-set contains genus and species data for most observations. More detailed taxonomic information upstream taxonomic information (family, order class, etc.) would be would be useful. First, more detailed classification information provides a way to measure the similarity between the trees in the data-set. Secondly, it will make incorporating taxonomic information into predictive models feasible. Using species or genus as a factor variable in a decision-tree or regression model would be impractical given the enormous number of dummy variables this would require. A better approach would be to use a higher level of classification, at least in initial models.
###Collecting Taxonomy Data
The taxize
package provides a function that handles calls for classification information from the Catalog of Life API.
<- classification(sci_id = unique(trees$genus), db = "ncbi") trees_tax
The taxize::classification
function returns its results as a list of data-frames. Below, the list is reorganized into a single data-frame.
<- trees_tax[! is.na(trees_tax)]
trees_tax
<- lapply(trees_tax, function(x){x[c('rank','name')]})
trees_tax <- lapply(trees_tax, function(x){pivot_wider(x, names_from = 'rank', values_from = 'name')})
trees_tax <- lapply(trees_tax, function(x){x[c('kingdom','phylum','class','order','family','genus')]})
trees_tax <- do.call(rbind, trees_tax)
trees_tax
<- trees_tax[trees_tax$kingdom == 'Plantae',] trees_tax
The tree taxonomic information can now be paired with the original street tree data-set. Now observation contains the tree’s full taxonomic classification.
<- merge(trees_tax, trees, by.x = 'genus', by.y = 'GENUS', all.y = TRUE)
trees
$SPECIES[!is.na(trees$kingdom)] <- paste(trees$genus[!is.na(trees$kingdom)],
treesifelse(is.na(trees$SPECIES[!is.na(trees$kingdom)]),
"Unknown",
$SPECIES[!is.na(trees$kingdom)]))
trees
rm(trees_tax)
It would be interesting to visualize the newly loaded taxonomic information. Below, A layered pie chart is created which will show which species, genuses, families, etc. are most common in Oakland’s street trees.
#Extract the relevant taxonomic information from the data
<- trees[c('kingdom','phylum','class','order','family','genus','SPECIES')]
trees_tax <- trees_tax[!is.na(trees_tax$kingdom),]
trees_tax
#Reorder data
<- trees_tax[order(trees_tax$kingdom,
trees_tax $phylum,
trees_tax$class,
trees_tax$order,
trees_tax$family,
trees_tax$genus,
trees_tax$SPECIES),
trees_tax
]
#Define a palette
<- colorRampPalette(colors = c('gold', 'dark green', 'dark green', "brown", 'brown','gold'))
palMaker <- palMaker(nrow(trees_tax)+10)
pal
#Assign color to each observation
$color <- 1:nrow(trees_tax)
trees_tax
#Build plot with plotly
<- plot_ly(textinfo = 'none')
p
#Each layer is added inside a for loop
for(i in 1:7){
#Extract relevent data for layer and reshape
<- dplyr::grouped_df(trees_tax, vars = names(trees_tax)[i])
data <- summarize(data, count = length(color), color = mean(color))
data <- data[order(data$color),]
data names(data)[1] <- 'var'
#Determin appropriate dimentions for layer
<-1-(1/i)
hole <- i*.14
diam <- diam/2
rad <- list(x=list(.5-rad, .5+rad), y=list(.5-rad, .5+rad))
domain
#Build layer
<- add_pie(p,
p values = data$count,
labels = data$var,
hole = hole,
domain = domain,
sort = FALSE,
hoverinfo = 'text',
text = paste0(tools::toTitleCase(tolower(names(trees_tax)[i])),": ",tools::toTitleCase(data$var),'\n',
as.character(data$count), " Street Trees \n",
as.character(round(100*(data$count/nrow(trees_tax)), digits = 3)),"%"),
marker = list(colors = pal[round(data$color)],
line = list(color = 'white', width = (.04*(10-i)^1.8))
)
) }
`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
#remove unessesary objects
rm(list = c('data', 'hole', 'diam', 'rad', 'domain', 'trees_tax'))
#set layour and configuration parameters
<- layout(p, title = 'Oakland Street Tree Taxonomy: Layered Pie Chart', showlegend = F, autosize = T,
p xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
<- config(p, displaylogo = FALSE, collaborate = FALSE, displayModeBar = F) p
Warning in config(p, displaylogo = FALSE, collaborate = FALSE, displayModeBar = F): The collaborate button is no longer supported
#Display the Chart!
p
Warning: 'config' objects don't have these attributes: 'collaborate'
Valid attributes include:
'staticPlot', 'plotlyServerURL', 'editable', 'edits', 'autosizable', 'responsive', 'fillFrame', 'frameMargins', 'scrollZoom', 'doubleClick', 'doubleClickDelay', 'showAxisDragHandles', 'showAxisRangeEntryBoxes', 'showTips', 'showLink', 'linkText', 'sendData', 'showSources', 'displayModeBar', 'showSendToCloud', 'showEditInChartStudio', 'modeBarButtonsToRemove', 'modeBarButtonsToAdd', 'modeBarButtons', 'toImageButtonOptions', 'displaylogo', 'watermark', 'plotGlPixelRatio', 'setBackground', 'topojsonURL', 'mapboxAccessToken', 'logging', 'notifyOnLogging', 'queueLength', 'globalTransforms', 'locale', 'locales'
Since the data-set contains precise coordinates for each tree, we can map them. Below, two kinds of maps are created.
The first map will be an interactive scatter plot of every street tree in the city.
#Let's make a scatter map of all the trees
<- leaflet(width = '100%')
m <- addProviderTiles(m,providers$CartoDB.Positron)
m <- addCircleMarkers(m, data = trees,
m lng = ~lng, lat = ~lat,
radius = 11, color = "#009933",opacity = .75, fillOpacity = .75,
label = paste("Species: ", trees$GENUS, trees$SPECIES),
clusterOptions = markerClusterOptions(showCoverageOnHover = FALSE)
#^The clustering makes the widget run efficently
)#Display the map!
m
I would like to determine the density of street trees in Oakland’s various neighborhoods. This allows us to pair trees with information about the neighborhoods they are in, fertile ground for future analysis. In the short term, this will be used to create a neighborhood level chloroplast map of tree density.
The first step is to load geospatial data into R. Census tracts will be used as neighborhoods, so the tigris
packages tracts
function is used to download the county’s census tract boundaries from the Census Bureau’s Tiger Database.
A shape file of the Oakland’s city limits, and a shape file of the coast of the San Francisco Bay are downloaded. These were retrieved from UC Berkeley’s Geo-spatial Database.
#First, lets import geospacial data (city and tract boundries)
<- tigris::tracts("Ca", county = "Alameda", year = 2018, class = "sp")
tracts
<- rgdal::readOGR("./alameda_cities/", "alameda_cities")
cities <- cities[cities$name == "Oakland",]
Oakland rm(cities)
<-spTransform(Oakland, tracts@proj4string@projargs)
Oakland
<- rgdal::readOGR("./bay_boundry/", "GIS_ADMIN_OCEAN_BAY")
bay <- spTransform(bay, Oakland@proj4string@projargs) bay
The census tract boundary data is then then cropped so that it just includes tracts within the city of Oakland, and only the portion of those tracts on land.
<- (Oakland - crop(Oakland,bay, snap = 'out')) #Stage 1: crop Oakland City limits by coastal boundry Oakland
Warning in RGEOSUnaryPredFunc(spgeom, byid, "rgeos_isvalid"): Self-intersection at or near point -121.56725280384251 37.763822031572118
y is invalid
Warning in rgeos::gUnaryUnion(y): Invalid objects found; consider using set_RGEOS_CheckValidity(2L)
<- crop(tracts, Oakland, snap = 'out') #Stage 2: crop tracts by Oaland City limits
tracts <- tracts[! area(tracts) < 10000,] #Due to border discrepencies, slivers of non-oakland were retained after crop
tracts #This removes those tiny protions of tracts
rm(bay); rm(Oakland)
Each tree can now be associated with the neighborhood it is within.
#First, lets add which tract each tree is in to the 'trees' dataframe
$tract <- vector(mode = 'numeric', dim(trees)[1])
treesfor(i in 1:length(tracts@polygons)){
<- point.in.polygon(trees$lng,
includes $lat,
trees@polygons[[i]]@Polygons[[1]]@coords[,1],
tracts@polygons[[i]]@Polygons[[1]]@coords[,2])
tracts$tract[includes == 1] <- i
trees
}$tract[trees$tract == 0] <- NA
trees
#Now we can add how many trees are in each tract to our 'tracts' spacial dataframe
@data$Trees <- tabulate(trees$tract)
tracts@data$TreeDensity <- tracts@data$Trees / as.numeric(tracts@data$ALAND) tracts
Finally, below is a map showing the relative density of trees in different sections of Oakland.
#Lets create a palette
<- colorNumeric(
pal palette = "Greens",
domain = tracts@data$TreeDensity)
#Lets build the map
<- leaflet(width = '100%')
m <- addProviderTiles(m,providers$CartoDB.Positron)
m <- addPolygons(m,
m data = tracts,
fillColor = ~pal(TreeDensity),
weight = 2,
opacity = 1,
color = "grey",
dashArray = "3",
fillOpacity = 0.7,
highlight = highlightOptions(
weight = 2,
color = "white",
dashArray = "3",
fillOpacity = 0.75,
bringToFront = TRUE)
#^Highlight options make neighborhood boundies more cclear on hover
)#Display the map!
m
The data-set contains information on which trees have been problems. Based on the variables correlation with other factors, it appears a tree is marked as a problem tree if it has excessive deadwood or is otherwise unhealthy. To help the city divert its tree management resources to were they will matter the most, it would be useful to predict which trees will be problem trees. Below, two models are constructed to do just that.
First up, a classification tree (ba dum tss).
#Lets build a tree model
#Variables are selected as inclusivly as possible, since the function picks which to use.
<- tree(WAS_PROBLEM ~ as.factor(order) + PLANAREAID + SURVEYOR + COLLDATE + TREE + WELL + WELLWIDTH + LOWWELL +
model_1 + STUMP + BRKNHNGLMB + OPENCAVITY + STRCTRCNT + LOWLIMB + TRNKLEAN + GRATE +
HVW + LANDUSE + TRUNKDIA_N,
STAKEGUARD data = trees)
#Let's use K-fold cross validation to select the minimum optimal size for the tree
<- cv.tree(model_1, K = 10,method = 'misclass')
cv.results <- min(cv.results$size[cv.results$dev == min(cv.results$dev)])
optimal.size <- prune.tree(model_1, best = optimal.size)
model_1
#Lets take a look at the tree.
summary(model_1)
Classification tree:
snip.tree(tree = model_1, nodes = 2L)
Variables actually used in tree construction:
[1] "as.factor(order)"
Number of terminal nodes: 3
Residual mean deviance: 0.1883 = 6584 / 34970
Misclassification error rate: 0.0195 = 682 / 34971
plot(model_1)
text(model_1)
#Lets identify the at risk from the tree above orders.
levels(as.factor(trees$order))[which(letters %in% c('c','d','g','o'))]
[1] "Arecales" "Asparagales" "Ericales" "Malpighiales"
The above model predicts that trees in the orders of Arecales, Asparagales, Ericales, and Malpighiales will become problem trees while the others will not. The misclassification rate on the training data is about 2%. Since K-fold cross validation was used to prune the model, it is expected that performance would be similar on new data.
Next, a logistic regression model is fitted using the same predictors that the tree function used. It is trained using K-fold cross validation to prevent over fitting.
<- caret::train(WAS_PROBLEM ~ order,
model2 data = trees,
trControl = trainControl(method = 'cv', number = 10),
method = 'glm',
family=binomial(link = "logit"),
metric = "Accuracy",
na.action = na.omit
)
summary(model2)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-0.5934 -0.2008 -0.1888 -0.1628 3.1814
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.896909 0.505051 -7.716 1.20e-14 ***
orderAquifoliales 20.462978 799.848399 0.026 0.9796
orderArecales 1.192788 0.535216 2.229 0.0258 *
orderAsparagales 2.249320 0.536844 4.190 2.79e-05 ***
orderCelastrales 0.090247 0.714585 0.126 0.8995
orderCornales -12.669159 251.541141 -0.050 0.9598
orderEricales 1.566153 0.641734 2.441 0.0147 *
orderFabales -0.420579 0.560967 -0.750 0.4534
orderFagales 0.275907 0.535875 0.515 0.6066
orderGentianales 0.046762 0.875051 0.053 0.9574
orderGinkgoales -0.246225 0.606590 -0.406 0.6848
orderLamiales -0.209321 0.526473 -0.398 0.6909
orderLaurales -0.739436 0.650660 -1.136 0.2558
orderMagnoliales -0.121550 0.542429 -0.224 0.8227
orderMalpighiales 0.996781 0.563148 1.770 0.0767 .
orderMalvales 0.025708 1.129562 0.023 0.9818
orderMyrtales -0.399329 0.523993 -0.762 0.4460
orderPinales 0.675930 0.531453 1.272 0.2034
orderProteales -1.011680 0.538343 -1.879 0.0602 .
orderRosales 0.002541 0.510285 0.005 0.9960
orderSapindales 0.497855 0.514351 0.968 0.3331
orderSaxifragales -1.157515 0.550453 -2.103 0.0355 *
orderSolanales 20.462978 641.305501 0.032 0.9745
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6900.4 on 34970 degrees of freedom
Residual deviance: 6426.1 on 34948 degrees of freedom
AIC: 6472.1
Number of Fisher Scoring iterations: 15
The results from the second model are similar to the results of the first. Four orders of trees have higher log odds of becoming problem trees (at a significance level of .1 or below). They are in fact the same four orders of trees that the previous model predicted would be problem trees: Arecales, Asparagales, Ericales, and Malpighiales. This consistency is reassuring.
The regression model however gives more detailed results than the classification tree model. For example, this model also reveals that two orders of trees, Saxifragales and Proteales, are less likely to be problem trees (again, at a significance level of .1 or below). Since the logistic regression also returns the log odds ratios, one can determine the relative magnitude of trees probabilities of becoming problem trees. For example, also trees of the order Arecales and Asparagales are both more likely than others to become problem trees, trees of order Asparagales are more likely than trees of order Arecales to become problem trees.
Most of the information in the data-set was not helpful in predicting which trees would be problem trees. The exception is the taxonomic information collected in the survey. Both models consistently show that trees of the orders Arecales, Asparagales, Ericales, and Malpighiales are likely to become problem trees. This could have policy implications for the city of Oakland. Perhaps it should plant less of these trees, and trees in the orders predicted to not be problem trees, Saxifragales and Proteales. Alternatively, the city could channel limited monitoring and trimming resources towards trees in the problematic orders.