A quick analysis of Baltimore crime

I'm going to do a very simple analysis of Baltimore crime to show off R. We'll use data downloaded from Baltimore City's awesome open data site.

Getting data

Let's load the data:

arrest_tab = read.csv("BPD_Arrests.csv", stringsAsFactors = FALSE)
cctv_tab = read.csv("CCTV_Locations.csv", stringsAsFactors = FALSE)

# these columns are mislabeled, so fix them
tmp = arrest_tab$sex
arrest_tab$sex = arrest_tab$race
arrest_tab$race = tmp

Exploring data

# dimension of table (data.frame)
dim(arrest_tab)
## [1] 104528     15

# what are the columns
names(arrest_tab)
##  [1] "arrest"            "age"               "sex"              
##  [4] "race"              "arrestDate"        "arrestTime"       
##  [7] "arrestLocation"    "incidentOffense"   "incidentLocation" 
## [10] "charge"            "chargeDescription" "district"         
## [13] "post"              "neighborhood"      "Location.1"

# what is the average arrest age?
mean(arrest_tab$age)
## [1] 33.2

# the range of arrest ages
range(arrest_tab$age)
## [1]  0 87

# how many arrests per sex
table(arrest_tab$sex)
## 
##           F     M 
##     2 19431 85095

# what are the most common offenses
head(sort(table(arrest_tab$incidentOffense), decreasing = TRUE))
## 
##         Unknown Offense            87-Narcotics       4E-Common Assault 
##                   38649                   24744                    6739 
## 87O-Narcotics (Outside)     97-Search & Seizure                79-Other 
##                    6515                    3670                    3461

# range of arrests after removing those w/ age==0
range(arrest_tab$age[arrest_tab$age > 0])
## [1]  8 87

Let's see a table of arrests by sex and race

table(sex = arrest_tab$sex, race = arrest_tab$race)
##    race
## sex           A     B     H     I     U     W
##         2     0     0     0     0     0     0
##   F     0    37 14663     0    34   183  4514
##   M     0   205 72605     1   184  1566 10534

A histogram of age

hist(arrest_tab$age, nc = 100)

plot of chunk unnamed-chunk-4

with(arrest_tab, hist(age[sex == "M"], nc = 100))  # males only

plot of chunk unnamed-chunk-4

with(arrest_tab, hist(age[sex == "F"], nc = 100))  # females only

plot of chunk unnamed-chunk-4

Are males and females arrested at different ages on average?

Let's take a look at how age depends on sex. Let's plot age as a function of sex first (notice how we indicate that sex is a factor).

plot(arrest_tab$age ~ factor(arrest_tab$sex))

plot of chunk unnamed-chunk-5

One of the neat things about R is that statistical model building and testing is built-in. The model we use is \( y_i=\beta_0+\beta_1 x_i \) where \( y_i \) is age of sample (example) \( i \) and \( x_i \) is an indicator variable \( x_i \in \{0,1\} \) with \( x_i=1 \) if the \( i \)-th record (example) is male. You can check that \( \beta_1 \) is the difference in mean age between females and males. We use the formula syntax to build a linear regression model.

# let's ignore those records with missing sex
fit = lm(age ~ factor(sex), data = arrest_tab, subset = arrest_tab$sex %in% 
    c("M", "F"))
summary(fit)
## 
## Call:
## lm(formula = age ~ factor(sex), data = arrest_tab, subset = arrest_tab$sex %in% 
##     c("M", "F"))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33.39 -10.15  -3.15   9.61  53.85 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   33.3878     0.0847   394.1   <2e-16 ***
## factor(sex)M  -0.2343     0.0939    -2.5    0.013 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.8 on 104524 degrees of freedom
## Multiple R-squared:  5.96e-05,   Adjusted R-squared:  5e-05 
## F-statistic: 6.23 on 1 and 104524 DF,  p-value: 0.0126

We see that \( \beta_1 \approx -0.2 \) meaning that the arrest age for males is about 2.5 months younger. So there is very little difference in the average age (which is what the linear model is testing) but we see that the probability of observing this difference from a sample of this size when there is no difference in average age is small \( p \approx 0.01 \). Since we have a very large number of examples, or records, this testing framework will declare very small differences as statistically significant. We'll return to this theme later in class.

Geographic distribution of arrests.

First we need to extract latitude and longitude from location, we'll use some string functions to do this

tmp = gsub("\\)", "", gsub("\\(", "", arrest_tab$Location))
tmp = strsplit(tmp, split = ",")
arrest_tab$lon = as.numeric(sapply(tmp, function(x) x[2]))
arrest_tab$lat = as.numeric(sapply(tmp, function(x) x[1]))

Now let's plot

plot(arrest_tab$lon, arrest_tab$lat, xlab = "Longitude", ylab = "Latitude", 
    main = "Arrests in Baltimore")

plot of chunk unnamed-chunk-8

We can also use density estimates to make this nicer:

smoothScatter(arrest_tab$lat, arrest_tab$lon, xlab = "Latitude", ylab = "Longitude", 
    main = "Arrests in Baltimore")
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009

plot of chunk unnamed-chunk-9

Let's make this fancier using the ggplot2 graphics systems and the maps package containing map data.

library(maps)
library(ggplot2)

balto_map = subset(map_data("county", region = "maryland"), subregion == "baltimore city")
plt = ggplot()
plt = plt + geom_polygon(data = balto_map, aes(x = long, y = lat), color = "white", 
    fill = "gray40")
plt = plt + geom_point(data = arrest_tab, aes(x = lon, y = lat), color = "blue", 
    alpha = 0.1)
print(plt)
## Warning: Removed 40636 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-10

Now let's add CCTV cameras.

tmp = gsub("\\)", "", gsub("\\(", "", cctv_tab$Location))
tmp = strsplit(tmp, split = ",")
cctv_tab$lon = as.numeric(sapply(tmp, function(x) x[2]))
cctv_tab$lat = as.numeric(sapply(tmp, function(x) x[1]))

plt = ggplot()
plt = plt + geom_polygon(data = balto_map, aes(x = long, y = lat), color = "white", 
    fill = "gray40")
plt = plt + geom_point(data = arrest_tab, aes(x = lon, y = lat), color = "blue", 
    alpha = 0.1)
plt = plt + geom_point(data = cctv_tab, aes(x = lon, y = lat), color = "red")
print(plt)
## Warning: Removed 40636 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-11

A challenge

Is there any relationship between the number of CCTV cameras and the number of arrests? Divide the city into a grid and plot the number of CCTV cameras vs. the number of arrests.

latRange = range(arrest_tab$lat, na.rm = TRUE)
lonRange = range(arrest_tab$lon, na.rm = TRUE)
latGrid = seq(min(latRange), max(latRange), len = 50)
lonGrid = seq(min(lonRange), max(lonRange), len = 50)
latFac = as.numeric(cut(arrest_tab$lat, breaks = latGrid))
lonFac = as.numeric(cut(arrest_tab$lon, breaks = lonGrid))

gridFac = (latFac - 1) * length(latGrid) + (lonFac - 1)

latFac = as.numeric(cut(cctv_tab$lat, breaks = latGrid))
lonFac = as.numeric(cut(cctv_tab$lon, breaks = lonGrid))
cctvGridFac = (latFac - 1) * length(latGrid) + (lonFac - 1)

arrestTab = table(gridFac)
cctvTab = table(cctvGridFac)
m = match(names(cctvTab), names(arrestTab))
plot(arrestTab[m] ~ factor(cctvTab))

plot of chunk unnamed-chunk-12

Extra analyses

As part of HW1 you will add to this analysis. Please use the following template:

Your name(s) here

What question are you asking?:

What is the code you use to answer it?:

plot(1:10, 1:10)

plot of chunk yourname

# code goes here

What did you observe?