Analysis of Boston Crime Incident Open Data Using Pandas
While I was learning Data Analysis using Pandas in Python, I was interested in analyzing the open data about the city I am currently living in.
Open data provides people myriad opportunities to use and analyze the data as they wish without restrictions. By looking at a city’s Crime Incident open data, you can learn about if this city or a specific neighborhood is safe. Hence, I used the open data of Boston Crime Incident Reports to analyze and visualize them by using Pandas. Hopefully, you can have a better understanding of obtaining, analyzing and visualizing open data after reading my experience.
Obtaining Data
- Firstly, I went to Analyze Boston: a Boston open data website. Search “crime” and go to Crime Incident Reports
- To get the data, I copied the link for downloading the data in csv format:
url = “https://data.boston.gov/dataset/6220d948-eae2-4e4b-8723-2dc8e67722a3/resource/12cb3883-56f5-47de-afa5-3b1cf61b257b/download/tmppj4rb047.csv”
- Use
wget
command to download the data.
- Import pandas
import pandas as pd
- Use
read_csv
in pandas to read data and save them into a data frame
df = pd.read_csv('tmppj4rb047.csv')
- Check if data is successfully obtained.
df.head()
Data Analysis
1. Types of Crimes
After obtaining the data, I want to know how many types of crimes are there in Boston.
- Use
value_counts
in Pandas.
value_counts
helps count the number of appearances of different types of crimes and sort them in orders.
df.OFFENSE_CODE_GROUP.value_counts().iloc[:10]
To make it easy to show, I only asked to return the first ten results.
Here, we can see the crime incident that has happened the most frequently in Boston is “Motor Vehicle Accident Response,” and “Larceny” has also been taking place very frequently.
Then, I plotted the result for better visualization.
df.OFFENSE_CODE_GROUP.value_counts().iloc[:10].sort_values().plot(kind= “barh”)
2. Analyzing a Specific Crime
I want to specifically analyze larceny in Boston. Hence I put the part of the data frame that contains larceny into another data frame and called it “larceny.”
larceny = df[df.OFFENSE_CODE_GROUP.str.contains(“Larceny”)]
larceny.head()
- Check the size of the data “Larceny”
larceny.shape()
There are 17961 records of Larceny incidents, and each record has 17 columns.
3. Analyzing Places
I want to know the data of crime incidents in different locations of Boston and, more specifically, what places in Boston are more dangerous.
I used groupby
function in Pandas to group the types of criminal locations, and used size
function to check the number of entries.
larceny.groupby(“STREET”).size().sort_values(ascending = False)
Looking at the result, we can see the locations in Boston where larceny is more likely to happen are Washington St, Boylston St, and Newbury St.
4. Analyzing Time
I also want to know about the trend of larceny incidents that has been taking place in Boston.
larceny.groupby(“MONTH”).size().plot(kind = “bar”)
Based on the bar graph that I computed, larceny happened the most during May, June, and December, whereas September, October, and August appear to be safer.
Let’s also look at how the number of larceny incidents changes within a day.
larceny.groupby(“HOUR”).size().plot(kind= “bar”)
Here, we can tell the safest time of the day when larceny is the least possible to happen in Boston is 5 am. However, people need to be more careful from 4 to 6 pm.
Now, I want to have an overall look at the data about larceny incidents in Boston in each month and 24 hours. Let’s specifically look at the data in 2018 since 2019 has not ended yet and the data is incomplete.
If we group MONTH and HOUR using groupby
in Pandas, we get the following results.
larceny[larceny.YEAR == 2019].groupby([‘MONTH’, ‘HOUR’]).size()
However, this is not helpful for us to read the data easily. To make it better, I used unstack
, and this will turn the result in a more readable form.
larceny[larceny.YEAR==2018].groupby([‘MONTH’,’HOUR’]).size().unstack(0)
Now, I want to visualize the result.
Since I am visualizing 12 months of data into 12 pieces of graphs, I want to use the facet plot. In Pandas, there is a parameter in plot
called subplots
. Initialize it to be True
. I can also adjust the length and width of the graph by using figsize
.
larceny[larceny.YEAR==2018].groupby([‘MONTH’,’HOUR’]).size().unstack(0).plot(subplots=True, kind = “bar”, figsize = (5, 30)
Now, we have a complete visualization of the larceny incidents happened in Boston in 2018.
Summary
By using Pandas, I analyzed and visualized the open data of Boston Crime Incident Reports. Turns out Pandas is indeed a very powerful Python package in terms of extracting, grouping, sorting, analyzing, and plotting the data. If you are interested in data analysis, using Pandas to analyze some real datasets is a good way to start.