Toronto Police -

Crime Prediction

(Written January 2021)

WordCloud ranking the top crimes in Toronto

I am currently enrolled in York's certificate for Big Data Analytics and we just finished a group project to perform predictive analysis and clustering using the Cross Industry Standard Process for Data Mining (CRISP-DM) framework. It was a great opportunity to take our analysis end to end... data wrangling, modeling, evaluation, and presentation. In particular we enjoyed attempting a variety of supervised and unsupervised Machine Learning models.

It was not easy. We made many mistakes along the way but miraculously by the 11th hour we were able to produce some results, insights, and even had a few surprises.

For our data set we decided to work with the crime data provided by Toronto Police Service (TPS) through their Public Safety Data Portal, "intended to improve the understanding of policing, improve transparency and enhance confidence through the creation and use of open data for public safety in Toronto".

Our objective


To create a model that accurately predicts crime while investigating the possible correlation between the prediction, demographics and social services.

The purpose for our model is to assist in crime reduction via effective resource reallocation.

The team

Suzanne Douglas

Rachna Kumari

Herby Robinson

Pushpendra Sharma

Don Sohn



Mis-steps along the way


  • Needed to remind ourselves ... while correlation may exist that does not imply causation

  • Overfitting our model

  • Including our target variable in our unsupervised model (gasp!)

  • ...and many more mishaps

For determining optimal k in K-means, how have I invented the double jointed elbow?

Exploring the data

Clustering

In our unsupervised learning we asked whether a neighbourhood's social services and demographics mattered.

Using K-means and Hierarchical clustering we also tried to identify similar groups or clusters based on all the data's attributes. In hindsight we could have used the resulting clusters to help feed into our supervised modeling and improve our prediction scores.


Predicting

In supervised learning we tried a variety of models, resulting in mediocre accuracy.

We then came to realize that we had a class imbalance where the majority of crime was of the assault category, hence our results would naturally be skewed. By splitting up our model to be per crime category, our accuracy improved to 70-80% for each of the categories.

As a secondary exercise, we performed time series forecasting using SARIMA method. SARIMA was selected as it shows characteristics on trend, seasonality, autoregressive and moving average components. In our data analysis we found seasonality patterns emerge e.g. crimes occurring more frequently on certain days of the week, months of the year, etc.

Presenting

In our final presentation to the class, we were fortunate to have Ian Williams, the Head of Analytics & Innovation at Toronto Police Service, join us to provide feedback, insights into their team, and the direction they are headed.


Project Presentation

What does this all mean?

  • Besides discovering that I live in the heart of crime in Toronto (surprise!), we discovered in response to our objective that indeed there appears to be correlation between crime and demographics and social services.

  • Along the way we saw trends of when crime occurred...Fridays, evenings, certain months of the years, certain neighbourhoods, etc.

  • Largest # of crimes in Toronto is assault, much more than property crimes such as burglary or auto theft.

  • Why are there so many crimes reported at noon and midnight? There are crimes reported without the exact time being known - for example, person is at work and their home is burglarized sometime during the day, so when filling out the report may select the time to be noon.

  • On two sides of the fence you could have a neighbourhood with high crime activity right next to one with low activity. The one with low activity happens to have more recreation centers.

  • Crime is forecasted to increase in the upcoming years. Not surprising, but what will the data driven decisions be, given the resources and budget.

  • We learned that decisions made around resourcing are not knee jerk reactions to the crimes of the day, month, or year; but require thoughtful and multi-year retrospectives.

  • Attempted to use Twitter feeds as a source of real time updates on potential crime, but turns out there's not enough citizens reporting it.

  • As of January 2020, TPS has started collecting race-based data. This opens up a whole new world of conversation and analysis.

Link to our lovely 40 page report and code >>>