2020-01-13: Data Science Fall 2019 Class Projects

Here’s a list of projects from the CS 620 Introduction to Data Science & Analytics course from Fall 2019. All the projects are implemented using Python and Google Colab. Google Colab (Colaboratory) is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud. With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser.
All the projects are based on publicly available datasets, you’ll be able to find the links to the datasets, all the pre-processing, wrangling, analytics, machine learning and visualization steps using Python from these Colab Reports. If you need a quick summary, there’s a summary of the project at the very end of each Colab report.  

Here are few projects that I'd like to highlight from the list.

There are two datasets used in this project. Both of them were taken from City of Norfolk Open Data. The two datasets are the street light outage dataset and the Police Incident Reports dataset in Norfolk. The end goal of this project was to find the relationship between the street light outage and number of incidents that were reported to the police in Norfolk. However, there are possibilities of finding some other useful information once the datasets are carefully observed and explored. For example, The Street Light Outage dataset might reveal the factors that contribute to the functionality of the Street Lights using the prediction models.

Dissolving the Myth surrounding Gender, Ethnic and Job discrimination in the city of Norfolk
In the city of Norfolk which is home to a population of over 244,000 people, one would say discrimination and bias towards certain individuals does not exist but according to the saying," The facts are in the details" and "Numbers dont lie", It could be infered from the above charts that 'YES' there is some form of discrimination going on.

A careful observation of the Police officer group below reveals that white female within this group earn far less in terms of base salary when compared to white male. A robust formation at the first 25 percentile of violin plot reveals the hypothesis is true. There were not much difference in salary between the Black female and Male. The Hispanics had a much rather staggering base salary but Hispanic female do make more in terms of salary while the American Indians even though were not much,their presence in this group still had higher pay.

Determining the contributing factors and similarities of absentees at work and predicting future absentees
The goal with the project was to use data regarding absentees for work to determine what factors play the biggest roles and attempt to predict absentees in employees along with their reasoning. There can be many reasons why a employee may have to take off work regarding health, family life, and social activities. If the reasons and factors are fairly consistent with the attributes of an individual, it would be possible to predict why an new individual with similar attributes would take off work.
The major reasons for missing work are medical consultation and dental consultation.

Factors that affect absenteeism
  • Spring has the most absences and march is the most frequently missed month. (Flu Season)
  • The most commonly missed day of the week is Monday and least commonly missed is Thursday and Friday.
  • There is a strong correlation between employees who have a disciplinary failure and the reason they're absent. They're mostly absent because of a dentist appointment.
Correlations between employees:
  • Employees who are social drinkers are more likely to live far away from work.
  • Employees who own a pet are more likely to have an increased transportation expense.
  • Employees who worked with the company for long periods of time are more likely to be social drinkers.
Boston Crimes Exploratory Data Analysis
Boston is the largest city and the capital of Massachusetts. It’s one of the oldest and famous city in the U.S. As a cultural anchor in the thriving Seaport District, Boston attracts thousands of tourists every year.  This dataset is provided by the Boston Police Department(BPD) which collects the crime types, date, frequencies and so on. As we can see from the EDA:
  • Larceny is by far the most common type of serious crime.
  • Serious crimes are most likely to occur in the afternoon and evening.
  • Serious crimes are most likely to occur on Friday and least likely to occur on Sunday.
  • Serious crimes are most likely to occur in the summer and early fall, and least likely to occur in the winter (with the exemption of January, which has a crime rate more similar to the summer).
  • There is no outstanding connection between major holidays and crime ratess
  • Serious crimes are most common in the city center, especially districts A1 and D4.
This EDA just one approach to analysis the dataset. Further study would expect how different types of crimes vary in time and space. Another interesting direction would be to combine this with another data about Boston city, such as demography or even the weather, to investigate what factors facilitate to predict crime rates across time and space.

-- Sampath