My Data Science Portfolio
25 May 2020This portfolio highlights that I’ve worked accross a wide range of data science case studies and I’ve gained “industry-like experience”.
Classification Problems
Census Income Classification with Bayesian Optimization
The realization of this project was part of my PhD course 80-629 Machine Learning for Large-Scale Data Analysis and Decision Making under the direction of Laurent Charlin at HEC Montreal. The data for this project was taken from the UCI Machine Learning Repository. The goal was to solve a binary classification task to determine whether a person makes over 50k a year. The core of the project lies in the application of the bayesian optimization to fine-tune the hyperparameters. To achieve this, we’ve used the package scikit-optimize
.
Coding a Binary Tree Classifier from Scratch
The realization of this project was part of my master course 60-607 Algorithms for Optimization and Big Data Analysis under the direction of Gilles Caporossi at HEC Montreal. The aim of this project was to develop from scratch an algorithm in python without using any third party package, or as little as possible. We decided to build our own binary tree classifier. Our algorithm was less efficient than sklearn’s decision tree classifier to predict 1 million observations but we managed to get better accuracy on the Iris dataset.
Regression Problems
Health Care Cost Forecast by using Linear Regression
The realization of this project was part of my master course 60-619 Analysis and Statistical Inference under the direction of Ferdaous Somrani at HEC Montreal. It should be mentioned that it was originally made in SAS but I transcribed it in Python. The data for this project was taken from Kaggle. The goal of our analysis was to assess the impact of the body mass index and being a smoker (or not) on the costs of health insurance in the United States, taking into account socio-demographic variables.
Computer Vision
Natural Language Processing
Building a Fake News Classifier
The realization of this project was part of my master course 60-621 Text Mining and Social Networks under the direction of Gilles Caporossi. Our task was to classify whether a news was fake or real. The emphasis has been placed on the application of traditional techniques as Bag-of-words, Stemming or TF-IDF instead of state-of-the-art algorithm for sequential data. We’ve also implemented Topic Modelling and Cosine Similarity to better understand our results.
Kaggle Competitions
Web Scraping
DataCamp Projects
Here is a summary of the projects that I’ve done throughout my journey in the Data Scientist with Python Career Track on DataCamp. The GitHub repository containing these projects is available here.
Project | Description |
---|---|
Introduction to DataCamp Projects | Gentle introduction to DataCamp projects |
Exploring the Evolution of Linux | Development discovery of the Linux operating system by exploring its Git repository history |
Exploring the Bitcoin Cryptocurrency Market | Market capitalization exploration of Bitcoin and other cryptocurrencies |
Exploring 67 years of LEGO | Database exploration of every LEGO set ever built |
Dr. Semmelweis and the Discovery of Handwashing | Data analysis behind one of the most important discoveries of modern medicine: Handwashing |
Word Frequency in Moby Dick | Use web scraping and NLP to find the most frequent words in Herman Melville’s novel, Moby Dick |
Book Recommendations from Charles Darwin | Build a book recommendation system using NLP and the text of books like “On the Origin of Species” |
Find Movie Similarity from Plot Summaries | Use NLP and clustering on movie plot summaries from IMDb and Wikipedia to quantify movie similarity |
TV, Halftime Shows, and the Big Game | Load, clean, and explore Super Bowl data in the age of soaring ad costs and flashy halftime shows |