My Data Science Portfolio

25 May 2020

This portfolio highlights that I’ve worked accross a wide range of data science case studies and I’ve gained “industry-like experience”.

Classification Problems

Census Income Classification with Bayesian Optimization

The realization of this project was part of my PhD course 80-629 Machine Learning for Large-Scale Data Analysis and Decision Making under the direction of Laurent Charlin at HEC Montreal. The data for this project was taken from the UCI Machine Learning Repository. The goal was to solve a binary classification task to determine whether a person makes over 50k a year. The core of the project lies in the application of the bayesian optimization to fine-tune the hyperparameters. To achieve this, we’ve used the package scikit-optimize.

Notebook GitHub

Coding a Binary Tree Classifier from Scratch

The realization of this project was part of my master course 60-607 Algorithms for Optimization and Big Data Analysis under the direction of Gilles Caporossi at HEC Montreal. The aim of this project was to develop from scratch an algorithm in python without using any third party package, or as little as possible. We decided to build our own binary tree classifier. Our algorithm was less efficient than sklearn’s decision tree classifier to predict 1 million observations but we managed to get better accuracy on the Iris dataset.

Notebook GitHub

Regression Problems

Health Care Cost Forecast by using Linear Regression

The realization of this project was part of my master course 60-619 Analysis and Statistical Inference under the direction of Ferdaous Somrani at HEC Montreal. It should be mentioned that it was originally made in SAS but I transcribed it in Python. The data for this project was taken from Kaggle. The goal of our analysis was to assess the impact of the body mass index and being a smoker (or not) on the costs of health insurance in the United States, taking into account socio-demographic variables.

Notebook GitHub

Computer Vision

Natural Language Processing

Building a Fake News Classifier

The realization of this project was part of my master course 60-621 Text Mining and Social Networks under the direction of Gilles Caporossi. Our task was to classify whether a news was fake or real. The emphasis has been placed on the application of traditional techniques as Bag-of-words, Stemming or TF-IDF instead of state-of-the-art algorithm for sequential data. We’ve also implemented Topic Modelling and Cosine Similarity to better understand our results.

Notebook GitHub

Kaggle Competitions

Web Scraping

DataCamp Projects

Here is a summary of the projects that I’ve done throughout my journey in the Data Scientist with Python Career Track on DataCamp. The GitHub repository containing these projects is available here.

Project	Description
Introduction to DataCamp Projects	Gentle introduction to DataCamp projects
Exploring the Evolution of Linux	Development discovery of the Linux operating system by exploring its Git repository history
Exploring the Bitcoin Cryptocurrency Market	Market capitalization exploration of Bitcoin and other cryptocurrencies
Exploring 67 years of LEGO	Database exploration of every LEGO set ever built
Dr. Semmelweis and the Discovery of Handwashing	Data analysis behind one of the most important discoveries of modern medicine: Handwashing
Word Frequency in Moby Dick	Use web scraping and NLP to find the most frequent words in Herman Melville’s novel, Moby Dick
Book Recommendations from Charles Darwin	Build a book recommendation system using NLP and the text of books like “On the Origin of Species”
Find Movie Similarity from Plot Summaries	Use NLP and clustering on movie plot summaries from IMDb and Wikipedia to quantify movie similarity
TV, Halftime Shows, and the Big Game	Load, clean, and explore Super Bowl data in the age of soaring ad costs and flashy halftime shows

Florent Fettu Data Scientist

My Data Science Portfolio

Classification Problems

Census Income Classification with Bayesian Optimization

Coding a Binary Tree Classifier from Scratch

Regression Problems

Health Care Cost Forecast by using Linear Regression

Computer Vision

Natural Language Processing

Building a Fake News Classifier

Kaggle Competitions

Web Scraping

DataCamp Projects

Related posts

My Self-Created Artificial Intelligence Path 26 May 2020