Machine Learning

Industry / contests for clients

First real-world contact with Machine Learning came in 2010 at an internship at Facebook in Palo Alto

Worked in the Ads Optimization team on improving the Click-Through-Rate (CTR) prediction
Task was to come with new features in order to better match users with ads
Wrote a neural network from scratch in C++ and used it to train an Autoencoder over the sparse user text data
It was one of the first neural network implementations / usages at Facebook
Used k-means in order to bucket the compressed representations into a categorical feature

Competed and won first prize in the Spoken Language Identification challenge on TopCoder - github link

Built a language classifier which received as input a 10 second mp3 and had to classify the language spoken in the mp3 - amongst 176 output language classes
Trained a Gaussian Mixture Model (GMM) with 2048 components, per output language (meaning 176 GMM's with 2048 components each). GMM's had a diagonal covariance matrix in order to speed up the training.
Used logistic regression as the final step to calibrate the individual GMM's prediction
Rented 5 AWS c4.8xlarge spot instances in order to train the models in time
Used iPython notebooks, sklearn, AWS spot machines and this GMM library
Client was Faith Comes by Hearing
The above was done within 2 weeks, in Summer 2015

Competed and got 5th prize in the Master Data Management challenge on TopCoder - github link

Build a database deduplication model which removed duplicates from a large database of medical providers - around 500.000 rows
Training data consisted of pairs which were known to be duplicates and the database containing the id, name, address and labels (such as 'internal medicine' or 'emergency').
Used different indexing keys in order to reduce the set of candidate pairs of rows to be evaluated to a manageable number (blocking)
Trained a Gradient Tree Boosting classifier (via XGBoost) which given a pair of rows predicted if they're duplicates - using the ground truth data
Used the fact the duplication relationship is transitive in order to obtain an additional improvement after the classification phase
Used iPython, XGBoost, pandas, AWS spot machines and some custom C++ code for fast candidate generation
The above was also done within 2 weeks, in Summer 2015

At Alien Labs

Built a data processing pipeline for Slack chat logs we got from users
The pipeline included data cleaning: stopword removal, stemming, removing first names, keeping only english chat logs, etc
Used Latent Dirichlet Allocation (LDA) for topic modelling
Used fastText for text classification (predicting whether a specific chat log is design-related or just chit-chat)
Used Python Jupyter Notebook, Pandas, Amazon Redshift (via the Pandas connector), gensim (for LDA), pyLDAVis (in order to visualize the LDA topics), sklearn and Facebook fastText

Competed and got 6th place in the HackerEarth Deep Learning Challenge (team of 3) - github link

Task was to build a 27-class grocery images classifier (given around 100 training images per class)
Used multiple pretrained CNN's, such as VGG16, VGG19, Resnet50 and InceptionResNetV2, adding dense layers on top
Each model was independently trained via progressively increasing the strength of data augmentation as the model started to overfit to the data
Used an ensemble over the 4 types of pretrained models as well as different checkpoints from each of the model
Used Python Jupyter Notebooks, Keras and Google Compute Engine (Teska K80 GPU's)