Machine Learning
Industry / contests for clients
First real-world contact with Machine Learning came in 2010 at an internship at Facebook in Palo Alto
- Worked in the Ads Optimization team on improving the Click-Through-Rate (CTR) prediction
- Task was to come with new features in order to better match users with ads
- Wrote a neural network from scratch in C++ and used it to train an Autoencoder over the sparse user text data
- It was one of the first neural network implementations / usages at Facebook
- Used k-means in order to bucket the compressed representations into a categorical feature
Competed and won first prize in the Spoken Language Identification challenge on TopCoder - github link
- Built a language classifier which received as input a 10 second mp3 and had to classify the language spoken in the mp3 - amongst 176 output language classes
- Trained a Gaussian Mixture Model (GMM) with 2048 components, per output language (meaning 176 GMM's with 2048 components each). GMM's had a diagonal covariance matrix in order to speed up the training.
- Used logistic regression as the final step to calibrate the individual GMM's prediction
- Rented 5 AWS c4.8xlarge spot instances in order to train the models in time
- Used iPython notebooks, sklearn, AWS spot machines and this GMM library
- Client was Faith Comes by Hearing
- The above was done within 2 weeks, in Summer 2015
Competed and got 5th prize in the Master Data Management challenge on TopCoder - github link
- Build a database deduplication model which removed duplicates from a large database of medical providers - around 500.000 rows
- Training data consisted of pairs which were known to be duplicates and the database containing the id, name, address and labels (such as 'internal medicine' or 'emergency').
- Used different indexing keys in order to reduce the set of candidate pairs of rows to be evaluated to a manageable number (blocking)
- Trained a Gradient Tree Boosting classifier (via XGBoost) which given a pair of rows predicted if they're duplicates - using the ground truth data
- Used the fact the duplication relationship is transitive in order to obtain an additional improvement after the classification phase
- Used iPython, XGBoost, pandas, AWS spot machines and some custom C++ code for fast candidate generation
- The above was also done within 2 weeks, in Summer 2015
At Alien Labs
- Built a data processing pipeline for Slack chat logs we got from users
- The pipeline included data cleaning: stopword removal, stemming, removing first names, keeping only english chat logs, etc
- Used Latent Dirichlet Allocation (LDA) for topic modelling
- Used fastText for text classification (predicting whether a specific chat log is design-related or just chit-chat)
- Used Python Jupyter Notebook, Pandas, Amazon Redshift (via the Pandas connector), gensim (for LDA), pyLDAVis (in order to visualize the LDA topics), sklearn and Facebook fastText
Self-learning
Competed and got 6th place in the HackerEarth Deep Learning Challenge (team of 3) - github link
- Task was to build a 27-class grocery images classifier (given around 100 training images per class)
- Used multiple pretrained CNN's, such as VGG16, VGG19, Resnet50 and InceptionResNetV2, adding dense layers on top
- Each model was independently trained via progressively increasing the strength of data augmentation as the model started to overfit to the data
- Used an ensemble over the 4 types of pretrained models as well as different checkpoints from each of the model
- Used Python Jupyter Notebooks, Keras and Google Compute Engine (Teska K80 GPU's)