Catalin Tiseanu

A place for stuff I find cool

Follow me on GitHub

Machine Learning

Industry / contests for clients

First real-world contact with Machine Learning came in 2010 at an internship at Facebook in Palo Alto

  • Worked in the Ads Optimization team on improving the Click-Through-Rate (CTR) prediction
  • Task was to come with new features in order to better match users with ads
  • Wrote a neural network from scratch in C++ and used it to train an Autoencoder over the sparse user text data
  • It was one of the first neural network implementations / usages at Facebook
  • Used k-means in order to bucket the compressed representations into a categorical feature

Competed and won first prize in the Spoken Language Identification challenge on TopCoder - github link

  • Built a language classifier which received as input a 10 second mp3 and had to classify the language spoken in the mp3 - amongst 176 output language classes
  • Trained a Gaussian Mixture Model (GMM) with 2048 components, per output language (meaning 176 GMM's with 2048 components each). GMM's had a diagonal covariance matrix in order to speed up the training.
  • Used logistic regression as the final step to calibrate the individual GMM's prediction
  • Rented 5 AWS c4.8xlarge spot instances in order to train the models in time
  • Used iPython notebooks, sklearn, AWS spot machines and this GMM library
  • Client was Faith Comes by Hearing
  • The above was done within 2 weeks, in Summer 2015

Competed and got 5th prize in the Master Data Management challenge on TopCoder - github link

  • Build a database deduplication model which removed duplicates from a large database of medical providers - around 500.000 rows
  • Training data consisted of pairs which were known to be duplicates and the database containing the id, name, address and labels (such as 'internal medicine' or 'emergency').
  • Used different indexing keys in order to reduce the set of candidate pairs of rows to be evaluated to a manageable number (blocking)
  • Trained a Gradient Tree Boosting classifier (via XGBoost) which given a pair of rows predicted if they're duplicates - using the ground truth data
  • Used the fact the duplication relationship is transitive in order to obtain an additional improvement after the classification phase
  • Used iPython, XGBoost, pandas, AWS spot machines and some custom C++ code for fast candidate generation
  • The above was also done within 2 weeks, in Summer 2015

At Alien Labs

  • Built a data processing pipeline for Slack chat logs we got from users
  • The pipeline included data cleaning: stopword removal, stemming, removing first names, keeping only english chat logs, etc
  • Used Latent Dirichlet Allocation (LDA) for topic modelling
  • Used fastText for text classification (predicting whether a specific chat log is design-related or just chit-chat)
  • Used Python Jupyter Notebook, Pandas, Amazon Redshift (via the Pandas connector), gensim (for LDA), pyLDAVis (in order to visualize the LDA topics), sklearn and Facebook fastText

Self-learning

Competed and got 6th place in the HackerEarth Deep Learning Challenge (team of 3) - github link

  • Task was to build a 27-class grocery images classifier (given around 100 training images per class)
  • Used multiple pretrained CNN's, such as VGG16, VGG19, Resnet50 and InceptionResNetV2, adding dense layers on top
  • Each model was independently trained via progressively increasing the strength of data augmentation as the model started to overfit to the data
  • Used an ensemble over the 4 types of pretrained models as well as different checkpoints from each of the model
  • Used Python Jupyter Notebooks, Keras and Google Compute Engine (Teska K80 GPU's)