Catalin Tiseanu

A place for stuff I find cool

Follow me on GitHub

Machine Learning

Industry / contests for clients

First real-world contact with Machine Learning came in 2010 at an internship at Facebook in Palo Alto

  • Worked in the Ads Optimization team on improving the Click-Through-Rate (CTR) prediction
  • Task was to come with new features in order to better match users with ads
  • Wrote a neural network from scratch in C++ and used it to train an Autoencoder over the sparse user text data
  • It was one of the first neural network implementations / usages at Facebook
  • Used k-means in order to bucket the compressed representations into a categorical feature

Competed and won first prize in the Spoken Language Identification challenge on TopCoder - github link

  • Built a language classifier which received as input a 10 second mp3 and had to classify the language spoken in the mp3 - amongst 176 output language classes
  • Trained a Gaussian Mixture Model (GMM) with 2048 components, per output language (meaning 176 GMM's with 2048 components each). GMM's had a diagonal covariance matrix in order to speed up the training.
  • Used logistic regression as the final step to calibrate the individual GMM's prediction
  • Rented 5 AWS c4.8xlarge spot instances in order to train the models in time
  • Used iPython notebooks, sklearn, AWS spot machines and this GMM library
  • Client was Faith Comes by Hearing
  • The above was done within 2 weeks, in Summer 2015

Competed and got 5th prize in the Master Data Management challenge on TopCoder - github link

  • Build a database deduplication model which removed duplicates from a large database of medical providers - around 500.000 rows
  • Training data consisted of pairs which were known to be duplicates and the database containing the id, name, address and labels (such as 'internal medicine' or 'emergency').
  • Used different indexing keys in order to reduce the set of candidate pairs of rows to be evaluated to a manageable number (blocking)
  • Trained a Gradient Tree Boosting classifier (via XGBoost) which given a pair of rows predicted if they're duplicates - using the ground truth data
  • Used the fact the duplication relationship is transitive in order to obtain an additional improvement after the classification phase
  • Used iPython, XGBoost, pandas, AWS spot machines and some custom C++ code for fast candidate generation
  • The above was also done within 2 weeks, in Summer 2015

At Alien Labs

  • Built a data processing pipeline for Slack chat logs we got from users
  • The pipeline included data cleaning: stopword removal, stemming, removing first names, keeping only english chat logs, etc
  • Used Latent Dirichlet Allocation (LDA) for topic modelling
  • Used fastText for text classification (predicting whether a specific chat log is design-related or just chit-chat)
  • Used Python Jupyter Notebook, Pandas, Amazon Redshift (via the Pandas connector), gensim (for LDA), pyLDAVis (in order to visualize the LDA topics), sklearn and Facebook fastText


Competed and got 6th place in the HackerEarth Deep Learning Challenge (team of 3) - github link

  • Task was to build a 27-class grocery images classifier (given around 100 training images per class)
  • Used multiple pretrained CNN's, such as VGG16, VGG19, Resnet50 and InceptionResNetV2, adding dense layers on top
  • Each model was independently trained via progressively increasing the strength of data augmentation as the model started to overfit to the data
  • Used an ensemble over the 4 types of pretrained models as well as different checkpoints from each of the model
  • Used Python Jupyter Notebooks, Keras and Google Compute Engine (Teska K80 GPU's)