Sign in

I build machine learning systems at a bookstore & write at https://eugeneyan.com. Applied science @ Amazon. Prev Lazada, Alibaba, IBM. twitter.com/eugeneyan

Stop procrastinating, go off the happy path, learn just-in-time, and get your hands dirty.

Read the latest version here.

Don’t get me wrong, I love MOOCs. They’re great for trying to learn a new programming language (e.g., Python, Scala) or framework (e.g., Spark, TensorFlow) or subject (e.g., statistics, machine learning). The structured learning environment, excellent teaching, and exercises (and solutions) guide us through the best way to learn new concepts.

But most of the time, we don’t really need it. If we already know machine learning, taking that shiny new MOOC won’t help with applying it more effectively. Doing another Python tutorial won’t help with writing better code. Most MOOCs follow the Pareto Principle


Why real-time? How have Chinese & US companies built them? How to design & build an MVP?

Read the latest version here.

A few weeks ago, Chip compared the state of real-time machine learning in China and US: While many Chinese companies have adopted real-time ML, US companies are still assessing its value. She also wrote about ML going real-time here.

This post continues the thread and shares how real-time ML looks in practice. Drawing from my experience and industry papers/blogs, we’ll discuss real-time recommendations.

  • When does real-time recommendation make sense? When does it not?
  • How have China and US companies implemented real-time recommenders?
  • How can we design and implement a simple MVP?

Note: This discussion…


Office Hours

How he switched from engineering to data science, what “senior” means, and how writing helps.

Photo by Brendan Church on Unsplash

Read the latest version here (video included!)

This week, we chat with Alexey Grigorev, a lead data scientist from OLX Group. We actually met in person last year at OLX Group’s Prod Tech Conference where he presented how to deduplicate images. However, we didn’t recognize each other online, and only found out when I asked him about it right before this chat!

Alexey has an interesting career. He started as a software engineer focused on Java. Then, he stumbled upon something that made him want to switch to data science. …


Opinion

Read the latest version here.

I think a lot about machine learning. And I think a lot about life. Sometimes, the channels mix and I find certain lessons from machine learning applicable to life.

Photo by Fabio Comparelli on Unsplash

Here are seven lessons. While I assume most readers are familiar with these machine learning concepts, I begin each lesson with a brief explanation.

Data cleaning: Assess what you consume

We clean data so our downstream analysis or machine learning is correct. As Randy Au shares, data cleaning isn’t grunt work; it is the work.

We don’t use data without exploring and cleaning it. …


Photo by Markus Spiske on Unsplash

Read the latest version here.

In 2012, the data scientist was named the sexiest job of the 21st century. Now in 2020, this catch-all role is more often split into multiple roles such as data scientist, applied scientist, research scientist, and machine learning engineer.

I used to get questions like “What does a data scientist do?” Now, I get questions such as “What does a data/applied/research scientist do? What is a machine learning engineer? How are they different from each other?”

Here’s my attempt to explain the goals, skills, and deliverables of each role. If you’re trying to enter…


Chip shares about setbacks she faced, overcoming them, and how writing changed her life.

Read the latest version here.

Recently, I had a chat with Chip Huyen, a computer scientist and writer. She’s someone I greatly admire. I had read about her life and career and was curious how she did it.

Little did I know that her story is more inspirational than I had thought. She went from learning English on her own, to attending Stanford, and then joining NVIDIA and now, SnorkelAI (with plenty of setbacks in between). We also chatted about her love for writing and how it changed her life, as well as her thoughts on machine learning in production.


Office Hours

Thinking of building your data science portfolio? If we google for “data science portfolio”, we’ll get many results on “how” to build one.

Google search results for “data science portfolio”. Many are related to getting a job.

Read the latest version here.

However, most resources don’t discuss enough about the “why” and the “what”. Why work on personal projects and build a portfolio? What does a portfolio demonstrate, other than technical skills?

Whether you’re starting on your first or fifth personal project, I hope this will help you find a meaningful “why” and make projects more enjoyable and sustainable. We’ll also hear from awesome creators on their motivations for building and writing (click on 👉). …


Thoughts and Theory

Read the latest version here.

RecSys 2020 concluded recently and was a great opportunity to peek into some of the latest thinking about recommender systems from academia and industry. Here are some observations and notes on papers I enjoyed.

Photo by JOSHUA COLEMAN on Unsplash

Emphasis on ethics & bias; More sequences & bandits

There was increased emphasis on ethics and bias this year. Day 1’s keynote was “4 Reasons Why Social Media Make Us Vulnerable to Manipulation” (Video) while Day 2’s keynote was “Bias in Search and Recommender Systems” (Slides).

Two (out of nine) sessions were on “Fairness, Filter Bubbles, and Ethical Concerns” and “Unbiased Recommendation and Evaluation”, discussing papers such as:


(Source)

Read the latest version here.

In the past year or two, many companies have shared their data discovery platforms (the latest being Facebook’s Nemo). Based on this list, we now know of more than 10 implementations.

I haven’t been paying much attention to these developments in data discovery and wanted to catch up. I was interested in:

By the end of this, we’ll learn about the key features that solve 80% of data discoverability problems…


Hands-on Tutorials

Checking for correct implementation, expected learned behaviour, and satisfactory performance (with a sample GitHub!)

Testing machine learning (source)

Read the latest version here.

A while back, Jeremy wrote a great post on Effective Testing for Machine Learning Systems. He distinguished between traditional software tests and machine learning (ML) tests; software tests check the written logic while ML tests check the learned logic.

ML tests can be further split into testing and evaluation. We’re familiar with ML evaluation where we train a model and evaluate its performance on an unseen validation set; this is done via metrics (e.g., accuracy, Area under Curve of Receiver Operating Characteristic (AUC ROC)) and visuals (e.g., precision-recall curve).

On the other hand, ML testing

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store