This week, we chat with Alexey Grigorev, a lead data scientist from OLX Group. We actually met in person last year at OLX Group’s Prod Tech Conference where he presented how to deduplicate images. However, we didn’t recognize each other online, and only found out when I asked him about it right before this chat!
Alexey has an interesting career. He started as a software engineer focused on Java. Then, he stumbled upon something that made him want to switch to data science. …
I think a lot about machine learning. And I think a lot about life. Sometimes, the channels mix and I find certain lessons from machine learning applicable to life.
Here are seven lessons. While I assume most readers are familiar with these machine learning concepts, I begin each lesson with a brief explanation.
We clean data so our downstream analysis or machine learning is correct. As Randy Au shares, data cleaning isn’t grunt work; it is the work.
We don’t use data without exploring and cleaning it. Similarly, we shouldn’t consume life’s inputs without assessing and filtering them.
Take food…
In 2012, the data scientist was named the sexiest job of the 21st century. Now in 2020, this catch-all role is more often split into multiple roles such as data scientist, applied scientist, research scientist, and machine learning engineer.
I used to get questions like “What does a data scientist do?” Now, I get questions such as “What does a data/applied/research scientist do? What is a machine learning engineer? How are they different from each other?”
Here’s my attempt to explain the goals, skills, and deliverables of each role. If you’re trying to enter or transition within the field…
Recently, I had a chat with Chip Huyen, a computer scientist and writer. She’s someone I greatly admire. I had read about her life and career and was curious how she did it.
Little did I know that her story is more inspirational than I had thought. She went from learning English on her own, to attending Stanford, and then joining NVIDIA and now, SnorkelAI (with plenty of setbacks in between). We also chatted about her love for writing and how it changed her life, as well as her thoughts on machine learning in production.
Regardless of what stage of…
However, most resources don’t discuss enough about the “why” and the “what”. Why work on personal projects and build a portfolio? What does a portfolio demonstrate, other than technical skills?
Whether you’re starting on your first or fifth personal project, I hope this will help you find a meaningful “why” and make projects more enjoyable and sustainable. We’ll also hear from awesome creators on their motivations for building and writing (click on 👉). In addition, we’ll discuss the various skills (technical and non-technical) and traits projects demonstrate so you can pick projects that better demonstrate your strengths.
(Note: I’ll address…
RecSys 2020 concluded recently and was a great opportunity to peek into some of the latest thinking about recommender systems from academia and industry. Here are some observations and notes on papers I enjoyed.
There was increased emphasis on ethics and bias this year. Day 1’s keynote was “4 Reasons Why Social Media Make Us Vulnerable to Manipulation” (Video) while Day 2’s keynote was “Bias in Search and Recommender Systems” (Slides).
Two (out of nine) sessions were on “Fairness, Filter Bubbles, and Ethical Concerns” and “Unbiased Recommendation and Evaluation”, discussing papers such as:
In the past year or two, many companies have shared their data discovery platforms (the latest being Facebook’s Nemo). Based on this list, we now know of more than 10 implementations.
I haven’t been paying much attention to these developments in data discovery and wanted to catch up. I was interested in:
By the end of this, we’ll learn about the key features that solve 80% of data discoverability problems. …
A while back, Jeremy wrote a great post on Effective Testing for Machine Learning Systems. He distinguished between traditional software tests and machine learning (ML) tests; software tests check the written logic while ML tests check the learned logic.
ML tests can be further split into testing and evaluation. We’re familiar with ML evaluation where we train a model and evaluate its performance on an unseen validation set; this is done via metrics (e.g., accuracy, Area under Curve of Receiver Operating Characteristic (AUC ROC)) and visuals (e.g., precision-recall curve).
On the other hand, ML testing involves checks on model behaviour…
Originally published on eugeneyan.com here.
“Instead of manually checking our data, why not try what LinkedIn did? It helped them achieve 95% precision and 80% recall.”
My teammate then shared how LinkedIn used k-nearest neighbours to identify inconsistent labels (in job titles). Then, LinkedIn trained a support vector machine (SVM) on the consistent labels; the SVM was then used to update the inconsistent labels. This helped them achieve 95% precision on their job title classifier.
This suggestion was the most useful in our discussion. Following up on it led to our product classifier’s eventual accuracy of 95%. How was she…
Originally published on eugeneyan.com here.
It’s hard to keep up with the rapid progress of natural language processing (NLP). To organize my thoughts better, I took some time to review my notes, compare the various papers, and sort them chronologically. This helped in my understanding of how NLP (and its building blocks) has evolved over time.
To reinforce my learning, I’m writing this summary of the broad strokes, including brief explanations of how models work and some details (e.g., corpora, ablation studies). Here, we’ll see how NLP has progressed from 1985 till now: