Problems before tools: A story of the full-stack data scientist
I recently read an article from Eric Colson, Chief Algorithm Officer at Stitch Fix, where he talked about how we should avoid building data science teams like a manufacturing plant, comprising of highly specialized individuals operating different parts of a manufacturing process. Instead data science teams should be built with a full stack approach where data scientists are considered to be generalists. Generalists refers to the capability of performing diverse functions from conception to modelling to implementation to measurement. I won’t go into a detail summary of the article here but you should read Eric’s article before continuing on.
The purpose of this article is to provide a complementary view into Eric’s philosophy. In his article, he took a very top down approach to describing why a data science team should be built with generalists. I believe the same conclusion can be drawn through the lens of a bottoms up approach, from a perspective of a practitioner of data science and what it really means to be in data science.
Let’s start this discussion of with just defining data science. What exactly is data science or what does a data scientist do? Looking to our friendly neighbourhood Mr. Interweb for some help, here are some definitions that I’ve found.
Tell Me Sir, What Is This Data Science You Refer To?
“Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining.” – Wikipedia
“Data science is a “concept to unify statistics, data analysis, machine learning and their related methods” in order to “understand and analyze actual phenomena” with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science.” – Wikipedia
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract value from data. Data scientists combine a range of skills—including statistics, computer science, and business knowledge—to analyze data collected from the web, smartphones, customers, sensors, and other sources. - Oracle
“Data science is the field of study that combines domain expertise, programming skills, and knowledge of math and statistics to extract meaningful insights from data. Data science practitioners apply machine learning algorithms to numbers, text, images, video, audio, and more to produce artificial intelligence (AI) systems that perform tasks which ordinarily require human intelligence. In turn, these systems generate insights that analysts and business users translate into tangible business value.” – Data Robot
These are some nicely crafted definitions (way better than I would have been able to articulate it) but an observable and consistent pattern across these definitions (along with numerous other on the internet) is that data science tends to be defined by doing a set of specific tasks or using a set of specific tools. If you do X, Y, Z and use A, B, C then that’s considered data science. The causal chain would look something this:
Data Science - The Most Unsexy Field In The 21st Century
Practicing data science will lead you to use math and statistics, apply machine learning algorithm, learn programming, increase domain expertise. Why? First, I think through the process of packaging up and marketing data science to be this sexy field, we have incorrectly defined the definition of data science. I don’t think the definitions mentioned above are incorrect but I think a definition like this one is more authentic:
“Data science, in its most basic terms, can be defined as obtaining insights and information, really anything of value, out of data. Like any new field, it’s often tempting but counterproductive to try to put concrete bounds on its definition.” – Thinkful
To make the definition even less sexy, I believe data science should be simply: “Just using data to solve problems”. Data scientist should be “the data girl or the data guy”. Not so sexy anymore.
Though we still have all the same elements in the two causal chains, the subtle difference in reversal of causality actually leads to a big difference in how data science is practiced in reality. This is the same concept as viewing a cup of water as half full vs. half empty. Although both views are correct, there is a significant difference in how individuals that see things as half full behave and act versus individuals that sees it as half empty.
In the world through the first causal chain, we see a lot of data science practitioners that are solely interested in specific tasks of data science. Have you met anybody that explicitly indicated they are only interested in building models and don’t care about analytics or working with stakeholders? Odds are you probably have. In the world through the second causal chain, specific data science tasks are performed because it helps solve the problem at hand.
How Does This Relate to Being A Generalist?
So, after all of this detour, how does any of this stuff relate to being a generalist? I believe if we looked at data science the right way, the only way to really do data science optimally is to be a generalist (once again, love to hear your thoughts and opinions in the comments below). Why? If we always start with the problem being the main focus, we will realize that if we wanted to solve the problem the best way possible, we cannot simply just do one part of data science and not the other part. Echoing Eric in his article, a generalist role provide all the things that drive job satisfaction: autonomy, mastery and purpose. Especially around mastery, if we always start with the problem, we can determine if our current toolkit can adequately answer the question or we need to branch deeper and apply new data science techniques or technology to help address the problem.