What exactly is data science, where did it come from1, and why is it more important now than ever?
So often, when I come across a kind of nebulous interdisciplinary term or concept, I really actually like to go to Wikipedia and kind of see the first intro paragraph. Because in a sense, that represents the general consensus of all the people who have edited that document or that Wikipedia entry. So through the natural crowdsourcing, it kind of gives you this, maybe not objective truth of what it means, but what the agreed upon truth is.
What is Data Science
And if we go to the Wikipedia on data science, the first sentence basically says data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data and apply knowledge and actionable insights across a broad range of domains. And just that one sentence, it kind of says a lot but also says nothing, it’s kind of too vague.
But the key things to take away from this is data science is very interdisciplinary. And because of that, it’s fairly broad. You’re going to use some techniques from computer science, statistics, math, business kind of everything in between. The second bit that I want to pull from just this sentence is knowledge and insights from data. So this is kind of an overlap between data science and something like statistics. But I would say that this is what somewhat distinguishes data science from maybe something like pure math, math deals with the same kind of analytical thinking that you might use in data science. But pure math is kind of one step removed and purely abstract in a sense.
Now, data science and statistics both are, how do we get some insight or knowledge from some data? So that’s where there’s the overlap. But what I would argue that distinguishes data science from statistics, and there’s always people who would argue applied statistics is the same as data science and kind of, you could argue either, but for the sake of these videos, when I talk about statistics I’m going to kind of talk about it more in an academic sense. And when I talk about data science, it’s really going to be about this, applying these insights you’ve garnered from your data to some broad range of applications. So there needs to be some end application and usually it needs to provide some value to someone. So with that definition out of the way the second thing I like to do when I come to a new term, field, concept, domain is get a little bit of historical context on.
A Brief Historical Diversion
And maybe try to figure out why they might be that way. So this is my brief historical diversion, my opinionated history of data science. And again, there’s a long history. There’s a lot of other things that I’m not going to mention but for the sake of brevity, put what I think are these kind of key landmark moments.
Broad Street Cholera Outbreak
So the first event I will have on here is actually the Broad Street cholera outbreak. And this happened in 1854 in London. Now, I’m sure you could argue that some scientists in antiquity have been doing data science for as long as data has existed, but this is kind of the modern history of data science, if you will. And in the Broad Street cholera outbreak a physician and epidemiologist John Snow was interested in trying to figure out where this cholera outbreak came from. And the dominant theory at the time was that it spread through this Miasma theory or it’s spread through bad air. And John Snow being kind of the good data detective that he was, didn’t necessarily know where the cholera outbreak actually came from. So he did what any good data scientist would do. And he actually started plotting where these outbreaks happened geographically on Broad Street. And he noticed that they were centered around this water pump. So through this analysis and trying to deduce where and how this cholera might’ve spread, the data really gave him a very clear picture, but it wasn’t until he actually plotted that data on this map that he had a sense that cholera was spread through the water.
The Future of Data Analytics
The second event is less an event or discovery and more a person. So if you’re not familiar, this is a picture of John Tukey. And in 1962, he published an article, book, kind of manuscript called The Future of Data Analysis. And John Tukey, if you’re not familiar with him he kind of is this godfather of modern statistics. He invented things like the Fast Fourier transform. He kind of defined what exploratory data analysis is and should be as well as inventing things like the box plot visualization. In the future of data analysis. He argues that kind of the bright future for statistics is in its applications and in using data and doing things with empirical studies to garner some insights.
Consice Survey of Computer Methods
The third event in this timeline is actually a very specific thing. It’s when data science was coined. So in Peter Naur’s seminal work a Concise Survey of Computer Methods in 1974, he actually coined the term data science. So this was a book all about kind of computational techniques that you could do now that computers would be coming more accessible and widely used. And in it, he basically coined the term, data science in the meaning that we use it today.
First Data Science Conference
The fourth event here actually is an event. So this was the first conference around data science or the first official conference that had data science kind of in its name and as the main theme. So this was in 1996 in Kyoto. But the reason I want put it on here and I think it’s important is that it’s the first kind of public discussion or gathering around data science.
Data Science Teams
And in 2008, this last bit on the timeline is when Jeff Hammerbacher and DJ Patil kind of independently were forming data science teams at their respective companies, Jeff Hammerbacher at Facebook and DJ Patil at LinkedIn. And the thing that sets these apart from just people doing data science was that this was kind of the formal institutionalization of the data scientist as a role.
The Emergence of a Profession
And if I had to summarize each of these events Broad Street cholera outbreak, I would say is the invention or maybe discovery of data science in our modern use. The second bit here I will say is the application. So this is, data science is a thing. And how do we apply it? And how do we develop it more? So that’s really where John Tukey steps into this timeline. The third bit is really the definition. So this is probably less important than some of the other bits on the timeline but I think it is important nonetheless. Now we actually have a term, data science it’s somewhat rigorously defined in this book. And we can now argue about, is this what we want it to be? Is this what we think it is? So now with the definition of data science, we can actually get to a discussion about what data science is, is this the right thing that we want to be developing further? What future should we push data science into? And that’s where the conference or the public discourse I think has its importance. And finally, with the data science team it’s really solidified the field in a profession. It was now something that had kind of the weight of capitalism behind it, kind of pushing it forward. So in this timeline from invention to application to definition to discussion, and finally to profession, we have where we currently were 13 years ago with data science.
Categorizing Data Science Applications
So data science now is fairly ubiquitous. It’s probably hard to find some application some company, some thing that doesn’t use data science than is to find something that does. So I’m just going to give my categorization of all of these things, in what I thinks a useful ontology. So the first application kind of broadly speaking of data science is what typically would be business intelligence.
And this is analytics. So this is, how do we get some metrics about some process that generates data to get some business value. So if you want to categorize the type of data science it’s this inferential statistics, you’re looking at historical data, and you’re trying to make some inference about some future action to take. And examples of this are customer lifetime value, things like split testing or AB testing, and how to do some kind of automated churn prediction for some users of a company.
So this here is a screenshot of an old notification that Facebook sent to the Zipfian Academy page. So after Zipfian Academy was acquired, a lot of the kind of social internet presence of it was incorporated into Galvanize. So we weren’t using Zipfian Academy Facebook page, Facebook’s algorithms picked up on that and said, “Hey, you haven’t come to your page in a while. You haven’t made a post in a while. You haven’t done some action in awhile.” So it sends us an email or it sends a post. It gives us a nudge to try to get us back engaged.
The second data science application, I would say with the second category is this idea of data products. These are basically products that use data and machine learning to enhance some existing application. So this is the recommendation algorithm on YouTube. So YouTube could and has existed I’m sure in the early days without the recommendation algorithm. So YouTube as a platform can be thought of as kind of a bare bones. This is a video hosting platform. It doesn’t need a recommendation algorithm. It’s just that the recommendation algorithm provides a much better user experience. Often data products are predictive in a sense. So while YouTube does do a lot of this first application of analytics, something like the recommender system is more of a predictive system. It’s trying to say, “What can I show you that I think you would like let me try to predict a video that you will click on.” So examples of this are LinkedIn’s people you may know, just Pandora as a company, giving music recommendations and even things like Google photos where I can intelligently pick up faces and objects in photos to kind of categorize them.
And the last application I’ll put here is not necessarily tied to a company or a business but it’s this idea of data tools. So these tools are really usually made to make some data more accessible and that could be more accessible of just, let me see what the data is and let me download and kind of go through the data myself. But it also can be thought of in the sense of how do I get some insight or kind of explore some large data set. And what I like to think of as the modern canonical example of this is the COVID tracking project. So if you’re not familiar with the COVID tracking project, it was this initially crowdsourced way to get some statistics and some data on COVID cases, hospitalizations and death in the early stages of the pandemic before there was a lot of official sources of this data. And it actually ended up being the kind of source of truth. Even after these government sources started providing their own data. The COVID tracking project was in a lot of cases, and in a lot of places, a better source of truth than the official government data source. And in addition to providing just the raw data and the data through an API, they provided visualizations. So non-experts, non-technical folks, non-data scientists could get a sense of what the data actually has in it without needing to go through it or program or go on an complicated analysis themself.
So hopefully this gives you a sense of both what data science is or what I mean when I say data science. And also some of the things that we’re going to touch on in this video series. Also, I want this to be a little bit more dynamic or interactive. So if you do have some topic that you’ve heard about or want to learn more about, feel free to either post a comment or chime in the chat room. In my opinion, the absolute best way is to actually just get our hands dirty and start building some of these things, which is why in the next video, we’re going to start off and actually start programming some Python.