There are plenty of articles and discussions on the web about what data science is, what qualities define a data scientist, how to nurture them, and how you should position yourself to be a competitive applicant. There are far fewer resources out there about the steps to take in order to obtain the skills necessary to practice this elusive discipline. Here I will provide a collection of freely accessible materials and content to jump-start your understanding of the theory and tools of Data Science.
While the emerging field of data science is not tied to any specific tools, there are certain languages and frameworks that have become the bread and butter for those working in the field. I recommend Python as the programming language of choice for aspiring data scientists due to its general purpose applicability, a gentle (or firm) learning curve, and — perhaps the most compelling reason — the rich ecosystem of resources and libraries actively used by the scientific community.
When learning a new language in a new domain, it helps immensely to have an interactive environment to explore and to receive immediate feedback. IPython provides an interactive REPL which also allows you to integrate a wide variety of frameworks (including R) into your Python programs.
It was once said that a data scientist is someone who is better at software engineering than a statistician and better at statistics than any software engineer. As such, statistical inference underpins much of the theory behind data analysis and a solid foundation of statistical methods and probability serves as a stepping stone into the world of data science.
- edx: Introduction to Statistics: A basic introductory statistics course.
- MIT: Statistical Thinking and Data Analysis: Introduction to probability, sampling, regression, common distributions, and inference.
While R is the de facto standard for performing statistical analysis, it has quite a high learning curve and there are other areas of data science for which it is not well suited. To avoid learning a new language for a specific problem domain, we recommend trying to perform the exercises of these courses with Python and its numerous statistical libraries. You will find that much of the functionality of R can be replicated with NumPy, SciPy, matplotlib, and pandas.
Well written books can be a great reference (and supplement) to these courses, and also provide a more independent learning experience. These may be useful if you already have some knowledge of the subject or just need to fill in some gaps in your understanding:
- Think Stats: An introduction to Probability and Statistics for Python programmers.
- Introduction to Probability: Textbook for Berkeley’s Stats 134 class, an introductory treatment of probability with complementary exercises.
- Lecture notes for Introduction to Probability: Compiled lecture notes of the above textbook, complete with exercises.
- OpenIntro: Statistics: Introductory text book with supplementary exercises and labs in an online portal.
- Think Bayes: An simple introduction to Bayesian Statistics with Python code examples.
A solid base of Computer Science and algorithms is essential for an aspiring data scientist. Luckily there are a wealth of great resources online, and machine learning is one of the more lucrative (and advanced) skills of a data scientist.
- Coursera: Machine Learning: Stanford’s famous machine learning course taught by Andrew Ng.
- MIT: Data Mining: An introduction to the techniques of data mining and how to apply ML algorithms to garner insights.
- CS188: Introduction to Artificial Intelligence: Berkeley’s popular introductory AI course that teaches you to build autonomous agents to efficiently make decisions in stochastic and adversarial settings.
- edx: Introduction to Computer Science and Programming: MIT’s introductory course to the theory and application of Computer Science.
- A first encounter with Machine Learning: An introduction to machine learning concepts focusing on the intuition and explanation behind why they work.
- A Programmer’s Guide to Data Mining: A web based book complete with code samples (in Python) and exercises.
- Elements of Statistical Learning: One of the most comprehensive treatments of data mining and ML, often used as a university textbook.
- An Introduction to Information Retrieval: Textbook from a Stanford course on NLP and information retrieval with sections on text classification, clustering, indexing, and web crawling.
Data ingestion and cleaning
One of the most under-appreciated aspects of data science is the cleaning and munging of data that often represents the most significant time sink during analysis. While there is never a silver bullet for such a problem, knowing the right tools, techniques, and approaches can help minimize time spent wrangling data.
- School of Data: A gentle introduction to cleaning data: A hands on approach to learning to clean data, with plenty of exercises and web resources.
- OpenRefine (formerly Google Refine): A powerful tool for working with messy data, cleaning, transforming, extending it with web services, and linking to databases. Think Excel on steroids.
- DataWrangler: Stanford research project that provides an interactive tool for data cleaning and transformation.
- sed: “The ultimate stream editor” — used to process files with regular expressions often used for substitution.
- awk: “Another cornerstone of UNIX shell programming” — used for processing rows and columns of information.
The most insightful data analysis is useless unless you can effectively communicate your results. The art of visualization has a long history, and while being one of the more qualitative aspects of data science, its methods and tools are well documented.
- UC Berkeley: Visualization: Graduate class on the techniques and algorithms for creating effective visualizations.
- Rice: Data Visualization: A treatment of data visualization and how to meaningfully present information from the perspective of Statistics.
- Tufte: The Visual Display of Quantitative Information: Not freely available, but perhaps the most influential text for the subject of data visualization. A classic that defined the field.
- School of Data: From Data to Diagrams: A gentle introduction to plotting and charting data, with exercises.
- D3.js: Data-Driven Documents — Declarative manipulation of DOM elements with data dependent functions (with Python port).
- Vega: A visualization grammer built on top of D3 for declarative visualizations in JSON. Released by the dream team at Trifacta, it provides a higher level abstraction than D3 for creating SVG based graphics.
- modest maps: A lightweight library with a simple interface for working with maps in the browser (with ports to multiple languages).
- Chart.js: Very simple (only six charts) HTML5 based plotting library with beautiful styling and animation.
Computing at Scale
When you start operating with data at the scale of the web (or greater), the fundamental approach and process of analysis must change. To combat the ever increasing amount of data, Google developed the MapReduce paradigm. This programming model has become the de facto standard for large scale batch processing since the release of Apache Hadoop in 2007, the open-source MapReduce framework.
- UC Berkeley: Analyzing Big Data with Twitter: A course — taught in close collaboration with Twitter — that focuses on the tools and algorithms for data analysis as applied to Twitter microblog data (with project based curriculum).
- CMU: Machine Learning with Large Datasets: A course on scaling machine learning algorithms on Hadoop to handle massive datasets.
- U of Chicago: Large Scale Learning: A treatment of handling large datasets through dimensionality reduction, classification, feature parametrization, and efficient data structures.
- UC Berkeley: Scalable Machine Learning: A broad introduction to the systems, algorithms, models, and optimizations necessary at scale.
- Mining Massive Datasets: Stanford course resources on large scale machine learning and MapReduce with accompanying book.
- Data-Intensive Text Processing with MapReduce: An introduction to algorithms for the indexing and processing of text that teaches you to “think in MapReduce.”
- Hadoop: The Definitive Guide: The most thorough treatment of the Hadoop framework, a great tutorial and reference alike.
- Programming Pig: An introduction to the Pig framework for programming data flows on Hadoop.
Putting it all together
Data Science is an inherently multidisciplinary field that requires a myriad of skills to be a proficient practitioner. The necessary curriculum has not fit into traditional course offerings, but as awareness of the need for individuals who have such abilities is growing, we are seeing universities and private companies creating custom classes.
- UC Berkeley: Introduction to Data Science: This course “combines three perspectives: inferential thinking, computational thinking, and real-world relevance.”
- How to Process, Analyze and Visualize Data: A lab oriented course that teaches you the entire pipeline of data science; from acquiring datasets and analyzing them at scale to effectively visualizing the results.
- Columbia: Applied Data Science (with book): Teaches applied software development fundamentals using real data, targeted towards people with mathematical backgrounds.
- Coursera: Data Analysis: An applied statistics course that covers algorithms and techniques for analyzing data and interpreting the results to communicate your findings.
- Kaggle: Getting Started with Python for Data Science: A guided tour of setting up a development environment, an introduction to making your first competition submission, and validating your results.
- Data Beta: Professor Joe Hellerstein’s blog about education, computing, and data.
- Dataists: Hilary Mason and Vince Buffalo’s old blog that has a wealth of information and resources about the field and practice of data science.
- FiveThirtyEight: Nate Silver’s famous NYT blog where he discusses predictive modeling and political forecasts.
- grep alex: Alex Holmes’s blog about distributed computing and the intricacies of Hadoop.
- Data Science 101: One man’s personal journey to becoming a data scientist (with plenty of resources)
- no free hunch: Kaggle’s blog about the practice of data science and its competition highlights.
Now this just scratches the surface of the infinitely deep field of Data Science and I encourage everyone to go out and try it yourself!