Many CU INFO majors ask me for resources for studying data science on their own. Here is a list of some of my favorites. I’ve broken up this list into resources focused on underlying math, machine learning concepts, and programming. I also include a section for students interested in natural language processing, and a short section on professional advice for those who are interested in working as data scientists.

Underlying math

Data science is based on math and statistics, so it makes sense to spend time continually deepening your understanding of these fundamental ideas. Also, while the field will change all the time, the underlying quantitative concepts will remain consistent.

When you are starting out, I would recommend making sure you are comfortable with the fundamentals of calculus, statistics and linear algebra. (More math will deepen your understanding, but I would start there.) Here are a few resources I have found helpful.

  • 3blue1brown is an AMAZING math YouTuber. I learn a ton from his videos, and am inspired by the quality of his teaching. I also leaned on videos from mathematicalmonk to build my quantitative background during my first year in grad school. There are tons of math YouTubers, and I suggest poking around to find a teaching style that works for you.

  • Wasserman’s All of Statistics is a great book that focuses on offering a terse and rigorous introduction to statistics. It will be a challenge to get started with this resource, but you will learn a lot if you put in the effort. (I think the title is a little bit of a joke. All of statistics can not fit in one book, but it is a good start.)

  • Zico Kolter has a good quick start for linear algebra. I have not found a full linear algebra textbook I really like yet. (I’ve used books from Gilbert Strang and David Poole.)

  • I really like the book Computer Age Statistical Inference to give an overall modern perspective on the history of stats, but I would not start with this resource.

Machine learning concepts

Machine learning is either the same thing as statistics or somehow closely-related to statistics, depending on your perspective. Here are a few books to get started, more focused on ML.

  • Packt’s Python Machine Learning (PML). This is the textbook for INFO 4604/5604. It offers a nice hands-on introduction to ML, with a focus on using and understanding Python ML libraries.

  • Kevin Murphy’s Machine Learning. This is a standard graduate textbook in ML. It is much harder than Python Machine Learning (PML) and you will need to put in a lot of effort to get comfortable with the contents. (I came to grad school with a limited math background, and it took me around 6 months to get to the point where I could read Murphy). But the book offers a much deeper perspective on the field than you will get from PML. If you are serious about understanding the underlying details of machine learning, you will need to spend time working through resources with some mathematical depth. You don’t need to read every chapter in this book at once. Give yourself years to learn the material.

Programming and tools

There are tens if not hundreds of thousands of books, tutorials and online courses for learning how to code. Someone even made a game that teaches you to use the Vim text editor. With that said, here are a few scattered thoughts on how to navigate the landscape.

  • Basic stuff to learn. Tools and languages change all the time, and I am not sure it makes too much sense to get too focused on any particular technology of framework. However, you will need to show some fluency with current tools to get started in your career. It is a good idea to learn how to use PyTorch or tensorflow, as well as to be comfortable with SQL, numpy, pandas and scikit-learn. I love the tidyverse for exploratory data analysis for small datasets (it’s a collection of R packages), and some people like Julia for writing faster code (Python can be slow). Altair is nice for data visualization in Python, but there are many competitors like matplotlib. It’s a good idea to get the hang of using GitHub, and to be comfortable using the command line. There is an MIT course that focuses on filling in these sorts of programming-adjacent skills, with free materials posted online.

  • Where to focus your effort. There are so many tools and languages to learn, it can be hard to know where to focus. I can think of at least two strategies:

    1. One good option is to just follow your own curiosity. Learn what you want to learn to answer the kinds of questions you want to answer and take on the kinds of projects you’d like to complete!

    2. Alternately, I also think it also makes sense to be a bit analytic about what you study first. The programmer competency matrix helped me a lot when I started working as a software developer without a CS degree. What I like about it is you can sort of see where you land on this matrix, and systematically fill gaps in your background. Of course, if you are interested in a particular role or job, definitely speak with people in that area to understand what you need to know.

NLP resources

If you are interested in natural language processing specifically, I recommend Jacob Eisenstein’s textbook. If you don’t want to buy a copy, you can download the original lecture notes from GitHub. You should also probably learn to use the spacy and Hugging Face Transformers libraries.

Professional advice

I am not a great resource for advice on how to get started with a career in data science. For that, you are much better off speaking with a hiring manager, a recruiter, a career advisor at CU or an actual data scientist working in industry. However, a friend who works in the field suggests the book Build a Career in Data Science. I also think that it makes sense to take an empirical approach to finding a job; if you want to get a specific job in the future, interview people who currently have that job, analyze their backgrounds, and then try to make your resume look more like their resume. I think people often do this using LinkedIn.