I am thrilled to announce that the second edition of Data Science from Scratch is now available! (buy from Amazon or your other favorite bookstore, or read on Safari, or get a PDF from ebooks.com it looks like.)
However, the first edition used Python 2.7. And as time ticks by,I've been feeling guiltier and guiltier about having a book out therewith my name on it that tells people to use Python 2. Because in [current year],you should not be using Python 2. Stop using Python 2!
Eventually I realized that the only way to clear my consciencewas to prepare a second edition that advocated for Python 3. Accordingly,the new edition is based on fresh, clean Python 3.6.(Except for a standalone section on dataclasses, which is based on Python 3.7, for obvious reasons.)
All that said, on some level it is justan improved, more-modern version of the first edition.If you are a Joel Grus completist (or if you haven't read the first edition)(or if you need a kick in the pants to upgrade to Python 3)(or if you want to learn about type annotations)then you probably want to read it. If you already read the first editionthen maybe you'll be happy just poking atthe new code on GitHub.
Anyway, I am extremely thrilled to share the new edition with youand (in particular) to no longer have a Python 2 book out therewith my name on it. (I mean, the first edition is still out there,and I'm sure I'll still be fielding errata about it until the sun burns out,but at least now it's officially defunct.)
I am super-excited to announce that the book I've been working on for more than the last year, Data Science from Scratch: First Principles with Python is finally available! (buy from O'Reilly, use discount code AUTHD to save some money) (buy from Amazon).
Although I am myself a "math person", the first approach never resonated with me. The fun of data science for me has always been working with data. At the same time, I've never been thrilled with the second approach -- it's a good way to start doing data science without ever really understanding what you're doing.
Hence Data Science from Scratch. It's got math, but only as much as is totally necessary. It's got scraping and cleaning and munging. It's got machine learning. It's got databases and MapReduce. Necessarily it doesn't go deep into any of these, but I like to think it establishes a broad, solid foundation for someone who knows some math and some programming but is not (necessarily) an expert at either.
Many technical books (I won't name names) explain things in their text and then dump pages of hard-to-follow code at you that you are expected to puzzle through. I spent a lot of time trying to write clean code that illuminated the concepts on its own and that reinforced the ideas from the text. As is the current fashion these days, all of the code and data is on GitHub, if you'd like to get a sense of what the book is about.
Books are amazing, aren't they? An expert pours all the knowledge he's accumulated over the long years of his career into the pages of a book, all of which you can read and learn in a tiny fraction of the time. It blows my mind.
That said, I have to admit that don't typically pick up STEM books (if I want to learn something technical I'll generally do a course), but after having read this one, I think I'll make more of an effort.
Being something of a Data Science hobbyist - I've gone through a couple courses and done a handful of projects - I thought I'd check out Data Science From Scratch by Joel Grus, one of the more popular beginner books on the subject. I'm a relative DS newbie with some pretty significant knowledge gaps, so I'm well within the book's target audience.
I've divided this review into 2 parts - part 1 is just a short overview, and if you want the gist then you can read that part. Part 2 is a chapter-by-chapter breakdown of the book, which I did because A) attempting to describe what I learn in my own words helps the concepts stick better, and B) because maybe someone wants a reaaally thorough idea of what the book is about.
Data Science, from A to Z, from scratch. There are 27 chapters, each dedicated to a specific area of Data Science - from Python and Data Visualisation to Neural Nets and Deep Learning, with Linear Algebra, Stats and a whole bunch of other stuff in between.
I spent about 25 hours going through the book, and I'd say that if you were really coming to Data Science from scratch and running/messing around with every piece of code (which I didn't always do), you could add a bunch more time to that estimate.
It's comprehensive in terms of the topics touched on. It's a resource that exposes you to the full breadth (note - not depth) of the Data Science field. For some reason, I find mentally reassuring to be able to see the "limit" of a given topic, if just for confirmation that it's not actually an infinitely wide subject.
Even though it's dealing with a technical topic, I think it's a pretty approachable book, doesn't get too heavy (though there's some math notation and a bunch of code), and it's written in a light, sometimes humorous style.
Also, I'd have liked to have had more concrete explanations as to how exactly some of the subjects tie into Data Science. For example, the book notes that Linear Algebra underpins much of Data Science and Machine Learning, but the book doesn't really go into detail as to why, though you do get a sense of its function as the book progresses.
I liked Data Science From Scratch - reading this book will give you a good overview of what Data Science entails, and hopefully pique your curiosity enough to go and search out other, more in-depth resources for each of the subjects touched upon. You can pick it up on Amazon here.
The first chapter runs through a hypothetical day in the life of a Data Scientist, which is written as a kind of microcosm of the book's various subjects, in more or less the order they're approached in. It's motivating stuff to be able to have an idea of what you'll (hopefully) be able to do at the end of the book.
Type annotations - declaring what kind of variable a variable is going to be, which the book makes heavy use of throughout it. It was a bit odd to me at first, given that I had no exposure to static typing before, but I got used to looking at it after a while and even started to appreciate the added clarity.
Chapter 3 is a fairly brief chapter that goes into data visualisation w/ Matplotlib, just long enough to describe the basics, best practices on bar charts, line charts and scatter plots. It is noted that Matplotlib, while still used, is starting to show its age, and at the end of the chapter notes some alternatives - though not Plotly, which I think also deserves a mention.
Anyway, here we learn what vectors are, what matrices are and some of their properties. As I noted in the short review, here I'd have liked a more thorough explanation on how exactly Linear Algebra is involved in the Data Science process.
Vectors add componentwise. This means that if two vectors v and w are the same length, their sum is just the vector whose first element is v[0] + w[0], whose second element is v[1] + w[1], and so on.
The definition of a "Bernoulli trial" - that being a random experiment with only 2 possible outcomes whose probability doesn't change across different experiments. A coin flip, for example, is a Bernoulli trial.
Continuity Correction - the idea of rounding figures in a continuous distribution to map to discrete figures. For example, say you're calculating the probability of somebody being 185cm in height in a normally distributed population of people, with an average height of 180cm. As you're trying to map the discrete figure of 185cm to the continuous data of the population, you would include an extra .5cm (or .499999cm...etc to be super precise) to the 185cm figure (184.5-185.5), as this is a closer approximation to 185cm than 185-186 is. 185.75cm, for example, is closer to 186cm.
8th Chapter is on Gradient Descent, a subject that gets pretty dense by the end. It covers partial derivatives, difference quotients, choosing the right step size, mini and stochastic batch gradient descent.
Minibatch and Stochastic gradient descent are both descent techniques I hadn't heard of, the later being a thorough but computationally expensive technique, the former being a balance (both in terms of precision and computing complexity) of Stochastic descent and the regular "batch" variety.
One approach to maximizing a function is to pick a random starting point, compute the gradient, take a small step in the direction of the gradient (i.e., the direction that causes the function to increase the most), and repeat with the new starting point.
Chapter 9 touches on gathering data, reading files using (vanilla) Python, working with delimited data, using web scraping (w/ Beautiful Soup + Requests), and accessing APIs using the example of Twitter. This chapter a fair bit gentler difficulty wise than the previous handful.
Chapter 10 goes through working with the data, starting from basic histograms of 1-dimensional data through to scatterplot matrices for many dimensions. It also touches briefly on data cleaning - dealing with bad or missing data, how to handle outliers, as well as data scaling, giving a good example of how scale can be changed, depending on the measurements used (in this case, inches vs centimeters).
Principal Component Analysis (PCA) as a means of reducing the dimensionality (and thus complexity) of a dataset. This allows you to, among other things, speed up ML training, and potentially allow you to visualise data in 2D or 3D.
While I'd heard of the 2 metrics precision and recall (the accuracy of the positive predictions, and the percentage of total, respectively), I hadn't heard of the F-score, that being a combination of the 2 metrics.
Chapter 12 goes through the K-Nearest Neighbors algorithm, a good starting point for getting to grips with ML models. It shows how this algorithm does fairly well with lower dimension datasets (such as the Iris dataset) but struggles with higher dimension datasets, where there may not be very many close neighbors at all due to their size.
7fc3f7cf58