Hey guys,
One of the things I've been learning over the last year or so is introductory 'data science' techniques. Last night at the meeting I attempted to explain some of it, but did not do so to my own satisfaction. This is my attempt to re-explain in a more clear manner. Forgive me in advance if any of this is patronizing:
1. A lot of what we call data science seems to be based on linear algebra, which uses algebra to predict values a long a line, or a plane, or some other object that is in a space of 'n' dimensions.
2. The computer takes in 'observations' consisting of whatever factors you put into it. If I have a tulip, it might have several observations about it: petal color, petal size, petal shape, whatever. The observation of one tulip is spread out over several 'features' or columns:
Petal color, size, shape, length, height of plant
3. If I had 100 such observations of tulips, I could then plot each observation on a graph, with each feature representing a dimension.
Only one thing you are categorizing = points on a line
2 things categorized = a 2d graph
3 things categorized = an isometric 3d plot
4+ = monkey mind blown!
4. Aside from the fact that a multi-dimensional space very often exists in such a categorization, if you think in 2d or 3d, the principles pretty much work.
-Just like how you can plot a line that describes the average of data that his a linear relationship and then predict what further or future observations might hold (
https://en.wikipedia.org/wiki/Linear_regression), you can do this in among any 2 dimensions in your n dimensional space.
-Just like how you might notice that certain plots on a 2d graph clump together and form clusters, you can do that in n dimensional space by calculating the linear distance from each observation point to its 'k' (ex. 8) nearest neighbouring observations.
https://en.wikipedia.org/wiki/Nearest_neighbor_graph-And so on. This is one important way in which computers are able to recognize patterns in data and make predictions about new data of the same type: by using algorithms that borrow heavily from a basis in linear algebra.
Travis
GPG Key: BFEB 7E65 04EB 184B A150 2E2C CC11 933F EE27 D86E