Data Science The Stanford Way

Data science is a relatively new term for a relatively old discipline. Essentially, it is data analysis, particularly for large data sets. It involves techniques as wide-ranging as statistics, computer science, and information theory. What to know more? Stanford has a “Data Science Handbook” that you can read online.

Topics range from how to design a study and create an analytic plan to how to do data visualization, summarization, and analysis. The document covers quite a bit but is very concise.

Data science tends to use Python, although we aren’t sure why that is. However, you might look into the Python Data Science Handbook and Think Stats to apply what you’ve learned about data science to Python. Be sure, too, to check out Stanford Online’s playlist for Statistics and Data Science for many interesting seminars, including “How to be a Statistical Detective.”

Generating a lot of data is something sensors are good at, so it makes sense that data science and statistics techniques might apply. Data science is supposed to be new and shiny, but in reality, it has been going on for a very long time. Ask World War II statistician Abraham Wald.

Title graphic: by [Schutz] CC-SA-3.0.

27 thoughts on “Data Science The Stanford Way

      1. Thousands per seat in many cases. Now PSPP and R exist in addition to Python and all three have been displacing expensive packages for 20 years now for people who bother looking… but most don’t. The reason for expensive packages has more to do with entrenched business use and proprietary plugins than anything and most knowledgeable depart heads will encourage adoption of open tools.

    1. Quite so. It not having strongly typed variables alone makes it far more approachable for not-quite-programmers than e.g. C or C++. Strongly typed variables prevent a lot of mistakes and bugs and is definitely very useful to have in some multi-user, service-type software — especially Internet-facing stuff — but you really need to wrap your head around a very specific way of thinking logic, then.

        1. That’d be the closest to type inference you get in modern, Statically Typed, languages.

          Dynamic typing refers to how
          a = 1
          print(a)
          a = “2”
          print(a)

          Is perfectly valid code*. It has less upfront friction, but all it takes is a tiny error combined with weak fundamentals to snowball out of control.

          *nonexhaustive

          1. A tiny modification here will get you most of the way where you want to go:

            a: int = 1
            print(a)
            a = “2”
            print(a)

            Your IDE, or any linter will inform you that you’re mismatching a type for “a”.
            Type hints are a part of the standard library.

    2. It is but mostly because it’s really simple vs. other faster, albeit structured, languages. Less thinking about how to properly structure the application and more just getting to work. Plus, all the real code is the bare metal stuff with a python wrapper so win-win.

    3. That line in the article is the lamest clickbait. Nobody who has done any data science work is in the dark about why Python is prevalent: it’s because it’s what everybody else is using and it’s what is well supported by a huge ecosystem. There are plenty of reasons another language might be a better choice (as well as plenty of reasons that python is a good choice), but there really is no mystery why people are using it.

      1. Yup. Among other things, numpy + matplotlib + scipy is making MAJOR inroads to replacing Matlab in a lot of scenarios where someone previously might have used Matlab.

        Back in the grad school days, and even for a while after grad school, I did a lot of work using GNU Octave as a replacement for Matlab, but at this point, I don’t think a single thing that I ever used Matlab/Octave for in the past is something I wouldn’t use numpy/matplotlib/scipy for now.

        1. “numpy + matplotlib + scipy is making MAJOR inroads to replacing Matlab in a lot of scenarios”

          But there’s a secondary reason for that: because a lot of numpy/scipy’s functions are direct equivalents to MATLAB, straight down to the names.

  1. Dear Al, so I’ve comunicated with you previously and like you a lot. I am also currently finishing up the HarvardX Data Science Professional certificate. We don’t use Python, we play with R. Unfortunately both are not multi-processor/CUDA enabled so, slow. Where I have to disagree with you a little bit though– I mean among other subjects I studied Finance like circa 2000-2004. In the textbooks we kind of learned about ‘Monte Carlo’ simulations (which I will agree with you, an ‘ancient’ concept contributed to Von Neumann)– But, the ‘so cool’ thing is that processors have actually gotten up to speed where we can play with it. Further, while stasts have long been around (and I promise you you don’t want to research where ‘regression’ comes from)… Honesly though I do feel this is a bit different. The difficult question I struggle with– Is no possible person could hold all this data in their head. Thus it makes it an interesting challenge.

    1. Python itself is, indeed, a single-process currently, but they are actively exploring ways of getting rid of GIL — the global interpreter lock — that’d allow for taking advantage of modern CPUs properly. Also, most data science and maths libraries in Python either have built-in support for taking advantage of multiple CPUs and/or cores and/or CUDA/OpenCL/etc., like e.g. if one is used to using Numpy, there’s a wrapper library for it called Numba ( https://numba.pydata.org/ ) that lets you use Numpy with multiple cores.

      I am not telling you to use Python, of course, but if you or someone you know ever does so, it may behoove it to spend a couple of minutes on finding which libraries support multiple cores out-of-the-box and which ones have wrapper libraries available for that.

      1. Even with the GIL, there are workarounds to it for some applications/use cases. GPU acceleration is becoming more and more available also, thanks to stuff like Pytorch and JAX.

    2. Well I think lots of fields have altered to deal with large data sets. As programmers, for example, we never had the ability to deal with video, but now we do but it’s still programming. I’m not saying there’s nothing new about data science but I am saying it’s not radical departure from what people have been doing with numbers for centuries. It’s just an extension of it using modern tools with modern sized data sets. Which is fine of course

  2. Generally, the best packages are in Python (slow) and R (syntactically awful). It’s often just hard enough to rewrite algorithms, you’re better off just sucking it up.

    It’s no wonder Go and Rust are gaining popularity, though in general it’s for very bespoke applications.

  3. Slightly off topic but related: I have a simple Python program that manages some alphanumeric lists. I’m learning which data structures are faster than others for this, but I do seem to have one routine I’ve written that takes the majority of the processing time. I think I remember reading a few years back about a technique to compile just a single, slow routine in your Python program to speed things up. But I can’t remember now what that was. Any suggestions from my Hackaday cohorts?? Thx..

  4. This was dinner discussion yesterday – how convenient! Everyone in our family works with huge piles of data, and for one of us, Python is the way. For another, Spotfire rules (look it up – it’s great, and I’m not that person). I had to generate visualizations with Excel for millions of data points, and was forced to learn about pivot tables, version-to-version incompatibilities in Excel, and its absolute horrific formatting limitations.
    I have nightmares about Tableau and Qlik.

    1. But you trade it for a horrid global namespace and some back asswards predilection for stupid and verbose symbology like <- (although it supports =, which I use even though it somehow confuses others).

      You can get quite a bit of the power of various python tools along with the user-friendly interface of commercial products like SPSS in Orange (Data Mining); see handle link. Also, while certainly less elegant, NIST's dataplot can be handy for some more esoteric analyses.

  5. I would wager it is Python’s List Comprehension. Other languages were too concerned with memory use and performance. Make you fiddle around with inserts and deletions.

    Want to remove an item from a list? Create a new list that excludes that one item! Single line of code in Python.
    Years ago I found that crazy, today I see the wisdom of it.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.