Data Science The Stanford Way

November 12, 2023

Data science is a relatively new term for a relatively old discipline. Essentially, it is data analysis, particularly for large data sets. It involves techniques as wide-ranging as statistics, computer science, and information theory. What to know more? Stanford has a “Data Science Handbook” that you can read online.

Topics range from how to design a study and create an analytic plan to how to do data visualization, summarization, and analysis. The document covers quite a bit but is very concise.

Data science tends to use Python, although we aren’t sure why that is. However, you might look into the Python Data Science Handbook and Think Stats to apply what you’ve learned about data science to Python. Be sure, too, to check out Stanford Online’s playlist for Statistics and Data Science for many interesting seminars, including “How to be a Statistical Detective.”

Generating a lot of data is something sensors are good at, so it makes sense that data science and statistics techniques might apply. Data science is supposed to be new and shiny, but in reality, it has been going on for a very long time. Ask World War II statistician Abraham Wald.

Title graphic: by [Schutz] CC-SA-3.0.

27 thoughts on “Data Science The Stanford Way”

Ostracus says:

November 12, 2023 at 7:08 pm

I’m guessing Python is easy to work with.

https://www.humblebundle.com/books/data-science-no-starch-press-books

Report comment

Reply
1. Pat says:
  
  November 12, 2023 at 7:15 pm
  
  I’m also guessing it also doesn’t cost hundreds of dollars like some other analysis packages.
  
  Report comment
  
  Reply
  1. S O says:
    
    November 13, 2023 at 3:59 pm
    
    Thousands per seat in many cases. Now PSPP and R exist in addition to Python and all three have been displacing expensive packages for 20 years now for people who bother looking… but most don’t. The reason for expensive packages has more to do with entrenched business use and proprietary plugins than anything and most knowledgeable depart heads will encourage adoption of open tools.
    
    Report comment
    
    Reply
2. WereCatf says:
  
  November 13, 2023 at 12:57 am
  
  Quite so. It not having strongly typed variables alone makes it far more approachable for not-quite-programmers than e.g. C or C++. Strongly typed variables prevent a lot of mistakes and bugs and is definitely very useful to have in some multi-user, service-type software — especially Internet-facing stuff — but you really need to wrap your head around a very specific way of thinking logic, then.
  
  Report comment
  
  Reply
  1. Leithoa says:
    
    November 13, 2023 at 10:40 am
    
    Dynamic typing != weak typing.
    
    If you try to run
    >>> a = 1
    >>> b = “1”
    >>> a + b
    you’re gonna get an error.
    
    Report comment
    
    Reply
    1. Maple says:
      
      November 13, 2023 at 9:36 pm
      
      That’d be the closest to type inference you get in modern, Statically Typed, languages.
      
      Dynamic typing refers to how
      a = 1
      print(a)
      a = “2”
      print(a)
      
      Is perfectly valid code*. It has less upfront friction, but all it takes is a tiny error combined with weak fundamentals to snowball out of control.
      
      *nonexhaustive
      
      Report comment
      
      Reply
      1. Dario says:
        
        November 15, 2023 at 1:50 am
        
        A tiny modification here will get you most of the way where you want to go:
        
        a: int = 1
        print(a)
        a = “2”
        print(a)
        
        Your IDE, or any linter will inform you that you’re mismatching a type for “a”.
        Type hints are a part of the standard library.
        
        Report comment
3. Yeshua Watson says:
  
  November 13, 2023 at 4:55 am
  
  It is but mostly because it’s really simple vs. other faster, albeit structured, languages. Less thinking about how to properly structure the application and more just getting to work. Plus, all the real code is the bare metal stuff with a python wrapper so win-win.
  
  Report comment
  
  Reply
4. Wade says:
  
  November 13, 2023 at 8:12 am
  
  That line in the article is the lamest clickbait. Nobody who has done any data science work is in the dark about why Python is prevalent: it’s because it’s what everybody else is using and it’s what is well supported by a huge ecosystem. There are plenty of reasons another language might be a better choice (as well as plenty of reasons that python is a good choice), but there really is no mystery why people are using it.
  
  Report comment
  
  Reply
  1. Andrew Dodd says:
    
    November 13, 2023 at 8:41 am
    
    Yup. Among other things, numpy + matplotlib + scipy is making MAJOR inroads to replacing Matlab in a lot of scenarios where someone previously might have used Matlab.
    
    Back in the grad school days, and even for a while after grad school, I did a lot of work using GNU Octave as a replacement for Matlab, but at this point, I don’t think a single thing that I ever used Matlab/Octave for in the past is something I wouldn’t use numpy/matplotlib/scipy for now.
    
    Report comment
    
    Reply
    1. Pat says:
      
      November 13, 2023 at 9:43 am
      
      “numpy + matplotlib + scipy is making MAJOR inroads to replacing Matlab in a lot of scenarios”
      
      But there’s a secondary reason for that: because a lot of numpy/scipy’s functions are direct equivalents to MATLAB, straight down to the names.
      
      Report comment
      
      Reply
Neverm|nd says:

November 12, 2023 at 8:53 pm

Dear Al, so I’ve comunicated with you previously and like you a lot. I am also currently finishing up the HarvardX Data Science Professional certificate. We don’t use Python, we play with R. Unfortunately both are not multi-processor/CUDA enabled so, slow. Where I have to disagree with you a little bit though– I mean among other subjects I studied Finance like circa 2000-2004. In the textbooks we kind of learned about ‘Monte Carlo’ simulations (which I will agree with you, an ‘ancient’ concept contributed to Von Neumann)– But, the ‘so cool’ thing is that processors have actually gotten up to speed where we can play with it. Further, while stasts have long been around (and I promise you you don’t want to research where ‘regression’ comes from)… Honesly though I do feel this is a bit different. The difficult question I struggle with– Is no possible person could hold all this data in their head. Thus it makes it an interesting challenge.

Report comment

Reply
1. WereCatf says:
  
  November 13, 2023 at 1:05 am
  
  Python itself is, indeed, a single-process currently, but they are actively exploring ways of getting rid of GIL — the global interpreter lock — that’d allow for taking advantage of modern CPUs properly. Also, most data science and maths libraries in Python either have built-in support for taking advantage of multiple CPUs and/or cores and/or CUDA/OpenCL/etc., like e.g. if one is used to using Numpy, there’s a wrapper library for it called Numba ( https://numba.pydata.org/ ) that lets you use Numpy with multiple cores.
  
  I am not telling you to use Python, of course, but if you or someone you know ever does so, it may behoove it to spend a couple of minutes on finding which libraries support multiple cores out-of-the-box and which ones have wrapper libraries available for that.
  
  Report comment
  
  Reply
  1. Andrew Dodd says:
    
    November 13, 2023 at 8:42 am
    
    Even with the GIL, there are workarounds to it for some applications/use cases. GPU acceleration is becoming more and more available also, thanks to stuff like Pytorch and JAX.
    
    Report comment
    
    Reply
2. Al Williams says:
  
  November 13, 2023 at 5:47 am
  
  Well I think lots of fields have altered to deal with large data sets. As programmers, for example, we never had the ability to deal with video, but now we do but it’s still programming. I’m not saying there’s nothing new about data science but I am saying it’s not radical departure from what people have been doing with numbers for centuries. It’s just an extension of it using modern tools with modern sized data sets. Which is fine of course
  
  Report comment
  
  Reply
Grant says:

November 13, 2023 at 5:48 am

Generally, the best packages are in Python (slow) and R (syntactically awful). It’s often just hard enough to rewrite algorithms, you’re better off just sucking it up.

It’s no wonder Go and Rust are gaining popularity, though in general it’s for very bespoke applications.

Report comment

Reply
CMH62 says:

November 13, 2023 at 6:45 am

Slightly off topic but related: I have a simple Python program that manages some alphanumeric lists. I’m learning which data structures are faster than others for this, but I do seem to have one routine I’ve written that takes the majority of the processing time. I think I remember reading a few years back about a technique to compile just a single, slow routine in your Python program to speed things up. But I can’t remember now what that was. Any suggestions from my Hackaday cohorts?? Thx..

Report comment

Reply
1. Andrew Dodd says:
  
  November 13, 2023 at 8:45 am
  
  cython? JAX’s JIT tricks?
  
  There are a few potential answers to what you’re looking for here, depending on the exact situation.
  
  Report comment
  
  Reply
NPHighview says:

November 13, 2023 at 8:42 am

This was dinner discussion yesterday – how convenient! Everyone in our family works with huge piles of data, and for one of us, Python is the way. For another, Spotfire rules (look it up – it’s great, and I’m not that person). I had to generate visualizations with Excel for millions of data points, and was forced to learn about pivot tables, version-to-version incompatibilities in Excel, and its absolute horrific formatting limitations.
I have nightmares about Tableau and Qlik.

Report comment

Reply
Robert says:

November 13, 2023 at 10:26 am

There is also R Studio which one can use as an open-source package for data analysis. Much like the python functionality, but without that dreadful whitespace dependency.

Report comment

Reply
1. Bob Marlee says:
  
  November 13, 2023 at 5:14 pm
  
  But you trade it for a horrid global namespace and some back asswards predilection for stupid and verbose symbology like <- (although it supports =, which I use even though it somehow confuses others).
  
  You can get quite a bit of the power of various python tools along with the user-friendly interface of commercial products like SPSS in Orange (Data Mining); see handle link. Also, while certainly less elegant, NIST's dataplot can be handy for some more esoteric analyses.
  
  Report comment
  
  Reply
AD says:

November 13, 2023 at 9:32 pm

I thought the “Stanford way” was to fabricate data…

Report comment

Reply
1. Glenn says:
  
  November 17, 2023 at 9:14 pm
  
  🤣👍
  
  Report comment
  
  Reply
Prfesser says:

November 14, 2023 at 5:05 am

Early work on data analysis: ;-)

“The Art Of Finding The Right Graph Paper For A Straight Line”, Journal of Irreproducible Results, 17, 235 (1979)

Also, “All Theories Proven With One Graph” is illuminating. The actual paper is available at: https://web.archive.org/web/20190928214938/http://jir.com/graph_contest/index.html#OneGraph

Always remember the first rule of analysis: First draw your curves, then plot your data.

Report comment

Reply
Christian says:

November 14, 2023 at 2:28 pm

I would wager it is Python’s List Comprehension. Other languages were too concerned with memory use and performance. Make you fiddle around with inserts and deletions.

Want to remove an item from a list? Create a new list that excludes that one item! Single line of code in Python.
Years ago I found that crazy, today I see the wisdom of it.

Report comment

Reply
1. combinatorylogic says:
  
  November 15, 2023 at 6:01 am
  
  And this is how all the functional languages always operated. It’s painful to see people crediting Python for the introduction of the list comprehension.
  
  Report comment
  
  Reply
2. Bob Marlee says:
  
  November 15, 2023 at 5:52 pm
  
  Why bother making a new list? Splice splice baby: splice(@F, $index, 1). See also grep to remove items by value. As Christian noted, these are not “new” concepts bestowed on us by Guido.
  
  Report comment
  
  Reply

Hackaday

Data Science The Stanford Way

27 thoughts on “Data Science The Stanford Way”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

VAR Is Ruining Football, And Tech Is Ruining Sport

Mining And Refining: Uranium And Plutonium

Programming Ada: First Steps On The Desktop

The Hunt For MH370 Goes On With Barnacles As A Lead

MXM: Powerful, Misused, Hackable

Our Columns

Hackaday Links: April 28, 2024

Welcome Back, Voyager

Hackaday Podcast Episode 268: RF Burns, Wireless Charging Sucks, And Barnacles Grow On Flaperons

This Week In Security: Cisco, Mitel, And AI False Flags

Keebin’ With Kristina: The One With The Transmitting Typewriter

27 thoughts on “Data Science The Stanford Way”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns