Data Before Hypothesis

I was taking DS101X Statistical Thinking for Data Science and Analytics and came upon a profound idea:

“…more emphasis needed to be placed on using data to suggest hypotheses to test…”
[Source: “A Very Short History of Data Science. Press, Gill. Forbes. May 28, 2013.“]

The scientific method taught in every elementary school: make a hypothesis, gather data, and determine validity of original hypothesis to pursue future work. I remember this getting ingrained into every science fair project. Every foam board presentation started with paper cut-outs of your hypothesis followed by a chart of the colored lights used to make plants grow. I remember putting a poster board together that varied the degrees among the legs of balsa wood arches. My board had a giant, metal arch for additional flair that spanned up over the blue foam poster board joining the left and right panels. Dropping from the arch was a laminated sign drawing in all to learn which arch shape was the strongest. The left panel: background and hypothesis, middle panel: data, right panel: conclusion and future work.

It was always that, make an educated guess first and then test it. For centuries the effort of data capturing has always been the bottle-neck of scientific research. It is all of the hours painstakingly setting up experiments in lab and taking meticulous notes. It was Galileo who pointed his telescope to the sky and charted out the movements of the planets night after night. It is amazing to think that optics he employed where vastly inferior to hobby telescopes of today, but through such a small lens he was able to ascertain heretofore unknown mysteries of the universe. The cost of data gathering has always been too high and so the scientist has always been tasked to shape the course of work by making hypotheses. Always have the ideas and understanding ran out ahead of what could be finally pinned down through observations.

Mathematics is much more interesting after you get all of the theorems and wrote methods out of the way. It gets really engaging when you first start learning that we had no idea how to solve some differential equations. No headway was made until computers came around that crunched through hours of attempted solutions to derive the underlying answer. The scientific method has always been hypothesis first, then data. Data Science turns this picture on its head. Now that the cost of data is coming down with cheaper sensors and computing power, many problems can be solved data first. Gather information for everything, then crunch on these data sets with numerical analysis to find the patterns. It is data before hypothesis and some problems lend themselves exceptionally well to this technique.

Columbia University has created a Data Science Institute. The goal is to manage a new discipline at the cross-section of computer science, mathematics, and every other discipline. There staff includes individuals harnessing statistical tools for comparative literature, biology, civil engineering, among others. It is a new breed of big data training that will continue to change and shape how we gather information and understand complex systems. Looking forward to taking some of these classes from edx, we’ll see what this one offers!

Comments are closed.



  • slide
  • slide
  • slide
  • slide