Big Introduction to Big Data

“The extraction of actionable knowledge from the vast amounts of available digital information seems to be the natural next step in the ongoing evolution from the Information Age to the Knowledge Age.” –Martin Hilbert Big Data for Dev.; pre-published version, Jan. 2013.

“But the number of meaningful relationships in the data – those that speak to causality rather than correlation and testify to how the world really works – is orders of magnitude smaller. Nor is it likely to be increasing at nearly so fast a rate as the information itself; there isn’t any more truth in the world than there was before the Internet or the printing press. Most of the data is just noise, as most of the universe is filled with empty space.” – Nat Silver “The Signal and the Noise” p250

In the following discussion we will use the same notation one uses in an Excel spreadsheet. The asterisk or “*” means “times” or multiplication and the up arrow or caret “^” means “raised to the power of”.

A few weeks ago I posed the question whether Homo sapiens (all of us, you, me, Americans and everybody) are sufficiently scientifically literate to survive the technological world we’ve created. The National Science Foundation had developed a test for scientific literacy but the answers to their questions should have been acquired in elementary school. I speculated that the NSF questions do not test for what I called “sufficient” science literacy.

Figure 1 is a plot of the energy flow through the human economy during the last two thousand years as well as the world GDP during the same period of human history. That they align is no coincidence. Growth of an individual, an ecosystem or an economy requires energy flow and this growth is fundamentally constrained by any physical limits on this rate of flow.

Figure 1, Energy flow through the human economy during the last two millennia (red curve) and human accumulation of wealth (blue curve)

This figure introduces some scientific tools which we all need to understand but which we as individuals and collectively seem not to have assimilated. The exponential function itself is poorly understood. It is impossible for humans to continue to increase energy flow exponentially indefinitely. This trend will end. Wealth cannot therefore continue to grow exponentially either. This is what energy economists and ecologists mean when they point out that continued economic growth is unsustainable. It is simply an acknowledgment of the finiteness of our world being only 12,742 km in diameter living off an energy gradient between the 5778 degrees Kelvin temperature at the surface of the sun (our energy source) and the 3 degrees Kelvin temperature of deep space (our energy sink). The importance of this discussion is that many economists believe our economy is entering an information age or knowledge age whereby we can continue economic growth independent of real physical resources by simply growing “information” and selling it back and forth to each other. We will discuss this in a future article.

The second thing to note is the use of scientific notation. We live in a world of large numbers growing larger and small numbers getting smaller. The NSA’s new data center will be able to process several Yotta Bytes ( or YB or 10^24 Bytes) worth of data every year and nanotechnology is the latest economic and technological fad measured in nanometers (nm or 10^-9 meters) a very small size. The amount of money wrapped up in derivatives and the net present value of the economic destruction of climate change must be expressed in Peta dollars (P$ or 10^15 dollars or thousands of trillions of dollars or millions of billions of dollars) an extremely large amount of money. Speaking of genetically modified organisms, the mass of a one copy of one genome is measured in fractions of a pico gram (pg or 10^-12 grams).

In order to comprehend the expanse of the data on which our science-based economic enterprise depends, we need to understand scientific notation. While the concept of scientific notation is easy to grasp by any person in a matter of minutes, it takes some experience to get used to scientific notation so that it becomes second nature. Familiarity and ease of use takes time, effort and practice.

In discussing energy, economy or big data or the human condition, there is no other way to express the information than by using scientific notation. I postulate that two necessary (but not sufficient) requirements for sufficient scientific literacy are the understanding of the exponential function and the familiarity with scientific notation.

To appreciate how this applies to big data consider figure 2. The inset figure was published in the Economist based on data provided by Martin Hilbert [1]. If we step back a bit we see that information growth is the over achiever of all exponential curves. Both information and information storage are growing hyper exponentially, as is the capacity of our networks to communicate information which is not shown. Secondly, we have to use Exabytes or EB or 10^18 to express the phenomenal amount of information. Note that information creation has surpassed information storage. We have to throw the stuff away.

Some questions we want to explore in this series include whether or not we can base an economy on information decoupled from resources as economists assume. What is the difference between information and knowledge; and between knowledge and wisdom? Is there any benefit to trolling vast data bases such as Google’s to determine important trends? How long can these exponential growth trajectories be maintained? Where is all this leading us?

To answer these questions and more we need to feel comfortable with very large and incredible small numbers spanning tens of orders of magnitude. I suggest searching Wikipedia for “orders of magnitude”, and the “International System of Units (SI)”.

Here are some interesting examples of the use of scientific notation applied to energy from Wikipedia:

2*10^-33 Joules is the average kinetic energy of translational motion of a molecule at the lowest temperature reached as of 2003, 100 picokelvins (100*10^-12).

4.1 zJ or 4.1*10-21 Joules is the common rough approximation for the total thermal energy of each molecule in a system at 25 degrees C or 77 degrees F and 2.856 zJ is by Landauer’s principle, the minimum amount of energy required at the same temperature to change one bit of information.

20 nJ or 20 nano Joules or 20*10^-9 is the mass-energy of the particle believed to be the Higgs Boson (125.3 GeV, Giga electron Volts billions of electron Volts) announced on March 14, 2013. By the way, the Large Hadron Collider generates 5*10^20 Bytes per day of data but only 0.001% are recorded or 5 Peta Bytes each day. This is 0.2% of all generated human data when it is turned on.

1 Joule is approximately the kinetic energy produced as an extra small apple (100 grams) falls 1 meter against Earth’s gravity. Scientists and Engineers express 1 as 10^0 or in fact any number to the power of 0 is identically 1.

1.361 kJ or kilo Joules or 1.361*10^3 Joules is the total solar radiation received from the Sun by 1 square meter at the altitude of Earth’s orbit above the Sun’s surface per second. This is called the solar constant though over geologic time of course it isn’t a constant. This is the solar energy the Earth intercepts at the top of the atmosphere before any of it is reflected back out into space.

88 TJ or TeraJoules was the yield of the Fat Man atomic bomb dropped on the city of Nagasaki on August 9, 1945 and 32 TJ was the yield of Little Boy dropped on Hiroshima on August 6, 1945.

100 PJ or PetaJoules is the total energy from the Sun that strikes the face of the Earth each second.

500 EJ or 500 ExaJoules is the total world annual energy consumption in 2010.

500 ZettaJoules or 500 ZJ is the approximate energy released in the formation of the Chicxulub Crater in the Yucatan Peninsula by collision with a 6 mile in diameter asteroid at a speed of 45,000 miles per hour 65 million years ago. This impact caused the extinction of the dinosaurs and every animal larger than a cat.

While all of these numbers are interesting and many are useful some are very important to know and it is not possible to comprehend them without the use of scientific notation. More next time.

Figure 2, Cumulative human information creation (dark blue curve and dotted curve) and cumulative information storage capability (light blue curve and solid curve).

[1] Figure 2 is adapted from “Special Report: Data, data everywhere”, The Economist February 27, 2010. Data source: Martin Hilbert and Priscila López, The World’s Technological Capacity to Store, Communicate, and Compute Information, Science 332, 60 (2011); DOI: 10.1126/science.1200970

