Notes on Big Data Book

Notes from…

Big Data: A Revolution That Will Transform How We Live, Work and Think

“With Correlation, there is no certainty, only probability. But if a correlation is strong the likelihood of a link is high” Nicholas Taleb

Optimal Proxy -> Billions of Models

We don’t need to care about taste your expectation but subject big data to correlation analysis and let it tell us what search queries are the best proxies.

Benefit – less biased, more accurate, faster

Causalities – quick illusory causality, slow methodical causal

We have a intuitive desire to see causal correlation. Not looking for causal reason but to find the correlations.

“Everything is obvious when you know the answer” Duncan Watts

Future will be driven by the abundance of data rather than by hypotheses.

Datafication #NewWord

Example – Maury US Navy (Mapping the weather)- took info generated from one purpose and convert it into something else. Doing so allowed the info to be used in a novel way to create a unique value.

To datafly a phenomenon is to put it in a quantified format so it can be tabulated and analysed.

Modern IT system certainly make big data possible but at its heart the move to big data is a continuation of human and ancient quest to measure, record and analyse the world.

“The IT revolution is evident all around us but the emphasis has most been of the technology. It is time to recast our gaze to focus on the information.”

Datafly – Data Source (How to measure), Data Entry (How to record what we measure)

There is the need to look at the whole picture.

Data History

Basis counting and measurement of length and weight were among the oldest form of conceptual tools of early civilisation.

3 Millennium BC the idea of recorded information had advanced significantly in the Indus valley, Egypt, and Mesopotamia. Tracks production and business transactions (the early foundation of dataification).

Roman found numeric’s were poor with their current system. They adopted an alternative system developed in India 1AD travelled to Persia and passed onto the arabs but they greatly re-defined. Crusades brought destruction but knowledge migrated.

~130million unique book since 15th century. Google took 20million which is 15% of world written languages. New discipline – computational lexiology that tries to understand human behaviour and cultural trends through quantitative analysis of words.

Fewer than 1/2 English words appear in books are included in the dictionary.

Google – scanning book project 2004 – digitized text (scan capture high resolution image files) – datafied (optical character recognition software that could take a digital image and recognise letters, words, sentences, paragraphs)  making the text indexable and searchable – Textual analysis (1st word of phrases used or became popular, spread of ideas and evolution of human thought across centuries and many languages, shift the thought)

“Words are like fossils encased within pages instead of sedimentary rock. The practioner of culturonics can mine them like archaeologists”

Quantitification, Standardization, Collection

Twitter – sell the access to data using firehose comes at a cost via data sift and GNIP

Many companies parse tweets for sentiment analysis to understand customer feedback and guide impact of marketing campaign.

Derwent capital and Market Psych analyze datafied text of tweets signal investments.

Bernardo Huberman and HP friend develop a model to predict the rates of new tweet post to forecast film success.

509 million tweets, 2.4million people from 84 countries- peoples moods follow similar daily and weekly patterns.

Big data means all data.

  1. Infrastructure in place
  2. Special value in combining dataset
  3. one stop shop to obtain data simplifies life for data users

Risks- Equifax, experian, acxion

Personal Data Guru – Mydex & ID3

Netflix increase 10% recommend

3 core strategies ensure privacy – individual notice and consent, opting out, anomymization (lost effectiveness)

Regulatory shift from “Privacy consent” to “privacy through accountability”. “Differential privacy”

Data User Accountability is fundamental and essential change necessary for effective big data governance.

Safeguards for big data predictions

  1. Openness, data and algorithms underlying the prediction that affect individuals
  2. Certification, having algorithms certified for certain sensitive user by expert 3rd party as sound and valid.
  3. Disapprovability, concrete way for people can disprove prediction about themselves.

Big data governance – Privacy, Propensity, Algorithm authority.

“Big data is a resource and a tool meant to inform rather than explain. It points toward understanding but it can still lead to misunderstanding depending on how well or poorly it wielded.”

Big Data = N- all

We can never have perfect information, our predictions are inherently fallible.

It doesn’t offer ultimate answers just good enough ones to help us now until better methods and hence better answers come along.

We must use this tool with generous degree of humanity.