Thinking about Big Data

“Big Data” has leapt rapidly into one of the most hyped terms in our industry, yet the hype should not blind people to the fact that this is a genuinely important shift about the role of data in the world. The amount, speed, and value of data sources is rapidly increasing. Data management has to change in five broad areas: extraction of data from a wider range of sources, changes to the logistics of data management with new database and integration approaches, the use of agile principles in running analytics projects, an emphasis on techniques for data interpretation to separate signal from noise, and the importance of well-designed visualization to make that signal more comprehensible. Summing up this means we don't need big analytics projects, instead we want the new data thinking to permeate our regular work.

29 January 2013

This page is a fallback page for the proper infodeck.

There are couple of reasons why you are seeing this page

The following is dump of the text in the deck to help search engines perform indexing


Thinking about Big Data

Martin Fowler

2013-01-29

“Big Data” has leapt rapidly into one of the most hyped terms in our industry, yet the hype should not blind people to the fact that this is a genuinely important shift about the role of data in the world. The amount, speed, and value of data sources is rapidly increasing. Data management has to change in five broad areas: extraction of data from a wider range of sources, changes to the logistics of data management with new database and integration approaches, the use of agile principles in running analytics projects, an emphasis on techniques for data interpretation to separate signal from noise, and the importance of well-designed visualization to make that signal more comprehensible. Summing up this means we don't need big analytics projects, instead we want the new data thinking to permeate our regular work.

Hints for using this deck



Our agenda
Messyin structure in content
Distributedit can come from many people

Contributors (more than ten edits) to wikipedia in January of each year since 2002.


We used to think of getting data from computers...but now there are many devices to consider98% of internet access points in Africa are mobile...and more to consider

Medical analysis of urine could be a way of detecting health problems early.

Cars these days carry a considerable number of lines of code. Connectivity to the internet provides navigation information to the driver and the car's movement can provide information about traffic conditions and diagnostics about its health.

Sensors are used to detect the presence of invasive fish species.


and yesBig
most importantlyValuable

Responding to the changes in data

We need to look in more places to find data, and it will take more work to get at it.

Do data analysis with small, quick projects focused on a particular business goal.

Make sure we can tell the difference between signal and noise in the data.

Choose particular database technologies and integrate through services

Explore a visualization that clarifies the meaning of data, making use of interactive and dynamic approaches


Data comes from many sources

Customers can provide user-generated data, such as reviews for products or corrections for mapping errors. This data needs a greater degree of checking, both for meaning and to avoid deception.

Messaging is a popular way for software systems to communicate, and one of the advantages is that these messages can also be processed by analytic systems. Synchronous calls can be fed into monitoring systems. In such ways we can analyze application communications. Sometimes this data can be very well structured - but then is usually non-uniform, with different messages having varied structures. Some messages, such as email, carry very little structure at all.

It is increasingly affordable to equip physical devices with sensors, which monitor their location, condition, and physical environment. For many years we've taken advantage of bar-codes and RFID tags to track objects through supply chains. Increasingly physical devices can be more active sources of information, sensors with memory and simple communication devices can record continuously and download to networks when they have the opportunity.

Sensor data is usually fairly well-structured and uniform. It can present challenges to handle the volumes both in storage terms and lots of writes to a data store.

Every time a customer, employee, or partner interacts with an application, that interaction can be logged - and these application logs are often a valuable source of information. This is most common for web sites where analytics software commonly traces the paths that users follow through a sequence of web pages. The data in these logs can be used to improve the user experience of the application, and also suggest information for new features and products.

Application data like this requires a good bit of work to tease out intent and make it consumable by analytic systems. Furthermore there's lots of it, so this is one of those cases where the bigness of the data is part of the challenge.

Mobile devices can be used both as explicit application interfaces, and also as sensors in a similar way to remote devices. For example Google uses real-time phone data to help estimate traffic delays to improve travel-time estimation.

Such usage has similar challenges to that of any physical device sensor, but with additional privacy issues. While it's useful for an airline to track its passengers and employees continuously, many people will reasonably object to the police-state implications of such scrutiny.

Much important data is stored in organizations in the form of documents: word-processing, pdfs, spreadsheets, and presentation decks. This is very ill-structured data, but often contains critical information - indeed much critical computation is done inside spreadsheets.


Extracting data is complex...... but the real problem is knowing where to look So cross-functional collaboration is essential

The role of data management needs re-thinkingit was... the changes require new strategies
The era of relational mono-culture is overWe now have to askwhat is the right database for our needs?
Aggregate-oriented databasesGood for Not for
Graph DatabasesGood for
We have found that NoSQL databases are suitable for enterprise applications but this does not mean Relational is dead
NoSQL, Relational, and other database technologies are all on the table We call this Polyglot Persistence

Many organization integrate applications with shared databases instead encapsulate databases and share via service APIs

A workflow for analytics

How will we find data that might be interesting?

How will we gather the data?

How will we eliminate the useless data?

How will we combine data from disparate sources?

These early stages of data preparation result in a cohesive convergence of data

This is followed by a rapid sequence of analytics that may diverge from the original goals based on new discoveries and directions.

How will we enhance the data?

How will we uncover interesting patterns?

How will we operationalize models for action?


6 months is too long for effective action Thus you need an agile approach with rapid cycles
“An Example Would Be Handy Right About Now”

First establish a high level business goal

How can we identify high-value customers who are about to leave and motivate them to stay?

Next, choose a small, simple aspect of the goal as an analytical starting point.

What are the common features of customers who leave?

Validate the usefulness and actionability of results with business stakeholders...

... and choose another aspect to explore

What are the shopping behaviors of customers who leave?

Repeat, exploring more aspects of the goal

Has the business goal been achieved, or is continued evolution needed?

Is there a time series of events that lead to customers leaving?

What do customers about to leave say about us on social media?

Can we determine customers’ sentiment for our company just before they leave?

What sequence of events seems to encourage leaving customers to stay?

Have our incentives reduced the number of high value customers who leave?


Some guidelines for making an agile analytics approach work

Ken Collier is the Agile Analytics Practice Lead at Thoughtworks who has been pioneering the use of agile techniques for analytics.



Is it because of...

Now consider the counties in which the incidence of kidney cancer is highest. These ailing counties tend to be mostly rural, sparsely populated, and located in traditionally Republican states in the Midwest, the South, and the West.


this is due to the law of small numbers

I took this example from Kahneman. His explanation of this effect is to imagine Jack and Jill each drawing colored balls from jars. Each jar contains equal numbers of red and white balls. Jack draws four balls, Jill draws seven. Just by probability Jack will see more draws with all balls of the same color than Jill will (by a factor of 8).

which is an example of bad intuitive reasoning with statistics

It is our responsibility to educate ourselves and our users about probabilistic illusions

If we are going to build tools to allow people to dig for meaning in big data, it is our responsibility to make sure the information people find isn't just statistical noise.

Educate Ourselves

We need to ensure that we have a better grip on probability and statistics - at least enough to alert us to possible problems.

Incorporate Statistics Skills

Teams involved in analytics need people with a background in statistics, who have the experience and knowledge to tell the difference between signal and noise

Educate Users

We must ensure the customers and users aren't probabilistically illiterate by helping them understand the actual significance of the numbers.


Data Scientist is the new hot job

“Data Scientist” will soon be the most over-hyped job title in our industry. Lots of people will attach it to their resumé in the hopes of better positions

but despite the hype, there is a genuine skill set

Although most data scientists will be comfortable using specialized tools, all this is much more than knowing how to use R. The understanding of when to use models is usually more important than being able to use them, as is how to avoid probabilistic illusions and overfitting.



Visualizations play a key role turning data into insight Explore the possibilities
ImpressiveVizualizationsConsiderableEffortBut you can often do something useful with ease
So.. is Big Data == Big Hype?
Big DataBig ProjectsAny software project should incorporate “Big Data” thinking
Some Thank-Yous