Thinking about Big Data
"Big Data" has leapt rapidly into one of the most hyped terms in our industry, yet the hype should not blind people to the fact that this is a genuinely important shift about the role of data in the world. The amount, speed, and value of data sources is rapidly increasing. Data management has to change in five broad areas: extraction of data from a wider range of sources, changes to the logistics of data management with new database and integration approaches, the use of agile principles in running analytics projects, an emphasis on techniques for data interpretation to separate signal from noise, and the importance of well-designed visualization to make that signal more comprehensible. Summing up this means we don't need big analytics projects, instead we want the new data thinking to permeate our regular work.
29 January 2013
This page is a fallback page for the proper infodeck.
There are couple of reasons why you are seeing this page
-
You are using an older browser that can’t display the infodeck. This probably due to a lack of support for SVG, most commonly due to an older version of Internet Explorer (pre 9.0). If this is the case please try this link again with a modern browser.
-
You may have navigated directly to this page - using a URL that ends with “fallback.html”. If so to read the full infodeck please go to https://martinfowler.com/articles/bigData
The following is dump of the text in the deck to help search engines perform indexing
Thinking about Big Data
Martin Fowler
2013-01-29
"Big Data" has leapt rapidly into one of the most hyped terms in our industry, yet the hype should not blind people to the fact that this is a genuinely important shift about the role of data in the world. The amount, speed, and value of data sources is rapidly increasing. Data management has to change in five broad areas: extraction of data from a wider range of sources, changes to the logistics of data management with new database and integration approaches, the use of agile principles in running analytics projects, an emphasis on techniques for data interpretation to separate signal from noise, and the importance of well-designed visualization to make that signal more comprehensible. Summing up this means we don't need big analytics projects, instead we want the new data thinking to permeate our regular work.
Hints for using this deck
- There is a significant change in the role that data plays in our activities.
- Although the term big data is often used to describe this change, it's not just about how much data we are looking to use.
- You may want to think of "big" applying to the importance of data - that data is playing a bigger role in our lives. Or just don't try to read "big data" literally.
- Big data is term that's generated a lot of hype. But I think it's important to resist our usual aversion to hype in this case - there is a significant change in thinking that's happening.
- This shift forces us to change many long-held assumptions about data. It opens up new opportunities, but also calls for new thinking and new skills.
- This deck explores my take on this change.
Our agenda
- How is the world of data changing?
- How is the software profession responding to these changes?
Messyin structure
- Traditionally data is thought of as coming from well organized databases with controlled schemas sporting strong validation conditions.
- But we are now seeing data in many forms: log files, message queues, spreadsheets. This data is scattered throughout an organization and its ecosystem.
- There is often little or no schema to control its structure.
- Data is often non-uniform, with each element having different properties.
- With multiple sources of data, crowd sourcing and even automated inferencing and discovery of data - there are big problems with data quality.
Distributedit can come from many people
Contributors (more than ten edits) to wikipedia in January of each year since 2002.
- The wide availability, and ease of access, through the internet means that data comes from many more contributors.
- This raises issues around handling many updates from varied sources, ensuring that people are encouraged to enter their useful data, and considering how to check entered data for consistency and veracity.
We used to think of getting data from computers...but now there are many devices to consider98% of internet access points in Africa are mobile...and more to consider
Medical analysis of urine could be a way of detecting health problems early.
Cars these days carry a considerable number of lines of code. Connectivity to the internet provides navigation information to the driver and the car's movement can provide information about traffic conditions and diagnostics about its health.
Sensors are used to detect the presence of invasive fish species.
and yesBig
- Walmart: 1 million transaction per hour
- eBay: 50 petabytes of data per day
- Facebook: 40 billion photos
- The sheer volume of data is enough to defeat many long-followed approaches to data management.
- Centralized database systems cannot handle many of the data volumes, forcing the use of clusters.
most importantlyValuable
- $300 billion per year: U.S. Healthcare
- 60% increase: in retail margins
- Although it's difficult to get hard figures on the value of making full use of your data, much of the success of companies such as Amazon and Google is credited to their effective use of data.
- We've covered the changes that are occurring in the world of data
- Now we have that initial context, we're ready to look at how the software development world is responding to these changes.
Responding to the changes in data
We need to look in more places to find data, and it will take more work to get at it.
Do data analysis with small, quick projects focused on a particular business goal.
Make sure we can tell the difference between signal and noise in the data.
Choose particular database technologies and integrate through services
Explore a visualization that clarifies the meaning of data, making use of interactive and dynamic approaches
Data comes from many sources
Customers can provide user-generated data, such as reviews for products or corrections for mapping errors. This data needs a greater degree of checking, both for meaning and to avoid deception.
Messaging is a popular way for software systems to communicate, and one of the advantages is that these messages can also be processed by analytic systems. Synchronous calls can be fed into monitoring systems. In such ways we can analyze application communications. Sometimes this data can be very well structured - but then is usually non-uniform, with different messages having varied structures. Some messages, such as email, carry very little structure at all.
It is increasingly affordable to equip physical devices with sensors, which monitor their location, condition, and physical environment. For many years we've taken advantage of bar-codes and RFID tags to track objects through supply chains. Increasingly physical devices can be more active sources of information, sensors with memory and simple communication devices can record continuously and download to networks when they have the opportunity.
Sensor data is usually fairly well-structured and uniform. It can present challenges to handle the volumes both in storage terms and lots of writes to a data store.
Every time a customer, employee, or partner interacts with an application, that interaction can be logged - and these application logs are often a valuable source of information. This is most common for web sites where analytics software commonly traces the paths that users follow through a sequence of web pages. The data in these logs can be used to improve the user experience of the application, and also suggest information for new features and products.
Application data like this requires a good bit of work to tease out intent and make it consumable by analytic systems. Furthermore there's lots of it, so this is one of those cases where the bigness of the data is part of the challenge.
Mobile devices can be used both as explicit application interfaces, and also as sensors in a similar way to remote devices. For example Google uses real-time phone data to help estimate traffic delays to improve travel-time estimation.
Such usage has similar challenges to that of any physical device sensor, but with additional privacy issues. While it's useful for an airline to track its passengers and employees continuously, many people will reasonably object to the police-state implications of such scrutiny.
Much important data is stored in organizations in the form of documents: word-processing, pdfs, spreadsheets, and presentation decks. This is very ill-structured data, but often contains critical information - indeed much critical computation is done inside spreadsheets.
Extracting data is complex...... but the real problem is knowing where to look
- With useful data present in so many places, the challenge is often more about realizing how valuable some of it may be.
- Often it's only technologists who work with applications on a day-to-day basis who know where useful data is hiding. They may know were data is, but often don't know how potentially valuable it can be.
- Business people are often aware of problems, but aren't aware of how data can help them, if that data exists, and if so where it is.
- If you want to match important problems with the data, you need collaboration between people with business knowledge, those who know what data exists, and those who can see how to process the data into shining the light on the problems.
- Knowing what data is available is also a multi-disciplinary exercise. Database people usually know the databases well, but with many more sources to consider, it's important to involve a wide range of technologists.
The role of data management needs re-thinkingit was...
- aiming towards a single, coherent and consistent model of data in the enterprise
- primarily based on relational databases
- focused on storing only validated data
- new database technologies are needed to support application needs more directly. Application teams now need to consider which database technology is appropriate for their situation, rather than using a single (relational) technology for everything.
- centralized management of data is giving way to particular applications managing their own data needs. Central groups now need to focus on enabling effective sharing between application teams.
- Relational databases have been the dominant data storage technology in the enterprise for over twenty years.
- They have resisted many challenges in the past, but the recent rise of NoSQL databases is cracking that control.
Aggregate-oriented databasesGood for
- Single hierarchic data-structure which is read and manipulated as a single unit-of-work (aggregate).
- Clustered operation (since aggregates make good units of distribution)
- When you want data to be sliced and diced in different structures
- Aggregate-oriented databases store complex structures of data in a single unit, rather than spreading the data over many rows in many tables.
Graph DatabasesGood for
- Small units of data with rich connection structures
- Graph databases represent data as a node-and-arc graph structure. They are designed for rapid traversal across the graph structure and support queries that can be framed in terms of the graph.
We have found that NoSQL databases are suitable for enterprise applications
- Thoughtworks have built critical production systems with several NoSQL databases, in particular Couchbase, Riak, MongoDB (aggregate-oriented) and Neo4J (graph). Project teams report excellent productivity and we would recommend these for future projects.
- The relational data model, with its simple tabular structure and powerful query language, is the right choice for many kinds of data.
- Relational databases are mature technology, that lots of people are familiar with and have good tooling. Unless there is a good argument for something else, they are currently still the default choice.
NoSQL, Relational, and other database technologies are all on the table
- The essential point is the time when data storage was not a decision is over. You now have to actively choose your database(s) depending on how you are going to use that data.
- Enterprises should expect multiple data-storage technologies for different applications.
- Even a single application may use polyglot persistence for when datasets have different characteristics.
Many organization integrate applications with shared databases
- Shared database integration has been encouraged by presence of SQL as a standard query language.
- It couples applications to each other by sharing database structure, making it harder for applications to change rapidly.
- It encourages a single database technology and schema for all applications, making it harder to use the appropriate database technology for a single application's needs.
- This makes reporting easier for simple cases as reporting tools are plentiful for SQL. But reporting needs often slow down applications and can only report on data within the shared database.
- An application database is used only by a single application. Any external integration is done through APIs built and exposed by that application.
- By encapsulating a database through an application API, clients are no longer directly coupled to the database technology and structure.
- Application APIs provide a greater range of data models than the underlying database model.
- A problem is that analytics clients may need APIs to be specially created for them to get at important data in an effective manner. If this is not important to the application development team, there can be frustrating delays and handoffs. The Service Custodian approach can help with this.
A workflow for analytics
- Business leadership uses their strategic goals to indicate what metrics should be used for analytics.
How will we find data that might be interesting?
How will we gather the data?
How will we eliminate the useless data?
How will we combine data from disparate sources?
These early stages of data preparation result in a cohesive convergence of data
This is followed by a rapid sequence of analytics that may diverge from the original goals based on new discoveries and directions.
How will we enhance the data?
How will we uncover interesting patterns?
How will we operationalize models for action?
6 months is too long for effective action
- There is competitive advantage to taking action quickly.
- Long cycle times lead to wasted effort, because it's hard to understand what data is important until you start to try to use it.
- Most importantly, slow cycle times reduce your ability to learn. Each time you pass through the cycle you learn what what forms of analysis are valuable and gain a greater understanding of what the next steps should be. Speed in learning amplifies the advantage of speed through the cycle.
- Reduce the scope of each cycle so you can run through the cycle quickly
- Use the results of one cycle to decide what to do on the next.
"An Example Would Be Handy Right About Now"
First establish a high level business goal
How can we identify high-value customers who are about to leave and motivate them to stay?
Next, choose a small, simple aspect of the goal as an analytical starting point.
What are the common features of customers who leave?
Validate the usefulness and actionability of results with business stakeholders...
... and choose another aspect to explore
What are the shopping behaviors of customers who leave?
Repeat, exploring more aspects of the goal
Has the business goal been achieved, or is continued evolution needed?
Is there a time series of events that lead to customers leaving?
What do customers about to leave say about us on social media?
Can we determine customers’ sentiment for our company just before they leave?
What sequence of events seems to encourage leaving customers to stay?
Have our incentives reduced the number of high value customers who leave?
Some guidelines for making an agile analytics approach work
- Use small teams, tightly focused on one aspect at a time.
- Don't try to build a grand analytics platform, instead solve particular problems and harvest a platform.
- Favor lightweight tools that allow you to gradually build up capability as you need it.
- Treat the analytics operation as an agile software development project, with the usual disciplines for agile application development.
Ken Collier is the Agile Analytics Practice Lead at Thoughtworks who has been pioneering the use of agile techniques for analytics.
Is it because of...
- Republican politics?
- clean air and environment in rural areas?
Now consider the counties in which the incidence of kidney cancer is highest. These ailing counties tend to be mostly rural, sparsely populated, and located in traditionally Republican states in the Midwest, the South, and the West.
this is due to the law of small numbers
- rural counties have a small population
- small populations are a small sample size
- smaller sample sizes tend to extremes
I took this example from Kahneman. His explanation of this effect is to imagine Jack and Jill each drawing colored balls from jars. Each jar contains equal numbers of red and white balls. Jack draws four balls, Jill draws seven. Just by probability Jack will see more draws with all balls of the same color than Jill will (by a factor of 8).
which is an example of bad intuitive reasoning with statistics- People commonly make the mistake of attributing a cause to something that is just chance.
- There are many ways in which this tendency to see patterns in randomness can deceive us.
- The law of small numbers is one of many examples of a phenomenon I call a probabilistic illusion
- Just like optical illusions confuse the eye, so probabilistic illusions confuse our reasoning.
- As we make more use of data this problem is likely be ever more prevalent, since so many people suffer from probabilistic illiteracy. Indeed even scientists and mathematicians get fooled frequently by these illusions.
- Kahneman gives many examples of how our intuitive reasoning can jump to erroneous conclusions. I strongly recommend this book to help understand how this happens.
It is our responsibility to educate ourselves and our users about probabilistic illusions
If we are going to build tools to allow people to dig for meaning in big data, it is our responsibility to make sure the information people find isn't just statistical noise.
Educate OurselvesWe need to ensure that we have a better grip on probability and statistics - at least enough to alert us to possible problems.
Incorporate Statistics SkillsTeams involved in analytics need people with a background in statistics, who have the experience and knowledge to tell the difference between signal and noise
Educate UsersWe must ensure the customers and users aren't probabilistically illiterate by helping them understand the actual significance of the numbers.
Data Scientist is the new hot job
"Data Scientist" will soon be the most over-hyped job title in our industry. Lots of people will attach it to their resumé in the hopes of better positions
but despite the hype, there is a genuine skill set- The ability to explore questions and formulate them as hypotheses that can be tested with statistics.
- Business knowledge, consulting, and collaboration skills.
- Understanding machine-learning techniques.
- Programming ability enough to implement the various models they are working with.
Although most data scientists will be comfortable using specialized tools, all this is much more than knowing how to use R. The understanding of when to use models is usually more important than being able to use them, as is how to avoid probabilistic illusions and overfitting.
Visualizations play a key role turning data into insight
- It's hard to see what's going on with raw data
- Good visualizations should be focused on how the data can inform the goals you have for your analytics.
- Modern visualization tools can use interactivity and dynamism to allow exploration and digging into details while retaining an overall view.
- A good way to learn about visualization is to explore examples of different approaches.
- This "periodic table" is an interactive display that's a great source for inspiration in using different visualization techniques.
- d3.js is an important tool for building visualizations.
- d3.js is a framework that allows binding from javascript data to DOM elements, particularly valuable for creating dynamic svg visualizations.
- This gallery shows many interesting visualizations built with d3 and acts as inspiration for both visualizations and implementation techniques.
ImpressiveVizualizationsConsiderableEffortBut you can often do something useful with ease
- Impressive visualizations can be extremely valuable, worth the considerable effort made to create them.
- But don't get hung up on the complex, you can often build useful visualizations with surprisingly little effort.
- For a personal project I experimented with using sparklines to provide historic context to a value.
- Without any prior knowledge it took me a couple of hours to get from a google search to a useful display (using jquery sparklines).
- Only after building the sparkline did I realize that some other data displays weren't needed
So.. is Big Data == Big Hype?
- There is a lot of hype, but behind the smoke there is considerable fire.
- There are significant changes in the way data is appearing to us, these cause appropriately significant actions for the profession's response.
- Many pundits, including me, will make incorrect predictions as to the exact changes over the next few years. Yet I am confident that there will be significant changes.
- The actions I've sketched out as the profession's response outline some of the most interesting changes in how our profession is handling data.
- As with any technology initiative, big data efforts need to be driven by the business. But the topics I've explored in this deck mean that a close collaboration with technology groups is even more important than usual.
- These suggest capabilities for companies to grow, and skills for individuals to acquire.
Big DataBig ProjectsAny software project should incorporate "Big Data" thinking
- Many software projects can do more to expose their data effectively.
- Look for more places where you can usefully extract data
- Collaborate closely with customers and users to explore what data is useful.
- Be careful to avoid probabilistic illusions
- Experiment with visualizations, starting with simple ones that can be built quickly
- All this requires innovative thinking, which usually comes from small, diverse teams operating within a self-adaptive process.
Some Thank-Yous
- This deck's structure was based on a talk given by Rebecca Parsons and myself at QCon London in March 2012. Rebecca did most of the work on the narrative structure for that talk, and I made use of it again for this deck.
- Ken Collier contributed important slides for agile analytics.
- The phrase "An example would be handy right about now" is of course forever associated with its coiner, Brian Marick
