Schemaless Data Structures

In recent years, there's been an increasing amount of talk about the advantages of schemaless data. Being schemaless is one of the main reasons for interest in NoSQL databases. But there are many subtleties involved in schemalessness, both with respect to databases and in-memory data structures. These subtleties are present both in the meaning of schemaless and in the advantages and disadvantages of using a schemaless approach.

Martin Fowler

7 January 2013

This page is a fallback page for the proper infodeck.

There are couple of reasons why you are seeing this page

You are using an older browser that can’t display the infodeck. This probably due to a lack of support for SVG, most commonly due to an older version of Internet Explorer (pre 9.0). If this is the case please try this link again with a modern browser.
You may have navigated directly to this page - using a URL that ends with “fallback.html”. If so to read the full infodeck please go to https://martinfowler.com/articles/schemaless

The following is dump of the text in the deck to help search engines perform indexing

Be wary of schemaless data structures, since they still have an implicit schema and, with a couple of exceptions, an explicit schema is better.

Schemaless Data Structures

Martin Fowler

2013-01-07

Hints for using this deck

For articles on similar topics, take a look at these tags:

Our agenda

What we mean by “schemaless”
The notion of an implicit schema
In-memory schema
Implementing schemaless structures in relational databases
Storage and Predicate Schemas

Implicit schemas are hidden
Custom fields
Non-uniform types
Schema Migration

To understand “schemaless”, begin with a relational schema

The relational schema defines what columns appear in the table, their names, and their datatypes.

It is an error to insert data that doesn't fit the schema

A schemaless database allows you to store any data

A schemaless database allows any data, structured with individual fields and structures, to be stored in the database.
Being schemaless reduces ceremony (you don't have to define schemas) and increases flexibility (you can store all sorts of data without prior definition)

but schemaless structures still have an implicit schema

Any code that manipulates the data needs to make some assumptions about its structure, such as the name of fields.

Any data that doesn't fit this implicit schema will not be manipulated properly, leading to errors.

The concept of “schema” also applies in-memory

A class definition defines the logical fields you can use to manipulate it. This is effectively a schema.

The same is true of any record structure (typed or not)

A Dictionary (aka Hash | Associative Array | Map) is a common way to make a schemaless data structure in memory. The notion of an implied schema still applies, there is little difference between aCustomer.firstname and aCustomer['firstname']

Kent Beck's essential books on object-oriented programming describe this difference as the difference between Common State and Variable State.

One object can combine a schema and schemaless access

This simple example supports Common State for properties where it's better to have a defined schema and Variable State to allow for more flexible options.
A more sophisticated implementation using reflection could allow Common State properties to be accessed through the Variable State API, so that aCustomer['firstname'] and aCustomer.firstname would access the same data. But this would be a questionable feature, any access with literals into a defined object is a smell.
A better form of sophistication is to support a method like keys that returns Variable State keys and Common State field names.
It is often useful to wrap a schemaless data structure with defined methods. This way other parts of the software are only dependent on the methods, not the schemaless data structure.

Schemaless extensions are common even with relational systems

Custom columns cause sparse tables, put a fixed limit on how many custom fields you can have.

customField_1_name	customField_1_value	customField_2_name
zip	02201

A field can have embedded structured data, such as putting json into a text field. Data like this is difficult to index, and requires a programming language to get at.

firstname	lastname	customData
Martin	Fowler	{'middle_initial': 'X', 'zip': '02201'}

An attribute table adds extra joins whenever you need to access the custom data.

Customers
id	firstname
1234	Martin

CustomAttributes
table	key	fieldName	fieldValue
customers	1234	zip	02201

None of these techniques mesh well with SQL, as they all involve bending the relational schema of their host database.
But this doesn't stop them being a way to support ad-hoc schemaless data on top of a relational schema

The examples so far are of Storage Schemas

A storage schema defines how a storage mechanism stores data. This means that any attempt to store data that violates the storage schema fails. You can't store data unless it passes the schema.

Another form of schema is a Predicate Schema

With a predicate schema, you store whatever data you like. The schema is applied to the stored data to determine if the data matches the predicate.

XML schemas are a familiar example of a predicate schema

XML File

Schema (in Relax NG compact syntax)

An XML file is just text, so can store anything that is text. Even a well-formed XML file can have any combination of elements. A schema can be applied to an XML file to see if that file conforms to the schema's definition.

The storage mechanism (text) is schemaless, but you can still use a predicate schema for validation.

You can have many predicate schemas for the same data structure
This allows you to use different schemas to validate data as appropriate for different contexts.
This can extend to using multiple schema languages, which is useful since one language may be better for a particular style of validation.

Schemas are a mechanism for documenting a contract

Bertrand Meyer came up with the notion of thinking of a component interface as a contract - statement about what a supplier provides and a consumer expects.
Often people think that the supplier should determine the contract, but as in commerce, the best contracts involve collaboration between a supplier and its consumers. A useful approach for this thinking is consumer-driven contracts.
This thinking is particularly important for widely exposed components, such as internet APIs.

So what are the factors we should consider between using a schema and going without?

Implicit Schemas are hidden

In order to properly manipulate data, you need to know what the schema is, so you know whether to use customer.lastname or customer.last_name.
With an explicit schema, this is easy to find.
But an implicit schema is easily scattered amongst all the code that accesses the data, making it hard to find, thus slowing down any further development on top of the data structure.
A clear data access layer and/or a predicate schema can reduce this problem.
The problems of a hidden schema are why my default preference is to to have an explicit schema. However there are plenty of cases when schemalessness can be worthwhile.

So use a schema if you can

Custom fields work best with a schemaless approach

In many situations people want to add their own fields to store data in addition to fields defined in the product
Usually these fields are only displayed in a UI, but they may also be used in users' ad-hoc scripts
These fields may be individual to a user, or used by a user-group. The latter is common with product software where customers want to add specific fields.

Since custom fields aren't used by the base software (as opposed to custom scripts), they don't impose an implicit schema on that base software, so the key problem with schemalessness doesn't apply.

An alternative to using a schemaless structure is to add fields at runtime. This seems to be rarely considered seriously, but I've seen cases where it's been effective.

Non-uniform types are tricky for schemas

Some data types are naturally non-uniform, in that the fields differ greatly between individuals.
A good example of these are events, which may take on different attributes depending on the kind of event.

If you use a single record type you end up with lots of optional fields, and your schema may not indicate which fields should be used for each type of event.

If you use a different record type for each event, then you have lots of record types, it's difficult to operate on all events together, and the schema may not indicate which fields should be expected on all events.

Structural inheritance is very useful here, a supertype event holds common fields and subtypes carry the variations. But many datastores do not support inheritance well, and type hierarchies often get complicated and cannot express all the combinations of fields that you need.

Schemaless stores offer a pragmatic alternative

Suppose you want to change from storing a customer's name in one field to storing it in separate firstname and lastname fields.

No schema to update

However you still need to be aware of the implicit schema and make corresponding changes to any access code.

Schemaless migration is easier

Schemaless stores can ease migration, since access code can be defined to read from either a single name field or separate fields. If it also saves to only the two fields, this will gradually migrate the data.
Doing this will add complexity to the data access code, so it's usually a good idea to remove support for the old data structure and remove the code that supports that old structure

Schemaless migration still requires care

While many people are sadly unfamiliar with evolutionary database design, Thoughtworks teams have been continually migrating relational databases for many years, including production databases.

Similar migration techniques are usually required with schemaless databases too, although flexible access code does help.

Custom fields and non-uniform types are both good reasons to use a schemaless approach. Schema migration is a much less compelling reason. But often it's not a choice, in that other factors than schemalessness will dominate your technology decisions.

When using a schemaless approach, consider using a clear data access layer and/or a predicate schema to reduce the problems of a hidden, implicit schema.