Sitecore xDB: A Technical Introduction To MongoDB

Big Changes In Sitecore 7.5

Let's talk about Sitecore 7.5 and MongoDB.

MongoDB is the backbone of Sitecore 7.5's new xDB data-engine. It's central to the re-design of Sitecore's DMS. The hype says it thrives with large data-sets, has excellent performance and scales horizontally with ease.

Let's breakdown the MongoDB hype and understand what we're working with.

NoSQL Databases (MongoDB) vs Relational Databases (RDBMS)?

What are the chances that a data storage technology developed in 1974 would be the best fit to solve 2014's problems?

NoSQL is an attempt to re-think aspects of data storage and access for our new distributed, granular, real-time, high-volume world which values versatility, simplicity and scalability.

NoSQL databases are built-for-purpose. They abandon the generic table / row/ columns idea of relational-databases and store data as documents (MongoDB), key-value pairs (Redis) or social graphs (Neo4J).

For example, a document database stores complete objects, not of a series of tables joined to create a complete view of the data.

NoSQL databases use objects and APIs (like a service) to perform queries instead of SQL (structure query language). This is more literal approach can be easier to understand than the oft-broken-english of SQL.

In regards to scaling, NoSQL databases are intended to scale horizontally. This allows one large database (1 x 1 GB) to be transparently sharded into a series smaller databases (20 x 50MB). Small databases keep the queries fast, even with terabytes of data. This largely decouples database size vs performance that plagues the vertical-scaling relational databases.

Overview of MongoDB

MongoDB is a document database. It abandons the RDBMS (relational database management system) concept of tables and and rows for collections and documents.

Think of a collection as a table.

Think of a document as a JSON object.

A JSON object can contain key-values, key-arrays and further nested JSON objects. This allows documents to have a very rich structure.

Documents typically represent complete structures of data. Something akin to what you'd expect returned from an ORM, but not a database. Take this example of a document representing a blog post:


{
        	"title" : "MongoDB Explained",
"author" : "Dan Cruickshank"
"date" : "07/30/2014",
"content" : "Many words go here."
"pageViews" : 100,
"comments" : [
        	{
"name" : "Andre 3000",
"comment" : "Thanks for this post.  It was mildly helpful.",
},
        	{
        	"name" : "Big Boi",
        	"comment" : "Our reunion tour isn't coming to Calgary"
}
],
"tags" : ["mongo", "nosql", "xdb", "sitecore"]
}
 

In a relational-database this single document would require joins across a posts, authors, commenters, comments & tags table.

We replace this with a single document of tiered objects and mixed-types. Having each entry be a complete picture of the data, as opposed to spread across many tables, is very advantageous for scaling.

Why Is Sitecore xDB Using MongoDB?

Analytics data consists of rapidly captured discrete interactions. That data is processed to create a larger picture or how users interact with our site.

MongoDB's ability to quickly capture and scale high-volumes of data makes it a perfect fit for the data-aggregation component Sitecore 7.5's xDB.

Here are few key ways Sitecore 7.5 benefits from MongoDB:

  • Horizontal-Scaling: Not only does this allow Sitecore to handle extremely large datasets (think petabytes), it also allow our analytics database to gracefully scale from 500MB to 5 GB to 500 GB.
  • Dynamic Schemas: Allows Sitecore to dynamically structure analytics data while simultaneously reducing application and database complexity.
  • No RDBMS Bottlenecks: Relational-databases created a technical bottleneck in volume and speed of analytics data that could be effectively processed.
  • Query Unstructured Data: The ability to query against unstructured datasets allows for versatile analytics reporting with minimal application complexity.

The Large Hadron Collider at CERN uses MongoDB for data aggregation - why can't Sitecore 7.5?

What Role Does MongoDB Play in xDB Architecture?

Like at CERN, MongoDB is used to collect all of the analytical data. Once analytics are gathered, they are processed and the results are[MvA1] stored in a relational-database, better suited for large-scale reporting.

Dynamic Schemas

MongoDB uses dynamic schemas. Data does not require a fixed structure. Any combination of JSON objects can be stored and queried in the same collection (table in relational-database terms)

Your application can shape its data on-the-fly. This freedom leads to much faster prototyping and development cycles.

Dynamic schemas also remove the need for time consuming data and schema migrations.

Queries

SQL has been replaced with Object-based API queries. People say this is a more natural approach than SQL. I agree. Here are queries against a blog collection (think table)

By author:


db.blog.find( { author: "Dan Cruickshank" } )
 

By author with tags:


db.blog.find( {
author: "Dan Cruickshank",
tags : { $in : ["xdb", "sitecore"]}
} )

By author with more than 50 page views:


db.blog.find( {
author: "Dan Cruickshank",
pageViews : { $gt : 50}
} )
 

Insert a new post:


db.blog.save( { author: "Dan Cruickshank", title : "How to Mongo" });
 

Update a post:


// if _id is found, MongoDB updates instead of inserts
db.blog.save( { _id: 101, author: "Dan Cruickshank", title : "How to Mongo Again" });
 

MongoDB Speaks Your Language

MongoDB communicates using JSON. JSON is the data format of choice for the modern web. The fact that a website and a database can now communicate with each other in the same format feels like a small miracle.

MongoDB has supports a REST API through various wrappers and C# through a variety of MongoDB Drivers.

REST APIs are always attractive and the entity mapping built into the C# driver gives you plenty to look forward to as well.

Scaling

MongoDB scales horizontally through a technique known as sharding. When scaled, a MongoDB database consists of many shards. 5 GB database may consist of 50 x 100 MB shards.

Each shard is complete database that contains unique data. Data is not replicated between shards.

As shards are added or removed from a database, MongoDB manages the migration of documents between the existing shards until an equilibrium between shards is reached.

The small shard size ensures MongoDB stays extremely performant by working with small datasets even when the total database may be terabytes.

Replication

To increase available, MongoDB supports replication. Replication promotes high-availability while scaling through sharding promotes performance.

Automatic fail-over between MongoDB replicated instances is supported.

Transactions

There are no transactions. But there is ability to do atomic updates, called atomicity.

MongoDB's atomicity still provides guaranteed all or nothing updates like transactions. However, Atomicity is only guaranteed for one document at a time. If 10 documents need updated/created, that is 10 atomic actions, not 1.

That said, what used to be stored across many tables is now stored in 1 document in MongoDB. So atomic updates do provide a great deal of reliability.

You can also create your own transactions using the two phase commit pattern.

When Would I Use MongoDB?

MongoDB feels like a homerun any time you're gathering data. Behaviours. Form inputs. Shopping carts. Also if you're working in JavaScript (either in the browser or node.js), the synergy MongoDB provides by using JSON is very beneficial.

MongoDB also has Geospatial Querying abilities. It can query against distances in 2D and 3D planes. If your application is location-aware, it's something to consider.

For most things it's more than capable of replacing your current RDBMS. Preference and existing infrastructure should surely play a role.

Who Is Using MongoDB?

From Wikipedia:

  • CERN uses MongoDB as the primary back-end for the Data Aggregation System for the Large Hadron Collider.
  • MetLife uses MongoDB for “The Wall", a customer service application providing a "360-degree view" of MetLife customers. (sic. billions of documents in MongoDB)
  • SAP uses MongoDB in the SAP PaaS.
  • Forbes stores articles and companies data in MongoDB.
  • The New York Timesuses MongoDB in its form-building application for photo submissions.
  • Shutterfly uses MongoDB for its photo platform. As of 2013, the photo platform stores 18 billion photos uploaded by Shutterfly's 7 million users.
  • The Guardian uses MongoDB for its identity system.
  • Foursquare deploys MongoDB on Amazon AWS to store venues and user check-ins into venues.
  • eBay uses MongoDB in the search suggestion and the internal Cloud Manager State Hub.

Not to mention Craigslist, IBM, Salesforce, ADP, Under Armor and eHarmony. The list goes on and on. Suffice to say, this is mainstream technology.

Other MongoDB Notes

Geospatial Indexes. I've mentioned this already, but querying using distance / space is very handy.

Upserts. When saving a document, it will create or update depending on if it already exists.

BSON. Internally MongoDB uses BSON (Binary JSON) format for its data. It's derived from JSON but adds more types to your document for richer-querying functionality. Some of the new supported types include Binary data, Timestamp, JavaScript and Symbol.

Closing

MongoDB and Sitecore 7.5 xDB are a perfect match of technology and function. MongoDBs speed, scalable and ease-of-use put Sitecore in an industry-leading position to capture the next-generation of analytics.

This article was authored using Markdown for Sitecore.

Fish