It is interesting to rewind 15 years back when it was time to get ready for my job interview. I was advised to refresh concepts behind Normalization, Referential Integrity, Constraints, etc. It would have been hard to imagine someone to work on database without a solid understanding and practice of the above concepts. Fast forward, RDBMS is being challenged by the emergence of NoSQL that fundamentally differs from RDBMS in every possible way to make one unlearn what has been learnt over years.
NoSQL stands for Not Only SQL representing the next generation database that supports the emerging needs. Relational database introduced concepts such as strong typed columns, tighter relationships between entities, and constraints that made sense when moving away from flat-file persistent stores. The digital revolution has penetrated our lives so much that more than 90% of the data generated so far has been created in the past few years. Storage costs have reduced by a factor of 300,000 in the past 2 decades. According to IBM, 2.5 billion Gigabytes of data is getting generated every day since 2012. And to make the matter more interesting, over 75% of the data generated is unstructured such as image, text, voice and video. The new context poses challenge to the conventional way of persisting and accessing the ever growing data.
Challenges with RDBMS
The 3 dimensions of Big Data are Volume, Velocity and Variety. Querying against the massive volume of data to serve online channels such as web or mobile requires scaling the database to run a heavy workload. In the IoT arena, the millions of devices pushing data to the cloud bring a high velocity of data to be ingested and persisted in the database. This, again, requires the database to be scaled to allow the parallelism, sometimes in the order of million transactions per second. Thirdly, RDBMS was not designed keeping the unstructured data such as image, videos and voice in mind though there is a limited support for such data types. RDBMS scales very well for the enterprise applications. However, scale up architecture is fundamental to RDBMS world and there is an inherent limit with that approach. There is a finite amount of memory and CPU one could add before giving up to think outside-the-box. Running a farm of tens and hundreds of application server nodes, and still expecting to scale up database node is not practical. Further, with the emerging standard data structures such as JSON, unstructured data, a database that has native support is need of the hour.
NoSQL is a category of databases that scales out in a large cluster, mostly open source, and are often schema-less. Being able to scale out in a large cluster offers the capability to process massive amount of data, thanks to distributed computing. A schema-less or less restrictive schema allows support for unstructured data and extensible data structure for the ever evolving business needs. NoSQL often achieves the distribution of data by techniques such as sharding and replication. At a broad level, NoSQL databases have four category types:
- Key-Value databases
- Document databases
- Column-family databases
- Graph databases
As the name indicates, Key-Value databases store the value against keys and the value can be a free-form data structure that can be interpreted by the client. Clients typically request for the value and fetches by the key. Due to the simplicity, this scales really well. Some of the examples of Key-Value databases are Redis, Riak, Memcached, Berkeley DB, Couchbase, etc.
Document databases store documents such as XML, JSON, and BSON in the key value store. The documents shall be self-describing and the data across rows might be similar or even different. Document databases perform very well in content management systems and blogging platforms. Some of the popular document databases are MongoDB, CouchDB and OrientDB.
Column Family databases
Column family databases store data in rows that consists of keys and the collection of columns. Related groups of columns form column families that typically would have been broken down into multiple tables in RDBMS world. Column family databases can scale very well for massive amounts of data. However, since the design is not generalized, it is very effective when the common queries of retrieving the data are known upfront while designing the column families. Other flexibility provided by column family database is that the columns across rows can vary and columns can be added to any row dynamically without having to add them to other rows. Column family database is well suited in IoT use cases that involve ingestion of high velocity data and high speed retrieval for online channels. Some of the popular column family databases are Cassandra, HBase and Amazon DynamoDB.
Graph databases allow storing entities (also known as nodes) and the relationships (known as edge) between them. Technically, there is no limit to the number of relationships between entities. Supporting multiple relationships and dynamic graphs in RDBMS world would involve a lot of schema changes and even data migration every time a new relationship is built. Social media is a classic domain where Graph databases excel well. Some of the popular graph database include Neo4j, Infinite Graph, etc.
The choice of the database really depends on the nature of the data, processing and retrieval need. Emergence of NoSQL is by no means a death knell to RDBMS. RDBMS is here to stay for a long run and it does have its relevance for many more years to come. NoSQL excels very well in certain areas and compliments the RDMBS in an enterprise towards data management. The technology is clearly moving towards polyglot persistence, hence, a heterogeneous combination of database technology within an enterprise to handle the massive amount of data is very natural.