DBMS revisit

Database Storage

  1. Slotted Page database - Relational - MySql, PostgreSQL & Non-Relational - MongoDB

  2. Log-Structured database - Write heavy operations - Cassandra, BigTable etc.

Slotted Page Database Management System

In a slotted page database, the data is stored in fixed-size blocks of memory called pages. Each page is divided into a fixed number of slots, and each slot can hold one data record. When a new record is added to the table, it is placed in an available slot on the page. If there are no available slots, a new page is added to the table.

History

  1. 1970s - Relational databases: Edgar F. Codd introduced the concept of a relational database, which used a tabular data model based on mathematical set theory. Relational databases were the dominant database technology of the 1970s and 1980s.

  2. 1980s - Multi-version concurrency control (MVCC): MVCC was introduced as a technique for handling concurrency in relational databases. It allowed multiple transactions to access the same data simultaneously without blocking each other.

  3. Slotted databases were first introduced in the 1990s as a way to optimize the storage and retrieval of variable-length data in relational databases. Prior to the introduction of slotted pages, relational databases used fixed-length records, which could result in wasted space if a record did not completely fill its allocated space. Slotted pages allowed a single page to hold multiple variable-length records, with each record stored in a fixed-size slot. This reduced wasted space and improved performance by allowing the database to more efficiently locate and retrieve data.

    The slotted page architecture quickly became popular and was adopted by many relational databases, including Microsoft SQL Server, Oracle Database, and IBM DB2. It remains a common feature of modern relational databases, as well as many NoSQL and cloud databases.

Log-Structured Database Management System

(DBMS) is a type of DBMS that uses a log-based approach to store and manage data. In a Log-Structured DBMS, data is organized into sequential log files, which are used for both write-ahead logging and storage. Here's a closer look at the components of a Log-Structured DBMS:

  1. Write-Ahead Logging: In a Log-Structured DBMS, all changes to the database are first written to a log file before being applied to the main database. This write-ahead logging ensures that changes can be recovered in the event of a failure.

  2. Log-Structured Storage: Instead of using a traditional page-based storage model, a Log-Structured DBMS stores data in log files that are appended to sequentially. As new data is written, it is appended to the end of the log file, and older data is removed during garbage collection.

  3. Garbage Collection: Garbage collection is the process of reclaiming space in the log files that is no longer needed. In a Log-Structured DBMS, garbage collection is performed by merging multiple log files together and removing duplicate data. This process can be resource-intensive, so it is typically performed in the background.

  4. Indexing: To support efficient queries, a Log-Structured DBMS may use indexing structures like B-trees or Bloom filters to map data to specific locations in the log files.

  5. Recovery: In the event of a failure, a Log-Structured DBMS can use the write-ahead log to recover data. The log is replayed from the last consistent checkpoint to the point of failure, ensuring that all changes are applied in the correct order.

Log-Structured DBMSs are often used in applications that require high write throughput and can tolerate eventual consistency, such as data warehousing or analytics applications. However, they may not be suitable for applications that require low-latency queries or have strict transactional requirements.

History

  1. 1992: The first mention of log-structured storage appeared in a paper by Mendel Rosenblum and John Ousterhout titled "The Design and Implementation of a Log-Structured File System." The paper described a new file system that used a sequential log for storage, which provided faster write performance and improved data durability.

  2. 1996: Jim Gray and Goetz Graefe published a paper titled "The Five Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb," which introduced the concept of the "five minute rule" for disk storage. The rule stated that data that was accessed less frequently than once every five minutes should be stored on disk, while data that was accessed more frequently should be stored in memory. This concept would later become central to LSDBMS design.

  3. 1997: Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil published a paper titled "The Log-Structured Merge-Tree (LSM-Tree)," which introduced a new data structure for efficient disk-based indexing. The LSM-Tree was based on the principles of log-structured storage, and provided a way to efficiently store and retrieve large amounts of data.

  4. 2004: The first LSDBMS, called LogBase, was developed by Nicolas Bruno and Surajit Chaudhuri at Microsoft Research. LogBase was designed to efficiently manage large-scale web data, and used a log-structured approach to provide high write throughput and scalability.

  5. 2007: The first version of Apache Cassandra was released. Cassandra is a distributed NoSQL database that uses an LSM-tree for storage, and is designed to provide high write throughput and fault tolerance.

  6. 2012: Facebook developed and open-sourced RocksDB, an embeddable key-value store that uses an LSM-tree for storage. RocksDB is designed for high write throughput and low latency, and is used by many large-scale applications, including Facebook, LinkedIn, and Airbnb.

  7. 2014: ScyllaDB, a distributed NoSQL database that uses an LSM-tree for storage, was released. ScyllaDB is designed to be compatible with Apache Cassandra, but with improved performance and scalability.

  8. 2015: Google released Cloud Bigtable, a fully managed NoSQL database service that uses an LSM-tree for storage. Cloud Bigtable is designed to provide high performance and scalability for large-scale analytics and data processing workloads.

  9. 2016: Microsoft released Azure Cosmos DB, a globally distributed NoSQL database service that uses a multi-model approach to support multiple data models and APIs, including a log-structured API. Azure Cosmos DB is designed to provide high availability and low latency for globally distributed applications.

  10. 2017: YugabyteDB, a distributed SQL database that uses an LSM-tree for storage, was released. YugabyteDB is designed to provide high performance, scalability, and fault tolerance, and is used by many modern cloud-native applications.

  11. 2019: FaunaDB, a distributed multi-model database that uses an LSM-tree for storage, was released. FaunaDB is designed to provide strong consistency, low latency, and global scalability, and is used by many modern applications, including Netlify and Remind.

  12. 2020: Amazon released Amazon QLDB, a fully managed ledger database that uses a log-structured approach to provide a transparent, immutable, and cryptographically verifiable transaction log. Amazon QLDB is designed for applications that require an immutable audit trail, such as financial and supply chain applications.

Since then, LSDBMSs have continued to evolve and gain popularity, particularly in applications that require high write throughput and can tolerate eventual consistency, such as analytics, search, and data warehousing.