Data Structures
SSTable
Sorted Strings Table (SSTable) is a persistent file format used by ScyllaDB, Apache Cassandra, and other NoSQL databases to take the in-memory data stored in memtables, order it for fast access, and store it on disk in a persistent, ordered, immutable set of files. Immutable means SSTables are never modified. They are later merged into new SSTables or deleted as data is updated. - https://www.scylladb.com/glossary/sstable/
For Things like Primary Key
LSM tree
In computer science, the log-structured merge-tree (also known as LSM tree, or LSMT) is a data structure with performance characteristics that make it attractive for providing indexed access to files with high insert volume, such as transactional log data. LSM trees, like other search trees, maintain key-value pairs. LSM trees maintain data in two or more separate structures, each of which is optimized for its respective underlying storage medium; data is synchronized between the two structures efficiently, in batches.
B-tree
In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children. Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as databases and file systems. - https://en.wikipedia.org/wiki/B-tree
B-tree vs LSM-tree
-
Because of write amplification (multiple writes to the disk over the course of the database’s lifetime), LSM-trees are probably more optimal for applications that have a lot of writes. LSM-trees also generate smaller files, and periodically remove fragmentation, which reduces the storage overhead.
-
If not setup properly setup, the compaction may also affect the writes to the database on the LSM-tree
-
Advantage of B-tree is that a key exists only once (since it overwrites it on any update).
Concepts
Indexes
- Secondary indexes may point to rows in a Heap File, this is useful for avoiding duplicatoin.
- The most common type of multi-column index is called a concactenated index, which simply combines several fields into one key by appending one column to another. For example, a last_name and name multi-column index would be (last_name, name). Similar to a phone book.
- Multi-dimensional indexes (https://www.postgresql.org/docs/current/indexes-multicolumn.html) are a more general way of querying several columns at once (e.g.
SELECT name FROM test2 WHERE major = constant AND minor = constant;
). When user is looking for example for a restaurante in a certain region, we need to filter by latitute and longitude. A standard B-tree or LSM-tree index is not able to efficiently find the data: it can give you either all the restaurants in a range of latitude, or all the restaurants in a range of longitude. One option is to translate a two-dimensional location into a single number using a space-filling curve, and then to use a regular B-tree index. There is also the option to use a specialized data structure, such as R-trees.
Terms
Back Compatibility
newer code can read data that was written by older code
Forward compatibility
older code can read data that was written by newer code
Monotonic reads
Monotonic reads ensures that if a process performs read r1, then r2, then r2 cannot observe a state prior to the writes which were reflected in r1; intuitively, reads cannot go backwards. Monotonic reads does not apply to operations performed by different processes, only reads by the same process.
Leaderless replication
- Client can write to multiple instances, if w of n instances responds with ok, the write can be considered done
- Client can read from multiple instances, and consider the biggest version number as the correct result.
- We can use quorums for writes/reads.
Sloppy Quorum
Consider a scenario where we have to assigned N nodes for a particular key. These N nodes are spread across two clusters and none of these clusters have more than W - 1 nodes for this key. Whenever we receive a request from the client to update the key, we send it to these N nodes and if we receive a confirmation from W of these N nodes, we consider the update to be successful. Now if we encounter a network failure between these two node clusters then we won’t be able to reach a quorum for all the upcoming write requests as we are unable to communicate across all N nodes and gather at least W votes to get a quorum.
To handle the above issue we have a modified version of quorum called as Sloppy Quorum. Under sloppy quorum, when we are unable to reach all the N nodes due to a partition failure, we temporarily store the updates on backup nodes. These backup nodes were not initially responsible for storing updates for the required key but they store the updates only in case of partition failure. The updates are stored along with the metadata which describes the original node that the key was required to be stored on.
In simple words:
A real-life example of sloppy quorum & hinted handoff will be if a coworker takes a message on your behalf once you are out on a break and shares the message with you once you are back. If you didn’t had such an understanding coworker with you then you might have missed the message completely. You will have to figure out on your own about what messages you missed and will be scared to leave your desk in future for a break. - https://distributed-computing-musings.com/2022/05/sloppy-quorum-and-hinted-handoff-quorum-in-the-times-of-failure/
Write-a-head logging
In computer science, write-ahead logging (WAL) is a family of techniques for providing atomicity and durability (two of the ACID properties) in database systems. A write ahead log is an append-only auxiliary disk-resident structure used for crash and transaction recovery. The changes are first recorded in the log, which must be written to stable storage, before the changes are written to the database. - https://en.wikipedia.org/wiki/Write-ahead_logging