Other notes here regarding types of databases and cloud providers for these types of databases.
http://solidfish.com/types-of-databases/
History
Relational theory
- Tuples = unordered set of attribute values (row and attribute of column in a table)
- Relation = collection of tuples and corresponding relations
- Constraints = enforce consistency; used to identify tuples and relationships between them
- Operations = joins, projectsions, unions, etc; generally based on the relations
Transaction model
Changes on a databases done with transactions, to ensure consistency and integrity. Follows the ACID model for transactions:
- Atomic = transaction is indivisible, either all parts go thru or none
- Consistent = db remains consistent state before and after transaction
- Isolated = transactions do not effect each other, it is focused on its own
- Durable = Once transaction saved, changes are expected to persists even if OS or Hardware failure
Key Differences between SQL vs NoSQL/Document Databases
Queries
SQL statements make querying relational databases very easy. Essentially we can query almost anything in the database. On the NoSQL side we would need to use things like MapReduce where we define maps in the data attributes and then reduce it.
Transactions
SQL databases provide atomicity out-of-the-box and one of the key capabilities of SQL databases. NoSQL databases need to control transactions at the application level and ensure the program implements it. It is far more complex in doing transactions in NoSQL versus SQL.
Consistency
Data consistency is far easier to control on SQL databases (its part of the “C” in the CAP Theorem that makes SQL databases beneficial). NoSQL can to a certain degree provide data consistency but there is latency. (NoSQL does the “A” in CAP Theorem).
Scalability
NoSQL has more scalability as by design as it is distributed. It easily distributes the data across multiple servers and by design makes it easy to add new nodes.
Schema
Being that NoSQL doesnt enforce relationships, its easier to expand the database. Its also is easier to handle schema changes. In SQL databases, it is tightly coupled to schema definitions and therefore changes to schema cascades through the relationships and can be disruptive to multiple tables and the overall database.
Example Use Cases for NoSQL/Document Databases
The largest benefit of NoSQL is its ability to quickly read and write large amounts of unrelational data that is constantly changing in format. This, paired with Hadoop, provides the industry’s best tool for data analytics and currently the leading tool of ‘big data’.
User Profiles
User profile information is constantly changing. Not only existing user data changing but the addition/removal of user attributes. As such, we need data model flexibility. Also, if the application is growing quickly, the user data needs to be able to scale.
Internet of Things (IoT)
As new devices are constantly introduced, we are creating new types of data. As such this requires the flexibility and scalability that NoSQL provides. Some of these devices may store a large set of data as well.
Catalogs
Any type of product catalog requires a database that can handle the dynamically changing data. All products have different attributes with states changing constantly.
Hierarchies
Data where relationships are hierarchical can be represented in non-relational databases with each record containing it’s parent OR each record containing it’s children (as a set).
Keywords
Having an array of keywords per record can be embedded in that table. Assuming the keywords do not change frequently.
Designing a Non-Relational Database
Evolution led to the storage of JSON objects; first stored on relational db but later evolved to be stored by itself without the relational db middleman – which lead to document databases. Popular ones are MongoDB and CouchBase.
Conceptually, the Document Database differs from a traditional relational database by the way the data is stored. In a relational database, the data is broken down and normalized with relationships. In a document database, the data is represented as a whole (not broken down to it’s individual relationships). When the data is represented as a whole, there is embedded data. This embedded data can be captured as a whole or referenced (like using keys from a relational database).
Embedding Data
One of the key questions to answer when designing a document database is to embed data by value or by reference. As a guideline, if the following criterias are true it is generally a good idea to embed data by value.
- The embedded data values are queried together with the parent record
- The child record (embedded data) is dependent on the parent record
- There is a 1:1 relationship between the child and parent
- There is similar volatility as the child and parent (meaning that the embedded data changes about as frequently as the parent record)
- There is a 1:few relationship between the child and parent AND similar volatility (a blog post could have a few tags associated with it and they both have similar volatility)
On the other hand, it is better to embed data by reference if the following are true:
- There is a 1:Many relationship (a blog post may have many comments associated with it, of which will change more frequently than the blog post)
- In this case it is better to break the comments data into its own table and have a field in there to reference the blog post id
- There is a Many:Many relationship
- In this case we need to choose which relationship to focus on and embed the data values into one of them. For example, if we have multiple teachers that can host multiple courses, in a relational model we would have a 3rd ‘mapping’ table that would contain a teacher_id and course_id. But in the non-relational model, we only need a single teacher table (or course table) where the courses are embedded under it.
- The child record (embedded data) changes at different volatility
- Like the first example above this should be broken into its own table with a field referencing it’s parent’s id
Normalization and De-normalization in NoSQL / Document Databases
In non-relational databases normalization rules are a bit more flexible than those in relational databases. When we have a database of teachers and courses, in the teacher’s table, instead of embedding all the course record information, we could just have the important fields related to the teacher. Then we would have another course table that can be drilled into if we need more information about the course.
Other Notes
Scaling Web 2.0
As web grew, db layer needed to keep up. Some interesting evolutions:
MySQL never scaled as good as Oracle. Not designed for multi-core processing. Some of the first tricks used by places like Facebook were Memcached.
Memcached = open source; distributed object cache; uses cache memory instead of going to database;
MySQL replication = copying of databases; Reads come from duplicates of db; Writes still need to go to master since duplication of writes (or copy of master) was not stable
Sharding
Partitioing of logical db across multiple physical servers; each partition is referred ot as a shard; Example below where shards are shown on bottom at db layer:
Sharding and Memcached are the best ways to scale a relational database. However, this is very costly and has several disadvantages:
- Application complexity = Need a dynamic routing layer that determines which shards to go to; also needs to manage the memcached objects and deal with replication.
- Crippled SQL = SQL statement cannot go across shards; queries can only be done against the whole database;
- Loss of transactional integrity = ACID transactions lost due to complexity; often not implemented or difficult to implement with sharding and memcached.
- Operational Complexity
CAP Theorem
States that in a distributed database systems, can only have two of the following:
- Consistency = every user has identical view of data at any given instand
- Availability = in event of failure, db remains operational
- Partition tolerance = when the network between the distributed systems fail, can maintain operations
Sharding
Partition data across multiple databases based on key attribute. The downside to this is that it looses ACID attributes, so it is difficult to do transactions.
Key-Value Databases
Opposite to sharding is Amazon, which needs the same scability but with strict ACID since it is dealing with transactions. This lead to them creating Dynamo db system.
References
NoSQL vs SQL
https://blog.pandorafms.org/nosql-vs-sql-key-differences/
Google I/O 2012 conference – SQL vs NoSQL
https://www.youtube.com/watch?time_continue=552&v=rRoy6I4gKWU
Other