Relational vs Non-Relational Databases

History

 

Relational theory

  • Tuples = unordered set of attribute values (row and attribute of column in a table)
  • Relation = collection of tuples and corresponding relations
  • Constraints = enforce consistency; used to identify tuples and relationships between them
  • Operations = joins, projectsions, unions, etc; generally based on the relations

 

Transaction model

Changes on a databases done with transactions, to ensure consistency and integrity. Follows the ACID model for transactions:

  • Atomic = transaction is indivisible, either all parts go thru or none
  • Consistent = db remains consistent state before and after transaction
  • Isolated = transactions do not effect each other, it is focused on its own
  • Durable = Once transaction saved, changes are expected to persists even if OS or Hardware failure

 

 

 

Google, Big Data, Hadoop

Using Memcache (Facebook) – caching of data at the outer endpoints

 

Sharding

Partition data across multiple databases based on key attribute. The downside to this is that it looses ACID attributes, so it is difficult to do transactions.

 

Key-Value Databases

Opposite to sharding is Amazon, which needs the same scability but with strict ACID since it is dealing with transactions. This lead to them creating Dynamo db system.

 

Document Databases

Evolution led to the storage of JSON objects; first stored on relational db but later evolved to be stored by itself without the relational db middleman – which lead to document databases. Popular ones are MongoDB and CouchBase.

 

Big Data

More Data = store and process all data, including machine generated, multimedia, social networking, transactional…

More Effective = machine learning, predictive analytics, collective intelligence; allows more value from the data

 

Google File System (GFS)

Custom distributed cluster file system that Google developed an use for its module data center.

 

MapReduce

Nonrelational db system that uses GFS for storage. Programming model that parallelizes data-intesnsive processing; using a mapping phase and reduce phase:

  • Mapping = data broken into chucks that can be processed by separate threads
  • Reduce = combines the output from mappers into final result

Canonical example of MapReduce is word-count program:

 

Hadoop

Used by Yahoo! In 2008; Used by Facebook 2012

 

Benefits:

  • Economical Scalable storage model
  • Massicve scaleable IO capability
  • Reliability
  • Scalable processing model
  • Schema on read

 

Hadop runs on Hadoop Distributed File System (HDFS) which is similar to GFS.

 

Each server hs a task tracker and data node, which means it both stores and processes data on the server. Above that is a specific JobTracker server which is used by MapReduce. The JobTracker works with each of the TaskTrackers. Also above is the NameNode, which works with each of Data nodes on the servers. The NameNode only does coordination, it does not deal with the actual data.

 

 

 

MapReduce is interfaced with products like Pig and Hive. This is like a compiler, takes user’s queries and translates it down to MapReduce.

 

The top level overview of Hadoop:

 

 

HBase

First formal NoSQL database

 

HBase uses the Hadoop HDFS file system to store its data. This is similar to how MySQL uses MyISAM to store its data on the file system. Though HDFS is data structure agnostic, HBase those enforce structure on the objects:

  • Columns
  • Rows
  • Tables
  • Keys

HBase records can have several unique columns. Below is example of how Relational to Nonrelational mapping works:

 

Hive and Pig

Hive = Facebook

  • SQL for Hadoop
  • HQL = hive query language
  • Uses YARN based processing paradigms

Pig = Yahoo!

  • Pig Latin = a scripting language

Example of their difference:

 

Utilites and Applications

  • Flume = loading file-based data
  • SQOOP = exchanging data with relation db; import/export relational tables into HDFS
  • Zookeeper = coordinate and sync services within cluster
  • Oozie = workflow scheduler
  • Hue = graphical UI for administering Hadoop

 

 

 

 

 

 

3 Sharding, Amazon, NoSQL

 

Scaling Web 2.0

As web grew, db layer needed to keep up. Some interesting evolutions:

 

MySQL never scaled as good as Oracle. Not designed for multi-core processing. Some of the first tricks used by places like Facebook were Memcached.

 

Memcached = open source; distributed object cache; uses cache memory instead of going to database;

 

MySQL replication = copying of databases; Reads come from duplicates of db; Writes still need to go to master since duplication of writes (or copy of master) was not stable

 

Sharding

Partitioing of logical db across multiple physical servers; each partition is referred ot as a shard; Example below where shards are shown on bottom at db layer:

 

Sharding and Memcached are the best ways to scale a relational database. However, this is very costly and has several disadvantages:

  • Application complexity = Need a dynamic routing layer that determines which shards to go to; also needs to manage the memcached objects and deal with replication.
  • Crippled SQL = SQL statement cannot go across shards; queries can only be done against the whole database;
  • Loss of transactional integrity = ACID transactions lost due to complexity; often not implemented or difficult to implement with sharding and memcached.
  • Operational Complexity =

CAP Theorem

States that in a distributed database systems, can only have two of the following:

  • Consistency = every user has identical view of data at any given instand
  • Availability = in event of failure, db remains operational
  • Partition tolerance = when the network between the distributed systems fail, can maintain operations

 

Amazon Dynamo

Early amazon used Oracle DB which would readily fail. Tried to distribute by having separate functions on the site, each would have its own different DB layers. Also tried to use SOA so that the database technologies could be different not coupled with web. Eventually in 2007 lead to Dynamo system, which did the following:

  • Continuous Availability = 24/7 period
  • Network partition tolerant = geographical locations independent; available always regardless of a location not being online/available
  • No-loss conflict resolution = if user adds to cart from two completely different sources, should always show both; implied there are no excluse write locks on objects
  • Efficiency = quick response; reduced delays
  • Economy = run on commodity hardware
  • Incremental scalability

Amazon was focused on Availability, over consistency. In terms of CAP, Dynamo would achieve consistency at the expense of consistency (to a reasonable degree)

 

Dynamo provides the following 3 architectural benefits:

  • Consistent Hashing = use of has values to evenly store data to a given set of nodes. The node set can change so these hash values must adjust with it dynamically. Ideally, need to keep data evenly distributed across all the nodes.
  • Tunable Consistency = Can control consistency = strong, eventual, weak
  • Data Versioning = write operations never blocked so can have multiple versions of same object; must merge data which can lead to user having duplicates – for example adding same item from two different computers

 

Hashing example:

 

 

 

 

4 Document Databases

 

 

File Databases based on XML and JSON

In XML, we have the following when dealing with XML documents:

  • XPath
  • XQuery
  • XML schema
  • XSLT (Extensible stylesheet language transofrmations)
  • DOM (Document Object Model)

 

JSON developed to replace parts of XML documents as those were becoming expensive, repeatitive, expensive to parse. JSON databases popular in key-value stores, such as DynamoDB. The JSON document being stored can have a dynamic list of column definition. For example, the relational db vs JSON for a movie:
JSON objects have a size limit (64mb in MongoDB). So sometimes if there are lists of values, it should be broken out with relations. Examples, a movie with list of actors could look like:

 

 

 

MongoDB

JSON oriented document database using a format called BSON. BSON has lower parse overhead than JSON and richer support for data types like dates and binary.

 

*See other training doc

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5 Graph Databases

Commonly used for social relationships (Facebook), network topologies, access-control systems, medical models. Graph data could be stored in relational db but it is inefficient. NoSQL doesn’t work either (key-value pairs) as there are no key-value pairs, its instead focused on the relationships. NoSQL is actually less efficient than relational db for graph data.

 

Graph Theory

Vertices = nodes, distinct objects

Edges = relationships/arcs, that connect the objects

Properties = on both vertices and edges

 

Because Graph data involves relations, it could be represented in relational databases:

 

 

 

Two problems with using relational db for graph data:

  • SQL lacks syntax to easily traverse the graph – example, going through unknown depth or travels. Such as determining the degrees of separation between people on Facebook
  • Performance = graph traversal on relational db is very expensive

 

RDF and SPARQL

Resource Description Framework (RDF) = data expressed as triples:,

  • Entity
  • Attribute
  • Value

 

Example:

The Matrix: is :Movie

Keanu: is :Person

Keanu: starred in :The Matrix

 

SPARQL = query language for RDF.

 

Graph data is shown like an ontology:

 

 

 

6 Column Databases

 

Star Schemas = for large data warehouses which aggregate queries quickly; provides predictable schema for BI, uses large Fact Tables with numerous smaller Dimension tables.

 

Columnar Tables

Two advantages to columnar architecture:

  • Queries that seek to aggregate values of specific columns are optimized; this is because all the data for given column are now in a single row and so only need to be read off a single disc read
  • Compression = since all data shown on single row, can store the delta instead of actual data. So the second data entry stores only delta of value from first data entry.

 

 

 

 

7 In Memory Databases

The move from hard disc / tape to solid state discs – mostly using DDR RAM or NAND (NAND being the industry standard).

 

 

An in-memory database architecture has the following key features:

  • Cache-less architecture = no need to rely on cache; no need for cache management
  • Alternative persistence model = ensure no data loss on power off (memory looses power); persist the data somehow on the backend

 

TimesTen

Since 2005, Oracle’s in memory database; can also be used as the caching db for traditional disc-based RDBMS. TimesTen is ACID compliant and deals with transactions by having a transaction log and checkpoint running in parallel.

 

 

Redis

Remote Dictionary Server = simple in-memory system. By VMWare but moved to Pivotal Software.

 

SAP HANA

In memory database for BI; can also do OLTP

 

VoltDB

H-Store design; in-memory db designed with explicit intention of not requiring disk IO during normal transactional operations; pure in-memory solution. Supports ACID. Data is persisted not through disc but through multiple machines.

 

Oracle 12c In Memory Database

Parts of Oracle 12c can run as/like in-memory database

 

BDAS / Spark

From UC Berkeley; Berkeley Data Analysis Stack

Distributed in-memory database, fault tolerant; written in Scala (Java VM based);

 

Other databases from the school:

Mesos = similar to Hadoop YARN; used for cluster management

Tachyon = Hadoop compatible memory centric distributed file system; allows for isk storage but uses aggressive caching to provide memory-lelvel response times