Types of Databases

These are some notes on different types of databases used in the current industry. Refer to my other post on Relational vs Non-Relational Database for more information specific to NoSQL Document Databases.

http://solidfish.com/relational-vs-non-relational-databases/

Below is a list of some different types of databases as well as some platforms/tools for these database types.

 

Document Databases

File Databases based on XML and JSON

In XML, we have the following when dealing with XML documents:

  • XPath
  • XQuery
  • XML schema
  • XSLT (Extensible stylesheet language transofrmations)
  • DOM (Document Object Model)

 

JSON developed to replace parts of XML documents as those were becoming expensive, repeatitive, expensive to parse. JSON databases popular in key-value stores, such as DynamoDB. The JSON document being stored can have a dynamic list of column definition. For example, the relational db vs JSON for a movie:
JSON objects have a size limit (64mb in MongoDB). So sometimes if there are lists of values, it should be broken out with relations. Examples, a movie with list of actors where the actors are embedded into the Movie table.

 

Big Data

More Data = store and process all data, including machine generated, multimedia, social networking, transactional…

More Effective = machine learning, predictive analytics, collective intelligence; allows more value from the data

 

Google File System (GFS)

Custom distributed cluster file system that Google developed an use for its module data center.

 

MapReduce

Nonrelational db system that uses GFS for storage. Programming model that parallelizes data-intesnsive processing; using a mapping phase and reduce phase:

  • Mapping = data broken into chucks that can be processed by separate threads
  • Reduce = combines the output from mappers into final result

Canonical example of MapReduce is word-count program:

 

Hadoop

Used by Yahoo! In 2008; Used by Facebook 2012

Benefits:

  • Economical Scalable storage model
  • Massicve scaleable IO capability
  • Reliability
  • Scalable processing model
  • Schema on read

 

Hadop runs on Hadoop Distributed File System (HDFS) which is similar to GFS.

Each server hs a task tracker and data node, which means it both stores and processes data on the server. Above that is a specific JobTracker server which is used by MapReduce. The JobTracker works with each of the TaskTrackers. Also above is the NameNode, which works with each of Data nodes on the servers. The NameNode only does coordination, it does not deal with the actual data.

MapReduce is interfaced with products like Pig and Hive. This is like a compiler, takes user’s queries and translates it down to MapReduce.

 

HBase

First formal NoSQL database

HBase uses the Hadoop HDFS file system to store its data. This is similar to how MySQL uses MyISAM to store its data on the file system. Though HDFS is data structure agnostic, HBase those enforce structure on the objects:

  • Columns
  • Rows
  • Tables
  • Keys

HBase records can have several unique columns. Below is example of how Relational to Nonrelational mapping works:

 

Hive and Pig

Hive = Facebook

  • SQL for Hadoop
  • HQL = hive query language
  • Uses YARN based processing paradigms

Pig = Yahoo!

  • Pig Latin = a scripting language

 

Amazon DynamoDB

Early amazon used Oracle DB which would readily fail. Tried to distribute by having separate functions on the site, each would have its own different DB layers. Also tried to use SOA so that the database technologies could be different not coupled with web. Eventually in 2007 lead to Dynamo system, which did the following:

  • Continuous Availability = 24/7 period
  • Network partition tolerant = geographical locations independent; available always regardless of a location not being online/available
  • No-loss conflict resolution = if user adds to cart from two completely different sources, should always show both; implied there are no excluse write locks on objects
  • Efficiency = quick response; reduced delays
  • Economy = run on commodity hardware
  • Incremental scalability

Amazon was focused on Availability, over consistency. In terms of CAP, Dynamo would achieve consistency at the expense of consistency (to a reasonable degree)

Dynamo provides the following 3 architectural benefits:

  • Consistent Hashing = use of has values to evenly store data to a given set of nodes. The node set can change so these hash values must adjust with it dynamically. Ideally, need to keep data evenly distributed across all the nodes.
  • Tunable Consistency = Can control consistency = strong, eventual, weak
  • Data Versioning = write operations never blocked so can have multiple versions of same object; must merge data which can lead to user having duplicates – for example adding same item from two different computers

 

Utilites and Applications

  • Flume = loading file-based data
  • SQOOP = exchanging data with relation db; import/export relational tables into HDFS
  • Zookeeper = coordinate and sync services within cluster
  • Oozie = workflow scheduler
  • Hue = graphical UI for administering Hadoop

 

 

MongoDB

JSON oriented document database using a format called BSON. BSON has lower parse overhead than JSON and richer support for data types like dates and binary.

 

Graph Databases

Commonly used for social relationships (Facebook), network topologies, access-control systems, medical models. Graph data could be stored in relational db but it is inefficient. NoSQL doesn’t work either (key-value pairs) as there are no key-value pairs, its instead focused on the relationships. NoSQL is actually less efficient than relational db for graph data.

 

Graph Theory

Vertices = nodes, distinct objects

Edges = relationships/arcs, that connect the objects

Properties = on both vertices and edges

Because Graph data involves relations, it could be represented in relational databases:

Two problems with using relational db for graph data:

  • SQL lacks syntax to easily traverse the graph – example, going through unknown depth or travels. Such as determining the degrees of separation between people on Facebook
  • Performance = graph traversal on relational db is very expensive

 

RDF and SPARQL

Resource Description Framework (RDF) = data expressed as triples:,

  • Entity
  • Attribute
  • Value

 

Example:
The Matrix: is :Movie
Keanu: is :Person
Keanu: starred in :The Matrix
SPARQL = query language for RDF.

 

Column Databases

Star Schemas = for large data warehouses which aggregate queries quickly; provides predictable schema for BI, uses large Fact Tables with numerous smaller Dimension tables.

 

Columnar Tables

Two advantages to columnar architecture:

  • Queries that seek to aggregate values of specific columns are optimized; this is because all the data for given column are now in a single row and so only need to be read off a single disc read
  • Compression = since all data shown on single row, can store the delta instead of actual data. So the second data entry stores only delta of value from first data entry.

 

In Memory Databases

The move from hard disc / tape to solid state discs – mostly using DDR RAM or NAND (NAND being the industry standard).

An in-memory database architecture has the following key features:

  • Cache-less architecture = no need to rely on cache; no need for cache management
  • Alternative persistence model = ensure no data loss on power off (memory looses power); persist the data somehow on the backend

 

TimesTen

Since 2005, Oracle’s in memory database; can also be used as the caching db for traditional disc-based RDBMS. TimesTen is ACID compliant and deals with transactions by having a transaction log and checkpoint running in parallel.

 

Redis

Remote Dictionary Server = simple in-memory system. By VMWare but moved to Pivotal Software.

 

SAP HANA

In memory database for BI; can also do OLTP

 

VoltDB

H-Store design; in-memory db designed with explicit intention of not requiring disk IO during normal transactional operations; pure in-memory solution. Supports ACID. Data is persisted not through disc but through multiple machines.

 

Oracle 12c In Memory Database

Parts of Oracle 12c can run as/like in-memory database

 

BDAS / Spark

From UC Berkeley; Berkeley Data Analysis Stack

Distributed in-memory database, fault tolerant; written in Scala (Java VM based);

Other databases from the school:

Mesos = similar to Hadoop YARN; used for cluster management

Tachyon = Hadoop compatible memory centric distributed file system; allows for isk storage but uses aggressive caching to provide memory-lelvel response times