Evolution led to the storage of JSON objects; first stored on relational db but later evolved to be stored by itself without the relational db middleman – which lead to document databases. Popular ones are MongoDB and CouchBase.
More Data = store and process all data, including machine generated, multimedia, social networking, transactional…
More Effective = machine learning, predictive analytics, collective intelligence; allows more value from the data
Google File System (GFS)
Custom distributed cluster file system that Google developed an use for its module data center.
Nonrelational db system that uses GFS for storage. Programming model that parallelizes data-intesnsive processing; using a mapping phase and reduce phase:
- Mapping = data broken into chucks that can be processed by separate threads
- Reduce = combines the output from mappers into final result
Canonical example of MapReduce is word-count program:
Used by Yahoo! In 2008; Used by Facebook 2012
- Economical Scalable storage model
- Massicve scaleable IO capability
- Scalable processing model
- Schema on read
Hadop runs on Hadoop Distributed File System (HDFS) which is similar to GFS.
Each server hs a task tracker and data node, which means it both stores and processes data on the server. Above that is a specific JobTracker server which is used by MapReduce. The JobTracker works with each of the TaskTrackers. Also above is the NameNode, which works with each of Data nodes on the servers. The NameNode only does coordination, it does not deal with the actual data.
MapReduce is interfaced with products like Pig and Hive. This is like a compiler, takes user’s queries and translates it down to MapReduce.
First formal NoSQL database
HBase uses the Hadoop HDFS file system to store its data. This is similar to how MySQL uses MyISAM to store its data on the file system. Though HDFS is data structure agnostic, HBase those enforce structure on the objects:
HBase records can have several unique columns. Below is example of how Relational to Nonrelational mapping works:
Hive and Pig
Hive = Facebook
- SQL for Hadoop
- HQL = hive query language
- Uses YARN based processing paradigms
Pig = Yahoo!
- Pig Latin = a scripting language
Early amazon used Oracle DB which would readily fail. Tried to distribute by having separate functions on the site, each would have its own different DB layers. Also tried to use SOA so that the database technologies could be different not coupled with web. Eventually in 2007 lead to Dynamo system, which did the following:
- Continuous Availability = 24/7 period
- Network partition tolerant = geographical locations independent; available always regardless of a location not being online/available
- No-loss conflict resolution = if user adds to cart from two completely different sources, should always show both; implied there are no excluse write locks on objects
- Efficiency = quick response; reduced delays
- Economy = run on commodity hardware
- Incremental scalability
Amazon was focused on Availability, over consistency. In terms of CAP, Dynamo would achieve consistency at the expense of consistency (to a reasonable degree)
Dynamo provides the following 3 architectural benefits:
- Consistent Hashing = use of has values to evenly store data to a given set of nodes. The node set can change so these hash values must adjust with it dynamically. Ideally, need to keep data evenly distributed across all the nodes.
- Tunable Consistency = Can control consistency = strong, eventual, weak
- Data Versioning = write operations never blocked so can have multiple versions of same object; must merge data which can lead to user having duplicates – for example adding same item from two different computers
Utilites and Applications
- Flume = loading file-based data
- SQOOP = exchanging data with relation db; import/export relational tables into HDFS
- Zookeeper = coordinate and sync services within cluster
- Oozie = workflow scheduler
- Hue = graphical UI for administering Hadoop
File Databases based on XML and JSON
In XML, we have the following when dealing with XML documents:
- XML schema
- XSLT (Extensible stylesheet language transofrmations)
- DOM (Document Object Model)
JSON developed to replace parts of XML documents as those were becoming expensive, repeatitive, expensive to parse. JSON databases popular in key-value stores, such as DynamoDB. The JSON document being stored can have a dynamic list of column definition. For example, the relational db vs JSON for a movie:
JSON objects have a size limit (64mb in MongoDB). So sometimes if there are lists of values, it should be broken out with relations. Examples, a movie with list of actors could look like:
JSON oriented document database using a format called BSON. BSON has lower parse overhead than JSON and richer support for data types like dates and binary.
Commonly used for social relationships (Facebook), network topologies, access-control systems, medical models. Graph data could be stored in relational db but it is inefficient. NoSQL doesn’t work either (key-value pairs) as there are no key-value pairs, its instead focused on the relationships. NoSQL is actually less efficient than relational db for graph data.
Vertices = nodes, distinct objects
Edges = relationships/arcs, that connect the objects
Properties = on both vertices and edges
Because Graph data involves relations, it could be represented in relational databases:
Two problems with using relational db for graph data:
- SQL lacks syntax to easily traverse the graph – example, going through unknown depth or travels. Such as determining the degrees of separation between people on Facebook
- Performance = graph traversal on relational db is very expensive
RDF and SPARQL
Resource Description Framework (RDF) = data expressed as triples:,
The Matrix: is :Movie
Keanu: is :Person
Keanu: starred in :The Matrix
SPARQL = query language for RDF.
Star Schemas = for large data warehouses which aggregate queries quickly; provides predictable schema for BI, uses large Fact Tables with numerous smaller Dimension tables.
Two advantages to columnar architecture:
- Queries that seek to aggregate values of specific columns are optimized; this is because all the data for given column are now in a single row and so only need to be read off a single disc read
- Compression = since all data shown on single row, can store the delta instead of actual data. So the second data entry stores only delta of value from first data entry.
In Memory Databases
The move from hard disc / tape to solid state discs – mostly using DDR RAM or NAND (NAND being the industry standard).
An in-memory database architecture has the following key features:
- Cache-less architecture = no need to rely on cache; no need for cache management
- Alternative persistence model = ensure no data loss on power off (memory looses power); persist the data somehow on the backend
Since 2005, Oracle’s in memory database; can also be used as the caching db for traditional disc-based RDBMS. TimesTen is ACID compliant and deals with transactions by having a transaction log and checkpoint running in parallel.
Remote Dictionary Server = simple in-memory system. By VMWare but moved to Pivotal Software.
In memory database for BI; can also do OLTP
H-Store design; in-memory db designed with explicit intention of not requiring disk IO during normal transactional operations; pure in-memory solution. Supports ACID. Data is persisted not through disc but through multiple machines.
Oracle 12c In Memory Database
Parts of Oracle 12c can run as/like in-memory database
BDAS / Spark
From UC Berkeley; Berkeley Data Analysis Stack
Distributed in-memory database, fault tolerant; written in Scala (Java VM based);
Other databases from the school:
Mesos = similar to Hadoop YARN; used for cluster management
Tachyon = Hadoop compatible memory centric distributed file system; allows for isk storage but uses aggressive caching to provide memory-lelvel response times