- Tuples = unordered set of attribute values (row and attribute of column in a table)
- Relation = collection of tuples and corresponding relations
- Constraints = enforce consistency; used to identify tuples and relationships between them
- Operations = joins, projectsions, unions, etc; generally based on the relations
Changes on a databases done with transactions, to ensure consistency and integrity. Follows the ACID model for transactions:
- Atomic = transaction is indivisible, either all parts go thru or none
- Consistent = db remains consistent state before and after transaction
- Isolated = transactions do not effect each other, it is focused on its own
- Durable = Once transaction saved, changes are expected to persists even if OS or Hardware failure
Using Memcache (Facebook) – caching of data at the outer endpoints
Partition data across multiple databases based on key attribute. The downside to this is that it looses ACID attributes, so it is difficult to do transactions.
Opposite to sharding is Amazon, which needs the same scability but with strict ACID since it is dealing with transactions. This lead to them creating Dynamo db system.
Evolution led to the storage of JSON objects; first stored on relational db but later evolved to be stored by itself without the relational db middleman – which lead to document databases. Popular ones are MongoDB and CouchBase.
More Data = store and process all data, including machine generated, multimedia, social networking, transactional…
More Effective = machine learning, predictive analytics, collective intelligence; allows more value from the data
Custom distributed cluster file system that Google developed an use for its module data center.
Nonrelational db system that uses GFS for storage. Programming model that parallelizes data-intesnsive processing; using a mapping phase and reduce phase:
- Mapping = data broken into chucks that can be processed by separate threads
- Reduce = combines the output from mappers into final result
Canonical example of MapReduce is word-count program:
Used by Yahoo! In 2008; Used by Facebook 2012
- Economical Scalable storage model
- Massicve scaleable IO capability
- Scalable processing model
- Schema on read
Hadop runs on Hadoop Distributed File System (HDFS) which is similar to GFS.
Each server hs a task tracker and data node, which means it both stores and processes data on the server. Above that is a specific JobTracker server which is used by MapReduce. The JobTracker works with each of the TaskTrackers. Also above is the NameNode, which works with each of Data nodes on the servers. The NameNode only does coordination, it does not deal with the actual data.
MapReduce is interfaced with products like Pig and Hive. This is like a compiler, takes user’s queries and translates it down to MapReduce.
The top level overview of Hadoop:
First formal NoSQL database
HBase uses the Hadoop HDFS file system to store its data. This is similar to how MySQL uses MyISAM to store its data on the file system. Though HDFS is data structure agnostic, HBase those enforce structure on the objects:
HBase records can have several unique columns. Below is example of how Relational to Nonrelational mapping works:
Hive = Facebook
- SQL for Hadoop
- HQL = hive query language
- Uses YARN based processing paradigms
Pig = Yahoo!
- Pig Latin = a scripting language
Example of their difference:
- Flume = loading file-based data
- SQOOP = exchanging data with relation db; import/export relational tables into HDFS
- Zookeeper = coordinate and sync services within cluster
- Oozie = workflow scheduler
- Hue = graphical UI for administering Hadoop
As web grew, db layer needed to keep up. Some interesting evolutions:
MySQL never scaled as good as Oracle. Not designed for multi-core processing. Some of the first tricks used by places like Facebook were Memcached.
Memcached = open source; distributed object cache; uses cache memory instead of going to database;
MySQL replication = copying of databases; Reads come from duplicates of db; Writes still need to go to master since duplication of writes (or copy of master) was not stable
Partitioing of logical db across multiple physical servers; each partition is referred ot as a shard; Example below where shards are shown on bottom at db layer:
Sharding and Memcached are the best ways to scale a relational database. However, this is very costly and has several disadvantages:
- Application complexity = Need a dynamic routing layer that determines which shards to go to; also needs to manage the memcached objects and deal with replication.
- Crippled SQL = SQL statement cannot go across shards; queries can only be done against the whole database;
- Loss of transactional integrity = ACID transactions lost due to complexity; often not implemented or difficult to implement with sharding and memcached.
- Operational Complexity =
States that in a distributed database systems, can only have two of the following:
- Consistency = every user has identical view of data at any given instand
- Availability = in event of failure, db remains operational
- Partition tolerance = when the network between the distributed systems fail, can maintain operations
Early amazon used Oracle DB which would readily fail. Tried to distribute by having separate functions on the site, each would have its own different DB layers. Also tried to use SOA so that the database technologies could be different not coupled with web. Eventually in 2007 lead to Dynamo system, which did the following:
- Continuous Availability = 24/7 period
- Network partition tolerant = geographical locations independent; available always regardless of a location not being online/available
- No-loss conflict resolution = if user adds to cart from two completely different sources, should always show both; implied there are no excluse write locks on objects
- Efficiency = quick response; reduced delays
- Economy = run on commodity hardware
- Incremental scalability
Amazon was focused on Availability, over consistency. In terms of CAP, Dynamo would achieve consistency at the expense of consistency (to a reasonable degree)
Dynamo provides the following 3 architectural benefits:
- Consistent Hashing = use of has values to evenly store data to a given set of nodes. The node set can change so these hash values must adjust with it dynamically. Ideally, need to keep data evenly distributed across all the nodes.
- Tunable Consistency = Can control consistency = strong, eventual, weak
- Data Versioning = write operations never blocked so can have multiple versions of same object; must merge data which can lead to user having duplicates – for example adding same item from two different computers
In XML, we have the following when dealing with XML documents:
- XML schema
- XSLT (Extensible stylesheet language transofrmations)
- DOM (Document Object Model)
JSON developed to replace parts of XML documents as those were becoming expensive, repeatitive, expensive to parse. JSON databases popular in key-value stores, such as DynamoDB. The JSON document being stored can have a dynamic list of column definition. For example, the relational db vs JSON for a movie:
JSON objects have a size limit (64mb in MongoDB). So sometimes if there are lists of values, it should be broken out with relations. Examples, a movie with list of actors could look like:
JSON oriented document database using a format called BSON. BSON has lower parse overhead than JSON and richer support for data types like dates and binary.
*See other training doc
Commonly used for social relationships (Facebook), network topologies, access-control systems, medical models. Graph data could be stored in relational db but it is inefficient. NoSQL doesn’t work either (key-value pairs) as there are no key-value pairs, its instead focused on the relationships. NoSQL is actually less efficient than relational db for graph data.
Vertices = nodes, distinct objects
Edges = relationships/arcs, that connect the objects
Properties = on both vertices and edges
Because Graph data involves relations, it could be represented in relational databases:
Two problems with using relational db for graph data:
- SQL lacks syntax to easily traverse the graph – example, going through unknown depth or travels. Such as determining the degrees of separation between people on Facebook
- Performance = graph traversal on relational db is very expensive
RDF and SPARQL
Resource Description Framework (RDF) = data expressed as triples:,
The Matrix: is :Movie
Keanu: is :Person
Keanu: starred in :The Matrix
SPARQL = query language for RDF.
Graph data is shown like an ontology:
Star Schemas = for large data warehouses which aggregate queries quickly; provides predictable schema for BI, uses large Fact Tables with numerous smaller Dimension tables.
Two advantages to columnar architecture:
- Queries that seek to aggregate values of specific columns are optimized; this is because all the data for given column are now in a single row and so only need to be read off a single disc read
- Compression = since all data shown on single row, can store the delta instead of actual data. So the second data entry stores only delta of value from first data entry.
The move from hard disc / tape to solid state discs – mostly using DDR RAM or NAND (NAND being the industry standard).
An in-memory database architecture has the following key features:
- Cache-less architecture = no need to rely on cache; no need for cache management
- Alternative persistence model = ensure no data loss on power off (memory looses power); persist the data somehow on the backend
Since 2005, Oracle’s in memory database; can also be used as the caching db for traditional disc-based RDBMS. TimesTen is ACID compliant and deals with transactions by having a transaction log and checkpoint running in parallel.
Remote Dictionary Server = simple in-memory system. By VMWare but moved to Pivotal Software.
In memory database for BI; can also do OLTP
H-Store design; in-memory db designed with explicit intention of not requiring disk IO during normal transactional operations; pure in-memory solution. Supports ACID. Data is persisted not through disc but through multiple machines.
Oracle 12c In Memory Database
Parts of Oracle 12c can run as/like in-memory database
BDAS / Spark
From UC Berkeley; Berkeley Data Analysis Stack
Distributed in-memory database, fault tolerant; written in Scala (Java VM based);
Other databases from the school:
Mesos = similar to Hadoop YARN; used for cluster management
Tachyon = Hadoop compatible memory centric distributed file system; allows for isk storage but uses aggressive caching to provide memory-lelvel response times