Overview of AWS DynamoDB

DynamoDB is a NoSQL database service provided by AWS. Refer to my other post about NoSQL databases here:

http://solidfish.com/relational-vs-non-relational-databases/

Some of the benefits of DynamoDB databases are:

  • massively scalable
  • low latency – can do very fast read and writes
  • low execution load
  • availability

Some quick notes on Relational vs Non-relational databases

  • SQL DB use vertical scaling whereas NoSQL uses horizontal (columns dynamic)
  • SQL DB provides strong data representation through ACID (eg transactions guarantee latest data always represented)
    • Atomicity
    • Consistency
    • Isolation
    • Durability
  • NoSQL DB provides weak data representation through BASE (may not be latest data)
    • Basic Availability
    • Soft state
    • Eventual Consistency

CAP Theorem

Based on CAP theorem, when representing data we can only support only 2 of the following 3 at any one time. SQL DB provides the ‘C’ and the ‘P’ whereas NoSQL DB provides the ‘A’ and the ‘P’.

  • Consistency
  • Availability
  • Partition tolerance

When we have distributed systems and want to support consistency, then each system would need to verify with each other before responding to a data request to ensure the correct value is returned. But in a distributed system where we want availability, then a data request is immediately returned by the first system that receives the request. That value may not be in sync with the other systems but the response is immediate. In the real world we wouldnt ever have a situation where Partition Tolerance is removed. This would mean that in the distributed system the data would be incomplete, or in other words, the data would be wrong.

 

DynamoDB Example Walthrough

In a NoSQL DB, we store data by document storage using key-values. It is schemaless but in return it is highly scalable and very low overhead. The amount of data consistency can be configured (low to high). Also, data I/O can be streamed. The data requests are done through simple APIs.

DyanmoDB integrates easily with AWS Lambda, CloudWatch, CloudSearch, EMR, Data Pipeline and Redshift.

In the following examples we’re using AWS CLI. An example account was created to use for these dyanmodb examples. This is done through IAM and the account has been given the built-in dyanamodb full administrative policy.

First configure the CLI:

[user:/]$ aws configure
AWS Access Key ID [None]: XXX
AWS Secret Access Key [None]: AAA/BBBBB
Default region name [None]: us-east-2
Default output format [None]: text
[user:/]$ aws dynamodb list-tables

Data Types Supported 

  • String
  • Integer
  • Binary (Base64 encoding string)
  • Boolean
  • Null
  • List
  • Sets
    • Strings
    • Numbers
    • Objects

The format is JSON. The following are key types:

  • Simple Key
    • Partition Key (unique, usually a hash)
  • Composite Key
    • Partition Key
    • Sort Key

Example of Simple Key is a Users table where the UserId is a simple partition key. Example of a Composite Key table is a blog post table where it contains a UserId field as the partition key but also a timestamp field as the sort key. Together they create the composite key.

Scans

A method of querying data. This uses an arbitrary search expressions. Scans are convenient but are much slower and expensive to execute. Scans should be used as last resort.

Indexes

Two types of indexes:

  • Local Secondary (LSI)
    • select a different sort order for a partition key (basically its a secondary key)
    • similar to sort keys but values can be duplicated
    • limited by size of single partition
  • Global Secondary (GSI)
    • access data using a different partition key
    • copy with different partition key
    • unlimited size
    • can emulate LSI with a GSI

Note that DynamoDB limits – can only have 5 LSI or GSI per table.

Configuration and API

When configuring DynamoDB we need to specify the RCU and WCU (Read/Write requests). If these settings are lower than actual, it can handle the overload with a burst capacity, but this is limited. Eventually this will result in an error so it is important to set the RCU/WCU up front correctly.

The APIs use HTTP methods. Example:

POST / HTTP/1.1
Host: dynamodb.us-east2.amazonaws.com;
...
X-Amz-Date: 20160811T123
X-Amz-Target: DynamoDB_20160101.GetItem
{
 "TableName": "mytable",
 "Key": {
  "UserId": {"N" : "1"},
  "Timestamp": {"N": "2018010112345" {
 }
}

Some of the API methods are:

  • Create, Delete, List, UpdateTable
  • BatchGetItem, GetItem, Query, Scan
  • BatchWriteItem, DeleteITem, PutItem, UPdateItem
  • DescribeTimeToLive, UPdateTimeToLive
  • TagResource, UntagResource, ListTagsOfResource
  • DescribeLimits

Object Persistence Interface

Maps a DynamoDB table to an object. The object will have certain annotations such as the partition key and attribute name definitions.

Conditional Updates and Optimistic Locking

This resolves write conflicts when multiple requests access the same data. In traditional SQL databases we would use transactions (or Unit of Work Pattern) to deal with this. For NoSQL databases we use a condition that checks for an attribute and executes the update only when the condition is met. This can be done programmatically, outside of DynamoDB. For example, one way to do this is to have a version field that gets incremented every time it is updated.

In the example code below, this can be seen in the Item object as.

@DynamoDBVersionAttribute
privateLongversion;
And the data access class uses a SaveBehavior of CLOBBER
    public Item put(Item item) {
        mapper.save(item, DynamoDBMapperConfig
                .builder()
                .withSaveBehavior(DynamoDBMapperConfig.SaveBehavior.CLOBBER)
                .build());

        return item;
    }

Finally the code below shows the condition check on the update class so that both updates can happen simultaneously.

    private static void updateDescription(ItemDao itemDao, Item item) {
        while (true) {
            try {
                item.setDescription("Retina display");
                itemDao.update(item);
                break;
            } catch (ConditionalCheckFailedException ex) {
                item = itemDao.get(item.getId());
            }
        }
    }

Transactions

The above section showed how to do locks programmatically using conditions and optimistic locks. We could also programmatically implement transactions to make our database mimic SQL db transaction feature. AWS provides a library for this called TransactionManager. When calling this DynamoDB will store the record in a separate shadow table before saving it to the original table.

Data Search using AWS CloudSearch

For doing text searches we can utilize the CloudSearch service, which is basically a managed search service running on top of Apache Solr. It will import data from DynamoDB and run the actual search. Another benefit is that CloudSearch has ranking ability and therefore rank our search results. Note that there is a limitation in that only up to 5MB of data can be searched at once from DynamoDB.

Sample Code

The following are some sample code (java) in defining a table programmatically using the aws.dynamodb library. This will be migrated into DynamoDB programmatically shown later below. Note the annotations that make these objects mapped to the DynamoDB table.

import com.amazonaws.services.dynamodbv2.datamodeling.*;

@DynamoDBTable(tableName = "Items")
public class Item {

    @DynamoDBAutoGeneratedKey
    @DynamoDBHashKey
    private String id;

    @DynamoDBAttribute
    private String name;

    @DynamoDBAttribute
    private String description;

    @DynamoDBAttribute
    private int totalRating;

    @DynamoDBAttribute
    private int totalComments;

    @DynamoDBVersionAttribute
    private Long version;

    public String getId() {
        return id;
    }
...

The following is a sample code that uses the DynamoDBMapper’s CreateTableRequest to create the tables. Note how we are also setting up the GSI and LSI here programmatically.

import com.amazonaws.services.dynamodbv2.AmazonDynamoDB;
import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBMapper;
import com.amazonaws.services.dynamodbv2.model.*;
import com.amazonaws.services.dynamodbv2.transactions.TransactionManager;
public class Utils {

    public static void createTables(AmazonDynamoDB dynamoDB) {
        DynamoDBMapper dynamoDBMapper = new DynamoDBMapper(dynamoDB);

        createTable(Item.class, dynamoDBMapper, dynamoDB);
    }

    private static void createTable(Class<?> itemClass, DynamoDBMapper dynamoDBMapper, AmazonDynamoDB dynamoDB) {
        CreateTableRequest createTableRequest = dynamoDBMapper.generateCreateTableRequest(itemClass);
        createTableRequest.withProvisionedThroughput(new ProvisionedThroughput(1L, 1L));

        if (createTableRequest.getGlobalSecondaryIndexes() != null)
            for (GlobalSecondaryIndex gsi : createTableRequest.getGlobalSecondaryIndexes()) {
                gsi.withProvisionedThroughput(new ProvisionedThroughput(1L, 1L));
                gsi.withProjection(new Projection().withProjectionType("ALL"));
            }

        if (createTableRequest.getLocalSecondaryIndexes() != null)
            for (LocalSecondaryIndex lsi : createTableRequest.getLocalSecondaryIndexes()) {
                lsi.withProjection(new Projection().withProjectionType("ALL"));
            }

        if (!tableExists(dynamoDB, createTableRequest))
            dynamoDB.createTable(createTableRequest);

        waitForTableCreated(createTableRequest.getTableName(), dynamoDB);
        System.out.println("Created table for: " + itemClass.getCanonicalName());

    }
...

Below is some sample data access code (java) to interact with the newly created table.

import com.amazonaws.services.dynamodbv2.AmazonDynamoDB;
import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBMapper;
import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBMapperConfig;
import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBScanExpression;

import java.util.List;

public class ItemDao {

    private final DynamoDBMapper mapper;

    public ItemDao(AmazonDynamoDB dynamoDb) {
        this.mapper = new DynamoDBMapper(dynamoDb);
    }

    public Item put(Item item) {
        mapper.save(item, DynamoDBMapperConfig
                .builder()
                .withSaveBehavior(DynamoDBMapperConfig.SaveBehavior.CLOBBER)
                .build());

        return item;
    }

    public Item get(String id) {
        return mapper.load(Item.class, id);
    }

    public void update(Item item) {
        mapper.save(item);
    }

    public void delete(String id) {
        Item item = new Item();
        item.setId(id);

        mapper.delete(item);
    }

    public List getAll() {
        return mapper.scan(Item.class, new DynamoDBScanExpression());
    }
}

 

DynamoDB Streams

This is a feature used for real-time updates, such as cross-region replication or real-time processing. This does not support scans or transactions.

Another popular use case for using DynamoDB Streams is to update CloudSearch.

DynamoDB Streams can be implemented using lower level API code provided through the AWS developer libraries, or using AWS Kinesis library. Note in the diagram above we’re using a Lambda function for updating CloudSearch. This is triggered by DynamoDB (configured in the Lambda console under the triggers tab). The Lambda function must have the correct role to access CloudSearch and DynamoDB access.

 

DynamoDB Design Patterns

Like traditional SQL databases, DynamoDB supports relationships between tables. This could be one-to-one, one-to-many and many-to-many. Though no normalization rules apply here, it is still a good idea to break data into multiple tables for benefits such as:

  • DynamoDB tables have size limits, by breaking into multiple tables we get away from these limits
  • DynamoDB charges by record size change, having tables with smaller number of fields is cost savings
  • Can create more indexes on each of the tables

1 to 1 relationships are implemented using a common Id field from both tables, much like foreign keys in traditional databases. Just keep track of the key mappings between tables.

1 to many relationships are implemented using a Partition Key and the Sort Key

Many to many relationships are implemented using GSI or LSI keys

Composite keys (keys requiring multiple fields) are defined  using the Partition Key + GSI

 

Redshift

This is AWS’s datawarehouse solution that uses PostegreSQL. It can scale to petabytes size of data. Like CloudSearch, it requires a copy of the dynamoDB data. Some limitations of Redshift are:

  • table names must be less than 127 characters
  • table names must not be any of the reserved keywords

To use Redshift we would follow a workflow something like this:

  • Launch a Redshift cluster
  • Copy DynamoDB data
  • Perform the query (SQL)

 

AWS EMR and Apache Hadoop

Whereas DynamoDB is the NoSQL database responsible for the read and write of data, Hadoop is a tool/framework we use to perform data analysis on that data. The NoSQL database provides fast read/writes with the horizontal scalability (and dynamic changing of data formats/schemas) so that can get data into our storage. Hadoop then takes that dataset and by using MapReduce, is able to compute the data quickly over large distributions. NoSQL and Hadoop work together to provide the ‘big data’ solution.

The Hadoop stack includes various software such as the HDFS storage system and the MapReduce query system that sits on top of it. There various tools AWS provides to create a Hadoop cluster. AWS’s EMR service runs Hadoop projects. It is an elastic service and supports many tools such as Spark and Flink.

Apache Hive is an ad hoc data analysis tool for distributed data like those in HDFS / Hadoop. It supports SQL-92 specification. Basically it converts the SQL into MapReduce jobs. Note that Hive is not a database and should not be treated like one because it has high latencies. The queries can be run directly from DynamoDB or copies of the data either on S3 or HDFS. This would all be done inside AWS EMR.

 

AWS CloudWatch

This is AWS monitoring service that collects and tracks metrics on various other services in AWS such as DynamoDB. We can set alarms which trigger emails or other actions, such as auto-scaling. Some of the metrics CloudWatch can monitor on DynamoDB are:

  • Throughput – Consumed Read/Write Capacity Units
  • Data Returned – number records returned, size of records
  • Failures
  • Other

 

AWS CloudTrail

CloudTrail stores metadata of all requests inside of AWS. It is a service that provides Audit trail, information for security analysis, or troubleshooting errors. Such metadata are:

  • Who made request
  • API called
  • Timestamp
  • Call parameters

All this information can be stored inside DyanmoDB or S3. This information is also forwarded to CloudWatch so that we can monitor it and set alarms.

 

References

AWS DynamoDB Deep Dive (pluralsight)
Ivan Mushketyk; 2017
https://app.pluralsight.com/player?course=aws-dynamodb-deep-dive

Hadoop and NoSQL
https://mapr.com/blog/hadoop-vs-nosql-whiteboard-walkthrough/