AWS Storage and Data Services

This article covers Storage services in AWS.

  • EBS
  • EC2 Instance Store
  • EFS
  • S3
  • S3 Glacier
  • ElastiCache
  • EMR
  • CloudFront
  • StorageGateway
  • Import / Export
  • Snowball / Snowmobile

 

EBS

Amazon Elastic Block Store (Amazon EBS) provides block level storage volumes for use with EC2 instances. EBS volumes behave like raw, unformatted block devices. You can mount these volumes as devices on your instances. You can mount multiple volumes on the same instance, but each volume can be attached to only one instance at a time. Data on EBS are automatically replicated within an Availability Zone. The data is not lost when the attached instance is stopped or even terminated (leaving the EBS intact).

 

SSD vs HHD

SSD is good for high random access whereas HDD is good for sequential access such as streaming workloads. Imagine the read arm moving about the disk platter while the spindle is moving. Its best to keep the read in one location sequentially. Note the differences below also affect pricing.

Note there are the following EBS Volume Types

  • gp2 – General Purpose SSD: balance of price and performance for variety of workloads
  • io1 – Provisioned IOPS SSD: Highest-performance SSD volume for mission-critical low-latency or high-throughput workloads
  • st1 – Throughput Optimized HDD: Low-cost HDD volume designed for frequently accessed, throughput-intensive workloads
  • sc1 – Cold HDD: Lowest cost HDD volume designed for less frequently accessed workloads

 

EC2 Instance Store

An instance store provides temporary block-level storage for your instance. This storage is located on disks that are physically attached to the host computer. Instance store is ideal for temporary storage of information that changes frequently, such as buffers, caches, scratch data, and other temporary content, or for data that is replicated across a fleet of instances, such as a load-balanced pool of web servers.

There can be one or more volumes of Instance Store depending on the Instance Type. See table below for example.

Instance Type Instance Store Volumes Type Needs Initialization* TRIM Support**
c1.medium 1 x 350 GB† HDD
c1.xlarge 4 x 420 GB (1.6 TB) HDD
c3.large 2 x 16 GB (32 GB) SSD

 

Amazon EFS

Amazon Elastic File System (Amazon EFS) provides a simple, scalable, fully managed elastic NFS file system for use with AWS Cloud services and on-premises resources. Amazon EFS supports authentication, authorization, and encryption capabilities to help you meet your security and compliance requirements. Amazon EFS supports two forms of encryption for file systems, encryption in transit and encryption at rest.

Features of EFS

  • Supports NFS (NFSv4) protocol
  • Only pay for the storage used (no pre-provisioning)
  • Can scale up to the petabytes
  • Can support thousands of concurrent NFS connections
  • Data is stored across multiple AZs within a region
  • Read after write Consistency

Managing File System

You mount your file system on an EC2 instance in your virtual private cloud (VPC) using a mount target that you create for the file system. Managing file system network accessibility refers to managing the mount targets.

The following illustration shows how EC2 instances in a VPC access an Amazon EFS file system using a mount target.

DataSync

AWS DataSync is a data transfer service that simplifies, automates, and accelerates moving and replicating data between on-premises storage systems and AWS storage services over the internet or AWS Direct Connect. AWS DataSync can transfer your file data, and also file system metadata such as ownership, time stamps, and access permissions.

Using amazon-efs-utils

The amazon-efs-utils package is an open-source collection of Amazon EFS tools. There’s no additional cost to use amazon-efs-utils, and you can download these tools from GitHub here: https://github.com/aws/efs-utils. The amazon-efs-utils package is available in the Amazon Linux package repositories, and you can build and install the package on other Linux distributions.

Using encryption of data in transit with the Amazon EFS mount helper requires OpenSSL version 1.0.2 or newer, and a version of stunnel that supports both OCSP and certificate hostname checking. The Amazon EFS mount helper uses the stunnel program for its TLS functionality.

 

S3

S3 – limitless blob storage that is strongly consistency for write and eventually consistency for reads. This is because S3 is distributed. It is stored across multiple devices in multiple facilities and is designed to sustain the loss of 2 facilities concurrently. As such, S3 has a durability of 99.999999999% (11 nines) and availability of 99.99%.  It supports encryption at rest (SSE-S3, SSE-KMS, SSE-C) and in transit (using HTTPS). It also supports versioning.

There are 3 main S3 types:

  • S3 Standard: 99.99 availability, 99.999999999% (11 nines) durability
  • S3 IA (Infrequently Accessed): lower fee than standard for storage but there is a fee for retrieval. Ideal if accessed less than once a month.
  • Reduced Redundancy Storage (RRS): 99.99 availability but lower 99.99 durability.
  • Glacier: very cheap archival storage, minutes to hours for retrieval
  • Glacier Deep Archive: further cheaper, days (1-2) for retrieval

S3 objects get urls with bucket name in it. The key is the filename. Bucket names are globally unique. The bucket url does not have ssl by default. Virtual hosted-based style URLs contain bucket name as subdomain of s3, whereas path style URLs has bucket name at the end. The region name is optional in the URL (can have it without region).

Note that if using static web hosting in S3, the url will always look like:

https://<bucket-name>.s3-website-<AWS-region>.amazonaws.com

 

Pricing

S3 Features

Some additional features of S3:

  • Transfer Acceleration: Enables faster transfer, additional cost
  • Each file/object in S3 can be 0 – 5TB. You can have 0 byte files in S3 (with metadata)
  • Unlimited storage
  • Files stored in buckets
  • Uses universal namespace
  • Supports encryption at rest
  • Read after Write consistency for PUTS of new objects
  • Eventual Consistency for Overwrite PUTS of existing objects and DELETES
  • S3 object components:
    • Key
    • Value
    • Version ID
    • Metadata
    • Sub resources
    • ACL
    • Torrent
  • Successful uploads will respond with HTTP 200
  • Multipart upload – Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object’s data. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.

S3 Options on Interface

  • Properties
    • Versioning
      • Can be suspended (stop creating new versions, but keeps previous versions)
      • Cannot be disabled once enabled
      • Supports MFA delete capability
      • Delete markers hide the object as if its deleted, but the pervious versions are retained. Permanent delete will completely remove the object plus previous versions
    • Logging
      • access log
      • object-level logging, api activity logging via cloudtrail
    • Static website hosting
    • Encryption
      • AES-256
      • AWS-KMS
    • Tagging
    • Object lock
    • Transfer acceleration
    • Requester pays (requester pays transfer fees)
  • Permissions
    • Block Public access (default)
    • Access control list (ACL)
      • Owner Access
      • Access for other AWS account
      • Public access
      • S3 log delivery group
    • Bucket policy
    • CORS configuration
  • Management
    • Lifecycle policy
      • You can manage an object’s lifecycle by using a lifecycle rule, which defines how Amazon S3 manages objects during their lifetime.
      • Lifecycle rules enable you to automatically transition objects to the Standard – IA and/or to the Glacier storage class.
      • Using a lifecycle rule, you can automatically expire objects based on your retention needs or clean up incomplete multipart uploads.
    • Replication
      • Cross region replication, can be all bucket or prefix objects (requires versioning and IAM policy)
      • The replicated bucket can have different storage class (such as IA for backups)
      • Replication only applies to new or changed objects. It will not replicate existing objects. Using AWS CLI we can copy all existing bucket contents to the new bucket
      • Delete markers are replicated
        • Delete markers hide the object as if its deleted, but the pervious versions are retained. Permanent delete will completely remove the object plus previous versions
        • Deleting individual versions or delete markers will not be replicated
    • Analytics
    • Metrics
    • Inventory
  • Access Points
    • Access points can be used to provide access to your bucket. The S3 console doesn’t support using virtual private cloud (VPC) access points to access bucket resources. To access bucket resources from a VPC access point, you’ll need to use the AWS CLI, AWS SDK, or Amazon S3 REST API.

 

S3 Lifecycle Policies

Before you transition objects from the STANDARD or STANDARD_IA storages classes to STANDARD_IA or ONEZONE_IA, you must store them at least 30 days in the STANDARD storage class. The same 30-day minimum applies when you specify a transition from STANDARD_IA storage to ONEZONE_IA or INTELLIGENT_TIERING storage.

Using lifecycle configuration, you can transition objects to the GLACIER or DEEP_ARCHIVE storage classes for archiving. When you choose the GLACIER or DEEP_ARCHIVE storage class, your objects remain in Amazon S3. You cannot access them directly through the separate Amazon S3 Glacier service.

S3 Transfer Acceleration

Edge locations that allow for faster transfer speeds. Amazon S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket. Transfer Acceleration takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.

You might want to use Transfer Acceleration on a bucket for various reasons, including the following:

  • You have customers that upload to a centralized bucket from all over the world.
  • You transfer gigabytes to terabytes of data on a regular basis across continents.
  • You are unable to utilize all of your available bandwidth over the Internet when uploading to Amazon S3.

Optimizing S3 Performance

Your applications can easily achieve thousands of transactions per second in request performance when uploading and retrieving storage from Amazon S3. Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by parallelizing reads. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second.

S3 Pricing

There are 4 things to consider when estimating costs for S3

  • Storage class
  • Storage amount
  • Number of Requests
  • Amount of Data Transfer

S3 Presigned URLs vs CloudFront Signed URLs vs Origin Access Identity (OAI)

AWS Cheat Sheet – S3 Pre-signed URLs vs CloudFront Signed URLs vs Origin Access Identity (OAI)

In S3, we can open object access with pre-signed URLs which use the owner’s security credentials to grant others time-limited permission to the object. These access are expired after time.

Both S3 and CloudFront have URL signing features that work differently. Only S3 refers to them as Pre-signed URLs CloudFront refers to them as Signed URLs and Signed Cookies. A signed URL includes additional information, for example, an expiration date and time, that gives you more control over access to your content. This additional information appears in a policy statement, which is based on either a canned policy or a custom policy.

In general, if you’re using an Amazon S3 bucket as the origin for a CloudFront distribution, you can either allow everyone to have access to the files there, or you can restrict access.  If you limit access by using, for example, CloudFront signed URLs or signed cookies, you also won’t want people to be able to view files by simply using the direct URL for the file. Instead, you want them to only access the files by using the CloudFront URL, so your protections work.

An origin access identity is an entity inside CloudFront that can be authorized by bucket policy to access objects in a bucket. When CloudFront uses an origin access identity to access content in a bucket, CloudFront uses the OAI’s credentials to generate a signed request that it sends to the bucket to fetch the content. This signature is not accessible to the viewer.

Note that if you use an Amazon S3 bucket configured as a website endpoint, you must set it up with CloudFront as a custom origin and you can’t use the origin access identity feature.

 

S3 Glacier

S3 Glacier is an extremely low-cost storage service that provides durable storage with security features for data archiving and backup. With S3 Glacier, customers can store their data cost effectively for months, years, or even decades. The Amazon S3 Glacier (S3 Glacier) data model core concepts include vaults and archives. S3 Glacier is a REST-based web service. It can be used as a stand alone service.

Vault

A vault is a container for storing archives. When you create a vault, you specify a vault name and the AWS Region in which you want to create the vault.  An AWS account can create up to 1,000 vaults per AWS Region. A vault can be locked – only allow write once, read many. When you create a vault, you must provide a vault name. The following are the vault naming requirements:

  • Names can be between 1 and 255 characters long.
  • Allowed characters are a–z, A–Z, 0–9, ‘_’ (underscore), ‘-‘ (hyphen), and ‘.’ (period).

Archive

An archive is any object, such as a photo, video, or document, that you store in a vault. It is a base unit of storage in Amazon S3 Glacier (S3 Glacier). Each archive has a unique ID and an optional description. When you upload an archive, S3 Glacier returns a response that includes an archive ID. This archive ID is unique in the AWS Region in which the archive is stored. The following is an example archive ID.

 

EMR Elastic Map Reduce

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Additionally, you can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

The central component of Amazon EMR is the cluster. A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a node. Each node has a role within the cluster, referred to as the node type. Amazon EMR also installs different software components on each node type, giving each node a role in a distributed application like Apache Hadoop.

The node types in Amazon EMR are as follows:

  • Master node: A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. The master node tracks the status of tasks and monitors the health of the cluster. Every cluster has a master node, and it’s possible to create a single-node cluster with only the master node.
  • Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.
  • Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.

By default, Amazon EMR uses YARN (Yet Another Resource Negotiator), which is a component introduced in Apache Hadoop 2.0 to centrally manage cluster resources for multiple data-processing frameworks.

 

CloudFront

CloudFront can serve static and dynamic content. It can serve from origins:

  • S3
  • EC2
  • ELB
  • HTTP servers

The origins you set up with origin failover can be any combination of AWS origins like EC2 instances, Amazon S3 buckets, or Media Services, or non-AWS origins like an on-premises HTTP server.

CloudFront also provides SSL and can work with AWS Shield Standard/Advanced and AWS WAF (Web Application Firewall).

 

ElastiCache

Caching can be done at the network layer using CloudFront or at the database layer using ElastiCache. CloudFront edge locations will cache blob data such as S3. ElastiCache will sit in front of the databases (such as MongoDB, DynamoDB, RDS) and cache database data. When using ElastiCache we configure it with either Memcached or Redis.

Good objects to store in cache are Session State, Product Catalog, and even Shopping Cart. A Bank Account Balance would not be a good object to cache.

Memcached

Can run up to 20 nodes per cluster for memcached

Redis

Redis clusters only have a single node, but can be grouped into multiple clusters together in a replication group.

Backup and Restore

Amazon ElastiCache clusters running Redis can back up their data. You can use the backup to restore a cluster or seed a new cluster. The backup consists of the cluster’s metadata, along with all of the data in the cluster. All backups are written to Amazon Simple Storage Service (Amazon S3), which provides durable storage. Backups can be scheduled for automatic backups or run manually by taking a snapshot.

ElastiCache with Memcached does not have a backup function.

Memcached vs Redis

https://aws.amazon.com/elasticache/redis-vs-memcached/

Select Memcached if you have these requirements:

  • You want the simplest model possible.
  • You need to run large nodes with multiple cores or threads.
  • You need the ability to scale out/in,
  • Adding and removing nodes as demand on your system increases and decreases.
  • You want to partition your data across multiple shards.
  • You need to cache objects, such as a database.

Select Redis if you have these requirements:

  • You need complex data types, such as strings, hashes, lists, and sets.
  • You need to sort or rank in-memory data-sets.
  • You want persistence of your key store.
  • You want to replicate your data from the primary to one or more read replicas for read intensive applications.
  • You need automatic failover if your primary node fails.
  • You want publish and subscribe (pub/sub) capabilities—to inform clients about events on the server.
  • You want backup and restore capabilities.
  • Redis authentication tokens enable Redis to require a token (password) before allowing clients to execute commands, thereby improving data security.

 

Storage Gateway

AWS Storage Gateway connects an on-premises software appliance with cloud-based storage to provide seamless integration with data security features between your on-premises IT environment and the AWS storage infrastructure. You can use the service to store data in the AWS Cloud for scalable and cost-effective storage that helps maintain data security.

AWS Storage Gateway offers file-based, volume-based, and tape-based storage solutions:

File Gateway (NFS)

A file gateway supports a file interface into Amazon Simple Storage Service (Amazon S3) and combines a service and a virtual software appliance. By using this combination, you can store and retrieve objects in Amazon S3 using industry-standard file protocols such as Network File System (NFS) and Server Message Block (SMB). The software appliance, or gateway, is deployed into your on-premises environment as a virtual machine (VM) running on VMware ESXi or Microsoft Hyper-V hypervisor. The gateway provides access to objects in S3 as files or file share mount points. With a file gateway, you can do the following:

  • You can store and retrieve files directly using the NFS version 3 or 4.1 protocol.
  • You can store and retrieve files directly using the SMB file system version, 2 and 3 protocol.
  • You can access your data directly in Amazon S3 from any AWS Cloud application or service.
  • You can manage your Amazon S3 data using lifecycle policies, cross-region replication, and versioning. You can think of a file gateway as a file system mount on S3.

Volume Gateway (block based storage)

A volume gateway provides cloud-backed storage volumes that you can mount as Internet Small Computer System Interface (iSCSI) devices from your on-premises application servers. Since it is block storage, you can store operating systems, database systems, etc. Its like EBS.

The volume gateway is deployed into your on-premises environment as a VM runnning on VMware ESXi or Microsoft Hyper-V hypervisor.

The gateway supports the following volume configurations:

  • Cached volumes – You store your data in Amazon Simple Storage Service (Amazon S3) and retain a copy of frequently accessed data subsets locally. Cached volumes offer a substantial cost savings on primary storage and minimize the need to scale your storage on-premises. You also retain low-latency access to your frequently accessed data.
  • Stored volumes – If you need low-latency access to your entire dataset, first configure your on-premises gateway to store all your data locally. Then asynchronously back up point-in-time snapshots of this data to Amazon S3. This configuration provides durable and inexpensive offsite backups that you can recover to your local data center or Amazon EC2. For example, if you need replacement capacity for disaster recovery, you can recover the backups to Amazon EC2.

Tape Gateway / Gateway Virtual Tape Library (VTL)

A tape gateway provides cloud-backed virtual tape storage. The tape gateway is deployed into your on-premises environment as a VM runnning on VMware ESXi or Microsoft Hyper-V hypervisor.

With a tape gateway, you can cost-effectively and durably archive backup data in GLACIER or DEEP_ARCHIVE. A tape gateway provides a virtual tape infrastructure that scales seamlessly with your business needs and eliminates the operational burden of provisioning, scaling, and maintaining a physical tape infrastructure.

You can run AWS Storage Gateway either on-premises as a VM appliance, as a hardware appliance, or in AWS as an Amazon Elastic Compute Cloud (Amazon EC2) instance. You deploy your gateway on an EC2 instance to provision iSCSI storage volumes in AWS. You can use gateways hosted on EC2 instances for disaster recovery, data mirroring, and providing storage for applications hosted on Amazon EC2.

 

AWS Import / Export – Snowball / Snowmobile

AWS Import Export Disk service has been replaced by Snowball.

Snowball

The AWS Snowball service uses physical storage devices to transfer large amounts of data between Amazon Simple Storage Service (Amazon S3) and your onsite data storage location at faster-than-internet speeds. By working with AWS Snowball, you can save time and money. Snowball provides powerful interfaces that you can use to create jobs, track data, and track the status of your jobs through to completion.

Snowball is intended for transferring large amounts of data. If you want to transfer less than 10 TB of data between your on-premises data centers and Amazon S3, Snowball might not be your most economical choice.

AWS Snowball with the Snowball device has the following features:

  • 80 TB and 50 TB models are available in US Regions; 50 TB model available in all other AWS Regions.
  • Enforced encryption protects your data at rest and in physical transit.
  • There’s no need to buy or maintain your own hardware devices.
  • You can manage your jobs through the AWS Snowball Management Console or programmatically with the job management API.
  • You can perform local data transfers between your on-premises data center and a Snowball. You can do these transfers through the Snowball client, a standalone downloadable client. Or you can transfer programmatically using Amazon S3 REST API calls with the downloadable Amazon S3 Adapter for Snowball. For more information, see Transferring Data with a Snowball.
  • The Snowball is its own shipping container, and its E Ink display changes to show your shipping label when the Snowball is ready to ship. For more information, see Shipping Considerations for AWS Snowball.

Snowball Edge

Following is a table that shows the different use cases for the different AWS Snowball devices:

Use case Snowball Snowball Edge
Import data into Amazon S3
Export from Amazon S3
Durable local storage
Local compute with AWS Lambda
Amazon EC2 compute instances
Use in a cluster of devices
Use with AWS IoT Greengrass (IoT)
Transfer files through NFS with a GUI
Snowball Snowball Edge
 The Snowball is a large briefcase-sized device that weighs less than 50 pounds and is used to transfer data.
 The Snowball Edge is a large briefcase-sized device that weighs less than 50 pounds and is used to transfer, store, or compute data.
Storage capacity (usable capacity) Snowball Snowball Edge
50 TB (42 TB) – US regions only
80 TB (72 TB)
100 TB (83 TB)
100 TB Clustered (45 TB per node)

Each device has the following physical interfaces for management purposes:

Physical interface Snowball Snowball Edge
E Ink display – used to track shipping information and configure your IP address.
LCD display – used to manage connections and provide some administrative functions.

 

 

References

 

Storage Gateway
https://docs.aws.amazon.com/storagegateway/latest/userguide/WhatIsStorageGateway.html