6.5 Redshift

Introduction

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. Customers can start small for just $0.25 per hour with no commitments or upfront costs and scale to a petabyte or more for $1,000 per terabyte per year, less than a tenth of most other data warehousing solutions.

The OLAP example in 6.1.

Data warehousing is all about columns. It is not about individual records set, it is actually about adding up the sum of a whole bunch of different columns and joining those data together. So data warehousing databases use different types of architecture both from a database perspective and infrastructure layer.

Redshift Configuration

  • Single Node (up to 160 GB)

  • Multi-Node

    • Leader Node (manages client connections and receives queries).

    • Compute Node (store data and perform queries and computations). Up to 128 Compute Nodes.

Redshift - Threes things make it 10 times faster, very fast

  • Columnar Data Storage: Instead of storing data as a series of rows, Redshift organizes the data by columns. Unlike row-based systems, which are ideal for transaction processing, column-based systems are ideal for data warehousing and analytics, where queries often involve aggregates performed over large data sets. Since only the columns involved in the queries are processed and columnar data is stored sequentially on the storage media, column-based systems require far fewer IO, greatly improving query performance.

  • Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. For example, if you compress by rows, they may contains lots of different data types, which will cost more space, but if you compress by columns, the data type is same in one column, and the storage is also column-based, so compress by column will save more storage space. In addition, Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Redshift automatically samples your data and selects the most appropriate compression scheme.

  • Massively Parallel Processing (MPP): Redshift automatically distributes data and query load across all nodes. Redshift makes it easy to add nodes to your data warehouse and enables you to maintain fast query performance as your data warehouse grows.

Pricing - very cheap

  • Compute Node Hours (total number of hours you run across all your compute nodes for the billing period. You are billed for 1 unit per node per hour, so a 3-node data warehouse cluster running persistently for an entire month would incur 2,160 instance hours. You will not be charged for leader node hours; only compute nodes will incur charges.)

  • Backup

  • Data transfer (only within a VPC, not outside it)

Security

  • Encrypted in transit using SSL.

  • Encrypted at rest using AES-256 encryption.

  • By default Redshift takes care of key management.

    • Manage your own keys through HSM (Hardware security modules)

    • AWS KMS (Key management service)

Availability

  • Currently only available in 1 AZ. Currently, it is not designed to be multi-AZ. Redshift is not customer-oriented, it doesn't need to be 24/7 up (like your production database), it is only used in management and running reports and queries on. It doesn't need very high availability.

  • Can restore snapshots to new AZ's in the event of an outage.

Last updated

Was this helpful?