10.3 Pillar Two - Reliability

The reliability pillar covers the ability of a system to recover from service or infrastructure outages/disruptions as well as the ability to dynamically acquire computing resources to meed demand.

Design Principles

Test recovery procedures (e.g. Netflix "Simian Army" in their production environment)
Automatically recover from failure
Scale horizontally to increase aggregate system availability
Stop guessing capacity

Definition

Reliability in the cloud consists of 3 areas:

Foundations
- Best practices: Before building a house, you always make sure that the foundations are in place before you lay the first brick. Similarly before architecting any system, you need to make sure you have the prerequisite foundations. In traditional IT one of the first things you should consider is the size of the communications link (e.g. bandwidth) between your HQ and your datacenter. If you misprovision this link, it can take 3 - 6 months to upgrade which can cause a huge disruption to your traditional IT estate.
- With AWS, they handle most of the foundations for you. The cloud is designed to be essentially limitless meaning that AWS handle the networking and compute requirements themselves. However, they do set service limits to stop customers from accidentally over-provisioning resources. Go to check "AWS Services Limitations" online.
- Questions to ask yourself:
  - How are you managing AWS service limits for your account?
  - How are you planning your network topology on AWS?
  - Do you have an escalation path to deal with technical issues? (For example, do you have a technical account manager with AWS?)
- AWS Key services:
  - IAM
  - VPC
Change management
- Best practices: You need to be aware of how change affects a system so that you can plan proactively around it. Monitoring allows you to detect any changes to your environment and react. In traditional systems, change control is done manually and are carefully coordinated with auditing.
- With AWS, things are a lot easier. You can use CloudWatch to monitor your environment and services such as auto-scaling to automate change in response to changes on your production environment. You can take a snapshot for that instance and use AMI to provision a new one with different instance types.
- Questions to ask yourself:
  - How does your system adapt to changes in demand?
  - How are you monitoring AWS resources?
  - How are you executing change management? (e.g. when you detect you get huge amounts of demand and then auto-scale a fleet of EC2 instances to fill that demand.)
- AWS Key services:
  - CloudTrail
Failure management
- With cloud, you should always architect your systems with the assumptions that failure will occur. You should become aware of these failures, how they occurred, how to respond to them and then plan on how to prevent these from happening again.
- Questions to ask yourself:
  - How are you backing up your data?
  - How does your system withstand component failures?
  - How are you planning for recovery?
- AWS Key services:
  - CloudFormation
  - Multi-AZ

Exam Tips:

Reliability in the cloud consists of 3 areas and their questions.

Previous10.2 Pillar One - Security Next10.4 Pillar Three - Performance

Last updated 5 years ago

Was this helpful?