10.6 Pillar Five - Operational Excellence
The Operational Excellence pillar includes operational practices and procedures used to manage production workloads.
This includes how planned changes are executed, as well as responses to unexpected operational events.
Change execution and responses should be automated. All processes and procedures of operational excellence should be documented, tested, and regularly reviewed.
Design Principles
Perform operations with code
Align operations processes to business objectives (we should collect metrics that can indicate the business objectives are being met)
Make regular, small, incremental changes
Test for responses to unexpected events
Learn from operational events and failures
Keep operations procedures current
Definitions
There are three best practices areas for operational excellence in the cloud:
Preparation
Best practices: Effective preparation is required to drive operational excellence. Operations checklists will ensure that workloads are ready for production operation, and prevent unintentional production promotion without effective preparation.
Workload should have:
Runbooks - operations guidance that operations teams can refer to so they can perform normal daily tasks.
Playbooks - Guidance for responding to unexpected operational events. Playbooks should include response plans, as well as escalation paths and stakeholder notifications.
In AWS there are several methods, services, and features that can be used to support operational readiness, and the ability to prepare for normal day-to-day operations as well as unexpected operational events.
CloudFormation can be used to ensure that environments contains all required resources when deployed in production, and that the configuration of the environment is based on tested best practices, which reduces the opportunity for human error.
Implement Auto-scaling, or other automated scaling mechanisms, will allow workloads to automatically respond when business related events affect operational needs.
Services like AWS Config with the AWS Config rules feature create mechanisms to automatically track and respond to changes in your AWS workloads and environments.
It is also important to use features like Tagging to make sure all resources in a workload can be easily identified when needed during operations and responses.
Be sure that documents doesn't become stale or out of date as procedures change. Also make sure that it is thorough. Without application designs, environment configurations, resource configurations, response plans, and migration plans, document is not complete. If document is not updated and tested regularly, it will not be useful when unexpected operational events occur. If workloads are not reviewed before production, operations will be affected when undetected issues occur. If resources are not documented, when operational events occur, determining how to respond will be more difficult while the correct resources are identified.
Questions to ask yourself:
What best practices for cloud operations are you using? (start with preparation.)
How are you doing configuration management for your workload?
AWS Key services:
AWS Config. It provides a detailed inventory of your AWS resources and configuration, and continuously records configuration changes.
AWS Service Catalog. It helps to create a standardized set of service offerings that are aligned to best practices. Designing workloads that use automation with services like auto-scaling, and SQS, are good methods to ensure continuous operations in the event of unexpected operational events.
Operation
Best practices: Operations should be standardized and manageable on a routine basis. The focus should be on automation, small frequent changes, regular quality assurance testing, and defined mechanisms to track, audit, roll back, and review changes. Changes should not be large and infrequent, they should not require scheduled downtime, and they should not require manual execution. A wide range of logs and metrics that are based on key operational indicators for a workload should be collected and reviewed to ensure continuous operations.
In AWS you can set up a continuous integration/continuous deployment (CI/CD) pipeline (e.g. source code repository, build systems, deployment and testing automation). Release management processes, whether manual or automated, should be tested and be based on small incremental changes, and tracked versions. You should be able to revert changes that introduce operational issues without causing operational impact.
Routine operations, as well as responses to unplanned events, should be automated. Manual processes for deployments, release management, changes, and rollbacks should be avoided. Releases should not be large batches that are done infrequently. Rollbacks are more difficult in large changes, and failing to have a rollback plan, or the ability to mitigate failure impacts, will prevent continuity of operations. Align monitoring to business needs, so that the responses are effective at maintaining business continuity. Monitoring that is ad-hoc and not centralized, with responses that are manual, will cause more impact to operations during unexpected events.
Questions to ask yourself:
How are you evolving your workload while minimizing the impact of change?
How do you monitor your workload to ensure it is operating as expected? (e.g. CloudWatch)
AWS Key services
AWS CodeCommit, AWS CodeDeploy, and AWS CodePipeline can be used to manage and automate code changes to AWS workloads.
Use AWS SDKs or third-party libraries to automate operational changes.
Use AWS CloudTrail to audit and track changes made to AWS environments.
Response
Best practices: Responses to unexpected operational events should be automated. This is not just for alerting, but also for mitigation, remediation, rollback, and recovery. Alerts should be timely, and should invoke escalations when responses are not adequate to mitigate the impact of operational events. Quality assurance mechanisms should be in place to automatically roll back failed deployments. Responses should follow a pre-defined playback that includes stakeholders, the escalation process, and procedures. Escalation paths should be defined and include both functional and hierarchical escalation capabilities. Hierarchical escalation should be automated, and escalated priority should result in stakeholder notifications. In AWS there are several mechanisms to ensure both appropriate alerting and notification in response to unplanned operational events, as well as automated responses (e.g. CloudWatch, SNS).
Questions to ask yourself:
How do you respond to unplanned operational events?
How is escalation managed when responding to unplanned operational events?
AWS Key services
Amazon CloudWatch. Take advantage of all of the CloudWatch service feature for effective and automated response. Amazon CloudWatch alarms can be used to set thresholds for alerting and notification (SNS), and Amaozn CloudWatch events can trigger notifications and automated responses.
Exam Tips:
There are three best practices areas for operational excellence in the cloud and their questions.
Last updated
Was this helpful?