Talk to an Expert

Incident Response on AWS: Outsourcing Cloud Services

Cloud Operations
Tags: AWS, Case Study

It is 4PM on a Friday before a holiday, right before your team leaves for a long weekend. An engineer on your team suddenly cannot connect to certain instances in your AWS environment. The error is affecting the largest projects and biggest customers — across hundreds of instances — including DR.

What happens now?

The answer to this question depends on your support model. In most companies, an incident like this means that no one is going home for the holiday weekend; they will spend 15+ hours diagnosing the problem, then 200+ hours fixing it manually, instance by instance. If they do not get it fixed in time, they will lose data and have to tell their customers — a potentially damaging situation.

This is a true story, but what actually happened was very different. The company instead called their Managed Service Provider (us) who diagnosed the problem and fixed it over the holiday weekend.

Every system has weak points. Every system can fail. It is how you deal with catastrophe — and who you trust to help you during failure — that makes the difference. Every enterprise team needs an insurance policy against mistakes, large and small.

It turns out that one of the company’s internal engineers had caused the problem, inadvertently changing permissions for their entire environment. Logicworks was able to diagnose the problem in less than an hour, determine blast radius, get our smartest engineers in a room to develop a complex remediation strategy, and implement that fix before business resumed after the holiday. This involved writing custom scripts (in Python, BASH, and Puppet) to investigate the scope of the failure and another more complex script to partially automate the fix, so that each instance could be repaired in 3-5 minutes, rather than 2-3 hours. Ultimately it took 170+ hours of engineering effort, but the company readily admitted that it would have taken them two weeks to fix on their own.

Download the full Incident Response Case Study.

Managed Infrastructure Service Providers were born in an age when implementing a fix meant going to a datacenter, swapping out hardware, and doing manual configurations. The value of an MSP to enterprises was not having to manage hardware and systems staff.

In the cloud, MSPs must do more. They must be programmers; instead of replacing hardware, they need to write custom scripts to repair virtual cloud instances. MSPs need to think and act like a software company: infrastructure problems are bugs, the solution is code, and speed is paramount.

Not all MSPs operate this way. Many MSPs would have looked at this company’s issue and applied the traditional incident response model: just reboot everything manually, one at a time. (Many also would have said, “You caused the problem, you fix it.”) This is the traditional MSP line of thinking, and it would have meant that the company would have lost three to five days of data and customer trust.

MSPs need to think and act like a software company: infrastructure problems are bugs, the solution is code, and speed is paramount.

Running on cloud infrastructure comes with unique risks. It is often easier for your engineers to make a career-limiting mistake when a single wrong click of a button in an automated script can change permissions across an entire system. These new challenges require new answers and a new line of defense.

Importantly, this means that MSPs no longer replace internal IT teams; they provide additional expertise that the enterprise currently lacks (or is in the process of building) in fields like cloud security and automation. They provide an additional layer of defense. In the example above, the internal and MSP team collaborated to fix the problem, since there is shared control of the infrastructure.

In the cloud, the conversation no longer has to be insourcing vs. outsourcing. In fact, you will get the most out of an MSP if are also setting up internal DevOps teams or implementing software development best practices. As an MSP, companies with an existing or growing DevOps team are the most exciting to work beside. As an example, an MSP cannot automate your entire deployment pipeline alone; most only operate below the application level and can only automate instance spin-up and testing. But if the two teams are working together, they can balance application-specific needs with advanced scaling and network options and create a very mature pipeline very quickly.

An MSP can accelerate your DevOps team building strategies, not substitute them.

In other words, an MSP can accelerate your DevOps team building strategies, not substitute them. This is an incredibly powerful model that we have watched transform and mature entire cloud projects in a matter of months. Plus, they can subtract all the crucial compliance work your DevOps team dreads, like setting up backups and logging — and even improve the quality of that compliance work by creating automated tests to ensure logs and backups are always kept.

It is true that internal IT teams sacrifice some control by using an MSP. The key is that you are sacrificing control to a group of people who are held responsible for making your environment secure, available, etc. You control how and when the MSP touches your environment.

Cloud projects are complex, and cloud problems can be equally so. Just make sure that when they happen, you have the right team on the bench.

By Jason Deck
VP – Strategic Development

Logicworks is an enterprise cloud automation and managed services provider with 22+ years of experience transforming enterprise IT. Contact us to learn more about our managed cloud solutions.

February 9, 2016

4 Comments

Pingback: AWS Week in Review – February 8, 2016 | wart1949

Pingback: Docker Security: How to Monitor and Patch Containers in the Cloud - Logicworks Gathering Clouds

Pingback: Docker security: How to monitor and patch containers in the cloud — world24-paper

Pingback: AWS Hackathon: Pokebot, Puppet Commands, and More - Logicworks Gathering Clouds

Logicworks Control Tower

AWS Control Tower is a purpose-built management utility for building, organizing, and maintaining multiple AWS Accounts. Control Tower allows you to deploy accounts programmatically by using predetermined templates that assign specific guardrails. Security, identitify management, logging, cost management, and other key business functions can be defined and executed through a successful Control Tower implementation. Control Tower operates across Organizational Units and defines rulesets through Service Control Policies. Control Tower Account Factory automates the deployment and configuration of new accounts.

Sessions & Milestones

Briefing & Discovery

Logicworks will lead a workshop to introduce core concepts including use cases, management, automation, and governance. The requirements for your deployment will be identified and documented, to align our technical resources around your project goals & objectives.

Architecture Design

Based on your requirements, Logicworks will present the recommended architecture design. Our team will share a diagram of the proposed configuration and review the specifics points of your deployment.

Transfer Knowledge

When your deployment is complete, Logicworks will present the details to your team and provide a guided walkthrough of the environment.

Scope & Details

Scope

Organization Units
Governance Requirements
Security Guardrails Definition
Service Control Policies.
AWS Config Rules
Service Control Policy Definitions
Guardrail Deployment

Deliverables

Default Control Tower in Desired Region
Administer Guardrails
Configure Account Factory
Provide Reusable IAC Template for Default VPC
Standardized Networking & Route Tables
Administer AWS SSO Configuration (can include integration with Active Directory)
Document Multi-Account Structure and Governance Strategy
Deploy Up To 2 Customizations for Control Tower (CfCT)
Cloud Solution Documentation detailing Control Tower Solution
Architecture Diagram and Technical Specifications

We're ready to help

Incident Response on AWS: Outsourcing Cloud Services

4 Comments

Leave A Comment

Logicworks Control Tower

Logicworks Control Tower

Get started with a Cloud Refresh Evaluation

Please complete this form to have a specialist contact you.

Get a Free Expert
Cloud Assessment

Consult with a Sr. AWS Solutions Architect to learn how you can improve cost efficiency, security, performance, and compliance. This session is free with no strings attached.

Identify quick wins to improve performance

Improve cost efficiency by 20-30%

Get ready for a compliance audit

We're ready to help

Incident Response on AWS: Outsourcing Cloud Services

Share this:

4 Comments

Leave A Comment