Introduction To AWS Cloud Security Chaos Engineering
AWS Cloud Security Chaos Engineering VS Adversary Simulation/Emulation
Introduction
We live in a crazy world where cyber-attacks are becoming more and more advanced that's why we need to be proactive by using concepts like cloud security chaos engineering (SCE), but what does that mean and what is the difference between cloud security chaos engineering (SCE) and chaos engineering (CE)? This is what we going to learn in this blog.
But first, we need to have some background knowledge
Prerequisites
Before reading this blog, here’s everything you need to know to understand the concepts discussed (You can skip this section)
What is the cloud?
Let’s imagine you travel a lot instead of buying a house in each country you go to you go to a hotel and all those hotels are connected with pipes where you can send letters from one hotel to the other
What is good about this approach is that you don't need to pay a lot of money to own the house, you only pay for the time you spend at the hotel
It's the same when it comes to the cloud instead of buying servers (just a fancy name for computers) in a lot of countries to serve your clients all you need is to rent those servers from AWS and you only pay for what you use
What is an instance?
An instance is a virtual server in the cloud
How can we say that something is not secure?
When we break the “CIA Triad” which stands for Confidentiality, Integrity, and Availability
- Confidentiality
Protecting information from unauthorized access - Integrity
Data are trustworthy, complete, and have not been accidentally altered or modified by an unauthorized user. - Availability
Data are accessible when you need them.
What’s the difference between proactive and reactive security?
Proactive: Identifying threats and security weaknesses before they happen
Reactive: Responding to hacks and data breaches after they happen
Chaos Engineering (CE)
What is Chaos Engineering (CE)?
One of the first companies to implement Chaos Engineering (CE) was Netflix in 2011 when they announced their open-source Chaos Engineering (CE) framework which is called Chaos Monkey but what is Chaos Engineering (CE) and Why did Netflix make this framework?
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production
Huh, what does that mean?
Simply put Chaos Engineering (CE) is the act of introducing failures in a system to make sure they are resilient and it can withstand unexpected failures
Note: Chaos Engineering (CE) is NOT randomly breaking stuff in production
Let's see an example to understand how it works
Example
I like the analogy that Netflix used in their blog
Imagine getting a flat tire. Even if you have a spare tire in your trunk, do you know if it is inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right? One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud.
So Netflix made a framework to automate this process of poking a hole in your tire once a week and they called it Chaos Monkey:
Chaos Monkey is a tool that randomly disables Netflix’s production instances in AWS to ensure they can survive this common type of failure without any customer impact.
The name comes from imagining a monkey loose in the cloud (AWS) randomly chewing cables and shooting instances without affecting the clients who are watching Harry Potter for the millionth time (This is not me at all)
This only happens in a controlled way in the production environment with engineers standing by to fix any problems that might appear
Chaos Monkey was one of a whole army called the Simian Army where each simian has its special chaos power 😈
The Simian Army
- Latency Monkey
- Conformity Monkey
- Doctor Monkey
- Janitor Monkey
- Security Monkey
- 10-18 Monkey
- Chaos Gorilla
Then two more joined the Simian Army
- Security Monkey
- Security Kong
To learn more about what each monkey does you can read the official blog that Netflix wrote about the Simian Army
More Examples
- Gremlin (Paid)
- ChaosToolkit (Open Source)
- Litmus (Open Source)
Now that we understand chaos engineering (CE) let's see what is cloud security chaos engineering (SCE)
AWS Cloud Security Chaos Engineering (SCE)
What is AWS Cloud Security Chaos Engineering (SCE)?
It is like Chaos Engineering but instead of testing the resilience of the infrastructure, AWS Cloud Security Chaos Engineering (SCE) makes sure that the system can respond to security threats (proactively)
The AWS Cloud Security Chaos Engineering Methodology
- Question
How your infrastructure can be vulnerable? - Identify
Which of these services you are going to test? - Experiment
What are the controlled failures that you are going to use to test these services? - Detect
What are the vulnerabilities that appeared from your experiments?
To automate this you can use a framework like “Chaoslingr” just keep in mind that it is no longer maintained :(
Example
A simple AWS security chaos engineering experiment can be
An experiment that tests if AWS CloudTrail and Amazon CloudWatch effectively detect and alert an unauthorized API call to the incident response team simulating an attacker
What is the difference between Cloud Security Chaos Engineering and Adversary Emulation/Simulation?
It's all about what you want to achieve cloud security chaos engineering aims for cyber resilience while adversary emulation/simulation aims for cyber security
But what does that mean?
What does Cyber Security mean?
The definition of Cyber Security according to the Cyber Security And Infrastructure Agency (CISA) is “The art of protecting networks, devices, and data from unauthorized access or criminal use and the practice of ensuring confidentiality, integrity, and availability of information.”
What does Cyber Resilience mean?
The definition according to IBM is that “Cyber Resilience is a concept that brings business continuity, information systems security, and organizational resilience together. The concept describes the ability to continue delivering intended outcomes despite experiencing challenging cyber events, such as cyberattacks, natural disasters, or economic slumps. A measured level of information security proficiency and resilience affects how well an organization can continue business operations with little to no downtime.”
What is the difference between Cyber Resilience and Cyber Security?
Cyber resilience is about preparing for and dealing with cyber attacks, and understanding that they can happen even with strong security measures in place. It includes planning ahead, assessing risks, having a plan for when incidents occur, and making sure there are good backup and recovery plans. The main idea is that businesses should be ready to quickly handle and bounce back from cyber issues to keep their operations running smoothly. and achieves that through cybersecurity
and if you want to learn more about cyber resilience and Security Chaos Engineering I really recommend checking this blog “Leveraging Security Chaos Engineering for Cloud Cyber Resilience — Part I” by Mitigant
Resources
👨🏻💻Videos
- Security Chaos Engineering: When and How You Should Break Your System (OWASP London: Anais Urlichs)
✍🏻Blogs
- Security Chaos Engineering 101: Fundamentals
(Mitigant: Kennedy Torkura)
🎧Podcasts
- Security Chaos Engineering — What is it and why should you care?
(Snyk: Aaron Rinehart)
📚Books
- Security Chaos Engineering
Written by Kelly Shortridge and Aaron Rinehart
References
- Chaos engineering (Wikipedia)
- Understanding Chaos Engineering (Microsoft Developer)
- The Netflix Simian Army (The Netflix Technology Blog)
- Security-focused chaos engineering experiments for the cloud
(Data Dog) - Security Chaos Engineering 101: Fundamentals (Mitigant)
- Leveraging Security Chaos Engineering for Cloud Cyber Resilience — Part I (Mitigant)