Have to know your cloud dependencies in a pinch? Sure, we’ve been there. Right here’s how we leveraged ChatOps to make our lives simpler.
We’ve received cloud, and we’ve received 99 issues about what’s residing in our clouds. On most days you may need the luxurious of time to unravel this dependency graph, nonetheless, if you happen to’re chasing down an incident, you could know in a rush! Right here’s how the Sentinels, which is our safety crew @ Hotstar, solved this utilizing ChatOps.
A cloud Useful resource Explorer is likely one of the most necessary objects within the the toolkit for anybody who’s constructing in a contemporary engineering crew. Whereas the explanations can differ, the necessity to know what resides the place and the metadata round it’s wanted with out a lot drama.
Right here’s what most individuals leverage right now to find objects within the cloud and their challenges.
- Console : Doesn’t scale in Multi Account setup, advanced correlation not attainable.
- CLI or SDK (e.g. Boto) : CLI wants setup like organising keys, function assume settings and so on.. SDK requires some programming consolation — doesn’t scale for crew members who aren’t present with coding. A default drawback which at all times exists with this technique is managing the keys at scale & their rotation.
- Cloud Stock or a Cloud Safety Posture Administration (CSPM) answer: Focus of this instrument is safety, not a lot, stock. Due to this fact the info is stale and may solely work as a rough technique, which could not serve all use-cases.
Whereas as a mixture these items would possibly work, this isn’t one thing that can be utilized in a pinch and would require stitching collectively of an answer.
What we needed to resolve is one thing seen at scale solely on a day-to-day foundation. For instance, somebody has a easy query, this somebody may very well be a buyer care govt, or a backend developer. Their query would possibly go one thing like :
“I need to know the place is x.x.x.x IP in our Infra”
Our objective was to make it as straightforward as querying from an excel sheet or a easy database for individuals. Utilizing the standard strategies would fail for the straightforward incontrovertible fact that it could require stitching and extra work every time this query was requested, until you pooled collectively some tooling. Add the complexity of multi-cloud, and even a number of entry ranges and so forth, which is quite common. On the whole, the head-wind to even reply a easy query like that is intimidating.
We started to introspect the questions that our groups have been asking. Here’s a sampling :
Which account does this S3 bucket belongs to & what kind of encryption is enabled on it?
For an entry key, which account & person this belongs to?
I need to know what xyz.hs.com factors to. Which account’s R53 to test?
Every of those takes a unique quantum of complexity to reply! Think about spinning up bespoke scripts to deal with every query, that is simply not scalable.
We extensively use Slack for speaking. ChatOps may be on any chat app for that matter. Anybody who retains questioning about numerous issues on Infrastructure involves slack first and asks somebody, many of the occasions — it’s DevOps, Infrastructure & Safety Groups who will get these questions.
Our objective was easy — nothing ought to restrict somebody to ask a query and guarantee minimal dependency.
Querying cloud nonetheless stays the identical — it’s both CLI, SDK or utilizing current information from a supply like CSPM which already pulls many of the information for you.
When to make use of actual time queries vs utilizing CSPM Information is dependent upon the use case and the way dwell you anticipate the info to be. For instance I anticipate IP information to be virtually dwell(1–2hr window) as plenty of IPs hold altering for numerous causes — Spot nodes, Auto Scaling and so on.. my IAM Information may be 6-12 hour outdated since person & entry key creation is just not that frequent. Equally pulling S3 or R53 information will also be round 6–12 hours.
A easy structure diagram to clarify how it’s constructed and used is right here —
Parts within the structure:
Slack — That is the place somebody fires a Slash Command relying upon the data they want to get. This command may be fired from their DM or a devoted channel, the response involves a pre-defined channel.
API Router — That is the place many of the logic sits. It authenticates the Slack Consumer, Payload coming in & then routes it to corresponding API. Choice of whether or not to make use of CSPM API, ES or Actual Time CLI Question is taken right here. Response to Slack can also be given by this part. This can be a easy Flask App.
CSPM API — This may be your CSPM, or another cloud stock service which pulls your posture information each 24 hrs. It’ll have some API uncovered to question information out of it, which can be utilized.
Customized Full Textual content Search — You should utilize any full textual content right here, we used Elastic Search right here. We now have few cron jobs operating to drag information and hold it dwell as a lot as attainable. The frequency of Cron is dependent upon what sort of information is being pulled from the Cloud. Like talked about earlier than — IAM information may be pulled each 6-12 hrs, IP information each 1 hour, so on and so forth. This frequency is dependent upon your atmosphere & precedence given to sure sources.
Actual Time Queries — You may fireplace customized queries both utilizing CLI, SDK similar to Boto or use a instrument like Steampipe.
Notice: Entry Management — Ofcourse everybody is just not allowed to see every part, we want to have some restrictions on what sort of information may be queried by what class of individuals. Easy entry controls may be written primarily based on the Slack Consumer ID who fires the command. Group of Slack Consumer IDs may be allowed/denied to fireside sure APIs.