Cloud Cost Optimization and Resource Tagging
I've been spending sometime in reviewing our cloud platform costs which has been left unreviewed for the last few quarters due to change of PICs and leadership in the Tech division. There are many resources that have no ownership tags (outside of shared resources) hence making the cost review difficult. To add more perspective into the picture, we manage our resources in 3 cloud platforms (let's call it cloud platform A, B and C).
In cloud platform A, we scanned all the resources and only found a few are still running in production, the rest have close to zero traffic with ownership left unknown due to the old and no documentation about them (and yes, the bills are keep being charged). We scoured inside to find clues about what they are and gave a brief description based on what we found. Then we gathered the OG members who might have a clue about them, then made a final sorting of which one to keep and not keep. We decided that these to keep resources are to be migrated to platform C and later be deleted. The remaining active resources are legacy systems and went a long process of planning between cross teams to migrate them into our newer environment in platform C. While waiting for the planning and data migration to be done, we also decided to downgrade our low utilized VMs to save our expenses. The migration took months and after it was done then we deleted the resources in cloud platform A.
Meanwhile from the cost report in platform B, even though some of the resources are tagged, it still leaves a huge portion of the cost with none of them. Compared to platform A, this platform has more active resources, which also means has higher monthly cost and our CTO is expecting clear cost attribution of the resources. The hard part is to identify which existing resource belongs to which team since the naming varies and as stated above does not have ownership tags. The obvious action is to define a tagging standard (which thankfully already made by my peer, Wid), then write an ad-hoc bash script to handle the job and also implement the tagging standard inside our IaC for future deployments. The bash script calls the cloud platform API CLI then sorted the resources that didn't have any particular tags using jq
and grep
then loop those list to be tagged by calling another tagging API (shoutout to Bli for helping me troubleshoot my scripts to get it running correctly). After I got the resource list, I went with the easiest task first which is to tag resources with explicit and intuitive naming, then identifying shared resources, and then start asking around about the rest. We migrated some of our resources to platform C, but kept the active resources in there since it would require extra egress cost and effort, however now with a much clearer cost attribution.
Our main resources resides within cloud platform C and just recently got my hand on it after dealing with the previous cloud platforms. Our CTO asked us to create a forecast of the next year given with monthly cost saving threshold and also estimation of business growth by percentage. With those defined budgets, we have to create a budget alert within cloud platform C and send those alerts to a dedicated Slack channel if it got triggered. Additionally, I created an Ansible schedule to call the cloud platform API to collect underutilized resources on a monthly basis to be picked up as tickets to be reviewed and executed. We do not want this to be fully automated since it might cause involuntary disruptions, so we plan to treat it as a regular operational task using change management approach. By having those inputs, it could help us to keep our cloud costs remain optimized.