Reducing BigQuery production cost by 70% with Flex Slots
November 16, 2021
About a year ago, we started our migration from AWS to GCP. We wanted to explore new technologies, redesign our pipelines, and focus on cost savings. After a short time we became assiduous BigQuery users, and we noticed our BigQuery cost increased far beyond expectations.
We found out that using BigQuery slots could keep our costs down, letting us choose how many slots we were willing to pay for the month. After some back and forth with different solutions, we were finally able to keep our BigQuery compute cost under control. However, we still had a tiny problem, paying less also means having less computing power.
As the day goes on, daily aggregates require more resources to deal with the continuous flow of data. As a result, some processing jobs began to run behind. We had to fix this as it is critical for the business to have the data ready on time. That is how we decided to useFlex Slots, a new way to use BigQuery slots rather than the usual monthly or yearly reservations. With Flex Slots, you can purchase slots for short durations and cancel anytime after 60 seconds.
Im going to explain the different experiments we did to find the final solution.
We useAirflow(Composer) to orchestrate our pipelines, so our first idea was to develop a custom BigQuery Flex Slots Operator that could handle the process of commissioning Flex Slots as needed, executing the query, and then decommissioning the slots. We first tried adding this operator only on the most problematic queries, the ones that require the biggest amount of resources.
That solution worked, although it had limitations. One of the main problems was that if we had two DAGs running at the same time with this operator we could not assign more than one reservation per project, so the task failed. This is intrinsic to how the slot architecture is designed.
To avoid this situation we had to make those DAGs wait on each other. Without going into too much detail, this first implementation was not giving us the expected results, and the final cost of triggering Flex Slots that often was not what we were looking for either.
This is the schema that is currently working in production:
High level architecture diagram.
The solution is straightforward compared to the previous one, even though it involves different GCP services. We programmed an alert associated with the monthly reservation slot use. If the reservation goes above a certain amount of slots for more than 30 minutes, we trigger an alert.
Our Reservations structure works as follows:
The alert is only attached to the analytics reservation where the analytics project is the only one that can trigger Flex Slots. Other projects can take advantage of those slots as well. However, by default, their reservation pool is smaller than the analytics one. That makes sense because the analytics project is the one that deals with the data transformation and storage so other projects can do ad-hoc queries or external reporting.
Once the alert is triggered, it sends a request to the Cloud Function we have. Our Cloud Function has the slots commitment logic.
Here you can see a code snippet of the aforementioned Cloud Function:
Because we want to know how often those slots are triggered, we added a Slack notification that notifies us every time we buy slots as well as every time we decommission those slots.
Slack notification message.
The next image shows the BigQuery slot use. As you can see, we only trigger Flex when necessary.
BigQuery weekly slots distribution.
With the initial solution, the cost of running just two DAGs with Flex Slots was more than200$extra dollars per day.
With the solution we currently have, running all the DAGs we have, just triggering Flex Slots when needed, cost approximately60$to80$per day. As a result we were able to reduce cost by70%per day with Flex Slots!