This one little trick will save you up to 79% on your cloud refresh costs

2024-10-05

Yes, the title is clickbait, but the savings are real! The approach can be applied to save money on any cloud data infrastructure. The saving in question applies to regular refreshes, not other data processes (more on that in the future).

The Problem

It’s not the first time I’ve looked at the numbers thinking this, and it won’t be the last. I glanced at the cloud admin dashboard and thought “Hmmm, those costs have gone up. It’s probably less cost in engineer resource than the amount we could save by making changes”. The place to start is the famous “low hanging fruit”, easy changes with significant impact.

The lowest hanging fruit

Most of us (especially those with experience in ‘bare metal’, in-house infrastructure) will set our regular data refresh to be hourly and then get on with adding value for our consumers. Over time the data volumes will increase, the business will ask for more and one day someone will look at the bill for cloud costs and say “Wow, that’s got bigger”. If you can get that number down (especially before someone notices it), you are winning with your stakeholders.

If you are running a data refresh (or any process) hourly, seven days a week, that’s 168 executions in a week. Whatever your cost per refresh, 168 multiplied by any base number will probably add up to a significant increase on it.

24 * 7 = 168

The Solution

You probably already guessed it… But in my own wise words:

Don’t run your refresh every hour of every day if you don’t need to!
David Roman-Halliday

The best case

If your data consumers are only working during office hours in one country, you probably only need to refresh the data for while they are working. If that’s 9 till 5, Monday to Friday then the hourly refresh only needs to run 09:00 – 16:00.

That’s seven times a day, five days a week. A total of 35 executions per week (a saving of 133 executions on our starting point of hourly), which is a difference of 79.17%

7 * 5 = 35 -- Mon -> Fri, 09:00 -> 16:00

A more realistic case

Realistically, some people start earlier (or want the numbers ready for their 9am meeting). While some may work a little later than finishing bang on 5pm. There may be reports and updates which need to go out including weekends.

This still clocks in at 54 refreshes (far fewer than 168) a week, to refresh from 8am until (and including) 6pm Monday through Friday and twice a day (for example the start of the morning working hours, and the start of the afternoon).

  10 * 5 = 50 -- Monday  -> Friday : 08:00 -> 18:00
+  2 * 2 =  4 -- Saturday & Sunday : 08:00  & 13:00
           54

That’s still a serious reduction of 67% in volume of executions, which is a significant saving on regular refreshes.

Coding it in

As most orchestration and related tools work with cron style definitions for the schedule, we can save time (and make it easier to read/check) using crontab.guru. Some examples are:

Every Hour: 0 * * * *
Our best case:
- Every hour from 9 through 16 on every day-of-week from Monday through Friday: 0 9-16 * * 1-5
Our more realistic case:
- Every hour from 8 through 18 on every day-of-week from Monday through Friday: 0 8-18 * * 1-5
- Hour 9 and 13 on Saturday and Sunday: 0 9,13 * * 6,0

Going Further

If you use dbt (and who in the world of data doesn’t seem to be using it now), you can go further and use tags to control different parts of your project to refresh at different cadences (perhaps some sources don’t update as regularly as others). Doing that in detail requires another blog post, on another day.