Taming the Snowflake: Strategies to Slash Your Cloud Data Warehouse Costs
My experience with Snowflake spans several different organizational contexts, it’s power in scaling to meet business data requirements is closely followed by how the costs can similarly scale. This post explores practical strategies to help you reduce your Snowflake costs without sacrificing performance. Starting with general guidance, then dipping into technical strategies including dbt specific packages and approaches.
This shouldn’t discourage usage and innovation (it’s a great tool), but for engineers and technical leaders we should always be aware of costs and enable the business to invest in people and future enablement rather than spiralling infrastructure costs.
General Advice
Snowflake’s cloud data warehouse offers incredible power and flexibility, but its usage-based pricing model can quickly lead to spiralling costs if left unchecked. While the pay-as-you-go approach is generally beneficial, it requires vigilance and proactive optimization to avoid budget overruns.
Understanding Snowflake’s Cost Drivers:
Before diving into optimization techniques, it’s crucial to understand what influences your Snowflake bill. The primary cost drivers are:
- Compute (Virtual Warehouses): This is the biggest expense for most users. You pay for the compute resources used to execute queries, load data, and perform other operations. Warehouse size, runtime, and the number of warehouses all contribute to this cost.
- Storage: Snowflake charges for the data stored in your database. This includes the raw data, metadata, and any data retained for Time Travel.
- Data Transfer: While often less significant than compute or storage, data transfer costs can add up, especially for large-scale data ingestion or egress.
Practical Strategies for Cost Reduction:
1. Optimize Warehouse Usage:
- Right-size your warehouses: Don’t over-provision. Start with smaller warehouses and scale up only when necessary. Monitor warehouse utilization and adjust sizes accordingly. Snowflake’s ability to easily resize makes this a simple and effective optimization.
- Suspend idle warehouses: Warehouses consume credits even when idle. Configure auto-suspend to automatically shut down warehouses after a period of inactivity. Set appropriate timeout durations based on your workload patterns.
- Use multiple warehouses strategically: Separate workloads with different resource requirements onto different warehouses. For example, use a smaller warehouse for ad-hoc queries and a larger one for ETL processes. This prevents smaller queries from being charged for the resources of a larger, busier warehouse.
- Monitor warehouse performance: Regularly analyse query execution plans to identify bottlenecks and optimize query performance. Faster queries mean less compute time and lower costs.
2. Optimize Data Storage:
- Compress your data: Snowflake supports various compression techniques. Ensure your data is compressed efficiently to minimize storage costs.
- Leverage Time Travel judiciously: Time Travel allows you to access historical data, but it also contributes to storage costs. Configure appropriate retention periods for Time Travel based on your recovery needs. Shorter retention periods translate to lower storage costs.
- Purge unnecessary data: Regularly identify and delete or archive data that is no longer needed. This reduces your storage footprint and associated costs.
- Consider data clustering: Clustering can improve query performance, which indirectly reduces compute costs. However, clustering can also increase storage costs, so carefully evaluate the trade-offs.
3. Optimize Queries:
- Write efficient SQL: Optimize your SQL queries to minimize execution time. Use appropriate indexes, avoid full table scans, and leverage Snowflake’s analytical functions where possible.
- Use caching effectively: Snowflake’s query caching can significantly improve performance and reduce compute costs. Ensure your queries are structured to take advantage of caching.
- Limit unnecessary data retrieval: Only select the columns and rows you actually need in your queries. Avoid selecting entire tables when only a subset of data is required.
4. Monitor and Analyse Costs:
- Use Snowflake’s cost management features: Snowflake provides built-in tools for monitoring and analysing cost usage. Utilize these features to identify cost drivers and track the effectiveness of your optimization efforts.
- Set up budget alerts: Configure alerts to notify you when your spending approaches predefined thresholds. This allows you to take proactive action to prevent cost overruns.
Review scheduling
This is the place to start! It’s the famous “low hanging fruit”, easy changes with significant impact.
Most of us (especially those with experience in ‘bare metal’, in-house infrastructure) will set our regular data refresh to be hourly and then get on with adding value for our consumers. Over time the data volumes will increase, along with the number of models processed which will lead to more costs.
I covered this in more depth in my whimsically titled blog post This one little trick will save you up to 79% on your cloud refresh costs, but the short themes are:
- Review what and when processing is running.
- Change regular refreshes to only run producing results during office hours (or when output are needed).
- Split some processing to only run daily/weekly where:
- The sources don’t change hourly.
- The outputs are only used for reporting ‘up to the end of the previous day’.
- You might need one weekend run (for monitoring and the odd enthusiastic colleague).
- Identify only the outputs which consumers need to have updated regularly and only run those.
Tips on dbt scheduling
If you are using dbt to manage the pipeline, then I’d consider looking into YAML selectors in dbt. They will give you a user readable way to dynamically define requirements for different types of refresh (a blog post for the future).
Remove views from regular refreshes in dbt
This is a dbt/snowflake specific change, if you have a large number of views removing them from being refreshed each time can save processing time and a little cash. By default, dbt will recreate each view in the flow when running the refresh. Which as views don’t change (outside of a deployment) is wasted time and processing.
I wouldn’t recommend this unless you have a large number of views as the per view refresh cost is still small (but they add up), and it adds extra complexity to your processing. You will have to make sure that they are still recreated as part of a deployment run.
This can be another “low hanging fruit”, as it’s a case of identifying views and tagging them with something that can be removed from the refresh (again YAML selectors are handy here).
Understanding where costs originate
The next step (unless you feel you know the processes intimately) is to do an audit of your refreshes and understand which models are actually costing and from when (you may still be surprised).
There is a lot you can do in Snowflake looking at the Query History, and using Resource Monitors, but it can be a bit of a graft going through it all manually.
The dbt package: dbt-snowflake-monitoring
To remove much of the graft if you are using dbt, the easy solution is to use the dbt-snowflake-monitoring package from select.dev (this package is open source and free to use). Their documentation is very easy to follow and in a very short space of time you will have intricate logs of your dbt runs and how much each action costs (including breaking incremental refreshes down to their component queries). With this data, you can then see model by model costs and with a little intuition, know which are the best places to start optimising.
Reduce processing volumes: Incremental Processing & Deltas
The next task, one which has varying difficulty and effort depending on your processes (and how many models you have) is to reduce the data volume processed on each refresh.
In most cases, this is going to be a case of only inserting new records, rather than processing everything. A good example of where this can be used is a daily transaction log, where historic data is unlikely to change.
Incremental models in dbt
If you are working with dbt, then it’s relatively easy to implement incremental (one of the available materializations) processing on tables.
Snowflake Query History & Query Profile
After identifying expensive operations, it’s a case of going through them one by one to see what can be improved. This involves getting deeper into snowflake, but tuning queries in snowflake is still far easier than the more in depth technical days of Oracle. Look at the query using the Query History, and dive into what’s happening under the Query Profile.
Right-size warehouses
Approaches to right-sizing a warehouse are a blog post in their own right. But the short summary is to look at the results in dbt-snowflake-monitoring, or the Query Profile. Identify cases of spillage (a nice walkthrough here), these need a bigger warehouse. While other operations may function better/the same on a smaller warehouse.
In dbt there are a few options to manage warehouse size for specific models, which can reduce costs for the bulk of the models while keeping a bigger warehouse for those which need it.
Clustering
For those with experience in other databases, clustering isn’t anything new. It has significant benefits in joins and filtering. You can only have one clustering arrangement on any table, and for best results in a join the join condition should have the same clustering on both tables. Pick your clustering carefully!
Bigger architectural changes
As engineers, we often develop to get results quickly and most projects over a year or two old have grown ‘organically’. Weather the starting point is a culture of “Query-Driven Modeling” (engineers focusing on individual reporting outputs, rather than an overall architectural harmony), or reflecting changes in source systems and data consumer requirements over time. Our data infrastructure can increasingly have issues including:
- Duplicated processes (performing the same calculation in different places)
- Complex and messy joins fitting disparate datasets together in an inefficient part of the processing
- Complex models needing to be refreshed more frequently due to their location in a data flow
- Repeated use of an upstream model
All of these can add unnecessary processing time and cost to a project. The resolution of this is rarely easy as it usually requires some significant changes. But often there can be cases where reviewing the most expensive flow and breaking it down into something ore efficient benefits developer efficiency (easier to re-use models) as well as costs.
Reviewing dbt projects
Review the best practice guides, there is a lot here which engineers can leverage to improve their processes and use of the tool.
If you are using dbt, and have a large project then I’d recommend using dbt-project-evaluator (another dbt package). As well as encouraging good developer practises like documentation, it can highlight inefficient design patterns.
Cost reduction with SaaS tools and services
I don’t represent these companies, but I’ve experience of these services, and wanted to include them here (there are many others, but I’m using these as examples for options).
Spending money on a tool or service makes sense if it will enable you to reduce costs elsewhere. The old TCO argument rings true in data processing as it does elsewhere. Often a proof of concept trial will tell you fast roughly how they can save time and money in optimising costs.
SnowFlake
Yes, you read that correctly! Snowflake can help you reduce your costs with them. When I spoke with my rep about it, they told me that it’s in their interest as becoming too expensive isn’t good for them in the long term.
If you have a rep from snowflake, speak to them about what they provide for reviewing costs and processes. The also have some consultancy type offerings if you don’t have the expertise in house.
select.dev
After testing out the dbt-snowflake-monitoring package from select.dev, it’s worth reaching out to them for their paid for offering. It provides a very handy UI, some automated savings and automated detection and recommendation of models to improve (and how to improve them).
Seemore data
Similar to select.dev, but also working across a wide (and growing) number of other data integrations Seemore Data identifies where the costs are occurring, and works wider than select.dev (which focuses on dbt processing) providing insights into how to save money across your wider data infrastructure as well as snowflake.
They are worth checking out for the whimsical pig mascot, even if you aren’t seriously considering the product.
Conclusion
Controlling Snowflake costs requires a proactive and ongoing approach. By implementing the strategies outlined in this post, you can gain better visibility into your spending, optimize resource utilization, and ultimately reduce your cloud data warehouse costs without compromising performance. Regular monitoring and analysis are key to ensuring continued cost efficiency as your data and workloads evolve. Don’t let your Snowflake bill become a monster – tame it with these practical tips!