Splitting storage and compute can save a lot on your DW costs
Are data warehouses [still] the best of breed?
Today, many customers we talk to are finding their data warehouses (DW) inadequate. Certainly they need the DW. It’s a critical part of their business. At one customer for example, their DW is now so large and cumbersome that they’ve had to look at other larger offerings to try to keep up with the volume of data being ingested. Space is an issue.
In another case, while the customer is not overwhelmed by the volume of data in their DW, they’re wanting more compute from their DW. Earlier in the working day, they require more compute, later less. Now that they’ve increased their compute (and storage) they can’t easily scale it back after the rush is over.
The cloud is elastic isn’t it?
I hear you say:
“Isn’t that what elastic means in all those cloud offerings?”
And you’d be right.
Amazon Redshift for example, is a great product that can elastically scale to meet your demand. However, variable scaling is not it’s forte. Once scaled up, it’s time consuming to scale it back given the partitioning of the data [which makes it fast in the first place].
Split your compute and storage
In talking to a customer recently they expressed dismay at computing their DW needs on the fly instead of keeping them in the DW. Is this even possible? Doesn’t that put the DW out of business?
The answer is yes. Of course there are some caveats with this approach - specifically that you want to be able to throw lots of compute at the problem when you need answers from your “new” DW.
Using Apache Spark, you can split the compute and the storage. Data lakes (DL) store data efficiently and in a low-cost model. Spark allows you to throw compute at the problem. We can continually re-compute or we can write the results to some intermediate table (perhaps in our smaller, optimized DW), storage or stream. And when demand increases, we can simply add more compute to the problem.
Scaling back is no problem either. Less demand means fewer Spark instances and so we’re not paying for the compute we don’t need after the rush on the DW is over. Older data no longer requiring frequent access can be archived using lifecycle policies, reducing the storage costs too.
This is a win-win. Not only do you save on storage costs in the data lake, you save on compute costs when that massive compute is not required. We can add and subtract compute based on the demands of the business.
Finally, we’re not implying you’ll need to throw out your DW. In fact, with this model, you can optimize the DW to serve those applications where you require mission-cricital reporting with clock-work regularity. The data lake and Apache Spark allow you to augment the DW for yor other reporting and data exploration requirements. Sewing this all together with Amazon EMR makes a cost-effective, practical solution.
Talk to us
Talk to us about helping to deliver solutions optimizing for cost of storage and compute. With the Cloud Fundis EMR framework for continuous spot instance clusters and using Apache Spark, we can deliver the best of both (storage and compute) worlds.