You Don't Have Big Data; You Have Bad Data Lifecycle Management

Storage is not always cheap

Nov 28, 2022

[Digital image] by NeONBRAND Digital Marketing, https://unsplash.com/photos/KZs5Bt5VDng

We are in the data gold era. The global data sphere produced around ~2 zettabytes in 2010 and ~64 in 2020 (x32!). The growth of data volume is exponential.

As individuals, we produce and consume an insane amount of data. As a company, we want to leverage these to make proper decisions and smarter products. Given this, it’s easy to fall into the trap of storing data just for the sake of storing. With today’s market conditions, data teams will be challenged more and more on their Total Cost of Ownership. “Storage is cheap” can be a myth. Let’s find out why.

🤔Where all the fuzz started

Let’s get a small reminder about where the “big data” is coming from. It refers to 3 V: Velocity (the speed at which you need the data), Volume(size), and Variety (variety of the sources of your data).

In this article, we will focus only on one most underrated dimensions: volume.

💨Data collection & downfall of ELT

The defacto strategy nowadays is to capture almost any data because we have “potential” future use cases but no active users/use cases. And if we don’t capture it, it’s lost.

That’s where ELT (extract-load-transform) shines and took over his old brother ETL (extract-transform-load) these past years because we needed a more flexible way to ingest the data without needing to transform and model it. But abusing this pattern ends up with a lot of data collected, unused.

🎌Common traps

Here’s the most obvious one I’ve seen repeatedly :

Too much data copy for development/experimentation purpose without expiration date: data stay there as a ghost.
Not doing any file conversion on some raw csv / json raw files: we have many file formats more appropriate for storage and query (e.g : parquet, ORC, Avro…).
Too many useless snapshots.
Too much staging pipeline data not being pruned.
Too much re-load in the Data warehouse without leveraging lakehouse/direct query on the data lake.
Too much data is captured, which is not used at all.

The last one is hard to tackle, as we could have a use case in the future. However ‘when’ this future is going to happen? Challenge yourself with the following question :

Is this use case really valuable vs the operation overhead?
Is there any way to backfill this data if we don’t capture it now?

Having an approach of “knowledge” first would help to decide if you need to capture this data in the first place.

But wait, isn’t storage cheap?

Yes, it is. But because it’s so easy to store, we tend to abuse it, and bad patterns are also easy to use. Storing a huge amount of data still adds up.

🫰Embracing FinOps

As discussed above, there are different ways to solve the issue, but the mindset is to adopt FinOps from day one.

That means, for instance :

Creating Cloud budget alerts.
Creating Dashboard of consumptions.

For this latter one, ask yourself the following questions?

What are the biggest object storage buckets?
What are the biggest tables?
What was the last time this big bucket/table was accessed?

Clear monitoring and alert will help you detect when things go sideways rather than react when the bill is already there.

Plus, having a clear view of your cost is always something your leadership will like, as it’s easier to plan a budget (for both headcounts and resources).

Bonus: if you use AWS S3, you should check intelligent tiering. It will automatically move your data to cheaper storage based on your configuration.

🩹ETL to the rescue?

Aside from FinOps, we saw also how ELT paradigm could be misused. ETL can also help because you could do some of the transformations rather than storing the raw data as it is.

For example, if you join some sources and just transform them to a more appropriate file format (e.g, parquet), there is little to no value to keeping the raw once you’ve transformed it.

In any way, it’s a good thing to keep in mind the tradeoff with ELT and rethink things if your storage costs are exploding.

💸Everything is cheap until it’s a priority.

As you can see, the biggest danger with storage is that we have a common knowledge that storage is cheap, and therefore, all bad patterns that we listed aren’t a priority to focus on as they are until your credit card budget are burned.

However, today, you can start budgeting and monitoring from the very first start of your data journey. No need to go crazy on this, but this should evolve as your use cases, and data maturity evolves so that you keep the costs tight and the added value high.

Want to connect? Follow me on 🎥 Youtube, 🔗LinkedIn for more data/code content!

Mehdio's Tech (Data) Corner

Discussion about this post