Data Lakes are Technical Debt

I’ve been working on big data since 2014 and I’ve managed to avoid taking the technical debt of data lakes so far. Here is why.

Myth of reusing existing text logs

For the purpose of this post, let’s define: a data lake is a system allowing a) to store data from various sources in their original format (included unstructured / semistructured) and b) to process this data.

Yes, you can copy your existing text log files into a data lake, and run any data processing on them in the second step.

This processing could be either a) converting them into a more appropriate storage format (more about it in just a minute) or b) working with the actual information – for example, exploring it, creating reports, extracting features or execute business rules.

The latter is a bad, bad technical debt:

  • Text logs are not compressed and not stored by columns, and have no secondary indexes, so that you waste more storage space, RAM, CPU time, energy, carbon emissions, upload and processing times, and money if you casually work with the actual information contained in there.
  • Text logs doesn’t have a schema. Schemas in data pipelines play the same role as strict static typing in programming languages. If somebody would just insert one more column in your text log somewhere in the middle, then your pipeline will in the best case fail weeks or months later (if you execute it for example only once a month), on in the worst case it will produce garbage as a result, because it wouldn’t be able to detect the type change dynamically.

Never work off text logs directly.

A more appropriate format for storage and for data processing is a set of relational tables stored in compressed, columnar format, with a possibility to add secondary indexes, projections, and with a fixed schema checking at least column names and types.

And if we don’t work off the text logs directly, it makes no sense to copy them into a data lake – first to avoid the temptation to use them “just for this one quick and dirty one-time report”, but also because you can read the logs from the system where they are originally stored, convert them into a proper format, and ingest into your relational columnar database.

Yes, data lakes would provide a backup for the text logs. But YAGNI. The only use-case where you would need to re-import some older logs is some nasty hard-to-find bug in the import code. This happens rarely enough to be willing to use much cheaper backup solutions than the data lake.

Another disadvantage of working with text logs in data lakes is that it motivates to produce even more technical debt in the future.

Our data scientist needs a little more information? We “just add” one more column into our text log. But, at some point, the logs become so big and bloated that you won’t be able to read them with your naked eye in any text editor, so that you’d lose the primary goal of any text log: tracing the state of the system to enable offline debugging. And if you add this new column in the middle, some old data pipelines can silently break and burn on you.

Our data scientist needs information from our new software service? We will just write it in a new text log, because we’re already doing it in our old system and it “kinda works”. But in fact, logging some information:

logging.info('Order %d has been completed in %f ms' % (order_nr, time))

takes roughtly as much effort as inserting it into the proper, optimal, schema-checked data format:

db.insert(action='order_completed', order=order_nr, duration_ms=time)

but the latter saves time, energy, storage and processing costs, and possible format mistakes.

Myth of decoupled usage

You can insert all the data you have now, store it in the data lake, and if somebody needs to use it later, they will know where to find it.

Busted. Unused data is not an asset, it is a liability:

  • you pay for storage
  • you always need to jump over it when you scroll down a long list of data buckets in your lake,
  • you might have personal data there so you have one more copy to check if you need to fulfill GDPR requirements,
  • the data might contain passwords, secure tokens, company secrets or some other sensitive information that might be stolen, or could leak.
  • Every time you change the technology or the clould provider of your data lake, you have to spend time, effort and money to port this unused data too.

Now, don’t get me wrong. Storage is cheap, and nothing makes me more angry at work than people who would delete or not store data, just because they want to save storage costs. Backup storage is not so expensive as data lake storage, and de-personalized parts of data should be stored forever, just in case we might need them (but remember: YAGNI).

Storing unused data in a data lake is much worse than storing it in an unused backup.

Another real-world issue preventing decoupled usage of data is how quickly the world change. Even if the data stored in the data lake is minutiously documented up to the smallest detail – which is rarely the case – time doesn’t stand still. Some order types and licensing conditions become obsolete, some features don’t exist any more, and the code that has been producing data is already removed, not only from the master branch, but also from the code repository altogether, because at some point the company was switching from SVN to git and they had decided to drop the history older than 3 years, and so on.

You will find column names that nobody can understand, and column values that nobody can interpret. And this would the best case. In the worst case, you would find an innocent and fairly looking column named “is_customer” with values 0 and 1, and you will mistake it for a paying user and you will use it for some report going up to the C-level, only to painfully cringe, after somebody would suddenly remember that your company had an idea to build up a business alliance 10 years ago, and this column was used to identify potential business partners for that cooperation.

I only trust the data I collect myself (or at least I can read and fully understand the source code collecting it).

The value of the most data is exponentially decaying with time.

Myth of “you gonna need it anyway”

It goes like this: you collect data in small batches like every minute, every hour or every day. Having many small files makes your data processing slow, so you re-partition them, for example into monthly partitions. At this point you can also use a columnar, checked store and remove unneeded data. These monthly files are still to slow to be used for online, interactive user traffic (with expected latency of milliseconds) so you run next aggregation step and then shove the pre-computed values into a some quick key-value store.

Storing the original data in its original format in the lake in the first step feels to be scientifically sound. It makes the pipeline uniform, and is a prerequisite for reproducability.

And at the very least, you will have three or more copies of that data (in different aggreation state and formats) somewhere anyway, so why not storing one more, original copy?

I suppose, this very widespread idea comes from some historically very popular big data systems like Hadoop, Hive, Spark, Presto (= AWS Athena), row-based stores like AWS Redshift (=Postgresql) or even document-based systems like MongoDB. Coincidentally, these systems are not only very popular, but also have very high latency and / or waste a lot of hardware resources, given the fact that some of them written on Java (no system software should be ever written in Java), or use storage concepts not suitable for big data (document or row-stores). With these systems, there is no other way than to duplicate data and store it pre-computed in different formats according to the consumption use-case.

But we don’t need to use popular software.

Modern column-based storage systems based on the principles discovered with Dremel and MonetDB are so efficient that in the most use-cases (like 80%) you can store your data exactly once, in a format that is suitable for a wide variety of queries and use-cases and deliver responses with sub-second latency for simple queries.

Some of these database systems (in alphabetical order):

  • Clickhouse
  • DuckDB
  • Exasol
  • MS SQL Server 2016 (COLUMNSTORE index)
  • Vertica

A direct comparison of Clickhouse running in an EC2 instance with data stored in S3 and queried by Athena (for some specific data mix and query types that are typical at my current employer Paessler AG) has shown that in this particular case Clickhouse is 3 to 30 times quicker and at the same time cheaper than the naive Athena implementation.

Is it possible to speed up the Athena? Yes, if you pre-aggregate some information, and pre-compute some other information, and store it in the DynamoDb. You’ll then get it cheaper than Clickhouse, and “only” 50% to 100% slower. Is it worth having three copies of data and employing a full time DBA monitoring the data pipelines health for all that pre-aggregating and pre-computing, as well as using three different APIs to access the data (Athena, DynamoDB and PyArrows)? YMMV.

Summary

Data lakes facilitate technical debt:

  • Untyped data (that can lead to silent, epic fuck-ups)
  • Waste of time
  • Waste of money
  • Waste of hardware
  • Waste of energy and higher carbon footprint
  • Many copies of the same data (that can get out of sync)
  • Can be against of the data minimization principle of GDPR
  • Can create additional security risks
  • Can easily become data grave if you don’t remove dead data regularly

Avoid data lakes if you can. Mind the technical debt you are agreeing on and be explicit about it, if you still have to use them.

Leave a comment