How to choose a database for data science

“I have five CVS files one Gb each and the loading them with Pandas is too slow. What database should I use instead?”

I often get similar questions from data scientists. Continue reading to obtain a simple conceptual framework to be able to answer this kind of questions yourself.

Conceptual framework

The time some data operation takes depends on the amount of data and the throughtput of the hardware, as well as on the number of operations you can do in parallel:

Time ~ DataSize, Throughtput, Parallelization

The throughput is the easiest part, because it it defined and limited by electronics and its physical constraints. Here are the data sources from slow to fast:

Internet (for example, S3)
Local Network (for example, NAS, also hard drives attached to your local network)
(mechanical) hard drives inside of your computer
SSD inside of your computer
RAM
CPU Cache memory / GPU memory if you use GPU for training

You can reduce the data processing time by moving the data to the faster medium. Here is how to:

Download the data from the internet to your harddrive, or to local network NAS, if the data doesn’t fit into your machine.
As soon as you read a file once, your OS will keep its data in RAM in the file cache for further reads. Note that the data will be eventually evicted from memory, because RAM size is usually much less than the harddrive size.
If you want to prevent eviction for some particular files that fit in memory, you can create a RAM drive and copy the files there.

There are a couple of tricks to reduce data size:

Compress the data with any losless compression (zip, gzip etc). You will still need to decompress it to start working on it, but the reading from slow data source like a hard drive will be quicker because there is less data to read, and decompression will happen in a fast RAM.
Partition the data, for example by month or by customer. If you only need to run a query related to one month, you will skip reading the data for other monthes. Even if you need the full data (for example for ML training), you can spread your data partitions into different computers and process the data simultaneously (see parallelization below)
Store the data not row by row, but column by column. Thats the same idea like partitioning, but column-wise. So if for some particular queries you don’t need some column, in a column-based file format you can skip reading it, therefore reducing the amount of data.
Store additional data (index) that will help you to find rows and columns inside of your main file. For example, you can create a full text index over the columns of your data containing free English text, and then, if you only need rows containing the word “dog”, this index will read only bytes from the storage with the rows containing “dog” inside, so you read less data, so you reduce amount of data to be read. For Computer Science, this is where their focus is on and they are most excited inventing additional data structures for especially fast indexes. For Data Science, this is rarely helpful, because we often need to read all the data anyways (eg. for training).

As of parallelization, it is a solution for the situation where your existing hardware in under-used and you want to speed up things by fully using it, or you can provision more hardware (and pay for it) to speed-up your processes. At the same time this is also the most complicated factor to optimize, so don’t go there before you try the stuff above.

Parallelization on macrolevel: you can split your training set to several servers, read the data and train the model simultaneously on each of them, calculate weight updates with the backpropagation and then apply them to a central storage of weights and redistribute the weights back to all servers.
Parallelization on microlevel: put several examples from your training set at once onto GPU. Similar ideas are utilized in a small scale by databases or frameworks like numpy and is called there vectorization.
Parallelization on hardware level: you can attach several hard drives to your computer in a so-called RAID array and when you read from them, the hardware ensures that it reads from all of the hard drives in parallel, multiplying your hardware throughput.

Parallelization is sometimes useless and/or expensive and hard to get right, so this is your last resort.

Some practical solutions

A database is a combination of smart compressed file format, an engine that can read and write it efficiently, and an API so that several people can work on it simultaneously, data can be replicated between database nodes, etc. When in doubt, for data science use ClickHouse, because it provides top performance for many different kind of data and use-cases.

But databases are the top notch full package, and often you don’t need all of it, especially if you work alone one some particular dataset. Luckily, also separate parts of the databases are available on the market. They have various names like embedded database, database engine, in-memory analytics framwork etc.

An embedded database like DuckDB or even just a file format like Parquet (Pandas can read it using fastparquet) are in my opinion the most interesting one and might be already enough.

Here is an short overview:

	ClickHouse	DuckDB	Fastparquet
Data compression	Native compressed format with configurable codecs. Also some basic support of Parquet.	Native compressed format. Also supports Parquet	Configurable compression codecs
Partitioning	Native partitioning. Also basic partitioning support for Parquet	Some partitioning support for Parquet	Manual partitioning using different file names
Column-based storage	Native and Parquet, as well as other file formats	Native and Parquet	Parquet is a column-based format
Indexes	Primary and secondary indexes support for native format. No index support for Parquet (?).	Min-Max index for native format. No index support for Parquet (?).	Not supported by fastparqet to my knowledge, but Parquet as file format has indexes.

Conceptual framework

Some practical solutions

Leave a comment