New Features in ClickHouse – Maxim Fridental

I have recently discovered a lot of interesting new features in ClickHouse, which I have missed previosly, and so I want to share them with you.

Variant data type

In a Variant column we can store data of different types. Which types they are, we have to specify during the column creation. In the practice I often have this requirement when importing data from untyped data formats (XML, JSON, Yaml). For example,

my_field Variant(Int64, String, Array(Int64), Array(String))

would cover a lot of my use-cases. When selecting the data, we can for example use my_field.Int64 to get either a number, or NULL if the row contains some other type.

Dynamic data type

Dynamic is similar to Variant, but you don’t need to define the possible data types up-front. As soon as you insert some value in this column, the data will be stored along with the data type, and you can use select queries with this data type from then on.

Geo data types

The family of Geo data types consists of Point, Ring, Polygon, and MultiPolygon. I think the work on geo features has started about 9 years ago, but now it looks like fully developed and production-ready.

Materialized database engines

ClickHouse has had already since many years the database engines for Postgres and MySQL. You would connect to ClickHouse and send a query, and ClickHouse would forward it behind the scenes to a running Postgres or MySQL instance and get the answer. This is nice if we want to join ClickHouse tables with some data stored in the other databases.

Now, we have an option to materialize these databases. When you create them, ClickHouse would download all data from Postgres / MySQL, store it in its native compressed format, and start receiving updates from the corresponding master servers, acting basically as a replica.

Very nice.

New table engines

Again, since many years ClickHouse has the possibility to define a table that is getting data from outside sources like MongoDB, S3, HDFS, Hive, ODBC, Kafka, and RabbitMQ.

New are the following integrations:

NATS
DeltaLake
Redis
SQLite
And some Apache stuff like Iceberg and Hudi

One of the most interesting table engines is EmbeddedRocksDB, which basically turns ClickHouse to a high-performance key-value store, with full high-performant update and delete support. For example, if you have a history table like this

create table MyTable (
  historydatetime DateTime,
  objid Int64,
  some_property String,
  some_other_property Float64
) 
engine = MergeTree
order by some_property
partition by toYYYYMM(historydatetime);

then to retrieve the latest values you had to write something like

select objid, argMax(some_property, historydatetime) last_property,
argMax(some_other_property) last_other_property
from MyTable
group by objid

Now, you can create a key-value store for this:

create table MyTableLast (
  objid Int64,
  some_property String,
  some_other_property Float64
) 
engine = EmbeddedRocksDB()
primary key objid

and organize a view to copy there all incoming data changes:

create materialized view MyTableLastCopy to MyTableLast 
as select objid, some_property, some_other_propery from MyTable

The EmbeddedRocksDB table engine implements inserts as updates if the primary key already exists, so to get the latest value, you just need

select objid, some_property, some_other_property
from MyTableLast

You can also use these tables in joins.

UDFs

You can implement a user-defined function (UDF) using any programming language and deploy it to the ClickHouse server, and call it from SQL.

Tukey fences

Yes, I’ve also never heard this term before, even though I am familiar with this simple method of outlier detection in time series. We also have there a function for STL decomposition (to trend, seasonality and noise), as well as for fast fourier transform.

More fancy functions

Here, I am just writing down functions with unknown terms.

A function to work with ULIDs.
UniqTheta functions
MortonEncode
HilbertEncode
DetectLanguage
And probably the biggest collection of hashes in the world:
- halfMD5
- MD4
- MD5
- sipHash64
- sipHash64Keyed
- sipHash128
- sipHash128Keyed
- sipHash128Reference
- sipHash128ReferenceKeyed
- cityHash64
- intHash32
- intHash64
- SHA1, SHA224, SHA256, SHA512, SHA512_256
- BLAKE3
- URLHash(url[, N])
- farmFingerprint64
- farmHash64
- javaHash
- javaHashUTF16LE
- hiveHash
- metroHash64
- jumpConsistentHash
- kostikConsistentHash
- murmurHash2_32, murmurHash2_64
- gccMurmurHash
- kafkaMurmurHash
- murmurHash3_32, murmurHash3_64
- murmurHash3_128
- xxh3
- xxHash32, xxHash64
- ngramSimHash
- ngramSimHashCaseInsensitive
- ngramSimHashUTF8
- ngramSimHashCaseInsensitiveUTF8
- wordShingleSimHash
- wordShingleSimHashCaseInsensitive
- wordShingleSimHashUTF8
- wordShingleSimHashCaseInsensitiveUTF8
- wyHash64
- ngramMinHash
- ngramMinHashCaseInsensitive
- ngramMinHashUTF8
- ngramMinHashCaseInsensitiveUTF8
- ngramMinHashArg
- ngramMinHashArgCaseInsensitive
- ngramMinHashArgUTF8
- ngramMinHashArgCaseInsensitiveUTF8
- wordShingleMinHash
- wordShingleMinHashCaseInsensitive
- wordShingleMinHashUTF8
- wordShingleMinHashCaseInsensitiveUTF8
- wordShingleMinHashArg
- wordShingleMinHashArgCaseInsensitive
- wordShingleMinHashArgUTF8
- wordShingleMinHashArgCaseInsensitiveUTF8
- sqidEncode
- sqidDecode

chDB

And last but not least, the embedded version of ClickHouse, a direct competitor to DuckDB! If you need all the features of ClickHouse but don’t want to deploy and operate a separate software, you can just embed it into your software.

Your software can be currently written in Python, Go, Rust or NodeJs.

How cool is that! I don’t believe I have missed announcement of this project! Now I just need a suitable project to try it out.

Summary

Every now and then I go to the clickhouse.com or read their ReleaseNotes, and every time — every single time — I find some cool new useful functionality. Clickhouse is what, about 10 years old now? — you would think they’d reach a plateau, become a corporation, release minor updates and generate cashflow per each running core like Microsoft or Oracle. But no, the development speed is not only going down, it is increasing with every year.

How do they do that!?