20Apr / 2024

Do we really need a staging system for data science?

So you are an infrastructure guy (a DevOp or an IT admin) and your data scientist or data engineer has told you they don’t need a staging system.

You know for a fact that any professional software development needs a development system, a staging system and a production system, and that developing in production is utterly dangerous and unprofessional, so you think your data scientist or data engineer are incompetent in software engineering.

Are they really?

A step back

So why the web development came up with these three environments in the first place?

If we develop a web site in production, any changes become immediately visible not only to us, but also to the users. If there is a bug, users will see it. If there is a security vulnerability, hackers will use it. If there is some suboptimal communication in the texts, activists will make a shitstorm out of it. Data can be lost, customers can be lost, the whole company can be lost.

So we need a development system. When the developer makes a change, they can check how it works before releasing it to everybody else. If we have only one developer and one change per week to do, this would be enough, because the testers, product managers, marketing, legal etc. would just check the changes on his development system, while the developer drinks coffee and plays table football.

But if the the developer is constantly working on changes, and even more so if there are more than one developers, it makes sense to establish the third system, staging, where their all changes can be integrated together, bundled to one release, tested, validated, checked by any internal stakeholders etc, before they get rolled out in production.

All of this makes sense.

How the data science is different?

Let’s say we want to create a next generation of the Netflix movie recommender. We want to extract every image frame and sound track of each movie, send them to a LLM model and ask the model to generate microtags for each movie, such as

main_character_wears_orange
more_than_three_main_characters
movie_happens_in_60ies_in_china
bechdel_wallace_test_passed
etc

We want then for each user to retrieve the movies they have watched and enjoyed before, so that we understand what microtags are important for the user, so that we can recommend them the movies with the same microtags.

So we download a small subsample of the dataset (i.e. a 15 minutes of video) to our development workstation, code the script that reads the data and uses the LLM, and when it works we want to execute it for bigger subset of movies (i.e. 100 movies) and see how reasonable the generated microtags are. We can’t do it on the development workstation, because it will takes years there and it should be parallelized on some fat compute cluster so that we’ll get the chance to finish this project until we retire.

So we deploy our training app to a compute cluster and call it production environment. It is still internal, so we don’t need stakeholders to check it, and if its fails in the middle, we can fix the bugs and restart the process, and no external users will be affected. So it has the freedom of a development environment and hardware costs of production environment.

When we finish with that, we want to evaluate this model v0.1. To do this, we load a history of watched movies from some user, split it in half, use the first half to generate recommendatoins and check whether the recommendations are present it the second half of his history. First, we develop this evaluation code in our development system and check it with one or two users. Then we deploy it to some fat compute cluster so that we can repeat the calculation for all or at least for many users. So, again, this is not staging, because the system won’t contain a release candidate of the final recommender service, but instead some batch code constantly loading data and calculating the evaluation metrics.

The first evaluation will deliver some baseline of our quality metrics. We can now start the actual data science work: tweak some model parameters, use another model, do “prompt engineering”, pre-process our movie data with some filters or even some other AI models, etc. We will then iteratively deploy our model v0.2, v.0.3, …, v0.42 to our compute cluster, train and then evaluate them. From time to time we will need some other training data to avoid overfitting.

At some point we will become satisfied with the model and want to productize it. For this, we will need two things. To train this model on all movies. For this, we need to scale out our compute cluster so that this training will take a month, not years. This will cost a lot on that month, but only on this month, and it will save us time to market.

And second, we need to develop the recommender microservice that can then be called from the frontend or backend of the movie player.

What happens if this microservice crashes in production? Well, we should configure the API gateway to forward the requests to a backup microservice in this case. This can be some stupid simple recommender, for example the one that always delivers the most popular movies of day. So we don’t really need to test crashes of the service before going live. A couple of unit tests and a quick test on development system would be good enough.

Can our microservice have a security bug? Well, the infra team should establish a template for microservices already handling all security things, so if we use this template, we should be reasonably safe too, given that this is an internal service behind of the API gateway. It is fallible to DDOS, but the API gateway should have DDOS protection for all microservices anyway.

Our microservice doesn’t communicate with the end-user, so no need for texters to check the spelling. Its legality is already defined before we start development and we don’t need a staging to test it. Also, if it loses data or starts recommending wrong movies, the worst thing that can happen is that the users see some bad recommendations. But they are seeing bad recommendations all the time anyway.

Summary

Data science apps cannot be fully developed on a development workstation. They are trained, tested, and fine-tuned on production, with production-grade hardware, full production data and sometimes with real production users.
In most cases, data science apps cannot benefit from manual or automated QA, because their quality depends on the quality of the input data. Tester’s behavior is different / random compared with real users, so most probably they will only get random / garbage recommendations back.
QA can test whether a recommender service reponds anything at all. But because the recommender service is computationally intensive, the most probable situation of crash is not when just one tester is calling it, but rather when a great load of user requests is generated. In this case, API gateway should re-route to some other, simple service anyway. Note that the API gateway is not something data science teams should be responsible for. It can be developed and tested (on staging) by the infra team.
Data science apps don’t have fixed texts that can be spell checked or approved by the legal team. Either they don’t have texts at all, or (if it is a generative model) they will generate infinite number of texts so that we can’t check them all in staging. We rather need to design our product in the way that would make us immune from spelling errors and legal issues. Yes, AI ethics engineering exists, but us mere mortals working not in FAANG cannot afford it anyway.

Permalink Leave a comment

13Apr / 2024

A three minute guide to Ansible for data engineers

Ansible is a DevOp tool. Data Engineeres can be curious about tech and tools used by DevOps, but rarely have more than 3 minutes to spend for a quick look over the fence.

Here is your 3 minutes read.

To deploy something into an empty server, first we need to instal python there. So we open our Terminal and use ssh to connect to the server:

localhost:~$ ssh myserver.myorg.com
maxim@myserver.myorg.com's password: **********

Welcome to Ubuntu 22.04.4 LTS (GNU/Linux 6.5.0-26-generic x86_64)



maxim@myserver.myorg.com:~$

Note that after the first connection, the DevOp will probably configure ssh client certificate as well as some sudoers magic, so that we don’t need to enter the password every time we use ssh and therefore we can use ssh in scripts doing something automatically. For example, if we need to install pip on the remote server, we can do

ssh -t myserver.myorg.com sudo apt install python3-pip

Now, let’s say, we have received 10 blank new empty servers to implement our compute cluster and we need to configure all of the in the same way. You cannot pass a list of servers to the ssh command above, but you can do that to Ansible. First, we create a simple file hosts.yml storing our server names (our inventory):

my-compute-cluster:
  hosts:
    myserver-1.myorg.com
    myserver-2.myorg.com
    ...

some-other-group-of-servers
  hosts:
    database-01
    database-02

Now we can install pip on all compute servers (even in parallel!) by one command:

ansible -i hosts.yml my-compute-cluster -m shell -a "sudo apt install python3-pip"

This is already great, but executing commands like this has two drawbacks:

You need to know the state of each server in my-compute-servers group. For example, you cannot mount a disk partition before you format it, so you have to remember whether partitions have been already formatted or not and
The state of all the servers has to be the same. If you have 5 old servers and one new, you want to format and mount the disk partition on the new one, and under no circumstances you want to format the partitions of the old servers.

To solve this, Ansible provides modules that not always execute some command, but first check the current state and skip the execution, if it is not necessery (so it is not “create”, it is “ensure”). For example, to ensure formatting of a disk partition /dev/sdb with the file system ext4, you call

ansible -i hosts.yml my-compute-cluster -m filesystem -a "fstype=ext4 dev=/dev/sdb"

This command won’t touch the old servers and only do something on the new one.

Usually, when preparing the server to host your data pipeline, several configuration steps are required (OS patches need to be applied, software needs to be installed, security must be hardened, data partitions mounted, monitoring established) so instead of having a bash script with commands such as above, Ansible provides comfortable and readable roles format in YAML. The following role prepare-compute-server.yml will for example update the OS, install pip, and format and mount filesystem:

- name: Upgrade OS
  apt:
    upgrade: yes

- name: Update apt cache and install python3 and pip
  apt:
    update_cache: yes
    pkg:
    - python3
    - python3-pip

- name: format data partition
  filesystem:
    fstype: ext4
    dev: /dev/sdb

- name: mount data partition
  mount:
    path: /opt/data
    src: /dev/sdb

Roles such this ment to be reusable building blocks and shouldn’t really depended on what rollout project you are currently doing. To facilitate this, it is possible to use placeholders and pass parameters to the roles using Jinja2 syntax. You also have loops, conditional executions and error handling.

To do some particular rollout, you would usually write a playbook, where you specify, what roles have to be executed on what servers:

- hosts: my-compute-cluster
  become: true    # indication to become root on the target servers
  roles:
    - prepare-compute-server

You can then commit the playbook to your favorite version control system, to keep track who did what when, and then execute it like this

ansible-playbook -i hosts.yml rollout-playbook.yml

Ansible has a huge ecosystem of modules that you can install from its galaxy (similar to PyPi) and also much more features, most notable of which is that instead of having a static inventory of your servers, you can write a script that fetches your machines using some API, for example the EC2 instances from your AWS account.

Alternatives to Ansible are Terraform, Puppet, Chef and Salt.

Permalink Leave a comment

06Apr / 2024

How to make yourself to like YAML

2003 when I was employed at straightec GmbH, my boss was one of the most brilliant software engineers I’ve ever met. I’ve learnt a lot from him and I am still applying most of his philosophy and guiding principles in my daily work. To make an impression of him, its enough to know that we have used Smalltalk to develop a real-world, successful commercial product.

One of his principles was, “if you need a special programming language to write your configuration, it means your main development language is crap”. He would usually proceed demostrating that a configuration file written in Smalltalk is at least not worse than a file written in INI format or in XML.

So, naturally, I had preposessions against JSON or YAML, preferring to keep my configurations and my infrastructure scripts in my main programming language, in this case Python.

Alas, life forces you to abandon your principles from time to time, and the need to master and to use Ansible and Kubernetes has forced me to learn YAML.

Here is how you can learn YAML if you in principle against of it, but you have to learn it anyway.

This is YAML for a string

Some string

and this is for a number

And this is a one-level dictionary with string as keys and strings or numbers as values

key1: value1
key2: 3.1415
key3: "500" # if you want to force the type to be string, use double quotes

Next is a one-dimensional array of strings

- item 1
- item 2
- third item
- строки могут быть utf-8

Nested dictionaries

data_center:
  rack_8:
    slot_0:
      used: 1
      power_consumption: 150  
    slot_1:
      used: 1
      power_consumption: 2  
    slot_2:
      used: 0

You can glue nested levels together like this:

data_center.rack_8.slot_0:
      used: 1
      power_consumption: 150

A dictionary having an array as some value

scalar_value: 123
array_value:
  - first
  - second
  - third
another_way_defining_array_value: ["first", "second", "third"]

Now something that I was often doing wrong (and still doing wrong from time to time): an array of dictionaries. Each dictionary has two keys: “name” and “price”

- name: Chair
  price: 124€       # note that price is indented and stays directly under name. 
                    # any other position of price is incorrect.
- name: Table
  price: 800€       

# note that there is nothing special in the "name", you can use any key to be first:

- price: 300€
  name: Another table       

- price: 12€
  name: Plant pot

Just to be perfectly clear, the YAML above is equivalent to the following JSON

[
  {
    "name": "Chair",
    "price": "124€"
  },
  {
    "name": "Table",
    "price": "800€"
  },
  {
    "price": "300€",
    "name": "Another table"
  },
  {
    "price": "12€",
    "name": "Plant pot"
  }
]

Finally, you can put several objects into one file by delimiting them with —

---
type: Container
path: some/url/here
replicas: 1
label: my_app
---
type: LoadBalancer
selector: my_app
port: 8080

This should cover about 80% of your needs when writing simple YAML.

If you want to continue learning YAML, I recommend you to read about

how to write a string spanning over several lines
how to reference one object from another inside of the same YAML
And the Jinja2 templating syntax that is used very often together with YAML

Permalink Leave a comment

30Mar / 2024

How to stop fearing and start using Kubernetes

The KISS principle (keep it simply stupid) is important for modern software development, and even more so in the Data Engineering, where due to big data and big costs every additional system or layer without clear benefits can quickly generate waste and money loss.

Many data engineers are therefore wary when it goes about implementing and rolling out Kubernetes into their operational infrastructure. After all, 99,999% of the organizations out there are not Google, Meta, Netflix or OpenAI and for their tiny gigabytes of data and two or three data science-related microservices running as prototypes internally on a single hardware node, just bare Docker (or at most, docker-compose) is more than adequate.

So, why Kubernetes?

Before answering this question, let me show you how flat the learning curve of the modern Kubernetes starts.

First of all, we don’t need the original k8s, we can use a simple and reasonable k3s instead. To install a fully functional cluster, just login to a Linux host and execute the following:

curl -sfL https://get.k3s.io | K3S_KUBECONFIG_MODE="644" sh -s - --secrets-encryption

You can then execute

kubectl get node

to check if the cluster is running.

Now, if you have a Docker image with a web service inside (for example implemented with Python and Flask) listening on port 5000, you only need to create the following YAML file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-microservice
  labels:
    app: my-microservice
spec:
  selector:
    matchLabels:
      app: my-microservice
  replicas: 1
  template:
    metadata:
      labels:
        app: my-microservice
    spec:
      containers:
       -name: my-microservice
          image: my-artifactory.repo/path-to-docker-image
          ports:
           -containerPort: 5000

---

kind: Service 
apiVersion: v1 
metadata:
  name: my-microservice
spec:
  type: LoadBalancer
  selector:
    app: my-microservice
  ports:
    - port: 5000
      targetPort: 5000

Conceptually, Kubernetes manages the computing resources of the nodes belonging to the cluster to run Pods. Pod is something akin a Docker container. Usually, you don’t create Pods manually. Instead, you create a Deployment object and it will then take care to start defined number of Pods, watch their health and re-start them if necessary. So, in the first object defined above, with the kind of Deployment, we define a template, which will be used whenever a Deployment needs to run yet another Pod. As you can see, inside the template you are specifying the path to the Docker image to run. There, you can also specify everything else necessary for Docker to run it: environment variables, volumes, command line, etc.

A Kubernetes cluster assigns IP addresses from its very own IP network to the nodes and Pods running there, and because usually your company network doesn’t know how to route to this network, the microservices are not accessible by default. You make them accessible by creating another Kubernetes object of kind Service. There are different types of the Services, but for now everything you need to know is that if you set it to be LoadBalancer, the k3s will expose your microservice to the rest of your corporate network by leasing a corporate network IP address and hosting a proxy service on it (Traefik) that will forward the communication to the corresponding Pod.

Now, when we have our YAML file, we can roll out our tiny happy microservice to our Kubernetes cluster with

kubectl apply -f my-microservice.yaml

We can see if it is running, watch its logs or get a shell access to the running docker container with

kubectl get pod
kubectl logs -f pod/my-pod-name-here
kubectl exec -it pod/my-pod-name-here bash

And if we don’t need our service any more, we just delete it with

kubectl delete -f my-microservice.yaml

Why Kubernetes?

So far, we didn’t see any advantages compared to Docker, did we?

Well, yes, we did:

We’ve got a watch-dog that monitors Pods and can (re)start them for example after server reboot or if they crash for any reason.
If we have two hardware nodes, we can deploy our Pods with “replicas: 2” and because we already have a load balancer in front of them, we can get high availability almost for free
If the microservice supports scalability by running several instances in parallel, we already get a built-in industrial grade loadbalancer for scaling out.

Besides, hosting your services in Kubernetes has the following advantages:

If at some point you will need to pass your internal prototypes for professional operations to a separate devop team, they will hug you to death when they learn your service is already kubernetized
If you need to move your services from on-premises into the cloud, the efforts to migrate, for example, to Amazon ECS is much much higher than the changes you need to do to go from k3s to Amazon EKS.
You can execute batched workflows scheduled by time with a CronJob object, without the need to access the /etc/crontab on the hardware nodes.
You can define DAG (directed acyclic graphs) for complicated workflows and pipelines using Airflow, Prefect, Flyte, Kubeflow or other Python frameworks that will deploy and host your workflow steps on Kubernetes for you
You can deploy Hashicorp Vault or other secret manager to Kubernetes and manage your secrets in a professional, safer way.
If your microservices need some standard, off-the-shelf software like Apache Kafka, RabbitMQ, Postgres, MongoDB, Redis, ClickHouse, etc, they all can be installed into Kubernetes with one command, and deploying additional cluster nodes will be just a matter of changing the number of replicas in the YAML file.

Summary

If you only need to host a couple of prototypes and microservices, Kubernetes will immediately improve their availability, and, more importantly, will be a future-proof, secure, scalable and standartized foundation for coming operational challenges.

Now when you’ve seen how easy the entry into the world of Kubernetes is, you don’t have the “steep learning curve” as an excuse for not using Kubernetes already today.

Permalink Leave a comment

22Mar / 2024

How to choose a database for data science

“I have five CVS files one Gb each and the loading them with Pandas is too slow. What database should I use instead?”

I often get similar questions from data scientists. Continue reading to obtain a simple conceptual framework to be able to answer this kind of questions yourself.

Conceptual framework

The time some data operation takes depends on the amount of data and the throughtput of the hardware, as well as on the number of operations you can do in parallel:

Time ~ DataSize, Throughtput, Parallelization

The throughput is the easiest part, because it it defined and limited by electronics and its physical constraints. Here are the data sources from slow to fast:

Internet (for example, S3)
Local Network (for example, NAS, also hard drives attached to your local network)
(mechanical) hard drives inside of your computer
SSD inside of your computer
RAM
CPU Cache memory / GPU memory if you use GPU for training

You can reduce the data processing time by moving the data to the faster medium. Here is how to:

Download the data from the internet to your harddrive, or to local network NAS, if the data doesn’t fit into your machine.
As soon as you read a file once, your OS will keep its data in RAM in the file cache for further reads. Note that the data will be eventually evicted from memory, because RAM size is usually much less than the harddrive size.
If you want to prevent eviction for some particular files that fit in memory, you can create a RAM drive and copy the files there.

There are a couple of tricks to reduce data size:

Compress the data with any losless compression (zip, gzip etc). You will still need to decompress it to start working on it, but the reading from slow data source like a hard drive will be quicker because there is less data to read, and decompression will happen in a fast RAM.
Partition the data, for example by month or by customer. If you only need to run a query related to one month, you will skip reading the data for other monthes. Even if you need the full data (for example for ML training), you can spread your data partitions into different computers and process the data simultaneously (see parallelization below)
Store the data not row by row, but column by column. Thats the same idea like partitioning, but column-wise. So if for some particular queries you don’t need some column, in a column-based file format you can skip reading it, therefore reducing the amount of data.
Store additional data (index) that will help you to find rows and columns inside of your main file. For example, you can create a full text index over the columns of your data containing free English text, and then, if you only need rows containing the word “dog”, this index will read only bytes from the storage with the rows containing “dog” inside, so you read less data, so you reduce amount of data to be read. For Computer Science, this is where their focus is on and they are most excited inventing additional data structures for especially fast indexes. For Data Science, this is rarely helpful, because we often need to read all the data anyways (eg. for training).

As of parallelization, it is a solution for the situation where your existing hardware in under-used and you want to speed up things by fully using it, or you can provision more hardware (and pay for it) to speed-up your processes. At the same time this is also the most complicated factor to optimize, so don’t go there before you try the stuff above.

Parallelization on macrolevel: you can split your training set to several servers, read the data and train the model simultaneously on each of them, calculate weight updates with the backpropagation and then apply them to a central storage of weights and redistribute the weights back to all servers.
Parallelization on microlevel: put several examples from your training set at once onto GPU. Similar ideas are utilized in a small scale by databases or frameworks like numpy and is called there vectorization.
Parallelization on hardware level: you can attach several hard drives to your computer in a so-called RAID array and when you read from them, the hardware ensures that it reads from all of the hard drives in parallel, multiplying your hardware throughput.

Parallelization is sometimes useless and/or expensive and hard to get right, so this is your last resort.

Some practical solutions

A database is a combination of smart compressed file format, an engine that can read and write it efficiently, and an API so that several people can work on it simultaneously, data can be replicated between database nodes, etc. When in doubt, for data science use ClickHouse, because it provides top performance for many different kind of data and use-cases.

But databases are the top notch full package, and often you don’t need all of it, especially if you work alone one some particular dataset. Luckily, also separate parts of the databases are available on the market. They have various names like embedded database, database engine, in-memory analytics framwork etc.

An embedded database like DuckDB or even just a file format like Parquet (Pandas can read it using fastparquet) are in my opinion the most interesting one and might be already enough.

Here is an short overview:

	ClickHouse	DuckDB	Fastparquet
Data compression	Native compressed format with configurable codecs. Also some basic support of Parquet.	Native compressed format. Also supports Parquet	Configurable compression codecs
Partitioning	Native partitioning. Also basic partitioning support for Parquet	Some partitioning support for Parquet	Manual partitioning using different file names
Column-based storage	Native and Parquet, as well as other file formats	Native and Parquet	Parquet is a column-based format
Indexes	Primary and secondary indexes support for native format. No index support for Parquet (?).	Min-Max index for native format. No index support for Parquet (?).	Not supported by fastparqet to my knowledge, but Parquet as file format has indexes.

Permalink Leave a comment

07Apr / 2023

Backup is working!

So guys, I have got some lessons learned about restoring from a backup this week!

Every computer geek shames for his backup strategy. Most of us know we should backup our data, but we don’t do it. Those who do backups regularly, haven’t ever tried to restore from a backed up file to check if it is really working. And only the geek gods like Umputun backup into several locations and probably also check their backups regularly.

Until this week, I’ve belonged to the minority above: I was making my backup, but never tried to restore from it. This is a kind of interesting psycological trap. On the one hand, you think your backup software is working properly (because if it isn’t, it makes no sense to backup at all), but on the other hand, you are scared to restore from a backup file, because if that will go wrong, it will both destroy your original data and leave you with an unusable backup file.

And it also doesn’t add trust that my backup software has been developed by AOMEI, a chinese company (no, having a chinese wife doesn’t make you to trust just any chinese software more than before).

People who work as IT admins don’t have this problem, because they can always use a spare drive to restore from backup into there and to check if the backup is working without the risk of losing the original data. Or else some of their server drives will die and they are forced to restore to a new drive anyway.

The latter scenario has finally happened to me this week. My booting drive (128 Gb SSD) has died. But don’t worry, I’ve had it backed up (sectorwise) using the AOMEI backup into my small Sinology NAS.

So I have ordered and recevied a new SSD (thanks Amazon Prime). Now, how would I restore my backup file on this new drive? Even if I could attach it to Sinology (I would probably need some adapters for that!) I don’t think that Synology can read the AOMEI file format, or that AOMEI has developed an app for Synology. Also, I cannot even login into the Synology Web UI, because it requires credentials, and I have stored the credentials on KeePass, which is luckily for me on a separate drive, but it is not replicated to my other devices. And also, Synology doesn’t allow to restore password by sending it to me via E-Mail. I only can reset it by pressing some hidden reset button with a needle. My NAS is physically located in a place I could hardly reach, so it would be yet another adventure.

Lesson 1: store your backup NAS IP, username and password not on the same PC you would need to restore.

So I have attached a DVD drive to my PC (for all of you Gen-Z: this is a cool technology from last century, where you would store data on disks, but no, the disks are not black and from vinyl, but rather rainbow colored and from plastic) and installed a fresh Windows 10 onto my new SSD system drive.

After booting I was happy to see all the data from my second drive are still there, including the KeePass database. I just didn’t have KeePass to read it. No problem, I will just download it! I went to the KeePass web site, but the download didn’t want to start, probably because I was using some old version of the Edge browser. Okay. So I needed to download and to install Chrome first. Thank God Windows 10 installation has automatically recognized all my drivers including the NIC (for all of you Gen-Z: in the past, Windows didn’t have access to the Internet after the installation. You would need to spend some time configuring your dialup connection).

So, downloaded and started KeePass, copied the NAS credentials from there, now I can see the backup file. Cool. Next step will be to restore it. Okay, need to download and install AOMEI. Thank God, the company is still in business and they still have a web site and their newest software is still compatible with my backup file.

Lesson 2: download a portable version or an installer of your backup software and store it not on the same PC you would need to restore.

Installed and started AOMEI and pointed it to the backup file. It said, I need to reboot into a system restore mode. Okay. After a restart, the AOMEI app has booted instead of Windows and after some scary warnings started to restore the system. Also, at this point I was happy I didn’t need some special drivers for NIC or something. I could have forced me to copy the backup file into USB-Stick first though, and I am grateful it hasn’t.

The restoration process took several hours, but after it has finished, the PC has rebooted again, into my old Windows, with all software installed and old data on the system drive restored!

Summary: I am happy! And I am less afraid of restores now!

Question: are there any home NAS on the market that allow you to restore sectorwise system backups directly onto a new drive (you don’t need to attach it to your PC, just insert the new drive directly into the NAS)?

Permalink Leave a comment

10May / 2022

Data Lakes are Technical Debt

I’ve been working on big data since 2014 and I’ve managed to avoid taking the technical debt of data lakes so far. Here is why.

Myth of reusing existing text logs

For the purpose of this post, let’s define: a data lake is a system allowing a) to store data from various sources in their original format (included unstructured / semistructured) and b) to process this data.

Yes, you can copy your existing text log files into a data lake, and run any data processing on them in the second step.

This processing could be either a) converting them into a more appropriate storage format (more about it in just a minute) or b) working with the actual information – for example, exploring it, creating reports, extracting features or execute business rules.

The latter is a bad, bad technical debt:

Text logs are not compressed and not stored by columns, and have no secondary indexes, so that you waste more storage space, RAM, CPU time, energy, carbon emissions, upload and processing times, and money if you casually work with the actual information contained in there.
Text logs doesn’t have a schema. Schemas in data pipelines play the same role as strict static typing in programming languages. If somebody would just insert one more column in your text log somewhere in the middle, then your pipeline will in the best case fail weeks or months later (if you execute it for example only once a month), on in the worst case it will produce garbage as a result, because it wouldn’t be able to detect the type change dynamically.

Never work off text logs directly.

A more appropriate format for storage and for data processing is a set of relational tables stored in compressed, columnar format, with a possibility to add secondary indexes, projections, and with a fixed schema checking at least column names and types.

And if we don’t work off the text logs directly, it makes no sense to copy them into a data lake – first to avoid the temptation to use them “just for this one quick and dirty one-time report”, but also because you can read the logs from the system where they are originally stored, convert them into a proper format, and ingest into your relational columnar database.

Yes, data lakes would provide a backup for the text logs. But YAGNI. The only use-case where you would need to re-import some older logs is some nasty hard-to-find bug in the import code. This happens rarely enough to be willing to use much cheaper backup solutions than the data lake.

Another disadvantage of working with text logs in data lakes is that it motivates to produce even more technical debt in the future.

Our data scientist needs a little more information? We “just add” one more column into our text log. But, at some point, the logs become so big and bloated that you won’t be able to read them with your naked eye in any text editor, so that you’d lose the primary goal of any text log: tracing the state of the system to enable offline debugging. And if you add this new column in the middle, some old data pipelines can silently break and burn on you.

Our data scientist needs information from our new software service? We will just write it in a new text log, because we’re already doing it in our old system and it “kinda works”. But in fact, logging some information:

logging.info('Order %d has been completed in %f ms' % (order_nr, time))

takes roughtly as much effort as inserting it into the proper, optimal, schema-checked data format:

db.insert(action='order_completed', order=order_nr, duration_ms=time)

but the latter saves time, energy, storage and processing costs, and possible format mistakes.

Myth of decoupled usage

You can insert all the data you have now, store it in the data lake, and if somebody needs to use it later, they will know where to find it.

Busted. Unused data is not an asset, it is a liability:

you pay for storage
you always need to jump over it when you scroll down a long list of data buckets in your lake,
you might have personal data there so you have one more copy to check if you need to fulfill GDPR requirements,
the data might contain passwords, secure tokens, company secrets or some other sensitive information that might be stolen, or could leak.
Every time you change the technology or the clould provider of your data lake, you have to spend time, effort and money to port this unused data too.

Now, don’t get me wrong. Storage is cheap, and nothing makes me more angry at work than people who would delete or not store data, just because they want to save storage costs. Backup storage is not so expensive as data lake storage, and de-personalized parts of data should be stored forever, just in case we might need them (but remember: YAGNI).

Storing unused data in a data lake is much worse than storing it in an unused backup.

Another real-world issue preventing decoupled usage of data is how quickly the world change. Even if the data stored in the data lake is minutiously documented up to the smallest detail – which is rarely the case – time doesn’t stand still. Some order types and licensing conditions become obsolete, some features don’t exist any more, and the code that has been producing data is already removed, not only from the master branch, but also from the code repository altogether, because at some point the company was switching from SVN to git and they had decided to drop the history older than 3 years, and so on.

You will find column names that nobody can understand, and column values that nobody can interpret. And this would the best case. In the worst case, you would find an innocent and fairly looking column named “is_customer” with values 0 and 1, and you will mistake it for a paying user and you will use it for some report going up to the C-level, only to painfully cringe, after somebody would suddenly remember that your company had an idea to build up a business alliance 10 years ago, and this column was used to identify potential business partners for that cooperation.

I only trust the data I collect myself (or at least I can read and fully understand the source code collecting it).

The value of the most data is exponentially decaying with time.

Myth of “you gonna need it anyway”

It goes like this: you collect data in small batches like every minute, every hour or every day. Having many small files makes your data processing slow, so you re-partition them, for example into monthly partitions. At this point you can also use a columnar, checked store and remove unneeded data. These monthly files are still to slow to be used for online, interactive user traffic (with expected latency of milliseconds) so you run next aggregation step and then shove the pre-computed values into a some quick key-value store.

Storing the original data in its original format in the lake in the first step feels to be scientifically sound. It makes the pipeline uniform, and is a prerequisite for reproducability.

And at the very least, you will have three or more copies of that data (in different aggreation state and formats) somewhere anyway, so why not storing one more, original copy?

I suppose, this very widespread idea comes from some historically very popular big data systems like Hadoop, Hive, Spark, Presto (= AWS Athena), row-based stores like AWS Redshift (=Postgresql) or even document-based systems like MongoDB. Coincidentally, these systems are not only very popular, but also have very high latency and / or waste a lot of hardware resources, given the fact that some of them written on Java (no system software should be ever written in Java), or use storage concepts not suitable for big data (document or row-stores). With these systems, there is no other way than to duplicate data and store it pre-computed in different formats according to the consumption use-case.

But we don’t need to use popular software.

Modern column-based storage systems based on the principles discovered with Dremel and MonetDB are so efficient that in the most use-cases (like 80%) you can store your data exactly once, in a format that is suitable for a wide variety of queries and use-cases and deliver responses with sub-second latency for simple queries.

Some of these database systems (in alphabetical order):

Clickhouse
DuckDB
Exasol
MS SQL Server 2016 (COLUMNSTORE index)
Vertica

A direct comparison of Clickhouse running in an EC2 instance with data stored in S3 and queried by Athena (for some specific data mix and query types that are typical at my current employer Paessler AG) has shown that in this particular case Clickhouse is 3 to 30 times quicker and at the same time cheaper than the naive Athena implementation.

Is it possible to speed up the Athena? Yes, if you pre-aggregate some information, and pre-compute some other information, and store it in the DynamoDb. You’ll then get it cheaper than Clickhouse, and “only” 50% to 100% slower. Is it worth having three copies of data and employing a full time DBA monitoring the data pipelines health for all that pre-aggregating and pre-computing, as well as using three different APIs to access the data (Athena, DynamoDB and PyArrows)? YMMV.

Summary

Data lakes facilitate technical debt:

Untyped data (that can lead to silent, epic fuck-ups)
Waste of time
Waste of money
Waste of hardware
Waste of energy and higher carbon footprint
Many copies of the same data (that can get out of sync)
Can be against of the data minimization principle of GDPR
Can create additional security risks
Can easily become data grave if you don’t remove dead data regularly

Avoid data lakes if you can. Mind the technical debt you are agreeing on and be explicit about it, if you still have to use them.

Permalink Leave a comment

21Feb / 2022

Flywheel Triangles

For a business to survive and become somewhat sustainable, it needs a self-sustaining business process to earn money, such as it would be very hard to destroy it by management errors or market changes.

I’ve heard it to be called “Flywheel” at StayFriends.

Business flywheels are positive feedback loops leading to business growth, and can be depicted as triangles. Here is for example how the flywheel of StayFriends looked like:

Users generate content. Content could be used to generate ads, or sold to other users, and the resulting revenue could be used to buy ads, bringing new users.

This effect was called “viral loop” back then, but now I understand that this kind of flywheels exist in any successful business and are not limited to social networks.

This is for example the flywheel of Axinom in its early years:

Any custom projects developed based on the AxCMS resulted to more generalized features being added to this CMS, and to more good looking references for it, so that it has attracted more customers, and thus it has generated new projects. Note how the quality of the produced software was not part of this triangle. Theoretically, you could run projects that have left customers dissatisfied, but you had still added something to AxCMS and could use it to win other customers.

Winning new customers might be harder than winning new projects of an existing customers, so that Axinom was working on a second flywheel:

Having several flywheels supporting each other seems to be a feature of companies demostrating sustainable growth. Here is for example the landscape of flywheels of Immowelt (including a potential new one):

It is interesting to see that even a damaged flyweel can support the business for decades. I can demonstrate it on example of Metz, a TV manufacturer. Initially, when television was not yet ubiquitos, the company was participating in the following flywheels:

Where Metz has owned only the lower triangle, but the growth has happened in the upper two: people discovered some new cool show they wanted to see, then needed to buy their first TV set for that, having done that, they became TV Viewers, that motivated creation of new TV shows both directly as well as because of more money from the advertisers. This worked until virtually every family had a TV set. From this time on, the link between TV Sets and TV Viewers became broken, so Metz remained with this:

Basically, they had only the pair of dealer -> TV sets, and a very weak third corner: by producing very high-quality TV sets, they could consistently win various Tests (eg. by Stiftung Warentest) and this could helped to win a little more electronics dealers.

I guess, the company has survived over 50 years in this state.

Every company is interested in having a healthy flywheel and in participating in to several flywheels at the same time. I think, the most realistic way adding a new flywheel would be reusing existing one or two nodes and adding a new triangle.

For example, Metz could try to grow true Metz fans, having basically the strategy of Apple and XiaoMi:

Well, in theory it looks good, but we all know that in the practice, there are all kind of problems, starting from missing investment budget, not enough innovation talents in R&D department, law and regulations preventing some flywheels, strong market competition, etc.

The reason I’ve written this post is that the idea of depicting the flywheels by triangles came to me in a dream, and for some reason, I was very sure in my dream that it absolutely must have at least three corners. I cannot explain this logically, but anecdotally, if we look at the Marx’s formula:

It is striking that it doesn’t provide any non-trivial insights of how to start growing business.

Permalink Leave a comment

29Dec / 2019

Sanitär für IT-ler

Irgendwann hat man angefangen, Wasserleitungen nicht nach Kundenvorgabe, sondern in einer Reihe von fest definierten Größen herzustellen.

Der Bauherr konnte zwar nicht mehr auf Millimeter genau festlegen, wie groß seine Röhre waren. Konnte dafür aber Wasserröhre von unterschiedlichen Herstellern miteinander kombinieren und von der Konkurrenz der Hersteller profitieren. Die Hersteller haben auch davon profitiert, weil sie a) die Röhre als Massenware herstellen, und b) auf Vorrat und nicht mehr auf Bestellung produzieren konnten – nimmt ein Kunde eine Rohr nicht ab, so kann es ein beliebiger anderer Kunde kaufen.

Nun, abgesehen von der Länge haben runde Röhre zwei weitere Dimensionen: Innendurchmesser und Außendurchmesser. Da damals die Differenz dazwischen immer gleich blieb, hat man eine der zwei Dimensionen für überflüßig gehalten und Röhre immer nach dem Innendurchmesser gekennzeichnet. Also eine 1” Rohr hatte den Innendurchmesser von 1 Zoll. Das war aus der Usability Sicht und aus Sicht der Kundenorientierung damals vorbildlich. Erstens, es hat den Kunden eher der Durchsatz des Wassers in seinem Netz interessiert, also im Endeffekt der Innendurchmesser. Wie viel Platz die Rohr in oder an der Wand nimmt, war für den Bauherr zweiträngig (damals waren Wände ja auch dicker). Zweitens, alle Unterlagen wie Preislisten, Angebote, Rechnungen, Buchungen und Lieferscheine waren einfacher zu schreiben und zu lesen, weil eben nur eine Zahl statt zwei darin kommuniziert werden musste.

So. What could possibly go wrong?

Dann kam man aber auf die Idee, dass es einfacher und schneller ist, Wasserleistungen mit Gewinde miteinander zu verbinden, als sie immer vor Ort schweißen oder loten zu müssen. Für Gewindeverbindungen braucht man aber Fittings – also kleine Rohrstücke in der gewünschten Form (ein Winkel, ein T-Stück, eine Muffe usw). Wenn man Fittings auf oder einschraubt, gibt es dann immer eine Innen- und eine Außengewinde. Wenn ich z. B. eine Rohr verlängern möchte, scheide ich an deren Ende eine Außengewinde auf und schraube darauf ein Fitting auf, der entsprechend Größe Innengewinde haben muss. Wenn ich eine 1” Rohr habe, wie groß muss die Innengewinde des Fittings sein? Sie muss so groß sein, wie der Außendurchmesser der Rohr. Also bei einer 1” Rohr war es damals 1”5/16. Also man hätte die passende Fittings damals auch so bezeichnen können: ein 1”5/16 Fitting. Tja, das Usability wäre dann aber schlecht. Wenn der Kunde eine Rohr mit 1 Zoll hatte, musste er daran denken, dass der passende Fittig die Größe 1”5/16 hat, und woher soll er das denn wissen, man hat ihm ja die ganze Zeit zuvor den Außendurchmesser der Rohr nirgendswo kommuniziert (siehe oben, aus Usability-Gründen).

Deswegen hat man sich damals wirklich dafür entschlossen, die Fittings so wie die passende Röhre zu kennzeichnen. Wenn ich also eine 1” Muffe kaufe, also ein Fitting mit zwei Innengewinden, dann hat diese Muffe keine einzige Größe, die sich auf 1 Zoll beläuft. Sondern der Fitting ist so groß, dass er auf eine Rohr mit Innendurchmesser 1” passt, wenn man denn darauf eine Außengewinde aufschneidet.

Soo. Des passd scho. War aber knapp. Als ein IT-ler spürt man hier schon langsam ein Geruchlein.

Und dann kam man aber auf die Idee! Darauf, dass Gusseisen nicht das einzige Material für Wasserröhre sein kann, und hat angefangen, Stahl, Kupfer, Messing usw. zu verwenden.

Nun, da gäbe es aber ein kleines Problemchen. Man braucht weniger von den neuen Materialien, um gleichwertig stabile Röhre zu bekommen. Also man hatte damals zum Beispiel einen kleineren Außendurchmesser bei gleichem Innendurchmesser machen können. Hat man aber nicht gemacht. Warum nicht? Weil man ja schon so viele Fittings produziert und verbaut hat, die indirekt einen bestimmten Außendurchmesser verlangen. Wir erinnern uns, die 1” Muffe hat die Innengewinde von 1”5/16, damit sie auf eine Gusseisen-Rohr mit 1” Innendurchmesser passt. Eine Stahl-Rohr mit einem Innendurchmesser von 1 Zoll hätte den Außendurchmesser von nur ca. 1”1/8. Der alte Fitting wäre dann um 3/16” zu groß.

Deswegen hat man sich damals wirklich dafür entschlossen, bei den neuen Röhren den Innendurchmesser zu vergrößern! Und den Außendurchmesser zu behalten! Und die Röhre immer noch nach dem nicht mehr vorhandenen Innendurchmesser zu kennzeichnen!

Das ist so herrlich, deswegen jetzt nochmals zum mitschreiben.

Wenn ich heute eine 1” Stahlrohr kaufe, hat sie weder den Innen- noch den Außendurchmesser von einem Zoll. Sondern, ihr Außendurchmesser ist so groß, wie er bei einer Gusseisenrohr irgendwann mal war, und diese Gusseisenrohr hatte damals den Innendurchmesser von einem Zoll.

Tja. Der einfachste Weg, zu sich einen Berater in einem Baumarkt zu holen, ist in die Sanitär-Abteilung mit einer Schiebelehre zu marschieren. “(Oh, oh, oh, der Kunde misst gerade die Gewinde eines Fittings mit der Schiebelehre, das kann nur schief gehen…) Guten Tag, kann ich Ihnen helfen?”.

Sooo.

Und dann sind die Französen gekommen. Die mit ihrem Metre. Und haben zurecht moniert, dass sich die ganze Welt auf das metrische System geeinigt hat, und was haben hier alle diese Zölle zu suchen? Deswegen haben die Röhre in den neuen Anwendungen (also nicht in der Haustechnik, aber z.B. für Raketen) tendenziell metrische Größen und metrische Gewinden. Es ist also gar nicht so unwahrscheinlich, ein Druckmessgerät zu kaufen, dessen Anschluss eine M20 Gewinde hat. Es hat also einen Durchmesser von exakt 20 mm. Der nächste Fitting wäre dann der 1/2” Fitting, der mit 18,5mm etwas zu klein gewesen wäre. Abgesehen davon haben die metrischen Gewinden eine andere Dichte, was die Anzahl von Faden angeht.

Da müssen wir uns endlich glücklich schätzen, dass die Gewindeverbindungen in der Haustechnik nicht mehr zum State of the Art gehören.

Endlich gibt es Mehrschichtverbundröhre, die mit ihrem Außendurchmesser und Stärke, mit metrischen Einheiten gekennzeichnet werden. Also z.B. 16×2,2 für eine Rohr mit 16mm Außendurchmesser und 2,2mm stark. Diese Röhre kann man von Hand biegen, und in Sekunden durch Verpressen miteinander verbinden. Und eine 100 Meter Rolle kann von einer Person getragen werden.

Es gibt nur ein kleines Problemchen. Alle Hersteller haben den Wikipedia-Eintrag über den Walled Garden gelesen und versuchen, auf ihr System zu locken. Wer “fremde” Röhre oder Fittings verwendet, kriegt keine Garantie. Also wenn die Rohr oder die Fittings alle sind, ist da nichts mit “mal schnell in den Baumarkt fahren und neue holen”, man muss im System bleiben und Nachschub dort besorgen, wo man es früher gekauft hat. Es gibt um 36 unterschiedliche Presskonturen, die jeweils nur auf passende Fittings angewendet werden können. Sie könnte man alle mit einer Presszange verpressen (mit auswechselbaren Pressbacken). Das kostet aber 25 Euro pro Pressbacke. Und dann verliert man ebenfalls die Garantie. Man soll das Originalgerät des Herstellers verwenden, das für den gleichen Job plötzlich das 10-fache kostet (wir sprechen hier um mehr als 1000 Euro für ein kleines Akku-Gerät).

Ich wäre dafür, dass man in Urheber- und Patentrecht auch den Tatbestand eines “Gestaltungsmisbrauchs” einführen würde. Ich habe nämlich den Verdacht, dass man exakte Größe des Fittings nicht preisgibt, nicht weil es auf diese Größen ankommt, um das Fitting besser als von der Konkurrenz machen (stabiler, bequemer, preiswerter), sondern einzeln und allein um zu verhindern, dass die Konkurrenz preiswerte und bessere kompatible Tools und Zubehör herstellen kann.

Sooooo.

Als wäre das alles. Isses aber ned.

Frage: ich habe eine Rohr, die mit 1/2” gekennzeichnet ist. Ich möchte sie durch eine Wand durchführen. Wie groß muss das Loch in der Wand sein? Antwort: 40mm. Wie komme ich darauf? Nun, die 1/2” beträgt bekanntlich 12,25mm. Diese Zahl ist ja aber wie wir gelernt haben heutzutage irrelevant und wir schmeißen sie gleich weg. Wir schauen stattdessen in die Tabelle an und finden heraus, dass der Außendurchmesser einer 1/2” Rohr 20mm beträgt. Nun, laut EnEV müssten wir die Rohr noch dämmen. Hierfür ist die Dämmstärke von 50% des Rohrdurchmessers erforderlich. 50% von 20mm ist 10mm. Die 10mm Schicht rund um die Rohr von 20mm herum, macht die finale Größe von 40mm.

Ich finde diese Lösung auch sehr elegant (nicht!). Die Gesetzgeber haben sich vielleicht überlegt, ob sie gleich die Hersteller vorgedämmte Röhre herstellen lassen. Als Verbraucher hätte man dann eine 40-er Rohr gekauft, die man mit der Schiebelehre messen und 40mm bekommen kann, die man auch mit einem 40er Fitting verbinden und in ein 40-er Loch reinschieben kann, usw. Doch so einfach ist es nicht – es gibt ja noch sehr viele alte Installationen, die man sowieso nachträglich dämmen müsste. Und überhaupt, wo kämen wir denn hin, dann würden wir doch dieses wunderbares seit Jahrhunderten sich bewehrtes Zoll-System verlieren! Das geht ned! So ungefähr ist es wohl verlaufen. Nun müssen die Installateure immer mit zwei Durchmesser hantieren. Einmal für die Rohr und Fittings wie sie sind. Und einmal für das Ganze, nachdem es fertig isoliert ist.

Es gibt viele IT-ler, sie sich für ihren Quellcode schämen. Zu viele. Eine Kur davon könnte es sein, mal mit den eigenen Händen eine alte Sanitär-Installation zu renovieren.

Permalink Leave a comment

23Aug / 2019

Four Advices For Product Managers of Machine Learning Products

The hype around Big Data and Machine Learning goes on and on, and more and more businesses seem to obtain competitive edge by developing data-based products. As a product manager, everyone is considering to use this new tech in their area.

Having made some first experiences designing data-based products, I want to share the lessons learned.

Artificial Intelligence is about having Less Control

In a traditional software product, we fully control its internal logic. On contrary, with a data-based product, we give up some part of the logic to the Machine Learning model. This is fully intended. In fact, this is why we want to use an A.I. at all.

For example, if we want to recognize images of kittens, we could define exact rules of how to process and transform the colors of pixels, by ourselves. But this would be a tedious and complex, if not impossible task, at least for human software developers. Instead, we would train a ML model that would accept images on its input, and would output the detection result, “somehow”.

And here is the dark side of the coin: this “somehow” means that we
1) cannot explain how exactly the model has made its decision, and
2) cannot find “just that one single screw” in the model to tune its behavior, because ML Models can contain millions of “screws” you can tune, and as a human, you cannot find the right one; and finally
3) we have to accept that the ML model is making right detections often, but not all the time.

In the practice, this all leads to the following advices:

1. Design for a Mistake

Most ML models produce results that are only statistically correct – i.e. only some big part of the users will obtain correct or at least satisfactory results. Howewer, some sizable part of the users will not get satisfactory results, and this part could be uncomfortably large (compared to the traditional products), especially because the ML model can make mistakes also in the situations, which for us humans would appear inacceptable, for example it could recognize a kitten in the photo of your CEO.

It is our task as product manager to proactively counteract the possible user dissatisfaction. Here are some ideas how to do so.

Human Moderation

We are working on a gallery of images with kittens and we want that only kittens can be published.
Wrong: if no kitten is detected in the uploaded image, prevent the user to post it.
Right: inform the user that the image is sent to manual moderation process, and hire moderators.
Alternatively: post the image in an “unsorted” category, let users tag inappropriate images, and automatically hide the image detected as “no kittens” after even the very first user complain.

Sort Instead of Hide

We have a baby products shop. We want to recognize the possible age of the users baby and to show them only items suitable for their age.
Wrong: hide the items of an inappropriate age
Right: sort the items of an inappropriate age “below the fold”
Alternatively: hide the inappropriate items, but provide good visible buttons “<<< Articles for younger babies” and “Articles for older babies >>>”

Suggest Instead of Fill

We have a marketplace for used products and we want to generate the headline for new listings automatically, to speedup and simplify the listing publishing process
Wrong: make the headline not editable and generate it automatically
Right: Fill the generated headline as predefined (default) value of the text box, remove it as soon as user starts typing something else.

Note that to be able to implement this idea, the overall publishing workflow has to be changed: instead of starting with the entering a headline, the user can start by uploading article images or filling some structured data, which is needed to generate the headline for him.

Upsell

We want to help users to estimate the value of their real estate.
Wrong: ask user to fill a form, then output the result of the model.
Right: output a wide range of valuations as soon as user has typed the location, and let the user either to enter more data to reduce the valuation range, or, if measuring or getting the data is too complicated for them, suggest them to order a human evaluation service.

Feedback to Calm Down

We want to show some ads that are as relevant to the users as possible.
Wrong: just show the ads produced by a recommender model.
Right: additionally, show a button allowing the users to give us the feedback (“Less of this topic”) or to turn off the ads (see the Upsell idea).

2. Test in Production

A traditional QA process includes comparing the documented intended logic of the product with the actual product behavior. But because we don’t know the exact logic of the A.I. models, we also cannot document it so the testers also cannot check it. Besides, even if we could document the logic, this document would be so complicated that it would be infeasible to test all the cases.

This leaves us with the following options:

Predefined Mockups

In your services, define some “magic” identifiers that would prevent going through the A.I. pipeline and return predefined results back. For example, we can add a logic into our kitten recognition service that would always return “kitten” for images of 1×2 pixels, and “no kitten” for images of 2×1 pixels, given that it is unprobably that real users would use this image sizes in production. A tester would be able at least to test the non-A.I. parts of the product (uploading, publishing, searching etc)

Exploratory Testing

The traditional exploratory testing is still possible with data-based products, but is always more expensive. For example, the testers would need to prepare testing sets with images of kittens, dogs, horses, lions, dolphins, NSFW-images, etc.

The exploratory testing is especially laborous for products that act on the previous user behavior, such as recommenders, especially if recommenders use some out-of-band data like current date and time, the weather outside, or any interesting shows running currently on TV.

A/B Testing

Therefore, the data-based products rely on A/B testing much more often than any traditional product, because that would compensate for the lack of the traditional QA, which is often infeasible due to time or budget constraints.

3. Document for Reproducibility

Software documentation is often part of prescribed software development processes. In traditional products, the documentation of their business logic is often considered to be the main and most important part.

As a product manager of a data-based product, you can often be confronted with the question, “Where I can find the documentation how the product decides to do this and that in a such and such situation”. And then you will need to explain that there is no documentation, because nobody knows the logic and nobody is able to know it either.

Here, we have to make one step back and remind ourselves, that the main goal of documentation is to enable the maintenance and further development of the software. For the data-based products, the key factors for this is being able to reproduce the model training, and being able to re-train the model with some new, or modified data. There are some tools on the market to manage the datasets (the most popular being Pachyderm), while we have also created our very own framework for this: https://github.com/Immowelt/iwlearn

4. Create Product Ideas with Data Mining, or Be Flexible

The usual process of product idea discovery includes doing interviews with the users and other UX research, filling out business canvas and performing SWAT analysis, scheduling brainstormings, and a lot of other important and effective tools.

What it does not include, is, to check whether the data in your data lake really has the quality level required to implement the Machine Learning model you need.

To give you some examples of what can diminish your data quality:

something that won’t be collected from user and cannot be inferred from other data. For example, if we don’t ask how many rooms there is in the apartment, it is very hard or impossible to realiably infer it from text or images,
something that is not validated during collection, for example users can enter literally anything as their zip code,
something that has been collected, but disappeared in further data processing steps. For example, the user id could be actually collected during his item search process, but then be removed in the later steps due to the data minimization principle of EU GDPR. In this case we cannot create recommendation models based on what else items this user was interested in the past. This is to demonstrate that low data quality not always means some computer bugs that can be quickly fixed, but could also be a well intended state.

Unexpected low data quality can hit you badly. In the worst case scenario you would successfully pitch an idea, get it on your company roadmap, start implementing it and only after investing 80% of the data science efforts you would recognize that the resulting ML Model cannot be used as expected, for example because it has a very low accuracy, so too many users would be annoyed by it.

Be Flexible

This is why you need to have a plan B for this kind of situations. One typical solution would be using some traditional business logic instead of the A.I., and accept the possible less than expected uplift of the products KPIs. Another solution would be to understand, what exactly part of the available data has good quality, and to quickly conceive, pitch, prioritize and implement a completely different product, possibly with some other target group, product focus or KPIs, but at least technically feasible as it uses a good quality data.

Mine Data

Another, a fundamentally different way to approach this problem is to generate product ideas with Data Mining.

Your data scientists would start with looking at data about some topic (for example, the behavior of users who like kittens) and start to mangle it in different ways, for example creating clusters of users (white kitties versus black kitties lovers), visualizing user retention, cohort analysis and so on. As a by-product, the data scientists will identify and eliminate the bad data: all that bots and crawlers, and users posting NSFW pictures, and can generate some insights that hopefully would be useful to generate an idea.

What if they can discover that a sizable part of the users “like” good kitten images within minutes after they appear online? You can make a new product feature out of it!

For example, for any new kitten image you would find out the users that most probably would like it, and send them a push notification, so that they can enjoy and like the image immediately, thereby increasing their retention on the web site.

The advantage of this process is that you’re guaranteed that you have data at least in some reasonable quality, before you start pitching and prioritizing your product idea.

The disadvantage is that you cannot guarantee that data mining would produce any insights related to the current company goals and focus user groups, so that you need to be flexible here too and be able to make good products, even though they not always contribute to the current company focus.

Permalink Leave a comment

← Older posts

Maxim Fridental

Category Archives: tech-and-biz

Do we really need a staging system for data science?

A step back

How the data science is different?

Summary

A three minute guide to Ansible for data engineers

How to make yourself to like YAML

How to stop fearing and start using Kubernetes

Why Kubernetes?

Summary

How to choose a database for data science

Conceptual framework

Some practical solutions

Backup is working!

Data Lakes are Technical Debt

Myth of reusing existing text logs

Myth of decoupled usage

Myth of “you gonna need it anyway”

Summary

Flywheel Triangles

Sanitär für IT-ler

Four Advices For Product Managers of Machine Learning Products

Artificial Intelligence is about having Less Control

1. Design for a Mistake

Human Moderation

Sort Instead of Hide

Suggest Instead of Fill

Upsell

Feedback to Calm Down

2. Test in Production

Predefined Mockups

Exploratory Testing

A/B Testing

3. Document for Reproducibility

4. Create Product Ideas with Data Mining, or Be Flexible

Be Flexible

Mine Data

Categories

Archive