Do we really need a staging system for data science?

So you are an infrastructure guy (a DevOp or an IT admin) and your data scientist or data engineer has told you they don’t need a staging system.

You know for a fact that any professional software development needs a development system, a staging system and a production system, and that developing in production is utterly dangerous and unprofessional, so you think your data scientist or data engineer are incompetent in software engineering.

Are they really?

A step back

So why the web development came up with these three environments in the first place?

If we develop a web site in production, any changes become immediately visible not only to us, but also to the users. If there is a bug, users will see it. If there is a security vulnerability, hackers will use it. If there is some suboptimal communication in the texts, activists will make a shitstorm out of it. Data can be lost, customers can be lost, the whole company can be lost.

So we need a development system. When the developer makes a change, they can check how it works before releasing it to everybody else. If we have only one developer and one change per week to do, this would be enough, because the testers, product managers, marketing, legal etc. would just check the changes on his development system, while the developer drinks coffee and plays table football.

But if the the developer is constantly working on changes, and even more so if there are more than one developers, it makes sense to establish the third system, staging, where their all changes can be integrated together, bundled to one release, tested, validated, checked by any internal stakeholders etc, before they get rolled out in production.

All of this makes sense.

How the data science is different?

Let’s say we want to create a next generation of the Netflix movie recommender. We want to extract every image frame and sound track of each movie, send them to a LLM model and ask the model to generate microtags for each movie, such as

  • main_character_wears_orange
  • more_than_three_main_characters
  • movie_happens_in_60ies_in_china
  • bechdel_wallace_test_passed
  • etc

We want then for each user to retrieve the movies they have watched and enjoyed before, so that we understand what microtags are important for the user, so that we can recommend them the movies with the same microtags.

So we download a small subsample of the dataset (i.e. a 15 minutes of video) to our development workstation, code the script that reads the data and uses the LLM, and when it works we want to execute it for bigger subset of movies (i.e. 100 movies) and see how reasonable the generated microtags are. We can’t do it on the development workstation, because it will takes years there and it should be parallelized on some fat compute cluster so that we’ll get the chance to finish this project until we retire.

So we deploy our training app to a compute cluster and call it production environment. It is still internal, so we don’t need stakeholders to check it, and if its fails in the middle, we can fix the bugs and restart the process, and no external users will be affected. So it has the freedom of a development environment and hardware costs of production environment.

When we finish with that, we want to evaluate this model v0.1. To do this, we load a history of watched movies from some user, split it in half, use the first half to generate recommendatoins and check whether the recommendations are present it the second half of his history. First, we develop this evaluation code in our development system and check it with one or two users. Then we deploy it to some fat compute cluster so that we can repeat the calculation for all or at least for many users. So, again, this is not staging, because the system won’t contain a release candidate of the final recommender service, but instead some batch code constantly loading data and calculating the evaluation metrics.

The first evaluation will deliver some baseline of our quality metrics. We can now start the actual data science work: tweak some model parameters, use another model, do “prompt engineering”, pre-process our movie data with some filters or even some other AI models, etc. We will then iteratively deploy our model v0.2, v.0.3, …, v0.42 to our compute cluster, train and then evaluate them. From time to time we will need some other training data to avoid overfitting.

At some point we will become satisfied with the model and want to productize it. For this, we will need two things. To train this model on all movies. For this, we need to scale out our compute cluster so that this training will take a month, not years. This will cost a lot on that month, but only on this month, and it will save us time to market.

And second, we need to develop the recommender microservice that can then be called from the frontend or backend of the movie player.

What happens if this microservice crashes in production? Well, we should configure the API gateway to forward the requests to a backup microservice in this case. This can be some stupid simple recommender, for example the one that always delivers the most popular movies of day. So we don’t really need to test crashes of the service before going live. A couple of unit tests and a quick test on development system would be good enough.

Can our microservice have a security bug? Well, the infra team should establish a template for microservices already handling all security things, so if we use this template, we should be reasonably safe too, given that this is an internal service behind of the API gateway. It is fallible to DDOS, but the API gateway should have DDOS protection for all microservices anyway.

Our microservice doesn’t communicate with the end-user, so no need for texters to check the spelling. Its legality is already defined before we start development and we don’t need a staging to test it. Also, if it loses data or starts recommending wrong movies, the worst thing that can happen is that the users see some bad recommendations. But they are seeing bad recommendations all the time anyway.

Summary

  • Data science apps cannot be fully developed on a development workstation. They are trained, tested, and fine-tuned on production, with production-grade hardware, full production data and sometimes with real production users.
  • In most cases, data science apps cannot benefit from manual or automated QA, because their quality depends on the quality of the input data. Tester’s behavior is different / random compared with real users, so most probably they will only get random / garbage recommendations back.
  • QA can test whether a recommender service reponds anything at all. But because the recommender service is computationally intensive, the most probable situation of crash is not when just one tester is calling it, but rather when a great load of user requests is generated. In this case, API gateway should re-route to some other, simple service anyway. Note that the API gateway is not something data science teams should be responsible for. It can be developed and tested (on staging) by the infra team.
  • Data science apps don’t have fixed texts that can be spell checked or approved by the legal team. Either they don’t have texts at all, or (if it is a generative model) they will generate infinite number of texts so that we can’t check them all in staging. We rather need to design our product in the way that would make us immune from spelling errors and legal issues. Yes, AI ethics engineering exists, but us mere mortals working not in FAANG cannot afford it anyway.

Leave a comment