Who am I?

Ten years ago I was working as product manager and I had problems explaining to my parents what exactly am I doing.

Now, as a Data Scientist, the challenge is still there.

No, I am not creating reports based on data — data analysts do.

No, I am not inventing next artificial intelligence — scientists do.

No, I am not just training machine learning models — a) most of the available data cannot be used for that so I need to be involved in data engineering first to get usable data, b) I can’t just take a requirement from customer or product management and train a model implementing it, because machine learning is not magic and can’t do everything, so I need to work with the customer or stakeholder first, to come up with a requirement that would both be technically feasible and provide some value for their business, and c) training of LLMs is very expensive and there are other ways to use them

No, I am not a software developer — software developers implement user stories and tickets assigned to them. I decide myself what to implement and why. Besides, I don’t care about proper software process — my goal is to discover a product feature that uses data in a way that brings money. I usually don’t see any reason to create software documentation, write unit tests, implement CI/CD pipelines, use branches and merge requests and code reviews — not at least before we have first three paying customers.

No, I am not an infrastructure guy — the devops are.

What I would like to be doing?

  • I want to make sure there is data available, in quality and quantity that I need to be able to start working.
  • I want to create product ideas promising some business value by using this data.
  • I want to test the product ideas by implementing a prototype and trying to sell it to the customers.

Five mistakes interpreting data

We’re living in the age of data and everybody is making data-driven decisions by interpreting data.

As a data scientist, I often see people making the same mistakes again and again. Here are the top five most common mistakes.

Failing to relate the numbers

Data analysis shows that one of the KPI is 65%. The newspaper reports reduction of carbon emissions of five tonnes per year. Your diating app says you have consumed 200000 joules of energy at the last meal.

Suprisingly many people would believe that the KPI is quite well and solid, the carbon emissions were hugely reduced, and they’ve overeaten during last meal. This happens, because they relate these numbers to some average, implicit baseline, which may or may not be correct.

We often see percents when we speak about revenue, interest rate or growth and we get used that revenue and interest rates run around 0% to 20% and that growth of 20% is often considered to be very good.

We usually weight below 100 kg, our car weights 1.5 tonnes, so when we hear about 5 tonnes, it has to be a lot, right?

Whenever we see numbers like 200,000, we think about something really expensive, like a very luxury car or some small real estate. So 200,000 joules has to be a lot.

But if your performance indicator was between 90% and 120% in the last ten years, 65% is quite an unsatisfactory news.

The worldwide yearly carbon emissions are 38,000,000,000 tonnes, so 5 tonnes are merely 0.000000013% of the total emissions.

And 200000 joules are just about 50 cal, something like one apple.

Establish a meaningful baseline and compare any analytics results with this baseline.

The most common baselines are the values from the other time periods (KPI example), some global total value (carbon emissions) or the conversion from some rarely used unit like joules to a more intuitive one like calories.

Failing to account for costs or consequences

Everything has advantages and disadvantages. Everything good comes at cost. Looking only at one chart in the analytics report can lead to wrong decisions.

Yes, the KPI has fallen down from 100% to 65%. But what if it has been achieved with only 10% of the costs?

Yes, five tonnes CO2 per year is very little. But if it has negative costs (selling the car and using public transportation), then 100 millons people can be motivated to do that and so together they will contribute to 1.3% of carbon emissions reduction.

Yes, consuming just 50 cal for a meal looks good if we want to lose weight, but eating just one apple for lunch might induce food crawings and so increase the risk of abandoning the diat.

Put results of the analytics report into relation with its costs and consequences, before labeling the numbers to be “good” or “bad”

Failing to take data quality into account

Just because an analytical report looks authoritative with its tables of numbers and charts, there is no guarantee that the numbers have meaning at all.

Was our KPI really 65%? Or some new processes and recent reorganizations in our company were just overlooked and not included in the KPI calculation? Is the KPI of this year comparable with the KPI of the last year at all?

How did they measure the 5 saved tonnes of carbon emission? Did they attach a measurement device at the exhaust of the car? Or they just took the carbon captured in the gasoline? Did they take the emissions into account for production and for scraping of the car? Can public transport be considered emission-free?

How did we measure the 200,000 joules of energy? Did we burn the food and measure the released warmth? Or we just pointed our smartphone camera and hoped that the app can tell an apple from a brownie in shape and color of an apple?

Understand how the data has been produced or measured and take its limitations into account. Improve data quality if you are to make an important decision, or else make your decision tentatively and be ready to change it if some contradictory data emerges.

Failing to consider casuality or confounders

Consider these two maps of the USA:

On the left is the frequency of UFO sightings [1], on the right is the population density [2].

The red dots on the left correlate with the red dots on the right, and we should have at least three possible explanations for that:

1) More people means there is a higher chance that at any given second at least one of them is looking at the sky so the UFOs have higher chances to be spotted.

2) Aliens are secretly living at Earth and they use their secret political influence to support policies that lead to higher population density (agglomeration of attractive employers, zoning rules etc), because they feed from the stress that humans emit when living in overpopulated areas.

If we want to depict these ideas, we can use causation graphs like this:

This is also widely known as “Correlation is not Causation”. From the data alone we cannot determine the direction of the arrow.

Confounders are less widely known and they go like this:

3) There is a unknown field, distributed everywhere in the world with different strength. Physicists don’t know about this field yet. Let’s call it the Force. Some part of Earth emanate the Force stronger than others. The force makes people more optimistic and happy and so people tend to unconsciously dwell around sources of the Force. Aliens fly by to mine the Force, because they use it as a currency.

The Force is a confounder for the correlation beween UFO sightings and population density:

So, which explanation is the correct one? From this data alone, it is impossible to tell.

Perform additional analysis to establish casuality and control confounders, or else, avoid making statements about the causation of the data report.

Failing to be really data-driven

I have a strong opinion that the Explanation 1 (more people lead to more UFO sightings) is the correct one. This opinion is based on my personal experiece and common sense. If I was especially well trained in sociology and UFO data, you could call it “my expert opinion”.

Expert opinions are not data-driven. They are based on experience, common sense and intuition.

Another reason of being not or not fully data-driven is the pressure of responsibility, typically to find at the C-level of management or in politics. There are so many political factors as well as soft factors, marketing, influence, group dynamics, psychological factors, company culture, and big money that discourage the search for truth. Data is only accepted when it confirms the VIP opinion.

Don’t hesitate to state that your make the one or other decision in a not data-driven manner. Data is one, but not the only way to make great decisions.

Do not pretend being data-driven while making some or all of these five mistakes, just because being data-driven is “in” and it gives to your decision an scientific touch. Data scientists are watching you and they will know the truth.

Data sources

[1] https://www.reddit.com/r/dataisbeautiful/comments/1cti1rz/do_ufo_sightings_happen_near_airports_best

[2] https://ecpmlangues.unistra.fr/civilization/geography/map-us-population-density-2021

Basic Error handling in Rust

In the first part of the series we’ve seen some first examples of error handling in Rust, using unwrapping or pattern matching. This post will try to show you how Rust is making error handling a little more convenient than this.

Let’s start with a simple program that reads this CSV file:

name,age,civil_status,married_when,maried_where,license_plate
Alice,17,Single,,,
Bob,19,Single,,,
Charles,43,Married,2005-05-01,New York,NY-102
Dina,6,Single,,,
Eva,40,Married,2005-05-01,New York,NY-655
Faith,89,Widowed,,,

and calculates the average age of persons:

import pandas as pd

def calc_avg_age():
    df = pd.read_csv("persons.csv")
    return df["age"].mean()

print("Average age is %f" % calc_avg_age())

Oops, this is written in my primary language Python, but I’ll let it stay for reference. Note how the error handling is implemented: no effort required from the developer, and it will panic if the file is not found, not readable, has wrong format, has no column age, or in any other case.

I would argue that this is exactly what I want when I develop new code. I want to see it works first, as soon as possible, because remember, I am a troubled developer and I want to have my first small success ASAP to soothe my fears. In Python, I usually first make it work, then I make it right (this includes error handling and documentation) and then optionally I will make it fast.

Here is the code is Rust:

use csv::Reader;

fn get_avg_age() -> f64 {
    // open CSV reader
    let mut reader = Reader::from_path("persons.csv").unwrap(); 

    // find index of the age column
    let headers = reader.headers().unwrap();
    let index_of_age = headers
        .iter()
        .enumerate()
        .find(|&x| x.1 == "age")
        .unwrap()
        .0;

    // extract column age and convert it to f64
    let records = reader.records();
    let mut count_persons = 0;

    let person_ages = records.map(|x| {
        count_persons += 1;
        x.unwrap()
            .get(index_of_age)
            .unwrap()
            .parse::<f64>()
            .unwrap()
    });

    // calculate average person age
    person_ages.sum::<f64>() / (count_persons as f64)
}

fn main() {
    println!("Average age {}", get_avg_age());
}

The Rust code has the same error handling behavior: it will panic, if something (anything!) will go wrong. Let’s imagine, for a sake of argument, that we want to move the get_avg_age function into a reusable library, and so, we want to add proper error handling to it.

In Python, I would probably not even change anything. The default Exceptions are good enough for production, they bear all necessary information, and they should be caught in the app (as a rule, we almost never should catch exceptions in the library, because it is up to the app to decide whether and how they will handled or logged).

In Rust, if I want to give the app the chance to decide whether and how errors will be handled or logged, I may not use unwrap anymore and have to forward all the errors to the caller. To do that, I need to change the return type of the function to a Result:

fn get_avg_age() -> Result<f64, ???>

The first generic parameter of the result is the type of the return value if there were no errors, so it is f64 like in the previous code. The second parameter is the type of the error. What should we write there? The function can cause different types of errors: on the file level, from the csv parser, when parsing strings to floats etc.

In other languages, all errors inherit from the same parent class Exception or Error, so we could then surely just use this base class in Rust too? Well, yes and no. It is possible in Rust, but Rust definitely raises its eyebrow every time you do it. More on this later, but let’s just first explore another alternative: we can create our own error type, and just return this one error every time some other error happens inside of our function.

If we follow this way and use pattern matching for the proper error handling, this would be our first version we can end up with:

use csv::Reader;

struct PersonLibError;

fn get_avg_age() -> Result<f64, PersonLibError> {
    // open CSV reader
    let mut reader = match Reader::from_path("persons.csv") {
        Ok(x) => x,
        Err(e) => return Err(PersonLibError {}),
    };

    // find index of the age column
    let headers = match reader.headers() {
        Ok(x) => x,
        Err(e) => return Err(PersonLibError {}),
    };

    let index_of_age = match headers.iter().enumerate().find(|&x| x.1 == "age") {
        Some(x) => x.0,
        None => return Err(PersonLibError {}),
    };

    // extract column age, convert it to f64, and calculate sum
    let mut count_persons = 0;
    let mut sum_ages: f64 = 0.0;

    for x in reader.records() {
        count_persons += 1;
        match x {
            Ok(record) => match record.get(index_of_age) {
                Some(age) => match age.parse::<f64>() {
                    Ok(age_num) => sum_ages += age_num,
                    Err(e) => return Err(PersonLibError {}),
                },
                None => return Err(PersonLibError {}),
            },
            Err(e) => return Err(PersonLibError {}),
        };
    }

    // calculate average person age
    Ok(sum_ages / (count_persons as f64))
}

fn main() {
    match get_avg_age() {
        Ok(r) => println!("Average age {}", r),
        Err(e) => println!("Error"),
    }
}

This version of code will never panic, which is good if we want to use it as library later. As you can see, the important information about the error (its type and context) will be lost, so this kind of error handling is not only very verbose, but also highly unprofessional and it makes the software hard to maintain.

Fortunately, Rust has supercharged enums, so that we can improve our error handling like this:

use csv::Reader;

enum PersonLibError {
    FileError,
    CsvParserError,
    NoColumnNamedAge,
    RecordHasNoValueInColumnAge,
    CannotParseAge(std::num::ParseFloatError),
}

fn get_avg_age() -> Result<f64, PersonLibError> {
    // open CSV reader
    let mut reader = match Reader::from_path("persons.csv") {
        Ok(x) => x,
        Err(e) => return Err(PersonLibError::FileError),
    };

    // find index of the age column
    let headers = match reader.headers() {
        Ok(x) => x,
        Err(e) => return Err(PersonLibError::CsvParserError),
    };

    let index_of_age = match headers.iter().enumerate().find(|&x| x.1 == "age") {
        Some(x) => x.0,
        None => return Err(PersonLibError::NoColumnNamedAge),
    };

    // extract column age and convert it to f64
    let records = reader.records();
    let mut count_persons = 0;

    let mut sum_ages: f64 = 0.0;

    for x in records {
        count_persons += 1;
        match x {
            Ok(record) => match record.get(index_of_age) {
                Some(age) => match age.parse::<f64>() {
                    Ok(age_num) => sum_ages += age_num,
                    Err(e) => return Err(PersonLibError::CannotParseAge(e)),
                },
                None => return Err(PersonLibError::RecordHasNoValueInColumnAge),
            },
            Err(e) => return Err(PersonLibError::CsvParserError),
        };
    }

    // calculate average person age
    Ok(sum_ages / (count_persons as f64))
}

Now in the caller we can match on different values of our enum and handle different errors accordingly. Note also how we have passed the original ParseFloatError into the PersonLibError::CannotParseAge value.

We can also derive our PersonLibError from Debug and can easily print out errors:

#[derive(Debug)]
enum PersonLibError {
    FileError,
    CsvParserError,
    NoColumnNamedAge,
    RecordHasNoValueInColumnAge,
    CannotParseAge(std::num::ParseFloatError),
}
...
fn main() {
    match get_avg_age() {
        Ok(r) => println!("Average age {}", r),
        Err(e) => println!("Error {:?}", e),
    }
}

Our code has now better error handling, but still is pretty verbose. Let’s finally explore the “?” operator. It tries to unwrap its argument. If it is an Ok, it will just pass on the unwrapped value. If it is an Err, it will try to convert the error at hand to the error type defined in the Result of the containing function.

So, in our case, all errors inside of our function will be converted to PersonLibError. So basically it will do automatically this part:

Err(e) => return Err(PersonLibError::<some fitting matching value>),

Rust will know how to convert error types if we implement the Into trait on our error. There is a nice helpful crate called thiserror that will do it for us automatically. Here is the new source code:

use csv::Reader;

#[derive(Debug, thiserror::Error)]
enum PersonLibError {
    #[error("Csv parser error")]
    CsvParserError(#[from] csv::Error),
    #[error("No column named age")]
    NoColumnNamedAge,
    #[error("Record has no value in column age")]
    RecordHasNoValueInColumnAge,
    #[error("Cannot parse age")]
    CannotParseAge(#[from] std::num::ParseFloatError),
}

fn get_avg_age() -> Result<f64, PersonLibError> {
    // open CSV reader
    let mut reader = Reader::from_path("persons.csv")?;

    // find index of the age column
    let headers = reader.headers()?;

    let index_of_age = match headers.iter().enumerate().find(|&x| x.1 == "age") {
        Some(x) => x.0,
        None => return Err(PersonLibError::NoColumnNamedAge),
    };

    // extract column age and convert it to f64
    let records = reader.records();
    let mut count_persons = 0;

    let mut sum_ages: f64 = 0.0;

    for x in records {
        count_persons += 1;

        match x?.get(index_of_age) {
            Some(age) => {
                sum_ages += age.parse::<f64>()?;
            }
            None => return Err(PersonLibError::RecordHasNoValueInColumnAge),
        };
    }

    // calculate average person age
    Ok(sum_ages / (count_persons as f64))
}

There is a lot going on here:

  • we could get rid of three match operators and replace it with much more concise “?” operator.
  • unfortunately, we have also lost the possibility to tell FileError and CsvParserError apart, because the underlying csv crate returns csv::Error in both cases. To keep the difference, we had to return to pattern matching.
  • ? doesn’t help us with Options, so we had to keep fully written pattern matching there
  • thiserror has forced us to give each of error value a human-readable description. It is nice then to read it in the stack trace.
  • the #[from] attribute tells thiserror to implement the Into trait for the corresponding error type.

All in all, by using “?” we have reduced boilerplate a little, have weakened our error handling by removing FileError, and last but not least, introduced a lot of things happening implicitely and therefore making it for novices harder to understand.

Another way to handle errors would be using some base error class and returning the underlying errors directly without wrapping them into our PersonLibError. All Errors implement trait Error, so we can just write

Result<f64, Error>

can’t we? No, we can’t. Because Error can be implemented by different structs and different structs can have different size, and Rust wants to check the sizes of all values, and cannot do it here, it raises its eyebrow and forces you to write it like this:

Result<f64, Box<dyn Error>>

As far as I can tell, this Box<dyn> stuff has no purpose other than to softly discourage you from using the memory allocations that cannot be statically checked by Rust. Rust could as well hide the difference between boxed and unboxed values from you as software developer, but it prefers to make it very explicit and if at some point the developer will get memory corruption or performance issues, Rust could then throw its hands in the sky and say “here, you are using Box and you are using dyn. No wonder you have issues now”.

Please correct me if I am wrong.

Using the Box<dyn> syntax, we can re-implement the error handling like this:

use csv::Reader;

#[derive(Debug, thiserror::Error)]
enum PersonLibError {
    #[error("No column named age")]
    NoColumnNamedAge,
    #[error("Record has no value in column age")]
    RecordHasNoValueInColumnAge,
}

fn get_avg_age() -> Result<f64, Box<dyn std::error::Error>> {
    // open CSV reader
    let mut reader = Reader::from_path("persons.csv")?;

    // find index of the age column
    let headers = reader.headers()?;

    let index_of_age = match headers.iter().enumerate().find(|&x| x.1 == "age") {
        Some(x) => x.0,
        None => return Err(Box::new(PersonLibError::NoColumnNamedAge)),
    };

    // extract column age and convert it to f64
    let records = reader.records();
    let mut count_persons = 0;

    let mut sum_ages: f64 = 0.0;

    for x in records {
        count_persons += 1;

        match x?.get(index_of_age) {
            Some(age) => {
                sum_ages += age.parse::<f64>()?;
            }
            None => return Err(Box::new(PersonLibError::RecordHasNoValueInColumnAge)),
        };
    }

    // calculate average person age
    Ok(sum_ages / (count_persons as f64))
}

fn main() {
    match get_avg_age() {
        Ok(r) => println!("Average age {}", r),
        Err(e) => println!("Error {:?}", e),
    }
}

Note that we could get rid of some values in our PersonLibError, but still need it to handle None options cases. The rest of the code remained mostly unchanged.

Next, we could use Option::ok_or to further reduce some boilerplate (note that you can also use ok_or if you don’t use Box<dyn>):

fn get_avg_age() -> Result<f64, Box<dyn std::error::Error>> {
    // open CSV reader
    let mut reader = Reader::from_path("persons.csv")?;

    // find index of the age column
    let headers = reader.headers()?;

    let index_of_age = headers
        .iter()
        .enumerate()
        .find(|&x| x.1 == "age")
        .ok_or(Box::new(PersonLibError::NoColumnNamedAge))?
        .0;

    // extract column age and convert it to f64
    let mut count_persons = 0;
    let mut sum_ages: f64 = 0.0;

    for x in reader.records() {
        count_persons += 1;

        sum_ages += x?
            .get(index_of_age)
            .ok_or(Box::new(PersonLibError::RecordHasNoValueInColumnAge))?
            .parse::<f64>()?;
    }

    // calculate average person age
    Ok(sum_ages / (count_persons as f64))
}

And our final step would be restoring the full range of possible errors using map_err. At the same time we can ditch the Box<dyn> approach:

use csv::Reader;

#[derive(Debug, thiserror::Error)]
enum PersonLibError {
    #[error("File error")]
    FileError(#[source] csv::Error),
    #[error("CsvParserError")]
    CsvParserError(#[from] csv::Error),
    #[error("No column named age")]
    NoColumnNamedAge,
    #[error("Record has no value in column age")]
    RecordHasNoValueInColumnAge,
    #[error["Cannot parse age value"]]
    CannotParseAge(#[from] std::num::ParseFloatError),
}

fn get_avg_age() -> Result<f64, PersonLibError> {
    // open CSV reader
    let mut reader = Reader::from_path("persons.csv")?;

    // find index of the age column
    let headers = reader.headers().map_err(PersonLibError::FileError)?;

    let index_of_age = headers
        .iter()
        .enumerate()
        .find(|&x| x.1 == "age")
        .ok_or(PersonLibError::NoColumnNamedAge)?
        .0;

    // extract column age and convert it to f64
    let mut count_persons = 0;
    let mut sum_ages: f64 = 0.0;

    for x in reader.records() {
        count_persons += 1;

        sum_ages += x?
            .get(index_of_age)
            .ok_or(PersonLibError::RecordHasNoValueInColumnAge)?
            .parse::<f64>()?;
    }

    // calculate average person age
    Ok(sum_ages / (count_persons as f64))
}

fn main() {
    match get_avg_age() {
        Ok(r) => println!("Average age {}", r),
        Err(e) => println!("Error {:?}", e),
    }
}

Would you agree that with all of these “?” at the end of the line this code resembles Perl or APL? On the other hand, Rust consistently takes top places in run-time performance benchmarks and an elaborate error handling might be one of the costs we have (or gladly want) to pay to achieve that.

Summary

The biggest challenge for me is currently this: just by the looking at the function name you never know whether it returns a value, a Result or an Option.

If you know the return type of a Rust function:

  • if it returns value, you don’t need to do anything
  • if it returns Result, you add a “?” and check if the returned Err type is compatible with the type of the parent function. If it is not, you make it compatible by adding map_err in front of “?”
  • If it returns Option, you add ok_or in front of “?”

P.S.

Both Python and Rust versions still have one (and the same) unhandled error that is extremely probable in production and would cause the process to exit with panic. Can you find it?

P.P.S

It takes Rust 62 ms to execute the main() function 1000 times (in –release configuration). Python needs 4 seconds for that. Q.E.D.

Rust as an untroubled programming language

Lately I had a chance to implement the same simple prototype in three languages: Python, Go, and Rust.

Python, being my primary programming language in the last 10 years, took me an hour. The version in Go took me a day. And I’ve spent a week (with pauses) to implement the same functionality in Rust.

Frankly speaking, only part of this huge time difference was because of the peculiarities of the Rust language itself. The rest went on working with AWS S3 and Parquet directly, instead of using Duck DB (as I did in Python and Go), because the Duck DB’s Rust binding doesn’t support Map data type. So the comparison is not really fair.

Nevertheless, after spending some weeks with Rust (and falling in love with it) I would still say that Rust is an untroubled language: meaning it is a language for untroubled software developers.

You see, I’ve spent almost all my carreer (except of one year at Metz) being a troubled developer: I was constantly in stress trying to meet deadlines, trying to develop software cheaper, trying to combine user requirements, interests of the customer, business aspects and cutting edge tech into a one perfect product, trying to achieve something big and historically meaningful until I retire.

An untroubled developer is the opposite of all this. No, it doesn’t mean they don’t care about business perspective, user satisfaction or deadlines. It means, they are just not troubled by these and they take their time to do the stuff right.

When I work on a ticket, I already start thinking about it in the shower and on the way to work, and often code first possible solution hours after the start. This soothes my stress.

When an untroubled developer starts working on a ticket, they first make their table clean, rip open the windows to let cool, fresh air inside, straighten the monitors positions, sharpen their pen, open the Moleskine on an empty page, and – first things first – start brewing a cup of coffee.

They don’t have this fear of wasting time.

Sometimes, it is because they are confident that they have enough time.

For me, only this kind of attitude explains why people would want static typing, and even more and more static checks (as Rust checks more statically than C++). And this is what makes Rust attractive to me, already out of respect for this consequent striving for statically correct programs (no matter how much more time they would cost).

One of the fascinating features of Rust that I want to show today, and I mean it unironically, it the fact that you have to unwrap everything before you can start working on it. No, it is not because Rust authors watch too much unwrapping videos on YouTube (they are out of fashion anyway, because TikTok won over YouTube and its videos are not so long).

Unwrapping is how Rust handles errors.

To make it usable, programmer-friendly and readable, they had to unify struct and enum, which is per se a cool idea worth speaking about.

Let’s start with your usual structs, tuples, and enums how we all know them from other languages:

#![allow(dead_code)]
#![allow(deprecated)]

struct Person {
    name: String,
    age: u8,
    civil_status: CivilStatus,
    married_date_city: (chrono::NaiveDate, String), //just for fun as tuple
}

enum CivilStatus {
    Single,
    Married,
    Widowed,
}

fn main() {
    let pete = Person {
        name: "Pete".to_string(),
        age: 34,
        civil_status: CivilStatus::Single,
        // we have to provide some "special" values, because there is
        // no null in Rust.
        married_date_city: (
            chrono::NaiveDate::from_ymd(0, 1, 1),
            "(unmarried)".to_string(),
        ),
    };

    println!("{}, we need to talk.", pete.name);
}

We can now make this code more pretty, and also marry Pete, because In Rust we can associate tuples with enum members:

#![allow(dead_code)]
#![allow(deprecated)]

struct Person {
    name: String,
    age: u8,
    civil_status: CivilStatus,
}

enum CivilStatus {
    Single,
    Married(chrono::NaiveDate, String),
    Widowed,
}

fn main() {
    let pete = Person {
        name: "Pete".to_string(),
        age: 34,
        civil_status: CivilStatus::Married(
            chrono::NaiveDate::from_ymd(2010, 5, 12),
            "Paris".to_string(),
        ),
    };

    println!("Congratulation, {}!", pete.name);

}

and it goes even more beautiful, because we also can associate structs with enum members:

#![allow(dead_code)]
#![allow(deprecated)]

struct Person {
    name: String,
    age: u8,
    civil_status: CivilStatus,
}

enum CivilStatus {
    Single,
    Married {
        date: chrono::NaiveDate,
        city: String,
    },
    Widowed,
}

fn main() {
    let pete = Person {
        name: "Pete".to_string(),
        age: 34,
        civil_status: CivilStatus::Married {
            date: chrono::NaiveDate::from_ymd(2010, 5, 12),
            city: "Paris".to_string(),
        },
    };

    println!("Congratulation, {}!", pete.name);

}

We can now react differently on Petes civil status using template matching:

    match pete.civil_status {
        CivilStatus::Single => println!("{}, we must talk!", pete.name),
        CivilStatus::Widowed => println!("We are so sorry, dear {}", pete.name),
        CivilStatus::Married {
            date: petes_date,
            city: _,
        } => println!("{}, remember, you've married on {}", pete.name, petes_date),
    };

Now we have everything ready to see how NULL values and possible errors can be handled in Rust.

For starters, there is no NULL in Rust. We use Option instead, which is either something or nothing:

enum Option<T> { // it is just a generic enum
    Some(T),  // this a value of type T accociated with the enum member
    None
}

So, we can extend Person with a possibility to drive a car like this:


#![allow(dead_code)]
#![allow(deprecated)]

struct Person {
    name: String,
    age: u8,
    civil_status: CivilStatus,
    license_plate: Option<String>,
}

enum CivilStatus {
    Single,
    Married {
        date: chrono::NaiveDate,
        city: String,
    },
    Widowed,
}

fn main() {
    let pete = Person {
        name: "Pete".to_string(),
        age: 34,
        civil_status: CivilStatus::Married {
            date: chrono::NaiveDate::from_ymd(2010, 5, 12),
            city: "Paris".to_string(),
        },
        license_plate: Some("NY202".to_string()),
    };

    match &pete.license_plate {
        Option::None => println!("Hello {}", pete.name),
        Option::Some(car) => println!("Suspect drives car with license plate {}", car),
    }

    // alernatively
    if pete.license_plate.is_some() {
        println!(
            "Suspect drives car with license plate {}",
            pete.license_plate.unwrap()
        )
    }
}

Rust doesn’t take jokes, when it is about handling of NULL cases. You have two choices: either you use template matching to write down – explicitely, painfully clearly, in an untroubled state of mind – what has to be done if there is None value in pete.license_plate. Or you do a lazy, sloppy work in a troubled way and just call unwrap() on an Option. This will save you development time, but Rust doesn’t like it when programmers call unwrap() on its Options. So if Rust encounters a None in run-time, Rust will kill you. Literally: the whole execution process will be terminated, without any way to save or stop it.

We can use the same idea for functions that can return errors, using a Result:

enum Result<T, E> {
  Ok(T),
  Err(E)
}

For example:

use std::fs::File;
use std::path::Path;


fn main() {
    let path = Path::new("hello.txt");
    let display = path.display();

    // Open the path in read-only mode, returns `io::Result<File>`
    let _file = match File::open(&path) {
        Err(why) => {
            println!("couldn't open {}: {}", display, why);
            return;
        },
        Ok(file) => file,
    };

    // Alternatively:
    let _file2 = File::open(&path).unwrap(); // be ready to be killed if file cannot be opened
}

Legends say that in the dark ages, the untroubled knights of light and computer science only had the unwrap() and the template matching as their weapons to handle errors in first Rust editions.

Every single function call that can fail (which is practically every call), had to be unwrapped or template-matched.

Fascinatenly beautiful, clean, responsible and consequent – with no workaround left, with no land to retreat and with the only possible way: to attack the enemy, singing the fighting songs of the ancestors, and to win. Or to stay forever on the battlefield.

People are weaker in the modern day. In the modern day, people like to hide the frightening chaos of reality in small neat innocent-looking boxes. In the modern day, Rust developers have moved further into the rabbit hole and now we can use the “?” operator.

No, “?” is not a censor mark. The operator is literally called “?”.

More on this and on Traits in the next post. Stay tuned.

Data Scientists and Secret Management

“It is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration.”
― Edsger Dijkstra

Already 1975 it was a common knowledge that software development tutorials, primers and courses can teach students bad habits. Nowadays, in my opinion, data science is the area with the highest level of discrepancy between the teaching materials and the reality on the job.

It all starts with the tutorial code getting datasets from CSV files. When the Data Scientist starts their first job they suddently realize that the data is not stored in CSV, because it is too slow, too expensive and doesn’t comply with security measures required by GDPR and internal company protocols.

So the Data Scientist will then retrieve the data using SQL from some database, somehow like this:

from pyclickhouse import Connection

conn = Connection('db.company.com', user='datascience', password='cfhe8&4!dCM')
cur = conn.Cursor()
cur.select(".....")

Here is why it is a bad idea:

  • As soon as you save your script, the database credentials (which are a particular example of a broader category of secrets) will be stored in a plain, unprotected file somewhere on your disk. Every hacker who can read contents of your disk, will also be able to read all the (potentially) sensitive and protected data from your database. In the worst case, they would also be able to delete all the data from the database. It is not so hard to get access to your files: you might let your notebook open while you get new portion of caffee, you might open some email attachment (being in reality a virus) from a sender looking as if they are your boss, and if you don’t update your operating system immediately after updates are available, your computer can be attacked remotely even without you knowing anything.
  • Even worse, you could commit your script to a version control system such as git and then push it onto gitlab, github or some other publicly available source code repository. Now, the hacker doesn’t even need to hack your computer, they just go to gitlab and read your publicly open source conviniently, without need to break somewhere in.
  • Do not fool yourself with the idea “I just quickly type the secrets now, test if my script is working, and remove the secrets before I save the work”. No you won’t. You will forget. Your editor might have an autosaving functionality. You file might need to be saved before the first run due to technical reasons, etc.

First rule of secret management: never ever type your secrets in your source code.

Trying to avoid secrets in the source code, many Data Scientists come up with the idea to store them in the environment variables. Unfortunately, there are a lot of devops, software engineers and admins who do the same and even recommend doing it. But beware, it is easy to do, but not so easy to do right.

The first naive implementation would be changing the script above like this:

import os
from pyclickhouse import Connection

conn = Connection('db.company.com', user=os.environ['DB_NAME'], password=os.environ['DB_PASSWORD'])
cur = conn.Cursor()
cur.select(".....")

and then run your script like this:

DB_NAME=datascience DB_PASSWORD=cfhe8&4!dCM python myscript.py

This would avoid publishing the secrets to gitlab or other public repository, but it will store them in a file on your computer. How come? When you type your commands in the Terminal, most popular shells would keep history of your commands. You can easily return to the previous command by pressing arrow up in the Terminal. This history is stored in the file .bash_history (if you use bash as your shell) in your home directory, so reading this file will reveal the secrets to the attacker.

Another naive and wrong solution would be to store the secrets in your .bashrc file. Data Scientists don’t want to type the secrets every time they use them, so they google for “how to set an environment variable permanently” and find the solution to write them into the .bashrc file and do that. So now the hacker just needs to look at two files: .bash_history and .bashrc, somewhere in these two places, they will find the secrets.

Proper secret management

The solution to the problem is not to store any secrets at all. You, as a Data Scientist, should never receive the secrets (via email, via chat etc) and so never type them, neither in the source code, nor in the environment variables. I personally would also never read the secrets for especially sensitive data, so even if the hacker would kidnap and torture me, I wouldn’t be able to reveal them.

Instead, your Devop will generate secrets for you and store them in a specially designed secret store, for example Hashicorp Vault. You can think of this store as a secure, encrypted key/value database. The keys can be names of your project, like ‘datascience/RevenuePrediction/Summer2025’, and the values are the secrets needed for this python script.

To use the secrets, you will change the script as follows:

import os
from getpass import getpass
from hvac import Client
from pyclickhouse import Connection

vault_client = Client(url='https://vault.company.com:8200')

username = os.environ['USER']
password = getpass('Enter password:')

vault_client.auth.ldap.login(
    username=username,
    password=password,
    mount_point='datascience'
)

secrets = vault_client.secrets.kv.v2.read_secret(
                path='RevenuePrediction/Summer2025,
                mount_point='datascience'
            )

conn = Connection(secrets['host'], user=secrets['db_user'], password=secrets['db_password'])
cur = conn.Cursor()
cur.select(".....")

Here is how it works. When you start the script, it will determine the username you have used to login at your operating system today. It will also ask the password for your operating system. This password won’t be stored persistently anywhere and not displayed on screen. These operating system credentials will then be used to authenticate you against vault. When this successfully happens, vault knows who you are and what secrets you are allowed to access. Next line reads the secrets from vault: these secrets are also never permanently stored, but immediately used to authenticate yourself against the database. So all this time, no secrets are ever stored on your disk and rather kept for a short time in your RAM.

A hacker can still steal the secrets, if they grab your open notebook with a running script while you are refilling your cup with new portion of latte, then use a debugger to connect to your running python script and to read its variables. Nevertheless, you have improved your secret management to the minimal acceptable SOTA secure level, so, well, congratulations.

So I have to enter my password every time I run a script? WTF?

Yep, unless your Devops and Admins provide you with a passwordless security infrastructure and hardware (like FIDO2 stick), your best behavior would be entering the password every time you run the script.

Most people don’t excercize their best behavior, all the time.

So most people do some “good enough” approach and add the following one line to the script:

import os
from getpass import getpass
from hvac import Client
from pyclickhouse import Connection

vault_client = Client(url='https://vault.company.com:8200')

if not vault_client.is_authenticated:
    username = os.environ['USER']
    password = getpass('Enter password:')

   vault_client.auth.ldap.login(
       username=username,
       password=password,
       mount_point='datascience'
   )

secrets = vault_client.secrets.kv.v2.read_secret(
                path='RevenuePrediction/Summer2025,
                mount_point='datascience'
            )

conn = Connection(secrets['host'], user=secrets['db_user'], password=secrets['db_password'])
cur = conn.Cursor()
cur.select(".....")

This line will skip authentication if the hvac framework thinks you are already authenticated towards the Vault. And how you can be authenticated? Well, after your open your notebook at the beginning of the day, you just type this in your terminal:

vault login -method ldap username=$USER

And then enter your operating system password, once. After this, you will receive an access token from the Vault allowing you to access it for some limited amount of time (like 8 hours). This token will be stored in .vault file in your home directory. The hvac framework will look and find it when running your script.

So what gives? I still have some secret stored on my local disk!

Yes, but this token doesn’t contain the secrets, it is only the right to read secrets from the database, for some limited amount of time. If the hacker steals this at night, the token will already expire. Second, if your notebook gets stolen, you will immediately report this to IT and they will be able to revoke the vault token (it is similar as locking your stolen credit card). Third, the token can be limited by IP address, so if the hacker copy the token file on their computer, they won’t be able to use it, because it is only valid when used from your notebook, with your IP address.

Is this really SOTA?

Yes, it is the industry SOTA for the minimal possible secure secret management. Organizations can improve it by introducing MFA, hardware tokens, and other methods.

As a Data Scientist, all you need to do is to be aware of the secret management protocol of your organization and to follow it as near as feasible. If your organization doesn’t have any secret management protocol in place, demand the IT or Devops to create one.

Do we really need a staging system for data science?

So you are an infrastructure guy (a DevOp or an IT admin) and your data scientist or data engineer has told you they don’t need a staging system.

You know for a fact that any professional software development needs a development system, a staging system and a production system, and that developing in production is utterly dangerous and unprofessional, so you think your data scientist or data engineer are incompetent in software engineering.

Are they really?

A step back

So why the web development came up with these three environments in the first place?

If we develop a web site in production, any changes become immediately visible not only to us, but also to the users. If there is a bug, users will see it. If there is a security vulnerability, hackers will use it. If there is some suboptimal communication in the texts, activists will make a shitstorm out of it. Data can be lost, customers can be lost, the whole company can be lost.

So we need a development system. When the developer makes a change, they can check how it works before releasing it to everybody else. If we have only one developer and one change per week to do, this would be enough, because the testers, product managers, marketing, legal etc. would just check the changes on his development system, while the developer drinks coffee and plays table football.

But if the the developer is constantly working on changes, and even more so if there are more than one developers, it makes sense to establish the third system, staging, where their all changes can be integrated together, bundled to one release, tested, validated, checked by any internal stakeholders etc, before they get rolled out in production.

All of this makes sense.

How the data science is different?

Let’s say we want to create a next generation of the Netflix movie recommender. We want to extract every image frame and sound track of each movie, send them to a LLM model and ask the model to generate microtags for each movie, such as

  • main_character_wears_orange
  • more_than_three_main_characters
  • movie_happens_in_60ies_in_china
  • bechdel_wallace_test_passed
  • etc

We want then for each user to retrieve the movies they have watched and enjoyed before, so that we understand what microtags are important for the user, so that we can recommend them the movies with the same microtags.

So we download a small subsample of the dataset (i.e. a 15 minutes of video) to our development workstation, code the script that reads the data and uses the LLM, and when it works we want to execute it for bigger subset of movies (i.e. 100 movies) and see how reasonable the generated microtags are. We can’t do it on the development workstation, because it will takes years there and it should be parallelized on some fat compute cluster so that we’ll get the chance to finish this project until we retire.

So we deploy our training app to a compute cluster and call it production environment. It is still internal, so we don’t need stakeholders to check it, and if its fails in the middle, we can fix the bugs and restart the process, and no external users will be affected. So it has the freedom of a development environment and hardware costs of production environment.

When we finish with that, we want to evaluate this model v0.1. To do this, we load a history of watched movies from some user, split it in half, use the first half to generate recommendatoins and check whether the recommendations are present it the second half of his history. First, we develop this evaluation code in our development system and check it with one or two users. Then we deploy it to some fat compute cluster so that we can repeat the calculation for all or at least for many users. So, again, this is not staging, because the system won’t contain a release candidate of the final recommender service, but instead some batch code constantly loading data and calculating the evaluation metrics.

The first evaluation will deliver some baseline of our quality metrics. We can now start the actual data science work: tweak some model parameters, use another model, do “prompt engineering”, pre-process our movie data with some filters or even some other AI models, etc. We will then iteratively deploy our model v0.2, v.0.3, …, v0.42 to our compute cluster, train and then evaluate them. From time to time we will need some other training data to avoid overfitting.

At some point we will become satisfied with the model and want to productize it. For this, we will need two things. To train this model on all movies. For this, we need to scale out our compute cluster so that this training will take a month, not years. This will cost a lot on that month, but only on this month, and it will save us time to market.

And second, we need to develop the recommender microservice that can then be called from the frontend or backend of the movie player.

What happens if this microservice crashes in production? Well, we should configure the API gateway to forward the requests to a backup microservice in this case. This can be some stupid simple recommender, for example the one that always delivers the most popular movies of day. So we don’t really need to test crashes of the service before going live. A couple of unit tests and a quick test on development system would be good enough.

Can our microservice have a security bug? Well, the infra team should establish a template for microservices already handling all security things, so if we use this template, we should be reasonably safe too, given that this is an internal service behind of the API gateway. It is fallible to DDOS, but the API gateway should have DDOS protection for all microservices anyway.

Our microservice doesn’t communicate with the end-user, so no need for texters to check the spelling. Its legality is already defined before we start development and we don’t need a staging to test it. Also, if it loses data or starts recommending wrong movies, the worst thing that can happen is that the users see some bad recommendations. But they are seeing bad recommendations all the time anyway.

Summary

  • Data science apps cannot be fully developed on a development workstation. They are trained, tested, and fine-tuned on production, with production-grade hardware, full production data and sometimes with real production users.
  • In most cases, data science apps cannot benefit from manual or automated QA, because their quality depends on the quality of the input data. Tester’s behavior is different / random compared with real users, so most probably they will only get random / garbage recommendations back.
  • QA can test whether a recommender service reponds anything at all. But because the recommender service is computationally intensive, the most probable situation of crash is not when just one tester is calling it, but rather when a great load of user requests is generated. In this case, API gateway should re-route to some other, simple service anyway. Note that the API gateway is not something data science teams should be responsible for. It can be developed and tested (on staging) by the infra team.
  • Data science apps don’t have fixed texts that can be spell checked or approved by the legal team. Either they don’t have texts at all, or (if it is a generative model) they will generate infinite number of texts so that we can’t check them all in staging. We rather need to design our product in the way that would make us immune from spelling errors and legal issues. Yes, AI ethics engineering exists, but us mere mortals working not in FAANG cannot afford it anyway.

A three minute guide to Ansible for data engineers

Ansible is a DevOp tool. Data Engineeres can be curious about tech and tools used by DevOps, but rarely have more than 3 minutes to spend for a quick look over the fence.

Here is your 3 minutes read.

To deploy something into an empty server, first we need to instal python there. So we open our Terminal and use ssh to connect to the server:

localhost:~$ ssh myserver.myorg.com
maxim@myserver.myorg.com's password: **********

Welcome to Ubuntu 22.04.4 LTS (GNU/Linux 6.5.0-26-generic x86_64)



maxim@myserver.myorg.com:~$ 

Note that after the first connection, the DevOp will probably configure ssh client certificate as well as some sudoers magic, so that we don’t need to enter the password every time we use ssh and therefore we can use ssh in scripts doing something automatically. For example, if we need to install pip on the remote server, we can do

ssh -t myserver.myorg.com sudo apt install python3-pip

Now, let’s say, we have received 10 blank new empty servers to implement our compute cluster and we need to configure all of the in the same way. You cannot pass a list of servers to the ssh command above, but you can do that to Ansible. First, we create a simple file hosts.yml storing our server names (our inventory):

my-compute-cluster:
  hosts:
    myserver-1.myorg.com
    myserver-2.myorg.com
    ...

some-other-group-of-servers
  hosts:
    database-01
    database-02

Now we can install pip on all compute servers (even in parallel!) by one command:

ansible -i hosts.yml my-compute-cluster -m shell -a "sudo apt install python3-pip"

This is already great, but executing commands like this has two drawbacks:

  • You need to know the state of each server in my-compute-servers group. For example, you cannot mount a disk partition before you format it, so you have to remember whether partitions have been already formatted or not and
  • The state of all the servers has to be the same. If you have 5 old servers and one new, you want to format and mount the disk partition on the new one, and under no circumstances you want to format the partitions of the old servers.

To solve this, Ansible provides modules that not always execute some command, but first check the current state and skip the execution, if it is not necessery (so it is not “create”, it is “ensure”). For example, to ensure formatting of a disk partition /dev/sdb with the file system ext4, you call

ansible -i hosts.yml my-compute-cluster -m filesystem -a "fstype=ext4 dev=/dev/sdb"

This command won’t touch the old servers and only do something on the new one.

Usually, when preparing the server to host your data pipeline, several configuration steps are required (OS patches need to be applied, software needs to be installed, security must be hardened, data partitions mounted, monitoring established) so instead of having a bash script with commands such as above, Ansible provides comfortable and readable roles format in YAML. The following role prepare-compute-server.yml will for example update the OS, install pip, and format and mount filesystem:

- name: Upgrade OS
  apt:
    upgrade: yes

- name: Update apt cache and install python3 and pip
  apt:
    update_cache: yes
    pkg:
    - python3
    - python3-pip

- name: format data partition
  filesystem:
    fstype: ext4
    dev: /dev/sdb

- name: mount data partition
  mount:
    path: /opt/data
    src: /dev/sdb
 

Roles such this ment to be reusable building blocks and shouldn’t really depended on what rollout project you are currently doing. To facilitate this, it is possible to use placeholders and pass parameters to the roles using Jinja2 syntax. You also have loops, conditional executions and error handling.

To do some particular rollout, you would usually write a playbook, where you specify, what roles have to be executed on what servers:

- hosts: my-compute-cluster
  become: true    # indication to become root on the target servers
  roles:
    - prepare-compute-server

You can then commit the playbook to your favorite version control system, to keep track who did what when, and then execute it like this

ansible-playbook -i hosts.yml rollout-playbook.yml

Ansible has a huge ecosystem of modules that you can install from its galaxy (similar to PyPi) and also much more features, most notable of which is that instead of having a static inventory of your servers, you can write a script that fetches your machines using some API, for example the EC2 instances from your AWS account.

Alternatives to Ansible are Terraform, Puppet, Chef and Salt.

How to make yourself to like YAML

2003 when I was employed at straightec GmbH, my boss was one of the most brilliant software engineers I’ve ever met. I’ve learnt a lot from him and I am still applying most of his philosophy and guiding principles in my daily work. To make an impression of him, its enough to know that we have used Smalltalk to develop a real-world, successful commercial product.

One of his principles was, “if you need a special programming language to write your configuration, it means your main development language is crap”. He would usually proceed demostrating that a configuration file written in Smalltalk is at least not worse than a file written in INI format or in XML.

So, naturally, I had preposessions against JSON or YAML, preferring to keep my configurations and my infrastructure scripts in my main programming language, in this case Python.

Alas, life forces you to abandon your principles from time to time, and the need to master and to use Ansible and Kubernetes has forced me to learn YAML.

Here is how you can learn YAML if you in principle against of it, but you have to learn it anyway.

This is YAML for a string

Some string

and this is for a number

42

And this is a one-level dictionary with string as keys and strings or numbers as values

key1: value1
key2: 3.1415
key3: "500" # if you want to force the type to be string, use double quotes

Next is a one-dimensional array of strings

- item 1
- item 2
- third item
- строки могут быть utf-8

Nested dictionaries

data_center:
  rack_8:
    slot_0:
      used: 1
      power_consumption: 150  
    slot_1:
      used: 1
      power_consumption: 2  
    slot_2:
      used: 0

You can glue nested levels together like this:

data_center.rack_8.slot_0:
      used: 1
      power_consumption: 150    

A dictionary having an array as some value

scalar_value: 123
array_value:
  - first
  - second
  - third
another_way_defining_array_value: ["first", "second", "third"]

Now something that I was often doing wrong (and still doing wrong from time to time): an array of dictionaries. Each dictionary has two keys: “name” and “price”

- name: Chair
  price: 124€       # note that price is indented and stays directly under name. 
                    # any other position of price is incorrect.
- name: Table
  price: 800€       

# note that there is nothing special in the "name", you can use any key to be first:

- price: 300€
  name: Another table       

- price: 12€
  name: Plant pot
       

Just to be perfectly clear, the YAML above is equivalent to the following JSON

[
  {
    "name": "Chair",
    "price": "124€"
  },
  {
    "name": "Table",
    "price": "800€"
  },
  {
    "price": "300€",
    "name": "Another table"
  },
  {
    "price": "12€",
    "name": "Plant pot"
  }
]

Finally, you can put several objects into one file by delimiting them with —

---
type: Container
path: some/url/here
replicas: 1
label: my_app
---
type: LoadBalancer
selector: my_app
port: 8080

This should cover about 80% of your needs when writing simple YAML.

If you want to continue learning YAML, I recommend you to read about

  • how to write a string spanning over several lines
  • how to reference one object from another inside of the same YAML
  • And the Jinja2 templating syntax that is used very often together with YAML

How to stop fearing and start using Kubernetes

The KISS principle (keep it simply stupid) is important for modern software development, and even more so in the Data Engineering, where due to big data and big costs every additional system or layer without clear benefits can quickly generate waste and money loss.

Many data engineers are therefore wary when it goes about implementing and rolling out Kubernetes into their operational infrastructure. After all, 99,999% of the organizations out there are not Google, Meta, Netflix or OpenAI and for their tiny gigabytes of data and two or three data science-related microservices running as prototypes internally on a single hardware node, just bare Docker (or at most, docker-compose) is more than adequate.

So, why Kubernetes?

Before answering this question, let me show you how flat the learning curve of the modern Kubernetes starts.

First of all, we don’t need the original k8s, we can use a simple and reasonable k3s instead. To install a fully functional cluster, just login to a Linux host and execute the following:

curl -sfL https://get.k3s.io | K3S_KUBECONFIG_MODE="644" sh -s - --secrets-encryption

You can then execute

kubectl get node

to check if the cluster is running.

Now, if you have a Docker image with a web service inside (for example implemented with Python and Flask) listening on port 5000, you only need to create the following YAML file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-microservice
  labels:
    app: my-microservice
spec:
  selector:
    matchLabels:
      app: my-microservice
  replicas: 1
  template:
    metadata:
      labels:
        app: my-microservice
    spec:
      containers:
       -name: my-microservice
          image: my-artifactory.repo/path-to-docker-image
          ports:
           -containerPort: 5000

---

kind: Service 
apiVersion: v1 
metadata:
  name: my-microservice
spec:
  type: LoadBalancer
  selector:
    app: my-microservice
  ports:
    - port: 5000
      targetPort: 5000

Conceptually, Kubernetes manages the computing resources of the nodes belonging to the cluster to run Pods. Pod is something akin a Docker container. Usually, you don’t create Pods manually. Instead, you create a Deployment object and it will then take care to start defined number of Pods, watch their health and re-start them if necessary. So, in the first object defined above, with the kind of Deployment, we define a template, which will be used whenever a Deployment needs to run yet another Pod. As you can see, inside the template you are specifying the path to the Docker image to run. There, you can also specify everything else necessary for Docker to run it: environment variables, volumes, command line, etc.

A Kubernetes cluster assigns IP addresses from its very own IP network to the nodes and Pods running there, and because usually your company network doesn’t know how to route to this network, the microservices are not accessible by default. You make them accessible by creating another Kubernetes object of kind Service. There are different types of the Services, but for now everything you need to know is that if you set it to be LoadBalancer, the k3s will expose your microservice to the rest of your corporate network by leasing a corporate network IP address and hosting a proxy service on it (Traefik) that will forward the communication to the corresponding Pod.

Now, when we have our YAML file, we can roll out our tiny happy microservice to our Kubernetes cluster with

kubectl apply -f my-microservice.yaml

We can see if it is running, watch its logs or get a shell access to the running docker container with

kubectl get pod
kubectl logs -f pod/my-pod-name-here
kubectl exec -it pod/my-pod-name-here bash

And if we don’t need our service any more, we just delete it with

kubectl delete -f my-microservice.yaml

Why Kubernetes?

So far, we didn’t see any advantages compared to Docker, did we?

Well, yes, we did:

  • We’ve got a watch-dog that monitors Pods and can (re)start them for example after server reboot or if they crash for any reason.
  • If we have two hardware nodes, we can deploy our Pods with “replicas: 2” and because we already have a load balancer in front of them, we can get high availability almost for free
  • If the microservice supports scalability by running several instances in parallel, we already get a built-in industrial grade loadbalancer for scaling out.

Besides, hosting your services in Kubernetes has the following advantages:

  • If at some point you will need to pass your internal prototypes for professional operations to a separate devop team, they will hug you to death when they learn your service is already kubernetized
  • If you need to move your services from on-premises into the cloud, the efforts to migrate, for example, to Amazon ECS is much much higher than the changes you need to do to go from k3s to Amazon EKS.
  • You can execute batched workflows scheduled by time with a CronJob object, without the need to access the /etc/crontab on the hardware nodes.
  • You can define DAG (directed acyclic graphs) for complicated workflows and pipelines using Airflow, Prefect, Flyte, Kubeflow or other Python frameworks that will deploy and host your workflow steps on Kubernetes for you
  • You can deploy Hashicorp Vault or other secret manager to Kubernetes and manage your secrets in a professional, safer way.
  • If your microservices need some standard, off-the-shelf software like Apache Kafka, RabbitMQ, Postgres, MongoDB, Redis, ClickHouse, etc, they all can be installed into Kubernetes with one command, and deploying additional cluster nodes will be just a matter of changing the number of replicas in the YAML file.

Summary

If you only need to host a couple of prototypes and microservices, Kubernetes will immediately improve their availability, and, more importantly, will be a future-proof, secure, scalable and standartized foundation for coming operational challenges.

Now when you’ve seen how easy the entry into the world of Kubernetes is, you don’t have the “steep learning curve” as an excuse for not using Kubernetes already today.

How to choose a database for data science

“I have five CVS files one Gb each and the loading them with Pandas is too slow. What database should I use instead?”

I often get similar questions from data scientists. Continue reading to obtain a simple conceptual framework to be able to answer this kind of questions yourself.

Conceptual framework

The time some data operation takes depends on the amount of data and the throughtput of the hardware, as well as on the number of operations you can do in parallel:

Time ~ DataSize, Throughtput, Parallelization

The throughput is the easiest part, because it it defined and limited by electronics and its physical constraints. Here are the data sources from slow to fast:

  • Internet (for example, S3)
  • Local Network (for example, NAS, also hard drives attached to your local network)
  • (mechanical) hard drives inside of your computer
  • SSD inside of your computer
  • RAM
  • CPU Cache memory / GPU memory if you use GPU for training

You can reduce the data processing time by moving the data to the faster medium. Here is how to:

  • Download the data from the internet to your harddrive, or to local network NAS, if the data doesn’t fit into your machine.
  • As soon as you read a file once, your OS will keep its data in RAM in the file cache for further reads. Note that the data will be eventually evicted from memory, because RAM size is usually much less than the harddrive size.
  • If you want to prevent eviction for some particular files that fit in memory, you can create a RAM drive and copy the files there.

There are a couple of tricks to reduce data size:

  • Compress the data with any losless compression (zip, gzip etc). You will still need to decompress it to start working on it, but the reading from slow data source like a hard drive will be quicker because there is less data to read, and decompression will happen in a fast RAM.
  • Partition the data, for example by month or by customer. If you only need to run a query related to one month, you will skip reading the data for other monthes. Even if you need the full data (for example for ML training), you can spread your data partitions into different computers and process the data simultaneously (see parallelization below)
  • Store the data not row by row, but column by column. Thats the same idea like partitioning, but column-wise. So if for some particular queries you don’t need some column, in a column-based file format you can skip reading it, therefore reducing the amount of data.
  • Store additional data (index) that will help you to find rows and columns inside of your main file. For example, you can create a full text index over the columns of your data containing free English text, and then, if you only need rows containing the word “dog”, this index will read only bytes from the storage with the rows containing “dog” inside, so you read less data, so you reduce amount of data to be read. For Computer Science, this is where their focus is on and they are most excited inventing additional data structures for especially fast indexes. For Data Science, this is rarely helpful, because we often need to read all the data anyways (eg. for training).

As of parallelization, it is a solution for the situation where your existing hardware in under-used and you want to speed up things by fully using it, or you can provision more hardware (and pay for it) to speed-up your processes. At the same time this is also the most complicated factor to optimize, so don’t go there before you try the stuff above.

  • Parallelization on macrolevel: you can split your training set to several servers, read the data and train the model simultaneously on each of them, calculate weight updates with the backpropagation and then apply them to a central storage of weights and redistribute the weights back to all servers.
  • Parallelization on microlevel: put several examples from your training set at once onto GPU. Similar ideas are utilized in a small scale by databases or frameworks like numpy and is called there vectorization.
  • Parallelization on hardware level: you can attach several hard drives to your computer in a so-called RAID array and when you read from them, the hardware ensures that it reads from all of the hard drives in parallel, multiplying your hardware throughput.

Parallelization is sometimes useless and/or expensive and hard to get right, so this is your last resort.

Some practical solutions

A database is a combination of smart compressed file format, an engine that can read and write it efficiently, and an API so that several people can work on it simultaneously, data can be replicated between database nodes, etc. When in doubt, for data science use ClickHouse, because it provides top performance for many different kind of data and use-cases.

But databases are the top notch full package, and often you don’t need all of it, especially if you work alone one some particular dataset. Luckily, also separate parts of the databases are available on the market. They have various names like embedded database, database engine, in-memory analytics framwork etc.

An embedded database like DuckDB or even just a file format like Parquet (Pandas can read it using fastparquet) are in my opinion the most interesting one and might be already enough.

Here is an short overview:

ClickHouseDuckDBFastparquet
Data compressionNative compressed format with configurable codecs. Also some basic support of Parquet.Native compressed format. Also supports ParquetConfigurable compression codecs
PartitioningNative partitioning. Also basic partitioning support for ParquetSome partitioning support for ParquetManual partitioning using different file names
Column-based storageNative and Parquet, as well as other file formatsNative and ParquetParquet is a column-based format
IndexesPrimary and secondary indexes support for native format. No index support for Parquet (?).Min-Max index for native format. No index support for Parquet (?).Not supported by fastparqet to my knowledge, but Parquet as file format has indexes.