βIt is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration.β
β Edsger Dijkstra
Already 1975 it was a common knowledge that software development tutorials, primers and courses can teach students bad habits. Nowadays, in my opinion, data science is the area with the highest level of discrepancy between the teaching materials and the reality on the job.
It all starts with the tutorial code getting datasets from CSV files. When the Data Scientist starts their first job they suddently realize that the data is not stored in CSV, because it is too slow, too expensive and doesn’t comply with security measures required by GDPR and internal company protocols.
So the Data Scientist will then retrieve the data using SQL from some database, somehow like this:
from pyclickhouse import Connection
conn = Connection('db.company.com', user='datascience', password='cfhe8&4!dCM')
cur = conn.Cursor()
cur.select(".....")
Here is why it is a bad idea:
- As soon as you save your script, the database credentials (which are a particular example of a broader category of secrets) will be stored in a plain, unprotected file somewhere on your disk. Every hacker who can read contents of your disk, will also be able to read all the (potentially) sensitive and protected data from your database. In the worst case, they would also be able to delete all the data from the database. It is not so hard to get access to your files: you might let your notebook open while you get new portion of caffee, you might open some email attachment (being in reality a virus) from a sender looking as if they are your boss, and if you don’t update your operating system immediately after updates are available, your computer can be attacked remotely even without you knowing anything.
- Even worse, you could commit your script to a version control system such as git and then push it onto gitlab, github or some other publicly available source code repository. Now, the hacker doesn’t even need to hack your computer, they just go to gitlab and read your publicly open source conviniently, without need to break somewhere in.
- Do not fool yourself with the idea “I just quickly type the secrets now, test if my script is working, and remove the secrets before I save the work”. No you won’t. You will forget. Your editor might have an autosaving functionality. You file might need to be saved before the first run due to technical reasons, etc.
First rule of secret management: never ever type your secrets in your source code.
Trying to avoid secrets in the source code, many Data Scientists come up with the idea to store them in the environment variables. Unfortunately, there are a lot of devops, software engineers and admins who do the same and even recommend doing it. But beware, it is easy to do, but not so easy to do right.
The first naive implementation would be changing the script above like this:
import os
from pyclickhouse import Connection
conn = Connection('db.company.com', user=os.environ['DB_NAME'], password=os.environ['DB_PASSWORD'])
cur = conn.Cursor()
cur.select(".....")
and then run your script like this:
DB_NAME=datascience DB_PASSWORD=cfhe8&4!dCM python myscript.py
This would avoid publishing the secrets to gitlab or other public repository, but it will store them in a file on your computer. How come? When you type your commands in the Terminal, most popular shells would keep history of your commands. You can easily return to the previous command by pressing arrow up in the Terminal. This history is stored in the file .bash_history (if you use bash as your shell) in your home directory, so reading this file will reveal the secrets to the attacker.
Another naive and wrong solution would be to store the secrets in your .bashrc file. Data Scientists don’t want to type the secrets every time they use them, so they google for “how to set an environment variable permanently” and find the solution to write them into the .bashrc file and do that. So now the hacker just needs to look at two files: .bash_history and .bashrc, somewhere in these two places, they will find the secrets.
Proper secret management
The solution to the problem is not to store any secrets at all. You, as a Data Scientist, should never receive the secrets (via email, via chat etc) and so never type them, neither in the source code, nor in the environment variables. I personally would also never read the secrets for especially sensitive data, so even if the hacker would kidnap and torture me, I wouldn’t be able to reveal them.
Instead, your Devop will generate secrets for you and store them in a specially designed secret store, for example Hashicorp Vault. You can think of this store as a secure, encrypted key/value database. The keys can be names of your project, like ‘datascience/RevenuePrediction/Summer2025’, and the values are the secrets needed for this python script.
To use the secrets, you will change the script as follows:
import os
from getpass import getpass
from hvac import Client
from pyclickhouse import Connection
vault_client = Client(url='https://vault.company.com:8200')
username = os.environ['USER']
password = getpass('Enter password:')
vault_client.auth.ldap.login(
username=username,
password=password,
mount_point='datascience'
)
secrets = vault_client.secrets.kv.v2.read_secret(
path='RevenuePrediction/Summer2025,
mount_point='datascience'
)
conn = Connection(secrets['host'], user=secrets['db_user'], password=secrets['db_password'])
cur = conn.Cursor()
cur.select(".....")
Here is how it works. When you start the script, it will determine the username you have used to login at your operating system today. It will also ask the password for your operating system. This password won’t be stored persistently anywhere and not displayed on screen. These operating system credentials will then be used to authenticate you against vault. When this successfully happens, vault knows who you are and what secrets you are allowed to access. Next line reads the secrets from vault: these secrets are also never permanently stored, but immediately used to authenticate yourself against the database. So all this time, no secrets are ever stored on your disk and rather kept for a short time in your RAM.
A hacker can still steal the secrets, if they grab your open notebook with a running script while you are refilling your cup with new portion of latte, then use a debugger to connect to your running python script and to read its variables. Nevertheless, you have improved your secret management to the minimal acceptable SOTA secure level, so, well, congratulations.
So I have to enter my password every time I run a script? WTF?
Yep, unless your Devops and Admins provide you with a passwordless security infrastructure and hardware (like FIDO2 stick), your best behavior would be entering the password every time you run the script.
Most people don’t excercize their best behavior, all the time.
So most people do some “good enough” approach and add the following one line to the script:
import os
from getpass import getpass
from hvac import Client
from pyclickhouse import Connection
vault_client = Client(url='https://vault.company.com:8200')
if not vault_client.is_authenticated:
username = os.environ['USER']
password = getpass('Enter password:')
vault_client.auth.ldap.login(
username=username,
password=password,
mount_point='datascience'
)
secrets = vault_client.secrets.kv.v2.read_secret(
path='RevenuePrediction/Summer2025,
mount_point='datascience'
)
conn = Connection(secrets['host'], user=secrets['db_user'], password=secrets['db_password'])
cur = conn.Cursor()
cur.select(".....")
This line will skip authentication if the hvac framework thinks you are already authenticated towards the Vault. And how you can be authenticated? Well, after your open your notebook at the beginning of the day, you just type this in your terminal:
vault login -method ldap username=$USER
And then enter your operating system password, once. After this, you will receive an access token from the Vault allowing you to access it for some limited amount of time (like 8 hours). This token will be stored in .vault file in your home directory. The hvac framework will look and find it when running your script.
So what gives? I still have some secret stored on my local disk!
Yes, but this token doesn’t contain the secrets, it is only the right to read secrets from the database, for some limited amount of time. If the hacker steals this at night, the token will already expire. Second, if your notebook gets stolen, you will immediately report this to IT and they will be able to revoke the vault token (it is similar as locking your stolen credit card). Third, the token can be limited by IP address, so if the hacker copy the token file on their computer, they won’t be able to use it, because it is only valid when used from your notebook, with your IP address.
Is this really SOTA?
Yes, it is the industry SOTA for the minimal possible secure secret management. Organizations can improve it by introducing MFA, hardware tokens, and other methods.
As a Data Scientist, all you need to do is to be aware of the secret management protocol of your organization and to follow it as near as feasible. If your organization doesn’t have any secret management protocol in place, demand the IT or Devops to create one.