Building a proto-MLOps Platform on Saturn Cloud

Vishnu Deva
7 min readNov 8, 2022

--

An entry in a series of blogs written during the Vector Search Hackathon organized by the MLOps Community, Redis, and Saturn Cloud.

I’m a die-hard (note the italics) fan of deploying and self-hosting open-source software for production. But that said, I’ve found myself enjoying the freedom that Saturn Cloud provided. Starting our project only three days ago, when my teammate, Chirag Bajaj, and I had to double down and focus on the ML and Redis exploration, I was ecstatic to not have to think about infrastructure. Things just worked out of the box; a unique experience for someone like me, since I usually have to assemble the parts before getting started with anything. I can see why teams comprising solely of ML engineers would gravitate here. Zero thought gets put into infrastructure, but their work gets done on time.

Our approach in this hackathon was two-pronged, one on the MLOps side and one on the ML side.

MLOps

Machine Learning or Data Science tend to plateau after a point without solid tools backing them. There’s only so far you can go with manually naming your models, storing your scripts on disk, and using log files to understand how each training run progressed. So we made the conscious decision to devote our attention primarily to building a rudimentary MLOps platform. Both in the hackathon and in the real world, this causes ML work to crawl along at a snail’s pace when the platform is being set up; the uncertainty is excruciating and it feels like nothing gets done. But once everything is available and running, the pace of research accelerates to levels that would never be possible without the platform.

ML

Our goal here was to extract relationships between papers in the arXiv dataset and explore the natural clusters that formed in the data using topic modelling. To this end, we used Top2Vec, an excellent Python library which implements the ideas explored in this paper by Dimo Angelov.

Let’s take a quick look at the mechanisms that Saturn Cloud offers to enable what we’re trying to do.

Saturn Cloud

There are three concepts relevant to us:

Jupyter Servers

Used as easy development environments backed by powerful machines. Accessible through the Jupyter UI on your browser or through SSH. The SSH connection also lets you connect to this server from the familiar comfort of your IDE, which is always a wonderful experience.

Deployments

These are like servers but use images, startup scripts and commands to automate the process of deploying an application. So once the data scientist finishes developing something with a Jupyter Server, it should be trivial for them to clone their environment and launch it as a deployment that will be permanently accessible.

Jobs

Similar to deployments, but these are meant to be ephemeral, dying once their task is complete. These are perfect for orchestrating jobs that have to be triggered on a periodic schedule or on a specific signal.

Notes

  • Saturn Cloud, the community edition, exposes only one port: 8000. This is the case on your deployments, and even on your Jupyter servers should you wish to use it. This port is automatically mapped to a website that looks like this:
https://<subdomain you choose>.community.saturnenterprise.io
  • To protect the platform from the execution of malicious code, external images are disabled on the community version of Saturn Cloud Hosted. But since all sensible OSS have a simplified installation process, this doesn’t really affect us as you’ll see below. We can just install whatever we need during runtime or build a custom image with their image builder tool.

We go into some detail on the components of our platform below.

Model Training

We needed a fair bit of compute to train our Top2Vec model. It had to chew through a couple of million papers, and the compute requirements were made worse by the fact that we weren’t going to be generating embeddings from a pre-trained model, but rather training a Doc2Vec model from scratch. And to add to this, Top2Vec doesn’t support GPU acceleration.

So we used the 4XLarge machine, with its 16 CPU cores and 128 GB of memory. Thankfully the library was able to fire at a 100% on all 16 cores and made it through the entire dataset in ~7 hours.

The auto-shutoff feature helps you save costs if you leave a server running without having it open. But if you do have a lengthy training running on it, make sure to connect to it via SSH from a stable server so the SSH connection prevents the auto-shutoff from being triggered.

As noted earlier, the interesting part is that even though we developed an ad-hoc script and trained with a notebook on a Jupyter server, Saturn Cloud allows you to clone your environment and recreate that as a job or a deployment. Combine this with the feature that lets you automatically pull code from a GitHub repository and a simple Docker image builder to pre-package your dependencies, you now have a platform in Saturn Cloud that can take any bit of code that a data scientist tests and plays around with and translates that into a periodic job with little effort. This is an inspired workflow.

Object Storage

Since we have a system that’s spread out across multiple components, having our files on disk doesn’t quite cut it for us. We deployed MinIO to act as an S3-compatible object store. I don’t think I’ll ever get over how well the S3 API has unified the landscape for object storage. It was a breeze to use MinIO with boto3.

Installation and startup

wget https://dl.min.io/server/minio/release/linux-amd64/archive/minio_20221029062133.0.0_amd64.deb -O minio.deb
sudo dpkg -i minio.deb
mkdir -p ~/minio# Startup command for the deployment.
minio server ~/minio --address :8000

Interfacing with MinIO from boto3

The only difference from using this with S3 is that we specify an endpoint URL as follows.

s3_resource = boto3.resource(
"s3",
endpoint_url="https://xyz-minio.community.saturnenterprise.io",
config=boto3.session.Config(signature_version="s3v4"),
)

Model Tracking

Behind every great model is a model registry rolling its eyes.

Mistakes are a core part of the data science journey, but mistakes only generate value if they are tracked. A model registry lets you do this and more.

We deployed MLFlow on Saturn Cloud to track our training runs. Though Top2Vec doesn’t quite generate metrics during training that we can track, we plan to create a metric from Topic Information Gain as described in equation 5 of the Top2Vec paper. It’s also useful to get an idea of the number of documents, words, and topics after each model is trained; we log these as params to MLFlow.

Installation and startup

# Startup command for the deployment.
mlflow ui --host 0.0.0.0 --port 8000 --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./artifacts

Interfacing with MLFlow from Python

It’s very straightforward.

  • Set the Tracking URI.

mlflow.set_tracking_uri(os.environ[“MLFLOW_TRACKING_URI”])

  • When your training/task begins, start your run.

mlflow.start_run(run_name=run_name)

  • To log parameters that feed into the task or describe the task at a high level, use log_param or log_params. In the screenshot above, num_topics, num_words, etc are parameters.

mlflow.log_param(key=”num_workers”, value=workers)

  • When your training/task ends, end your run.

mlflow.end_run()

Logging

Distributed anything is a dream for scalability, but it’s also a nightmare for logs. Ideally, you’d want a centralized logging interface so that you can make sense of everything. Log files sound good, and do the job, but are practically hard to work with when you want to get a sense of your system as a whole.

I swear by Loki for anything to do with logging, because its ability to process large volumes of logs and extract useful information and metrics from them is nothing short of magical.

Installation and startup

curl -O -L "https://github.com/grafana/loki/releases/download/v2.6.1/loki-linux-amd64.zip"
sudo apt update; sudo apt install unzip -y
unzip "loki-linux-amd64.zip"
chmod a+x "loki-linux-amd64"
wget https://xyz-minio.community.saturnenterprise.io/topvecsim/platform/loki/loki_config.yaml# Startup Command for the deployment.
./loki-linux-amd64 --config.file loki_config.yaml

In the script above, you’ll note a link to xyz-minio. We store our Loki configuration file in our MinIO object store and download it whenever the Loki deployment starts up. Loki listens by default on port 3100, so we use the config file to redirect it to port 8000; specifically, we set http_listen_port: 8000.

Interfacing with Loki from Python

Loki has a simple API that’s easy to use, but we can use a library called python-logging-loki which makes the process easier by providing a logging handler that we can consume for our logger.

Here’s a sample:

loki_handler = logging_loki.LokiHandler(
url=https://xyz-loki.community.saturnenterprise.io/loki/api/v1/push,
version="1",
)
logger = logging.getLogger("topvecsim")
logger.addHandler(loki_handler)
logger.setLevel(logging.INFO)
logger.info("Starting model training.")

The library also provides a Queue Handler which is well-suited for high log-volume situations.

Monitoring

Grafana needs no introduction as an observability interface. It pairs perfectly with Loki and lets you manipulate logs into any number of intuitive visualizations.

Installation and startup

sudo apt-get update
sudo apt-get install -y adduser libfontconfig1
wget https://dl.grafana.com/oss/release/grafana_9.2.3_amd64.deb
sudo dpkg -i grafana_9.2.3_amd64.deb
# Change port 3000 to 8000.
sudo sed -i 's/3000/8000/g' /etc/grafana/grafana.ini
sudo sed -i 's/3000/8000/g' /usr/share/grafana/conf/defaults.ini
# Startup Command for the deployment.
sudo /usr/sbin/grafana-server --homepath='/usr/share/grafana' --config='/usr/share/grafana/conf/defaults.ini'

What was missed out/in-progress for the Hackathon?

Jaeger

Distributed tracing to help you track the performance of your APIs and even your Python code that’s deployed in different places.

Loki Metrics on Grafana Dashboards

Sending logs to Loki as a JSON encoded string lets it deconstruct it and extract values from it. Doing this will let us plot any metrics from our application on Grafana through LogQL queries.

There are certain low-effort tasks you can take up that will make your life infinitely better. This is especially true when it comes to something as inherently chaotic and disorganized as ML/DS research and exploration. Invest in simple tools like the above for exponential workflow improvement.

--

--

No responses yet