raduloff.dev


home blog

how to (try to) deploy a custom AI model to the cloud

Boris Radulov
published on 2024-09-19T00:00:00.000Z

1. the problem

Recently, a lot of debate has been conducted on serverless vs hosted vps and, regardless on which side you are, one has to admit serverless has one main benefit — pay for what you use ONLY. For regular applications, this might not be a good enough selling point (or even result in cost savings), but for AI applications it’s crucial. A VPS with an RTX 4090 goes for $0.69/hr (heh) which is about $500 a month. That’s for an RTX 4090 which is a commercial GPU made for video games. Big boy GPUs go for multiple times that. I’m not paying so much unless I really have to.

2. the “solutions”

Thankfully, with AI going crazy these days, there’s a wide selection of platforms that offer serverless model deployments. After much searching and comparison, I narrowed it down to two options: AWS SageMaker and RunPod.

I’ll use this section to discuss my experience with both as someone who’s never used them before. The model that I was working it is a custom analytics pipeline con top of a checkpoint for stardist. We (meaning LNLink, a small startup I’m involved in) use it to count cells with crazy accuracy and speed.

AWS SageMaker

aws meme by /u/donjuan26 As an important preface, this was the first time I’d ever used anything AWS. I hadn’t even used S3 before. I’ve seen a lot of memes about AWS’ complexity, but holy sh*t — nothing could have prepared me for what I was about to stumble on with SageMaker.

Firstly, you need to decide if you’re gonna use a synchronous or asynchronous deployment. The limits on synchronous are as follows:

Unfortunately for me, I had to chose asynchronous deployment, because we often process extremely high resolution cell microscopy images. This is a problem because dealing with asynchronous endpoints is significantly harder.

Your input data AND YOUR MODEL need to be hosted on S3. You get notification about your job progresses through Amazon SNS (message-queue service similar to RabbitMQ), and you need to pick instances from the most incomprehensible table known to mankind. Also, don’t forget to provision accounts through AWS IAM and authenticate via config files.

There is an example notebook which didn’t work for me out of the box, but you can give it a go. I realize my complaints with AWS can probably be chalked up to a skill issue, but, yet again, AWS have created a whole market for startups that are just a nicer AWS wrapper.

RunPod

RunPod takes this in the completely different direction. They give you a python library where you just pass a function that gets called in managed queue and you just need to return data from it.

import runpod
import asyncio


async def async_generator_handler(job):
    for i in range(5):
        # Generate an asynchronous output token
        output = f"Generated async token output {i}"
        yield output

        # Simulate an asynchronous task, such as processing time for a large language model
        await asyncio.sleep(1)


# Configure and start the RunPod serverless function
runpod.serverless.start(
    {
        "handler": async_generator_handler,  # Required: Specify the async handler
        "return_aggregate_stream": True,  # Optional: Aggregate results are accessible via /run endpoint
    }
)

It’s literally this simple. Deployment is also insanely simple, just put a docker container on DockerHub and give your creds to run pod to pull it. They don’t really have super good documentation on it, so here’s my Dockerfile I use for TF-based ML models:

from nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04

# Define your working directory
ENV HOME="/"
ENV LC_ALL C.UTF-8

# Define your callback

# Install correct python version
WORKDIR ${HOME}
RUN apt update && apt upgrade -y
RUN apt-get install -y git curl build-essential libssl-dev libffi-dev libncurses5-dev zlib1g zlib1g-dev libreadline-dev libbz2-dev libsqlite3-dev make gcc gfortran libopenblas-dev liblapack-dev
RUN git clone --depth=1 https://github.com/pyenv/pyenv.git .pyenv
ENV PYENV_ROOT="${HOME}/.pyenv"
ENV PATH="${PYENV_ROOT}/shims:${PYENV_ROOT}/bin:${PATH}"
ENV PYTHON_VERSION=3.9.18
RUN pyenv install ${PYTHON_VERSION} -v
RUN pyenv global ${PYTHON_VERSION}

# Install dependencies
RUN pip install csbdeep
RUN pip install stardist
RUN pip install pycocotools
RUN pip install gputools
RUN pip install tensorflow[and-cuda]
RUN pip install pycocotools
RUN pip install runpod

# Add API and run it
ADD main.py .
RUN mkdir models
ADD models ./models/
CMD [ "python", "-u", "/main.py" ]

Then you can make POST requests on the IP and port they give you to make new tasks and use a GET request to get status/results. If you make multiple tasks, they will automatically deploy new containers up to some limit to handle them in parallel.

runpod queue

Their servers have some sort of Docker cache or just insanely good bandwith because a container with CUDA that is a few GBs gets deployed in seconds. In the example blow, you can see the model took 10s to deploy and 40s to process a 14MiB cell microscopy image, while the second container task was running in parallel:

runpod speed

3. conclusion

This is not sponsored, but for all of you trying to build an AI startup, I can’t recommend RunPod enough. The developer velocity is crazy. You don’t have to deal with any complicated storage systems, message queues, or anything like that. Just your python file and a Docker container. Sure, once you get a lot of users it might be worth going to AWS, but remember: premature optimization is the root of all evil.