MLOPS - What and Why βš™


Hello Reader,

MLOps is gaining quite the buzz these days, given the rise of Gen AI. In today's newsletter, let's find out what MLOps is and why it is becoming so popular.

This edition of the newsletter is written by Vijay Kodam, Principal Engineer at Nokia, and AWS Community Builder.

MLOps is set of practices that streamline and automate machine learning workflows. It integrates DevOps practices into machine learning workflows to streamline machine learning operations.

Why do we need MLOps?

Most of the time, as part of the machine learning workflow, they go through EDA, data prep, model training, tuning, then model deployment and monitoring just to find out that it is not ready for production. You have to repeat the process all over again and retrain the model.

Since the machine learning workflows were manual and several teams were involved in this process at different stages, it took lot of time and effort to maintain it.

Streamlining and automating such manual process speeds up time to product and decreases manual errors and risks. This leads to scalability of managing and monitoring thousands of machine learning models. This allows the data scientists and engineers to focus on model development and innovation.

Key components of MLOps

Machine learning lifecycle has several interconnected stages and all of these key components together make up MLOps. I have been going through various MLOps guides from AWS, Google, IBM and Databricks and realized all of them mostly follow the same key components.

Data Management

Data is the new oil. For ML data makes or breaks a model. It is the backbone of any machine learning model. Fetching right data, storage, preprocessing the data for model development and versioning are very important.

Primarily this stage consists of Exploratory Data Analysis (EDA) which includes exploring and understanding data. Data preparation and feature engineering are also part of this step, which includes collecting data, processing data.

Feature engineering preprocesses raw data into a machine-readable format. It optimizes ML model performance by transforming and selecting relevant features.

Some MLOps implementations separate EDA and Data preparation into two stages.

Here are some of the tools used for data management:

  • Data versioning: Data Version Control (DVC), Delta Lake, MLflow.
  • Data storage and management: Amazon S3, Google cloud storage, Azure Blob storage, Google BigQuery, Amazon RedShift, Snowflake
  • Data Preparation: Apache Airflow, Databricks

Model Development

This stage involves the design, training, tuning, and evaluation of machine learning models.

Here are some of the tools and services used as part of model development:

  • Model development frameworks: Tensorflow / Keras, PyTorch, Scikit-learn
  • Experiment tracking and Management: MLflow
  • AutoML: Amazon SageMaker Autopilot, Google AutoML, Azure Machine Learning Studio
  • IDEs: Jupyter Notebooks, R studio, VS Code, etc

Model deployment

Focuses on packaging models, shipping them and deploying them to production environments. This step ensures the model is accessible via an API, microservice or application.

Here are the tools and services used for model inferencing, serving and model deployment:

  • Containers and Orchestration: KServe + Kubernetes platforms like Amazon EKS, GKE, Azure Kubernetes service.
  • Managed model deployment services: Amazon Sagemaker, Google Vertex AI, Azure Machine Learning
  • Model Serving: Kubeflow, TorchServe, TensorFlow Serving

Model inference and serving

Model inference and serving involves making it available for use by applications and end users. It focuses on querying the deployed model to generate predictions.

Services like Amazon SageMaker Endpoints, Google Vertex AI Endpoints, Azure Machine Learning Endpoints, TensorFlow Serving, KServe and MLflow Models are used.

Model Monitoring

After deployment, continuous monitoring is essential to ensure that models perform as expected and maintain their accuracy over time.

Prometheus + Grafana is the opensource stack for monitoring. Good to get started. Model monitoring services: AWS SageMaker Model Monitor, Evidently. There are also custom monitoring solutions like Kubeflow pipelines.

Governance and Compliance

This key component ensures ML models are developed and deployed responsibly and ethically.

Model explainability can be done using Local Interpretable Model-agnostic Explanations (LIME) and SHAP (SHapley Additive exPlanations). MLflow supports Audit and compliance. Amazon Macie handles security. Data and Model Lineage can be done using MLflow, Amazon SageMaker Model Registry and Google Cloud Vertex AI Model Registry.

Automated model retraining

Automated model retraining involves retraining the ML model when its performance degrades or when new data becomes available. In this stage model retraining is triggered when specific conditions are met, then retrain the model using latest data and then evaluate the retrained model.

Conclusion

As the adoption of machine learning is skyrocketing, the importance of MLOps is now higher than ever. MLOps helps automate and streamline machine learning operations. I have tried listing some of the tools used in MLOps is every key component/stage of machine learning workflow. Which tools or services you choose for MLOps depends on whether you are running on AWS, Google, Azure, Databricks, baremetal or opensource.

Vijay regularly posts about AWS, AI/ML, EKS, Kubernetes, and Cloud computing-related topics. Visit https://vijay.eu/posts/ for all his articles, and follow him on LinkedIn.


If you have found this newsletter helpful, and want to support me πŸ™:

Checkout my bestselling courses on AWS, System Design, Kubernetes, DevOps, and more: Max discounted links​

AWS SA Bootcamp with Live Classes, Mock Interviews, Hands-On, Resume Improvement and more: https://www.sabootcamp.com/​

Keep learning and keep rocking πŸš€,

Raj

Fast Track To Cloud

Free Cloud Interview Guide to crush your next interview. Plus, real-world answers for cloud interviews, and system design from a top Solutions Architect at AWS.

Read more from Fast Track To Cloud

Hello Reader, In the previous newsletter edition, we took a look at top Gen AI tools, and how can you benefit from the trend. The matter of the fast is, even in general SA interviews, you have to expect Gen AI questions. This is similar to how you expect fundamental containerization and DevOps questions. Gen AI is becoming quite popular, and this is no exception. In today's edition, let's go over AWS Gen AI landsacape, that I am following: The image illustrates the AWS Generative AI (Gen AI)...

Hello Reader, Agentic AI is the new buzzword. Every YouTube video, LinkedIn article, and blog is about Agentic AI. In today's edition, we will go over what is Agentic AI, and why is it becoming so popular. To understand this, we have to see the evolution of LLMs, let's find out: In the beginning, there was a single LLM, and we were doing prompting. This was significant first step, but it had the following challenges: Static, pre-trained information Not able to integrate project-specific data...

Hello Reader, The landscape of generative AI (Gen AI) consumer applications is evolving rapidly, with new players emerging and established ones innovating at an unprecedented pace. In today's newsletter, we will take a look at the top Gen AI apps and, more importantly, how YOU can use this for your career growth and get more money. Generative AI Consumer Apps: The Top Performers ChatGPT's Resurgence: After an initial plateau, ChatGPT has surged to 400 million weekly active users, driven by...