Training custom models on Vertex AI

Google Cloud Tech
18 Aug 202208:51

TLDRIn this informative session, we explore the process of training custom models on Vertex AI, emphasizing the advantages of managed training services over traditional notebook training. The tutorial covers containerization with Docker, utilizing pre-built containers for machine learning, and the steps to run a custom training job. It also discusses integrating cloud storage for data input and model output, highlighting the ease of use and scalability of Vertex AI for machine learning applications.

Takeaways

  • 📚 Custom training jobs on Vertex AI enable you to train models that may take a long time to complete, which is not always convenient to do in a notebook.
  • 🔄 As models evolve, it's important to retrain them to ensure they stay relevant and produce valuable results, making a managed training service like Vertex AI beneficial for long-term ML applications.
  • 🧱 Containers are used to package application code with its dependencies, allowing for increased portability and ease of dependency management.
  • 📦 Pre-built containers are available on Vertex AI, which can be used if they meet the requirements of your use case. Otherwise, you can use a custom image.
  • 🚀 To run a custom training job, you need to containerize your code with Docker, which is a tool for creating and managing containers.
  • 📂 Training jobs on Vertex AI can access data on cloud storage as if it were part of the local file system, providing high throughput for large file sequential reads.
  • 💾 Ensure that your trained model is saved to cloud storage for later access and deployment, rather than to a local path that may not be accessible after the job completes.
  • 🛠️ The process of running a custom training job involves exporting your notebook to a Python file, updating the code to read and save data to cloud storage, containerizing the code with Docker, and launching the job.
  • 🔗 The Dockerfile is a script that contains commands for building the container image, specifying the base image, copying the code, and setting the entry point for the training application.
  • 🚀 After building the container image, it should be pushed to an artifact registry for storage and later use in launching training jobs on Vertex AI.
  • 📊 The status and details of training jobs can be tracked in the Vertex AI console, where you can also view logs and saved model artifacts in cloud storage.

Q & A

  • Why is a managed training service like Vertex AI beneficial for long-term machine learning projects?

    -A managed training service is beneficial for long-term machine learning projects because it provides convenience for models that take a long time to train, allows for automation of model retraining to keep the model fresh and valuable, and offers additional features such as hyperparameter tuning, distributed training support, and integration with other parts of Vertex AI.

  • What is the role of containers in running custom training jobs on Vertex AI?

    -Containers package the application code along with its dependencies, ensuring that the training code runs consistently across different environments. They provide dependency management, increased portability, and the ability to run virtually anywhere, which is essential for cloud-based training services like Vertex AI.

  • How does Vertex AI Training handle dependencies for custom training jobs?

    -Vertex AI Training provides a set of pre-built containers with necessary libraries. If a pre-built container meets the needs of the training application, the user can provide the training code as a Python source distribution, and Vertex AI will manage the container. For custom dependencies or non-Python code, a custom image can be used.

  • What is the process of exporting a Jupyter notebook to a Python file?

    -The process involves using the `nbconvert` tool from the terminal. The command `jupyter nbconvert --to python` followed by the name of the notebook is executed, which converts the notebook into a Python file.

  • How does Vertex AI Training allow access to data stored in cloud storage during a training job?

    -Vertex AI Training provides access to cloud storage data as files in the local file system. When a custom training job starts, it sees a directory named `gcs` (Google Cloud Storage) that contains all cloud storage buckets as subdirectories, enabling high-throughput access to large files.

  • What is the purpose of updating the model's save path to a cloud storage location?

    -Updating the model's save path to a cloud storage location ensures that the trained model is saved to a location that can be accessed later for deployment and predictions. This is important because the training job runs on a machine that won't be accessible after the job completes.

  • What steps are involved in containerizing training code with Docker?

    -The steps include exporting the notebook to a Python file, updating the code to read and save data to cloud storage, creating a Dockerfile that specifies the base image, copying the training code into the container, and setting the entry point for the training job. After that, the container is built and pushed to Artifact Registry.

  • What is the base container image used for the TensorFlow model in the provided script?

    -The base container image used is the TensorFlow Enterprise GPU Docker image, as it comes with all the necessary packages for the training code.

  • How can one launch a custom training job on Vertex AI using the provided container?

    -One can launch a custom training job by navigating to the training section of the Vertex AI console, selecting 'Create', choosing 'Custom training advanced' as the training method, naming the model, specifying the container settings to use a custom container and providing the path to the container in Artifact Registry, and finally clicking 'Start training'.

  • What are the advantages of using the Python SDK for launching training jobs on Vertex AI?

    -Using the Python SDK allows for programmatic and automated launching of training jobs, which can be integrated into CI/CD pipelines, making the process more efficient and less prone to human error compared to using the UI.

  • How can one track the status and details of a training job on Vertex AI?

    -Under the custom jobs tab in the Vertex AI console, one can track the status of the training job. Clicking on the job's name provides details on the configuration, and once finished, the logs and saved model artifacts can be viewed in the associated cloud storage bucket.

  • What is the next step after training the model on Vertex AI as discussed in the script?

    -The next step, as mentioned in the script, is to use the trained model to get low-latency predictions with the Vertex AI Prediction Service.

Outlines

00:00

🚀 From Notebook to Deployed Model on Vertex AI

This paragraph introduces the transition from using notebooks for storing unstructured data to running custom training jobs on Vertex AI. The speaker, Nikita, explains the advantages of using a managed training service for machine learning models, especially when it comes to long training times and the need for retraining models to maintain their performance. The paragraph highlights the convenience of using Vertex AI's managed training options, which come with features like hyperparameter tuning and distributed training support. It also touches on the concept of containers and how they facilitate the management and portability of application code, with a focus on Docker as the tool for creating these containers.

05:01

🛠️ Containerization and Custom Training with Docker

The second paragraph delves into the specifics of containerizing code with Docker for custom training on Vertex AI. It outlines the process of exporting a notebook to a Python file, updating the code to read and save data from Cloud Storage, and launching a custom training job. The speaker guides the audience through creating a directory structure for the training code, writing a Dockerfile to build the container image, and pushing this image to Artifact Registry. The paragraph emphasizes the importance of understanding Docker basics for machine learning practitioners and provides a step-by-step approach to setting up a custom training job on Vertex AI, including the syntax and commands used in the Dockerfile.

Mindmap

Keywords

💡Vertex AI

Vertex AI is a managed machine learning platform provided by Google Cloud that enables users to train custom machine learning models at scale. In the context of the video, Vertex AI is used to run custom training jobs, which are essential for deploying models that can handle complex tasks and provide valuable insights. The platform offers additional features like hyperparameter tuning and distributed training support, making it easier to optimize and scale machine learning models for production use cases.

💡Custom Training Jobs

Custom training jobs refer to the process of training machine learning models using specific datasets and algorithms that are tailored to the user's requirements. In the video, custom training jobs on Vertex AI involve packaging the training code in containers and running these jobs on the platform to automate and scale the model training process. This is particularly useful for applications that require continuous retraining to stay up-to-date and maintain their performance.

💡Containers

Containers are lightweight, standalone, and executable software packages that include everything needed to run an application, including the code, runtime, system tools, libraries, and settings. In the context of the video, containers are used to package the training code and its dependencies, allowing the code to run consistently across different environments. This is crucial for machine learning applications as it ensures that the training environment remains consistent, regardless of where the code is executed.

💡Docker

Docker is an open-source platform that automates the deployment of applications by using containers. It allows developers to package an application with all of its dependencies into a single container image, which can then be run on any system that supports Docker. In the video, Docker is used to create containers for the training code, making it easy to deploy and run the code in various environments, including the Vertex AI platform.

💡Cloud Storage

Cloud Storage is a service that allows users to store and access data in the cloud. It provides high durability and availability, making it suitable for storing large amounts of data, such as datasets used for training machine learning models. In the video, Google Cloud Storage is used to store the training data and the model artifacts, allowing for easy access and management of these resources in a scalable and cost-effective manner.

💡Hyperparameter Tuning

Hyperparameter tuning is the process of optimizing the performance of a machine learning model by adjusting the hyperparameters, which are the configuration settings for the learning algorithm. This is a critical step in the model development process as it can significantly improve the accuracy and efficiency of the model. In the video, Vertex AI's hyperparameter tuning feature is mentioned as one of the additional benefits of using the platform for custom training jobs.

💡Distributed Training

Distributed training is a method of training machine learning models by dividing the workload across multiple devices or nodes. This approach allows for faster training times and the ability to handle larger datasets than would be possible with a single machine. In the context of the video, distributed training support on Vertex AI means that users can leverage the power of multiple machines to train their models more efficiently, which is particularly beneficial for large-scale and complex machine learning tasks.

💡Model Deployment

Model deployment refers to the process of making a trained machine learning model available for use in applications, typically by hosting the model on a server or in the cloud. In the video, the focus is on training models and saving them to cloud storage, which is a crucial step before deployment. The model can then be accessed and used to make predictions as part of an application or service.

💡TensorFlow

TensorFlow is an open-source software library for machine learning, developed by Google Brain Team. It provides a comprehensive, flexible ecosystem of tools, libraries, and community resources that enables researchers and developers to build and deploy machine learning applications. In the video, TensorFlow is mentioned as the library used for the model being trained, and a pre-built TensorFlow image is used as the base container image for the custom training job.

💡Google Cloud Platform (GCP)

Google Cloud Platform (GCP) is a suite of cloud computing services offered by Google, which includes a variety of hosting, computing, storage, and networking services. In the context of the video, GCP provides the infrastructure for Vertex AI, as well as other services like Cloud Storage and Artifact Registry, which are used for storing and managing the training data and container images.

💡Machine Learning (ML)

Machine Learning (ML) is a subset of artificial intelligence that provides systems the ability to learn from and make decisions or predictions based on data. It involves the development of algorithms that allow computers to learn from and understand data, improving their performance on specific tasks without being explicitly programmed. In the video, ML is the central theme, with the focus on training custom ML models using Vertex AI and deploying them for applications that require continuous learning and retraining.

Highlights

The episode discusses running custom training jobs on Vertex AI, transitioning from notebook code to a deployed model in the cloud.

Nikita explains the benefits of using a training service for models that take a long time to train, as opposed to running them directly in a notebook.

The importance of retraining models over time to ensure they stay fresh and produce valuable results is emphasized.

A managed ML training option is introduced as a solution for automating experimentation at scale and retraining models for production applications.

Containers are introduced as packages of application code and dependencies, with Docker as the tool to create them.

Vertex AI training offers pre-built containers, which can be used if they meet the needs of the training application.

Custom images are recommended when the training application requires specific libraries not included in the pre-built images.

The Flowers model uses TensorFlow, so the pre-built TensorFlow image can be used, but customizing Docker is a valuable skill for machine learning.

The steps to run a training job include exporting the notebook to a Python file, updating code to read and save data to cloud storage, containerizing the code with Docker, and launching the job.

The use of the Cloud Storage FUSE tool is demonstrated for accessing data directly from a Google Cloud Storage bucket.

The process of setting the data directory path to access images in the bucket and saving the trained model to cloud storage is detailed.

A directory structure for the training code is created, followed by the creation of a Dockerfile with specific commands.

The Dockerfile syntax is explained, including the base image, work directory, copying of code, and the entry point command.

Building the container and pushing it to the Artifact Registry is outlined, which is a repository for storing container images.

Launching a custom training job on Vertex AI involves specifying the training method, container settings, compute, and starting the training process.

Tracking the status of the training job and accessing the saved model artifacts in the cloud storage bucket is demonstrated.

The next episode will cover using the trained model for low-latency predictions with the Vertex AI prediction service.

A code lab is mentioned for detailed instructions on the topics covered, including launching training jobs using the Python SDK.