Training custom models on Vertex AI
TLDRIn this informative session, we explore the process of training custom models on Vertex AI, emphasizing the advantages of managed training services over traditional notebook training. The tutorial covers containerization with Docker, utilizing pre-built containers for machine learning, and the steps to run a custom training job. It also discusses integrating cloud storage for data input and model output, highlighting the ease of use and scalability of Vertex AI for machine learning applications.
Takeaways
- 📚 Custom training jobs on Vertex AI enable you to train models that may take a long time to complete, which is not always convenient to do in a notebook.
- 🔄 As models evolve, it's important to retrain them to ensure they stay relevant and produce valuable results, making a managed training service like Vertex AI beneficial for long-term ML applications.
- 🧱 Containers are used to package application code with its dependencies, allowing for increased portability and ease of dependency management.
- 📦 Pre-built containers are available on Vertex AI, which can be used if they meet the requirements of your use case. Otherwise, you can use a custom image.
- 🚀 To run a custom training job, you need to containerize your code with Docker, which is a tool for creating and managing containers.
- 📂 Training jobs on Vertex AI can access data on cloud storage as if it were part of the local file system, providing high throughput for large file sequential reads.
- 💾 Ensure that your trained model is saved to cloud storage for later access and deployment, rather than to a local path that may not be accessible after the job completes.
- 🛠️ The process of running a custom training job involves exporting your notebook to a Python file, updating the code to read and save data to cloud storage, containerizing the code with Docker, and launching the job.
- 🔗 The Dockerfile is a script that contains commands for building the container image, specifying the base image, copying the code, and setting the entry point for the training application.
- 🚀 After building the container image, it should be pushed to an artifact registry for storage and later use in launching training jobs on Vertex AI.
- 📊 The status and details of training jobs can be tracked in the Vertex AI console, where you can also view logs and saved model artifacts in cloud storage.
Q & A
Why is a managed training service like Vertex AI beneficial for long-term machine learning projects?
-A managed training service is beneficial for long-term machine learning projects because it provides convenience for models that take a long time to train, allows for automation of model retraining to keep the model fresh and valuable, and offers additional features such as hyperparameter tuning, distributed training support, and integration with other parts of Vertex AI.
What is the role of containers in running custom training jobs on Vertex AI?
-Containers package the application code along with its dependencies, ensuring that the training code runs consistently across different environments. They provide dependency management, increased portability, and the ability to run virtually anywhere, which is essential for cloud-based training services like Vertex AI.
How does Vertex AI Training handle dependencies for custom training jobs?
-Vertex AI Training provides a set of pre-built containers with necessary libraries. If a pre-built container meets the needs of the training application, the user can provide the training code as a Python source distribution, and Vertex AI will manage the container. For custom dependencies or non-Python code, a custom image can be used.
What is the process of exporting a Jupyter notebook to a Python file?
-The process involves using the `nbconvert` tool from the terminal. The command `jupyter nbconvert --to python` followed by the name of the notebook is executed, which converts the notebook into a Python file.
How does Vertex AI Training allow access to data stored in cloud storage during a training job?
-Vertex AI Training provides access to cloud storage data as files in the local file system. When a custom training job starts, it sees a directory named `gcs` (Google Cloud Storage) that contains all cloud storage buckets as subdirectories, enabling high-throughput access to large files.
What is the purpose of updating the model's save path to a cloud storage location?
-Updating the model's save path to a cloud storage location ensures that the trained model is saved to a location that can be accessed later for deployment and predictions. This is important because the training job runs on a machine that won't be accessible after the job completes.
What steps are involved in containerizing training code with Docker?
-The steps include exporting the notebook to a Python file, updating the code to read and save data to cloud storage, creating a Dockerfile that specifies the base image, copying the training code into the container, and setting the entry point for the training job. After that, the container is built and pushed to Artifact Registry.
What is the base container image used for the TensorFlow model in the provided script?
-The base container image used is the TensorFlow Enterprise GPU Docker image, as it comes with all the necessary packages for the training code.
How can one launch a custom training job on Vertex AI using the provided container?
-One can launch a custom training job by navigating to the training section of the Vertex AI console, selecting 'Create', choosing 'Custom training advanced' as the training method, naming the model, specifying the container settings to use a custom container and providing the path to the container in Artifact Registry, and finally clicking 'Start training'.
What are the advantages of using the Python SDK for launching training jobs on Vertex AI?
-Using the Python SDK allows for programmatic and automated launching of training jobs, which can be integrated into CI/CD pipelines, making the process more efficient and less prone to human error compared to using the UI.
How can one track the status and details of a training job on Vertex AI?
-Under the custom jobs tab in the Vertex AI console, one can track the status of the training job. Clicking on the job's name provides details on the configuration, and once finished, the logs and saved model artifacts can be viewed in the associated cloud storage bucket.
What is the next step after training the model on Vertex AI as discussed in the script?
-The next step, as mentioned in the script, is to use the trained model to get low-latency predictions with the Vertex AI Prediction Service.
Outlines
🚀 From Notebook to Deployed Model on Vertex AI
This paragraph introduces the transition from using notebooks for storing unstructured data to running custom training jobs on Vertex AI. The speaker, Nikita, explains the advantages of using a managed training service for machine learning models, especially when it comes to long training times and the need for retraining models to maintain their performance. The paragraph highlights the convenience of using Vertex AI's managed training options, which come with features like hyperparameter tuning and distributed training support. It also touches on the concept of containers and how they facilitate the management and portability of application code, with a focus on Docker as the tool for creating these containers.
🛠️ Containerization and Custom Training with Docker
The second paragraph delves into the specifics of containerizing code with Docker for custom training on Vertex AI. It outlines the process of exporting a notebook to a Python file, updating the code to read and save data from Cloud Storage, and launching a custom training job. The speaker guides the audience through creating a directory structure for the training code, writing a Dockerfile to build the container image, and pushing this image to Artifact Registry. The paragraph emphasizes the importance of understanding Docker basics for machine learning practitioners and provides a step-by-step approach to setting up a custom training job on Vertex AI, including the syntax and commands used in the Dockerfile.
Mindmap
Keywords
💡Vertex AI
💡Custom Training Jobs
💡Containers
💡Docker
💡Cloud Storage
💡Hyperparameter Tuning
💡Distributed Training
💡Model Deployment
💡TensorFlow
💡Google Cloud Platform (GCP)
💡Machine Learning (ML)
Highlights
The episode discusses running custom training jobs on Vertex AI, transitioning from notebook code to a deployed model in the cloud.
Nikita explains the benefits of using a training service for models that take a long time to train, as opposed to running them directly in a notebook.
The importance of retraining models over time to ensure they stay fresh and produce valuable results is emphasized.
A managed ML training option is introduced as a solution for automating experimentation at scale and retraining models for production applications.
Containers are introduced as packages of application code and dependencies, with Docker as the tool to create them.
Vertex AI training offers pre-built containers, which can be used if they meet the needs of the training application.
Custom images are recommended when the training application requires specific libraries not included in the pre-built images.
The Flowers model uses TensorFlow, so the pre-built TensorFlow image can be used, but customizing Docker is a valuable skill for machine learning.
The steps to run a training job include exporting the notebook to a Python file, updating code to read and save data to cloud storage, containerizing the code with Docker, and launching the job.
The use of the Cloud Storage FUSE tool is demonstrated for accessing data directly from a Google Cloud Storage bucket.
The process of setting the data directory path to access images in the bucket and saving the trained model to cloud storage is detailed.
A directory structure for the training code is created, followed by the creation of a Dockerfile with specific commands.
The Dockerfile syntax is explained, including the base image, work directory, copying of code, and the entry point command.
Building the container and pushing it to the Artifact Registry is outlined, which is a repository for storing container images.
Launching a custom training job on Vertex AI involves specifying the training method, container settings, compute, and starting the training process.
Tracking the status of the training job and accessing the saved model artifacts in the cloud storage bucket is demonstrated.
The next episode will cover using the trained model for low-latency predictions with the Vertex AI prediction service.
A code lab is mentioned for detailed instructions on the topics covered, including launching training jobs using the Python SDK.