Environment Setup - Vertex AI for ML Operations [notebook 00]

StatMike
4 Jan 202224:21

TLDRIn this video, Mike guides viewers through setting up their environment for a machine learning project using Vertex AI on Google Cloud. He covers creating a project, enabling necessary APIs, and setting up a Jupyter notebook instance. Mike also demonstrates how to clone a GitHub repository, create a storage bucket in Google Cloud Storage, and extract data from BigQuery into a CSV file. The video concludes with a Q&A section addressing cost and cleanup of resources.

Takeaways

  • 📌 The video series is about setting up and running end-to-end machine learning workflows using Jupyter notebooks.
  • 🔧 The first step is to set up the project environment, which includes creating a Google Cloud project and enabling necessary APIs.
  • 📂 A new project is created in the Google Cloud Console, with a specific focus on keeping costs controlled and easy to delete after experimentation.
  • 🔄 Enabling APIs like Vertex AI and Workbench is crucial for using Google Cloud's machine learning services and running notebook instances.
  • 📊 The video demonstrates the creation of a Jupyter notebook instance, which is used to clone and work with the provided GitHub repository.
  • 👨‍💻 The speaker, Mike, uses the alias 'statmik' for the project and emphasizes the ease of setting up and deleting projects for cost management.
  • 🛠️ The video provides a detailed walkthrough of creating a storage bucket in Google Cloud Storage and extracting data from BigQuery into the bucket.
  • 📈 The script includes instructions for installing necessary packages for the workflow, such as TensorFlow 2.3 and Google Cloud Pipeline Components.
  • 🚫 The video addresses potential concerns about charges for resources used, offering solutions for cost management and cleanup of resources.
  • 📋 The speaker encourages viewers to provide feedback and contribute to the GitHub repository for continuous improvement of the workflows.
  • 🎥 The video concludes with a Q&A section, answering common questions about resource management and cleanup.

Q & A

  • What is the main purpose of this video?

    -The main purpose of this video is to guide viewers through the process of setting up their project environment for a series of machine learning workflows using Jupyter notebooks.

  • What type of workflows are described in the video?

    -The workflows described are end-to-end machine learning processes that include grabbing data, preparing data, training a model, evaluating a model, and potentially automating the entire process.

  • What is the first step in setting up the project?

    -The first step is to create a new project in the Google Cloud environment.

  • How can one review the files directly?

    -One can review the files directly by opening them in GitHub and reading them.

  • What is the role of APIs in this setup process?

    -APIs play a crucial role as they need to be enabled for services like Vertex AI and Workbench, which are used for running notebooks and managing resources.

  • What type of notebook instance is recommended for this tutorial?

    -A TensorFlow 2.3 notebook instance without GPUs is recommended as the modeling techniques used in the series are not the most sophisticated and the data is small, ensuring quick training times.

  • How is the data extracted from BigQuery?

    -The data is extracted from BigQuery by creating a client, setting a destination as a bucket in Google Cloud Storage, and then creating an extraction job to move the data from the BigQuery table to the specified destination.

  • What are the main components of Vertex AI that are highlighted in the video?

    -The main components highlighted are data management, model training, model evaluation, and model deployment, which are all part of the machine learning operations journey.

  • How can one avoid charges after completing the experiments?

    -One can avoid charges by either deleting the entire project, which eliminates all resources created within it, or by individually deleting the resources such as the GC bucket and endpoints created.

  • What additional packages are installed for the workflow?

    -Additional packages installed include CU Flow Pipelines for orchestration, Plotly for interactive graphing, and updates to the AI Platform module for interacting with Vertex.

Outlines

00:00

🚀 Project Setup and Introduction

The speaker, Mike, introduces himself as a statistician and Googler passionate about learning and sharing. He welcomes viewers to his office and explains that the video series will cover end-to-end machine learning workflows encapsulated in Jupyter notebooks. The workflows involve data grabbing, preparing, model training, evaluation, and deployment, potentially automating the entire process. Mike outlines the project's structure and encourages viewers to follow along by either reviewing the files on GitHub or creating a project on Google Cloud to run the notebooks. He guides viewers through the process of creating a project on Google Cloud, enabling necessary APIs, and setting up a notebook instance to clone the repository and begin working through the notebooks.

05:02

📚 Notebook Instance Creation and Repository Cloning

In this paragraph, Mike explains the process of creating a notebook instance for running the notebooks and emphasizes the importance of selecting the right version of TensorFlow for the series. He details the creation of a new notebook instance without GPUs and the selection of a small machine规格. Mike then demonstrates how to clone the repository into the notebook instance and prepare for the next steps. He also provides instructions on how to review the notebooks on GitHub and the benefits of using Jupyter Hub for running them.

10:03

🛠️ Setting Up the Data and Environment

Mike continues by discussing the next steps in setting up the environment, which include creating a storage bucket in Google Cloud Storage and using BigQuery to extract data into the bucket. He explains how to create a client for BigQuery, set up a destination for the data, and execute an extraction job. Mike also covers the use of CP flow pipelines for orchestration and plotly for graphing, updating the main module for interaction with Vertex AI. He ensures that viewers understand how to manage and delete resources to avoid unnecessary costs.

15:05

💡 Cost Management and Clean Up

This paragraph focuses on cost management and the importance of cleaning up resources to avoid charges. Mike reassures viewers that the setup uses a small compute instance without GPUs to minimize costs. He explains how to delete the entire project to eliminate all associated costs quickly or how to remove individual resources using a dedicated notebook. Mike encourages viewers to provide feedback and suggests that they can contribute to improving the repository by submitting issues on GitHub.

20:08

🎉 Conclusion and Next Steps

Mike concludes the setup video by thanking viewers for their attention and enduring the setup process. He encourages viewers to like, subscribe, and use the notification bell for updates on future videos. Mike emphasizes the importance of collaboration and feedback, inviting viewers to contribute ideas, corrections, or improvements to the GitHub repository. He reiterates the goal of making AI and machine learning more accessible and collaborative for a broader audience.

Mindmap

Keywords

💡Environment Setup

The process of configuring the necessary software, tools, and services required for a project. In the context of the video, it refers to the initial steps taken to prepare for machine learning workflows, including setting up a Google Cloud project, enabling APIs, and creating a notebook instance. This is crucial as it lays the foundation for all subsequent development and execution of machine learning models.

💡Jupyter Notebooks

Jupyter Notebooks are interactive computing environments that allow users to create and share documents containing live code, equations, visualizations, and narrative text. They are widely used in data science and machine learning for prototyping, analysis, and demonstration of algorithms. In the video, Jupyter Notebooks are used to encapsulate machine learning workflows, making it easier to share, recreate, and iterate on the processes.

💡Machine Learning Workflows

A machine learning workflow refers to the series of steps or processes involved in developing, training, evaluating, and deploying machine learning models. These workflows typically include data acquisition, preprocessing, model selection, training, evaluation, and deployment. The video focuses on setting up an environment where these workflows can be executed end-to-end efficiently.

💡Google Cloud Platform (GCP)

Google Cloud Platform is a suite of cloud computing services offered by Google, which includes various tools and platforms for data storage, computing, and machine learning. In the video, GCP is used as the cloud environment where the machine learning workflows are set up and executed, leveraging its services like Google Cloud Storage and Vertex AI.

💡Vertex AI

Vertex AI is a fully managed cloud service by Google that makes it easier to build, deploy, and manage machine learning models. It provides various tools and features for end-to-end machine learning workflows, including data labeling, model training, and prediction. In the video, Vertex AI is enabled as part of the environment setup to facilitate the machine learning processes.

💡Workbench

Google Cloud Workbench is a service that provides a managed environment for Jupyter notebooks, allowing users to develop and deploy machine learning applications more easily. It is part of the Google Cloud Platform and integrates with other services like Vertex AI. In the video, Workbench is used to host the Jupyter notebook instances where the machine learning workflows will be executed.

💡TensorFlow

TensorFlow is an open-source software library for dataflow and differentiable programming across a range of tasks, particularly machine learning. It is used for both research and production, and provides a comprehensive ecosystem of tools, libraries, and community resources. In the video, a specific version of TensorFlow (2.3) is chosen for the machine learning workflows, indicating the use of this library in the development process.

💡BigQuery

Google BigQuery is a fully managed, serverless data warehouse solution that enables scalable analysis over petabytes of data. It allows users to run SQL queries over large datasets stored in Google Cloud Storage. In the video, BigQuery is used as a source for extracting public data sets, which will then be saved as CSV files in cloud storage for further processing in the machine learning workflows.

💡Cloud Storage

Google Cloud Storage is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. It allows users to store and retrieve data easily and is integrated with other Google Cloud services. In the video, Cloud Storage is used to store a CSV file extracted from BigQuery, which will be used as input for the machine learning workflows.

💡APIs

APIs, or Application Programming Interfaces, are sets of protocols and tools for building software applications that specify how different software components should interact. In the context of the video, APIs refer to the services provided by Google Cloud Platform that allow the Jupyter Notebooks to interact with various Google Cloud services like Vertex AI and BigQuery.

Highlights

The video series focuses on end-to-end machine learning workflows using Jupyter notebooks.

The project involves grabbing data, preparing it, training a model, evaluating, deploying, and possibly automating the process.

Google Cloud environment is used to recreate and run the notebooks for the tutorial.

A new Google Cloud project called 'stat mic demo 3' is created for the tutorial.

APIs are enabled for Vertex AI and Workbench, which are essential for the project.

A notebook instance is created without GPUs, using TensorFlow 2.3.

The repository is cloned into the notebook instance for practical work.

The project name and region are set within the Jupyter notebook for consistency.

Google Cloud Storage bucket is created and utilized for data storage.

Public dataset from BigQuery is extracted and saved as a CSV file in the cloud storage.

Package installations are updated for TensorFlow, Google Cloud Pipeline Components, and Plotly.

The AI Platform module is updated for interaction with Vertex.

Costs are minimized by using a small compute instance without GPUs and creating small files.

Google Cloud provides free credits for new users to experiment with their services.

Projects can be deleted to avoid future charges, and there's a notebook dedicated to cleaning up resources.

The video series aims to make AI and ML more collaborative, accurate, and approachable.

Feedback and suggestions are encouraged through the GitHub repository for continuous improvement.

The video concludes with a Q&A section addressing questions about charges and resource management.