Creating Your Own Dataset In Hugging Face | Generative AI with Hugging Face | Ingenium Academy

Ingenium Academy
19 Sept 202310:09

TLDRThis video tutorial from Ingenium Academy guides viewers on how to work with datasets on Hugging Face, including creating and uploading a custom dataset to the Hugging Face Hub. It covers the installation of necessary libraries, loading datasets, and performing basic data preprocessing like shuffling and splitting. The video also demonstrates how to extract data from the Reuters 21578 dataset, create train, validation, and test splits, and finally, push the dataset to the Hugging Face Hub for sharing and future use.

Takeaways

  • 😀 The video teaches how to work with datasets from Hugging Face and create a custom dataset.
  • 🛠️ It's necessary to install libraries like Transformers, torch, and datasets to access Hugging Face datasets.
  • 📚 Hugging Face allows loading datasets by simply calling `load_dataset` with the dataset's path.
  • 🔍 Each dataset in Hugging Face is a `DatasetDict` object, which may contain different splits like train, test, and validation.
  • 📈 Some datasets require additional packages to be installed, highlighting the variety of requirements across different datasets.
  • ✂️ Data preprocessing involves operations like shuffling, selecting subsets, and splitting into train and test sets.
  • 🌐 The video demonstrates creating a dataset from the Reuters 21578 dataset, showcasing how to handle raw data.
  • 🔗 The process includes downloading, extracting, and parsing the data using tools like `wget` and `BeautifulSoup`.
  • 📝 The script shows how to convert raw data into a structured format suitable for machine learning tasks.
  • 📊 After preprocessing, the data is split into train, validation, and test sets, which are saved as `.jsonl` files.
  • 🌟 The custom dataset can be uploaded to the Hugging Face Hub, allowing others to access and use it.

Q & A

  • What is the main focus of the video?

    -The main focus of the video is to teach viewers how to work with datasets from Hugging Face, create their own dataset, and push it to their Hugging Face Hub account.

  • Which libraries are required to be installed for working with Hugging Face datasets?

    -The required libraries for working with Hugging Face datasets are 'transformers', 'torch', and 'datasets'.

  • How does Hugging Face allow users to load a dataset?

    -Hugging Face allows users to load a dataset by calling the 'load_dataset' function and providing the path to the dataset.

  • What is a 'dataset dict' object and what does it contain?

    -A 'dataset dict' object is a Hugging Face specific dictionary that contains different splits of the dataset such as 'train', 'test', and 'validation', each with their respective features.

  • What is the significance of the 'Samsung dataset' mentioned in the video?

    -The 'Samsung dataset' is a summarization dataset used in the video to demonstrate how to load and work with different types of datasets in Hugging Face.

  • Why might some datasets require additional 'pip install' packages?

    -Some datasets may require additional 'pip install' packages because they have specific dependencies or require certain libraries to function properly.

  • What is data preprocessing and why is it necessary?

    -Data preprocessing involves cleaning and organizing data to make it suitable for use in machine learning models. It is necessary to ensure the data is in the correct format and to improve the performance of the models.

  • How does the video demonstrate creating a custom dataset?

    -The video demonstrates creating a custom dataset by using the Reuters 21578 dataset, extracting titles and bodies from articles, and then splitting them into train, validation, and test sets.

  • What is a 'JsonL' file format and how is it used in the video?

    -A 'JsonL' file format is a JSON file where each line is a separate JSON object. In the video, it is used to save the custom dataset articles in a format that can be easily loaded by Hugging Face.

  • How can users share their custom dataset on the Hugging Face Hub?

    -Users can share their custom dataset on the Hugging Face Hub by using the 'huggingface_hub' library, logging in with an access token, and pushing the dataset to their Hub account.

  • What is an access token in the context of Hugging Face, and why is it needed?

    -An access token in the context of Hugging Face is a security credential that allows users to authenticate their identity and perform actions such as pushing datasets or models to the Hugging Face Hub. It is needed to ensure that only authorized users can make changes to their repositories.

Outlines

00:00

😀 Introduction to Hugging Face Datasets

This segment of the video introduces the viewers to working with datasets from Hugging Face. The presenter explains how to install necessary libraries such as Transformers, torch, datasets, and how to access datasets directly from Hugging Face. They demonstrate loading a dataset using the 'load_dataset' function and discuss the structure of the dataset object, which includes features like 'train', 'test', and 'validation' sets. The presenter also touches on the need to install additional packages for certain datasets and shows how to shuffle and split a dataset into train and test sets for pre-processing.

05:02

📚 Creating a Custom Dataset from the Reuters Archive

In this part, the presenter guides viewers through the process of creating a custom dataset using the Reuters 21578 dataset from a machine learning archive. They show how to download and decompress the dataset, extract relevant information using BeautifulSoup, and create a master list of articles. The articles are then split into training, validation, and test sets. The presenter also explains how to save these sets in a JSONL file format and load them back into a Hugging Face dataset object. The video concludes with a demonstration of how to upload the custom dataset to the Hugging Face Hub, requiring the use of an access token for authentication.

10:02

🚀 Advanced Dataset Manipulation and Model Fine-Tuning

The final paragraph hints at future content where the presenter will delve into more advanced dataset manipulation techniques and model fine-tuning. While the details are not elaborated in this segment, it sets the stage for upcoming videos that will cover these topics in greater depth.

Mindmap

Keywords

💡Hugging Face

Hugging Face is a company that specializes in natural language processing (NLP) and provides a platform for developers to build, train, and deploy NLP models. In the context of the video, Hugging Face is used as a platform to work with datasets, create custom datasets, and push them to the Hugging Face Hub, which is a repository for machine learning models and datasets.

💡Dataset

A dataset in the context of machine learning and NLP is a collection of data used for training, validating, and testing models. The video discusses how to work with datasets from Hugging Face, create custom datasets, and manipulate them for tasks such as shuffling and splitting into training and test sets.

💡Transformers

Transformers are a type of deep learning model architecture that has become the standard for NLP tasks. The video mentions the installation of the 'transformers' library, which is a collection of pre-trained models and tools provided by Hugging Face for working with these models.

💡Torch

Torch, often referred to as PyTorch, is an open-source machine learning library based on the Torch library. It is used for applications such as computer vision and NLP and is mentioned in the video as a necessary installation for working with datasets in Hugging Face.

💡Data Preprocessing

Data preprocessing is the process of cleaning and organizing data to make it suitable for use in a machine learning model. In the video, data preprocessing involves shuffling the dataset to remove any natural ordering and splitting it into training and test sets to prepare it for model training.

💡Train-Test Split

A train-test split is a technique used in machine learning to divide a dataset into a training set and a test set. The video demonstrates how to create a train-test split with a specified train size and random seed to ensure that the model can be trained and evaluated properly.

💡JSON

JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for humans to read and write and easy for machines to parse and generate. The video shows how to save processed data into JSON files, which are then used to create a custom dataset in Hugging Face.

💡Reuters 21578 Dataset

The Reuters 21578 dataset is a collection of news wire articles from Reuters, used in text classification and clustering. In the video, the dataset is used as an example to create a custom dataset by extracting titles and bodies of articles and splitting them into training, validation, and test sets.

💡Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It is used in the video to parse the .sgm files from the Reuters dataset and extract the necessary information such as titles and bodies of the articles.

💡Hugging Face Hub

The Hugging Face Hub is a platform where users can share and discover machine learning models and datasets. The video demonstrates how to push a custom dataset to the Hugging Face Hub, making it accessible to others in the community.

Highlights

Learn how to work with datasets from Hugging Face and create your own dataset.

Install necessary libraries: Transformers, torch, datasets.

Load a dataset using `load_dataset` and access it like a dictionary.

Datasets in Hugging Face have different structures, some with train, validation, and test splits.

Some datasets require additional package installations.

Demonstration of data pre-processing, including shuffling and creating train-test splits.

Explanation of how to access and manipulate dataset features like 'act' and 'prompt'.

Introduction to the Reuters 21578 dataset and its significance.

Process of downloading and extracting the Reuters dataset.

Use of BeautifulSoup to parse .sgm files and extract article titles and bodies.

Transformation of parsed data into a master list of Reuters articles.

Splitting the dataset into train, validation, and test sets.

Conversion of dataset into JSONL format for Hugging Face compatibility.

Loading the custom dataset using `load_dataset` from JSON files.

Sharing the custom dataset to the Hugging Face Hub.

Explanation of how to obtain and use an access token for the Hugging Face Hub.

Demonstration of pushing the dataset to the Hugging Face Hub and its immediate availability.

Overview of advanced dataset operations for future model fine-tuning.