This ML Scientist reproduced Karpathy's GPT-2 for Audio!!!
TLDRMachine learning engineer Shas Vah successfully adapted Andrej Karpathy's GPT-2 for audio, enabling the model to accept audio input and produce audio output. Despite overfitting, the model's ability to generate audio sequences from a single file is groundbreaking. Vah discusses his process, the potential for multimodal models, and the importance of experimentation with new modalities and architectures, emphasizing the need for more data and computational power to further develop these models.
Takeaways
- 😀 Shas vah, a machine learning engineer, successfully reproduced Andrej Karpathy's GPT-2 for audio, enabling the model to take audio input and generate audio output.
- 🔍 Shas has a background in data science and machine learning, with experience in working with large language models (LLMs) at Expedia, where he is part of the NLP team.
- 🎧 The project uses GPT-2 architecture with a focus on audio, inspired by the potential of multimodal models like GPT-4 and the idea of a singular model capable of reasoning across different modalities.
- 🛠️ Shas built a tokenizer for audio based on the SNACK model, which converts audio into a sequence of tokens that can be processed by the GPT-2 model.
- 📚 He utilized an open domain dataset from LibriVox for training, which consists of audio books read by volunteers, and noted the model's quick overfitting due to the small dataset size.
- 🔢 Shas mentioned that training on a larger dataset, such as 10 to 40 billion tokens, could lead to more robust and realistic audio generation.
- 💡 The experiment demonstrated that with the right tokenizer, even an older model like GPT-2 can be adapted to work with new modalities like audio.
- 🚀 Shas is inspired by the potential of LLMs to extend beyond text to other modalities and is interested in exploring voice cloning and generating music with such models.
- 🌐 He encourages others to experiment with different modalities and datasets, emphasizing that it's not as difficult or resource-intensive as people might think.
- 💡 Shas highlighted the importance of compute resources in training models and the potential impact of more efficient models on the availability of such resources.
- 🔬 Shas is excited about the future of LLMs, especially regarding their efficiency and the possibility of running large models locally on devices like phones.
Q & A
What is the main achievement of Shas Vah in the field of machine learning as described in the transcript?
-Shas Vah, a machine learning engineer, successfully reproduced Andrej Karpathy's GPT-2 model for audio, enabling the model to take audio input and generate audio output based on the GPT-2 architecture.
What is the significance of Shas Vah's project in the context of multimodal AI models?
-Shas Vah's project is significant as it demonstrates the potential for a single model to handle multiple modalities natively, without the need for separate heads for different types of data like text, images, or audio.
What is the current limitation of Shas Vah's audio GPT-2 model as mentioned in the transcript?
-The current limitation of Shas Vah's model is that it is quite overfitting on the training data and is not yet a perfect model that can be used directly by others.
What is the educational background of Shas Vah in relation to machine learning?
-Shas Vah has a degree and Masters in data science from Warwick, providing him with a strong background in machine learning and model building, which he has further developed in his work with language models at Expedia.
How does Shas Vah's model differ from previous attempts at multimodal AI models like Meta's Chameleon?
-Shas Vah's model differs in that it is based on a singular model architecture that is natively multimodal, rather than having separate model heads for different data types, as seen in some previous attempts like Meta's Chameleon.
What inspired Shas Vah to attempt the reproduction of GPT-2 for audio?
-Shas Vah was inspired to reproduce GPT-2 for audio after watching Andrej Karpathy's video and being interested in exploring the capabilities of language models beyond text, as well as the launch of GPT-4 and its native multimodality.
What is the role of the tokenizer in Shas Vah's audio GPT-2 model?
-The tokenizer in Shas Vah's model plays a crucial role in converting audio into a sequence of tokens that can be processed by the GPT-2 architecture, allowing the model to understand and generate audio.
What dataset did Shas Vah use to train his audio GPT-2 model?
-Shas Vah used an open domain dataset from LibriVox to train his audio GPT-2 model, which he tokenized and formatted for training.
How long did it take for Shas Vah's model to start showing overfitting on the training data?
-The model started showing signs of overfitting within a few thousand training steps, specifically around 4,000 to 5,000 steps.
What is the potential next step for Shas Vah in improving his audio GPT-2 model?
-The next step for Shas Vah could be to build a larger and more diverse dataset to pre-train the model on, in order to improve its capabilities and reduce overfitting.
What are Shas Vah's thoughts on the future of large language models and their efficiency?
-Shas Vah is excited about the increasing efficiency of large language models, with the possibility of running them on devices like smartphones and the potential for local training of models like GPT-4 in the future.
Outlines
🤖 Machine Learning Engineer's Audio GPT2 Adaptation
Shas, a machine learning engineer, discusses his project to adapt Andre Karpa's GPT2 model for audio input and output. He explains that despite the model's overfitting, the successful adaptation is impressive. Shas has a background in machine learning and NLP, and his project was inspired by the potential of multimodal reasoning in large language models. He details the process of adapting the model, using native audio capabilities, and his experience with training the model from scratch.
🔊 Exploring Audio Tokenization and Model Training
This paragraph delves into the specifics of Shas's audio GPT2 project, focusing on the use of the Snack tokenizer to convert audio into a hierarchical sequence of tokens. Shas explains the process of token flattening and the model's quick learning of the token format. He also discusses the limitations of his small dataset from LibriVox and the model's rapid overfitting, highlighting the need for a more extensive and varied dataset for improved results.
📈 Model Overfitting and the Quest for Realistic Audio Output
Shas talks about the model's learning process, noting that it quickly learned to format sequences in the Snack style but struggled with generating varied and realistic audio due to the limited dataset. He discusses the model's performance at different training stages and the challenges of generating diverse outputs. Shas emphasizes the importance of data variety and the potential of training on a larger scale to achieve higher-quality audio generation.
💡 Insights on Model Training and Future Directions
In this section, Shas reflects on the insights gained from his project, including the feasibility of adapting text-based models for audio and the potential for multimodal applications. He suggests that with more data and computational power, it's possible to achieve better results and considers experimenting with different voices and zero-shot voice cloning. Shas also contemplates the need for a larger, cleaner dataset to improve model training.
🚀 The Potential of LLMs and Personal Experimentation
Shas expresses excitement about the future of large language models (LLMs), particularly their efficiency and the possibility of running them on devices like smartphones. He discusses the impact of recent announcements from tech giants and the potential for local model training. Shas encourages others to experiment with LLMs, suggesting that one could train a model on a small dataset within hours using free resources like Google Colab.
🔧 Experimentation with New Model Architectures
The conversation turns to Shas's interest in experimenting with newer model architectures like Mamba and the potential benefits of training with these models for audio tasks. He considers the possibility of faster training and better performance with these architectures and the importance of keeping up with the latest developments in the field.
🌐 Open Source Inspirations and Community Engagement
Shas shares his interest in open-source projects, particularly 'm-free-llm', which demonstrates the potential for training models with reduced memory usage. He discusses the implications of such projects for making large-scale model training more accessible. Shas also talks about his engagement on social media platforms like Twitter and LinkedIn, where he shares his work and invites collaboration.
📝 Conclusion and Call for Community Involvement
In the final paragraph, Shas concludes the discussion by reiterating the importance of community involvement and experimentation in the field of LLMs. He invites others to follow him on social media for updates on his work and expresses a desire to connect with others who share his interests, including those who might be able to provide access to GPUs for further research.
Mindmap
Keywords
💡Machine Learning Engineer
💡GPT-2
💡Overfitting
💡Multimodality
💡Tokenizer
💡Audio Dataset
💡Inference
💡NLP (Natural Language Processing)
💡Fine-tuning
💡Compute
💡Mamba
Highlights
Machine Learning Engineer Shas Vah has ported Andrej Karpathy's GPT-2 for audio, creating a model that takes audio input and outputs audio.
The model is based on GPT-2 architecture and demonstrates the potential of native audio processing without the need for separate heads for different modalities.
Although the model is currently overfitting, the fact that it works at all is considered magical and is a significant insight.
Shas Vah has a background in data science and machine learning, with experience in NLP and LMs at Expedia.
The project uses a modified version of the original GPT-2 code, with changes primarily in the data and tokenization process.
Shas discusses the use of the SNACK tokenizer, which converts audio into a hierarchical structure of tokens.
The tokenizer flattens the hierarchical tokens into a sequence for input into the GPT-2 model.
The model was trained on a small dataset from LibriVox, leading to quick overfitting but demonstrating the model's ability to learn the format of the sequence.
The model's ability to generate audio is currently limited by the size of the dataset and the variety of data available.
Shas shares his process of training the model, including the use of a single GPU and the compute time required.
The potential for training larger models with more data is discussed, highlighting the need for more substantial datasets.
Shas considers the possibility of training the model on multiple voices and the potential for zero-shot voice cloning.
The conversation touches on the efficiency of LLMs and the potential for running models like Mixol on mobile devices.
Shas is inspired by recent developments in LLMs, such as the ability to run models locally on devices and the potential for training larger models in the future.
The project's implications for multimodal AI and the potential for extending GPT-2's capabilities to other modalities are discussed.
Shas encourages others to experiment with different modalities and datasets, emphasizing that training models is not as difficult as people think.
The interview concludes with Shas sharing his thoughts on the future of LLMs and the importance of experimentation and innovation in the field.