* This blog post is a summary of this video.

Advancing AI: New Multimodal Models for Minecraft Agents and Text-to-Video

Table of Contents

Jarvis 1: AI Agent for Complex Planning and Execution in Minecraft

Jarvis 1 is an AI agent developed by Anthropic to perform complex planning and execution of tasks within the game Minecraft. The agent utilizes a multimodal language model that can retrieve and add memories to create plans. These plans are then sent to a planner module and finally executed by a controller that interfaces with the game through keyboard and mouse inputs.

As shown in the examples, Jarvis 1 is able to break down complex crafting tasks into sequential sub-tasks, gather the required materials, and ultimately craft the target item. Over successive attempts, the agent is able to learn from failures, understand what resources or steps were missing, and improve its plans.

This represents a major advance in creating AI agents that can reason about tasks in simulated 3D environments. The complex planning and tight integration with the Minecraft engine pushes the boundaries of agent capabilities.

Multimodal Architecture for Planning and Execution

The key innovation enabling Jarvis 1's impressive performance is its multimodal architecture consisting of several interconnected components: First, the multimodal language model takes a natural language description of the task and can access external memory storage to formulate a conceptual plan. Next, the symbolic planner module converts this into a formal sequence of low-level game actions. Finally, the controller executes the plan by sending keyboard and mouse inputs to Minecraft. This clean separation of planning and execution, along with the ability to leverage large language models, gives Jarvis 1 an advantage over previous Minecraft agents.

Improving Over Multiple Game Epochs

An intriguing aspect of Jarvis 1 is its ability to learn over successive attempts at a task, referred to as game epochs. In the shears crafting example, Jarvis 1 failed in the first two epochs due to missing ingredients or tools. However, by epoch 3 the agent successfully completed the full sequence of mining, smelting, plank crafting, workbench crafting, and finally shears crafting. This demonstrates a key capability of autonomous self-improvement through experience. Each failure provides signal for the agent to adjust its knowledge and plan accordingly the next time.

Emo Video: High-Fidelity 4-Second Video Generation

Emo Video is a new text-to-video generation model from Meta AI that achieves unprecedented quality and coherence for 4-second video clips. As seen in the examples, Emo Video creates highly realistic and smooth motions, maintaining consistent style and physical plausibility across frames.

In formal human evaluations against competing models like PICKGEN-2 and Imagen Video, Emo Video attained win rates upwards of 80%, cementing its place as state-of-the-art in this rapidly advancing domain.

With further advances in duration, fidelity, and controllability, text-to-video models like Emo Video could become versatile tools for media creation or augmenting simulations and games.

Generating Diverse 4-Second Video Clips

While today's top text-to-video models can generate 1-second clips, Emo Video pushes duration to 4 full seconds with no loss of quality or coherence. The samples showcase smooth motions like a robot walking, dynamic backgrounds like waving flags, particle effects like splashing water, and camera movement capturing multiple angles. Moreover, the style is remarkably photorealistic and consistent across frames. This showcases tremendous progress from earlier generations of text-to-video models.

Benchmark Results Across Datasets

Emo Video was evaluated against other state-of-the-art models on 3 labeled video datasets: Actuality, Kinetics, and Moments in Time. Across datasets, Emo Video attained win rates of 67-83% based on side-by-side human judgments. This quantifiable benchmarking demonstrates Emo Video's superior performance over the entire distribution of video types, not just cherry-picked qualitative examples.

Orca 2: Teaching Smaller Models to Reason

Orca 2 from Microsoft Research focuses on imbuing reasoning abilities in smaller language models by leveraging the capabilities of a larger teacher model, GPT-4.

By carefully generating synthetic training data using GPT-4, Orca 2 models with just 7B-13B parameters are able to match or exceed the performance of baseline models over 5x larger on logical reasoning benchmarks.

This demonstrates the viability of distilling complex reasoning skills into smaller, more deployable models by relying on a higher-capability teacher during training.

Using Synthetic GPT-4 Data

The key insight is that a 175B parameter model like GPT-4 has strong inherent reasoning skills. By using GPT-4 to generate large-scale synthetic training data, its abilities can be distilled. Concretely, prompts are fed into GPT-4 and its responses are treated as new data to train the smaller Orca 2 model on question answering. This allows Orca 2 to learn robust reasoning patterns from GPT-4's demonstrations rather than having to learn solely from human-generated training sets.

Efficiency Through Distillation

Orca 2 models with just 7B-13B parameters were able to match or exceed baseline 175B models on Complex Reasoning, Common Sense Reasoning, and Winogrande benchmarks. For example, Orca 2 with 13B parameters achieves 71.6 accuracy on WinoGrande compared to 70.9 for 175B LLAMAP-C75. This represents over 10x greater efficiency in encoding reasoning skills by learning from a higher-capability teacher model.

Claude 2.0: 200k Context Size, Lower Errors

Anthropic has released Claude 2.0 with notable upgrades including a massive 200k token context window for long-form processing and significantly reduced hallucination rates.

With 2x lower rates of hallucination compared to the previous Claude base model, Claude 2.0 is better poised to take on complex, open-ended conversations without spurious fabricated content.

Moreover, the 200k context window future-proofs Claude 2.0 to tackle books, scientific papers, movies and other long-form content within a single prompt, avoiding fragmented context.

Massive 200k Token Context

Claude 2.0 now supports by default a context length of 200,000 tokens - the largest of any publicly-known conversational agent besides character-based models. This enables retaining complete context from entire books and papers rather than small excerpts when asking complex, context-dependent followup questions. With more competition emerging in long-context language models, Claude 2.0's expanded window and lowered hallucination risk gives it an edge.

Reduced Hallucinations

A persistent issue faced by large language models is hallucination - generating responses not grounded in actual context. This issue becomes more pronounced with longer context. However, Claude 2.0 claims to reduce hallucination rates by up to 2x compared to its previous version through better training methods. With its high-fidelity 200k token context, lowered hallucination further augments Claude 2.0's capabilities for long-form dialogue.

OpenGPT: Open-Source Alternative to GTP Models

In the wake of OpenAI's closed-source GPT chatbot models, LangChain has introduced OpenGPT - an open-source reimplementation providing additional transparency, customization, and extensibility.

Building on top of existing LangChain capabilities, OpenGPT enables adding custom knowledge sources and analytics as well as public sharing of model configurations.

For those wanting more flexibility from conversational models beyond what OpenAI offers, OpenGPT represents an intriguing alternative thanks to LangChain's modular open-source foundation.

Custom Tools and Analytics

Unlike OpenAI's GTP which has a fixed set of narrow capabilities, OpenGPT allows connecting custom tools like search, Q&A, and code execution. Developers can build on top of OpenGPT's capabilities or analyze its internal workings via interfaces like LangSmith. This transparency and customization enables adapting OpenGPT to specialized domains where rigid one-size-fits-all models may falter.

Shareable Configurations

Useful configurations tailored to specific topics like medicine or computer science can be shared publicly as reusable OpenGPT instances. With callable API endpoints, these could power focused virtual assistants accessible to anyone rather than narrow proprietary chatbots. OpenGPT's custom interfaces, tools, and ability to share instances underpins its potential as an open alternative.

Optimized Whisper Transcribes 150 Mins in 100 Seconds

A new multi-tiered whisper architecture combines speaker segmentation, chunking, and enhanced transcription models to achieve blazing 150 minute transcriptions in as little as 100 seconds.

Trading off some accuracy for speed, optimized whisper reaches up to 100x real-time transcription on long-form audio like podcasts and videos.

With runtime versus accuracy dialed by parameters, this approach tailors whisper to either precision transcription or quick-and-dirty automated captions.

Multi-Model Speaker Diarization

The first innovation chunks long audio by detecting speaker changes using a combination of Spectrogram U-Net for segmentation followed by wav2vec 2.0 for speaker encodings. Chunking audio by speaker avoids wasting time transcribing irrelevant silence and allows processing chunks in parallel to multiply speedup. Speaker diarization paves the way for 2500x faster transcription compared to sequential whisper.

Quality/Speed Configurability

There are inherent tradeoffs between transcription speed and accuracy based on parameters. At one extreme, 105 minutes can be transcribed in just 1 second - fast enough for live subtitling - but at only 27% accuracy. At the other end, an accuracy of over 97% can be reached at a still-impressive 100x speed, enabling practical applications.

FAQ

Q: How does Jarvis 1 differ from previous Minecraft agents?
A: Jarvis 1 features a multimodal architecture for complex planning and execution as well as built-in self-improvement capabilities.

Q: What benchmarks was Emo Video tested against?
A: Emo Video was tested against Picagen 2, Imagen and other recent text-to-video models, achieving state-of-the-art win rates in human evaluations.

Q: Why create small reasoning models like Orca 2?
A: Smaller reasoning models require less compute resources while still providing capabilities useful for real-world applications.

Q: What causes the hallucinations in large language models?
A: As context length increases, language models tend to hallucinate more erroneous text not based on the actual inputs provided.

Q: Can OpenGPT execute arbitrary code?
A: Yes, OpenGPT provides a Python repl tool allowing code execution and interaction within conversations.

Q: Why optimize Whisper for fast transcription?
A: Very long audio can take hours to transcribe normally, optimized Whisper solves this by providing order-of-magnitude speedups.