* This blog post is a summary of this video.

AI Chatbot Comparison - Evaluating Leading Language Models on Math, Summarization, Availability

Table of Contents

Comparing Chatbot Math Ability with Simple Word Problem

This section will evaluate the math ability of several leading chatbots by giving them a simple word problem to solve. The word problem describes a scenario where two people, Joe and John, are painting a house together. It states that John believes it would take him 3 times longer to paint the house alone compared to if he worked together with Joe. Working together, they finished painting the house in 24 hours.

We will see how accurately each chatbot interprets and solves this math word problem, assessing their capability in mathematical reasoning.

ChatGPT Analysis

ChatGPT incorrectly interpreted the key statement that John believes it would take 3 times longer for him to paint the house alone compared to working with Joe. Instead, ChatGPT assumed Joe would take 3 times longer than John. With this flawed assumption, ChatGPT solved the problem inaccurately, stating John could paint the house alone in 32 hours and Joe in 96 hours.

Bing Chat Analysis

Like ChatGPT, Bing Chat also incorrectly interpreted the initial statement about John's belief on how much longer it would take him alone. Bing Chat made a wrong initial assumption and then derived incorrect solutions from that. It concluded Joe could paint the house alone in 32 hours and John in 96 hours.

Perplexity AI Analysis

Perplexity AI made the inaccurate assumption that it would take Joe 3 times as long to paint the house alone compared to John. With this wrong interpretation, Perplexity AI also arrived at incorrect solutions, stating John could paint the house alone in just 12 hours while Joe would take 36 hours.

Claude Analysis

Claude correctly interpreted the initial statement that John believed it would take 3 times longer for him to paint the house alone compared to working with Joe. Claude accurately derived that if together they took 24 hours, then John alone would take 72 hours. However, Claude then contradicted its earlier correct interpretation and made an incorrect assumption to calculate Joe's time, concluding Joe could paint the house alone in 8 hours.

Google Bard Analysis

Google Bard set up incorrect symbolic variables right from the start, assuming Joe's time was one third of John's, conflicting with the word problem description. With this inaccurate set up leading to wrong assumptions, Google Bard also derived incorrect solutions for how long each would take to paint the house alone.

Evaluating Chatbot Ability to Summarize Lengthy Content

In this section, we test the summarization skills of the chatbots by asking them to summarize a lengthy blog article. First, we attempt to directly input a long text excerpt to see if they can process it. For chatbots with lower text limits but internet access, we provide the URL to leverage their connectivity.

We assess their competence in distilling key details from a long input while maintaining accuracy.

ChatGPT Summarization Capability

When provided a lengthy text excerpt exceeding ChatGPT's input limit of 30,000 characters, it gave an error that the text was too long. Even when given just the URL to a long blog article instead, ChatGPT stated it does not have internet access to process web content.

Bing Chat Summarization

Although Bing also has input text limits, by providing the URL of the lengthy blog article it was able to successfully access the content from the web and generate an accurate, concise summary extracting key details.

Perplexity AI Summarization

Since Perplexity has internet access capabilities, when given the URL to the long blog article it was able to pull the content from the web and summarize it competently, demonstrating strong summarization skills for web-based data.

Claude Summarization

When provided a lengthy 352,000 character text excerpt, Claude was able to process and accurately summarize the long content into a concise overview of the key points, highlighting its robust input capacity and summarization abilities.

Google Bard Summarization

Leveraging its internet connectivity, when given the URL Google Bard successfully found the lengthy blog article online and generated a solid summarization covering the core details, showing capability to summarize web content.

Comparing Chatbot Availability and Accessibility

In this section we analyze the global availability and accessibility of the different chatbots across countries and regions:

Bing Chat and ChatGPT have the widest global availability, accessible in over 160 countries currently. Perplexity AI has unclear country-level availability data. Claude is only available in the US and UK, though can be accessed via VPN workarounds globally. Google Bard has a narrow launch with availability currently only in the US out of the chatbots compared.

Conclusion and Recommendations

In mathematical reasoning evaluations, Claude and Bing Chat showed the most accurate interpretations, though still some flaws. For lengthy content summarization from web sources, Bing Chat, Perplexity AI, and Google Bard performed strongly. In terms of accessibility, ChatGPT and Bing Chat have the widest country reach.

Based on these assessments, Bing Chat demonstrates the most robust all-round capabilities spanning math reasoning, web content summarization, and global availability. Claude shows promise in certain areas but has limitations. Continued evaluations are recommended as these AI chatbots rapidly evolve.

FAQ

Q: Which chatbot did the best on the math word problem?
A: None of the chatbots were able to solve the math word problem completely correctly. Claude came the closest by correctly determining how long it would take John to paint the house alone.

Q: Which chatbot was able to summarize the lengthy blog post?
A: Bing Chat, Perplexity AI, Claude, Google Bard were all able to successfully summarize the lengthy blog post because they have internet access. ChatGPT was unable to summarize the post.

Q: Which chatbot is available in the most countries?
A: Bing Chat is available in the most countries of the compared chatbots, with availability in over 169 countries.