How to scrape the web for LLM in 2024: Jina AI (Reader API), Mendable (firecrawl) and Scrapegraph-ai

LLMs for Devs
17 May 202420:22

TLDRIn 2024, startups like Jina AI and Mendable are revolutionizing web scraping for Large Language Models (LLMs). Jina AI offers a free 'reader API' for clean data extraction, while Mendable's 'firecrawl' uses LLMs for web scraping. Scrapegraph-ai is an open-source project for creating web scraping pipelines with AI. The video demonstrates scraping competitor pricing pages using these tools, comparing costs, and using OpenAI's GPT-4 for data extraction, highlighting the efficiency and cost-effectiveness of third-party scraping tools over traditional methods like BeautifulSoup.

Takeaways

  • 😀 In 2024, startups are pivoting into web scraping, likely due to the high interest in keeping Large Language Models (LLMs) up-to-date with current data.
  • 🤖 Mendable introduced 'firecrawl', a tool for web scraping using LLMs, which can be accessed through a simple robot icon on certain documentation sites.
  • 🚀 Jina AI offers a 'reader API' that allows users to retrieve clean data from websites by prefixing a URL with 'aen.g.com', without needing an API key.
  • 📈 The open-source project 'scrapegraph-ai' facilitates web scraping with LLMs by creating a pipeline of different Python modules to generate graphs.
  • 💼 The speaker is using these tools for market research, specifically to scrape competitor pricing pages in the Learning and Development space.
  • 🔗 The 'Tik token' from OpenAI is used to count the number of tokens generated from scraped content, helping to estimate costs associated with using LLMs.
  • 📊 A comparison is made between BeautifulSoup, Jina AI, and Mendable in terms of token cost, with a focus on efficiency and cost-effectiveness.
  • 💬 The output from different scraping tools varies; some provide markdown, while others offer more human-readable text, affecting the ease of use for LLMs.
  • 💰 The cost analysis shows significant differences in token usage and potential expenses when using different web scraping tools with LLMs.
  • 🔍 Scrapegraph-ai is highlighted as a powerful open-source tool for orchestrating web scraping tasks, allowing for detailed control over the scraping process.

Q & A

  • What is the significance of the year 2024 in the context of web scraping for LLM?

    -In 2024, there is a notable shift with startups, particularly those from recent Y Combinator batches, pivoting into web scraping. This is likely due to the growing interest in scraping the web to provide up-to-date answers for LLMs (Large Language Models).

  • What is Mendable's 'firecrawl' and how does it relate to web scraping?

    -Mendable's 'firecrawl' is a tool specifically designed for web scraping using large language models. It allows users to perform natural language search queries on documentation sites, providing a more intuitive and efficient way to extract information from the web.

  • Can you explain the concept of 'embedding models' as mentioned in the context of Jina AI?

    -Embedding models in the context of Jina AI refer to a type of machine learning model that can convert input data, such as text, into a numerical format that can be used for various tasks like search and recommendation. Jina AI offers these models and allows users to try them without an API key, indicating a generous free tier for experimentation.

  • What is the Reader API by Jina AI and how does it simplify web scraping?

    -The Reader API by Jina AI simplifies web scraping by allowing users to append 'api.jina.ai' to any URL, which returns clean, structured data from that website. This API reduces the complexity of web scraping by automating the extraction process and providing human-readable outputs.

  • What is Scrapegraph-ai and how does it differ from other tools mentioned?

    -Scrapegraph-ai is an open-source project that uses different Python modules to create data pipelines for web scraping. Unlike other tools that provide clean inputs, Scrapegraph-ai incorporates AI to answer questions at the end of the scraping process, offering a more comprehensive scraping solution.

  • Why is the speaker interested in scraping competitor's pricing pages?

    -The speaker is interested in scraping competitor's pricing pages because they are building a product in the Learning and Development space. By understanding the pricing structures of competitors like Articulate 360 and others, they can make informed decisions for their own product development.

  • What is 'Tik token' and how is it used in the context of large language models?

    -Tik token is a tokenization library used by OpenAI for their GPT models. In the context of large language models, it is used to count the number of tokens generated from the scraped content. This helps in estimating the cost of using these models, as they are often charged based on the number of tokens processed.

  • How does the cost of web scraping using different tools compare in terms of tokens?

    -The cost of web scraping varies significantly depending on the tool used. For instance, using Beautiful Soup might result in a higher token count and thus higher costs, while tools like Mendable's 'firecrawl' or Jina AI's Reader API provide cleaner, more optimized outputs that require fewer tokens, potentially reducing costs.

  • What is the purpose of using 'pretty table' in the script?

    -The 'pretty table' is a Python library used to create readable tables in the terminal. In the script, it is used to organize and display the results of web scraping in a structured format, making it easier to compare the effectiveness and cost of different web scraping tools.

  • How does the output from different web scraping tools compare in terms of readability and usefulness for LLMs?

    -The output from different web scraping tools varies in terms of readability and usefulness. Tools like Jina AI's Reader API provide very human-readable outputs, which are ideal for tasks that require reasoning. In contrast, 'firecrawl' outputs are more marked-down, which might be more suitable for large language models that prioritize clean data over readability.

Outlines

00:00

🌐 Web Scraping Trends and Tools

The speaker discusses the emerging trend of startups, particularly those from Y Combinator, pivoting into web scraping. They highlight the growing interest in this field, possibly due to the need for up-to-date information for platforms like LMS. They mention 'mendable' as an example, a platform that uses a robot icon for natural language search queries within documentation sites. The speaker introduces 'fire crawl', a tool for web scraping using large language models, and 'Gina AI', known for its embedding models that can be tested without an API key. They also touch on 'scrape graph AI', an open-source project that orchestrates different Python modules to create scraping pipelines. The speaker plans to use these tools to scrape competitor pricing pages for market research in the Learning and Development space.

05:01

🔍 Comparing Web Scraping Techniques

The speaker sets up a comparison between different web scraping tools: Beautiful Soup, Gina AI, and Mendable. They install Beautiful Soup, a straightforward but easily detectable scraping tool, and set up a function to scrape with it. Gina AI is described as simple to use, requiring only a string before a URL for clean data retrieval. Mendable, which recently pivoted to include documentation chatbots, also offers a scraping service that requires an API key. The speaker plans to run these tools simultaneously to scrape the same websites and compare their efficiency and output quality. They also mention the importance of tokenization in large language models, specifically using 'Tik token' from OpenAI to count and cost tokens based on scraped content.

10:03

💸 Cost Analysis of Scraping Tools

The speaker conducts a cost analysis of the web scraping tools by comparing the number of tokens generated by each tool when scraping content. They use a 'pretty table' library in Python to organize the data in rows and columns for easy comparison. The analysis aims to determine which tool is the most cost-effective for the speaker's web scraping needs. The speaker also discusses the tokenization process and how it affects the cost, mentioning that newer generations of models like GPT-4 can reduce costs due to more efficient tokenization. They run the scraping process and provide a cost comparison table, highlighting the differences in token usage and potential cost savings.

15:05

📊 Extracting and Analyzing Scraped Data

The speaker uses OpenAI's GPT-4 model to extract specific information from the scraped data, focusing on competitor pricing tiers and costs. They set up an OpenAI client and use a utility function to display the extracted content in a table. The speaker runs GPT-4 on the inputs from the previous scraping tools and requests JSON outputs with the pricing information. They find that while some tools provide accurate and usable data, others may not fully comply with the extraction requests, highlighting the importance of being specific with prompts when working with large language models. The speaker also introduces 'scrape graph', an open-source tool for creating scraping pipelines with graph data structures.

20:06

🛠️ Demonstrating Scrape Graph and User Interaction

In the final part, the speaker demonstrates the use of 'scrape graph' by taking a live example where they scrape a website for electric unicycle models and their specifications. They interact with the audience to determine what information to extract and successfully retrieve the model names and speeds. The speaker also discusses the token consumption of the scrape graph tool and shares a method to calculate it, which they learned from the tool's community discussions. The session concludes with a Q&A where the speaker addresses any questions from the audience.

Mindmap

Keywords

💡Web Scraping

Web scraping is the process of extracting data from websites. It's a technique used by the presenter to gather information on competitor pricing from various Learning and Development platforms. In the script, web scraping is central to the demonstration of how different tools can be utilized to extract and process data for business intelligence purposes.

💡LLM (Large Language Models)

Large Language Models (LLMs) refer to advanced artificial intelligence models that can process and understand human language at scale. The video discusses how these models can be employed to analyze and interpret the scraped data. An example from the script is the use of LLMs to process markdown outputs from different scraping tools.

💡Jina AI

Jina AI is a company mentioned in the script that offers embedding models and tools for data extraction. The 'Reader API' from Jina AI is highlighted as a tool that simplifies the process of web scraping by returning clean data from URLs. It's portrayed as a user-friendly and efficient method for obtaining structured data from web pages.

💡Mendable

Mendable is introduced as a platform that has shifted its focus to include web scraping capabilities. Their 'firecrawl' feature is specifically designed for scraping the web using large language models. The script illustrates how firecrawl can be used to extract more structured and markdown-friendly data compared to traditional scraping methods.

💡Scrapegraph-ai

Scrapegraph-ai is an open-source project mentioned in the script, which involves orchestrating different Python modules to create data graphs. It's used to build a pipeline for web scraping and incorporates AI to answer specific questions or perform tasks on the scraped data, such as extracting pricing tiers from a website.

💡API Key

An API key is a code passed in by computer programs calling an API to identify the calling program, its developer, or its user. In the context of the video, Mendable requires an API key to use their web scraping services, which the presenter uses to demonstrate the capabilities of the tool.

💡Tokenization

Tokenization in the context of the video refers to the process of converting text into tokens, which are the elements that large language models use for processing. The presenter discusses how different models, like GPT-3 and GPT-4, have different tokenization schemes, affecting the cost of using these models for tasks like web scraping.

💡Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It's one of the tools compared in the script for web scraping. The video shows that while it can provide detailed HTML data, it may not be as optimized for use with large language models as other tools.

💡Markdown

Markdown is a lightweight markup language with plain-text-formatting syntax. The script discusses how some web scraping tools output data in markdown format, which is more suitable for processing with large language models. This format is highlighted as a feature of tools like firecrawl and Jina AI's Reader API.

💡Cost Analysis

Cost analysis in this video refers to the comparison of expenses associated with using different web scraping tools in conjunction with large language models. The presenter calculates the cost of tokens used by each tool to determine which is the most cost-effective for their scraping needs.

Highlights

In 2024, startups are pivoting into web scraping, especially those from recent Y Combinator batches.

Mendable introduces 'firecrawl', a tool for web scraping using large language models.

Jina AI offers a 'reader API' that cleans data from any URL without an API key.

Scrapegraph-ai is an open-source project for creating web scraping pipelines with AI.

The presenter is scraping competitor pricing pages for market research in the Learning and Development space.

Tik token is used to count the number of tokens for large language models, affecting cost.

Beautiful Soup is a straightforward web scraping tool but can be easily detected.

Firecrawl outputs data in a markdown format that is more suitable for large language models.

Jina AI's reader API provides human-readable markdown, which is beneficial for reasoning tasks.

The cost of web scraping can vary significantly between tools, affecting the choice for businesses.

Scrapegraph-ai allows for complex orchestration of web scraping tasks using different Python modules.

The presenter demonstrates the use of GPT-4 for entity extraction from web scraping data.

Different web scraping tools have varying levels of success in extracting pricing information.

Scrapegraph-ai is shown to accurately extract model names and speeds from a website.

The importance of being specific with prompts when working with large language models is emphasized.

The presenter discusses the cost implications of using different web scraping tools with large language models.