How to scrape the web for LLM in 2024: Jina AI (Reader API), Mendable (firecrawl) and Scrapegraph-ai
TLDRIn 2024, startups like Jina AI and Mendable are revolutionizing web scraping for Large Language Models (LLMs). Jina AI offers a free 'reader API' for clean data extraction, while Mendable's 'firecrawl' uses LLMs for web scraping. Scrapegraph-ai is an open-source project for creating web scraping pipelines with AI. The video demonstrates scraping competitor pricing pages using these tools, comparing costs, and using OpenAI's GPT-4 for data extraction, highlighting the efficiency and cost-effectiveness of third-party scraping tools over traditional methods like BeautifulSoup.
Takeaways
- 😀 In 2024, startups are pivoting into web scraping, likely due to the high interest in keeping Large Language Models (LLMs) up-to-date with current data.
- 🤖 Mendable introduced 'firecrawl', a tool for web scraping using LLMs, which can be accessed through a simple robot icon on certain documentation sites.
- 🚀 Jina AI offers a 'reader API' that allows users to retrieve clean data from websites by prefixing a URL with 'aen.g.com', without needing an API key.
- 📈 The open-source project 'scrapegraph-ai' facilitates web scraping with LLMs by creating a pipeline of different Python modules to generate graphs.
- 💼 The speaker is using these tools for market research, specifically to scrape competitor pricing pages in the Learning and Development space.
- 🔗 The 'Tik token' from OpenAI is used to count the number of tokens generated from scraped content, helping to estimate costs associated with using LLMs.
- 📊 A comparison is made between BeautifulSoup, Jina AI, and Mendable in terms of token cost, with a focus on efficiency and cost-effectiveness.
- 💬 The output from different scraping tools varies; some provide markdown, while others offer more human-readable text, affecting the ease of use for LLMs.
- 💰 The cost analysis shows significant differences in token usage and potential expenses when using different web scraping tools with LLMs.
- 🔍 Scrapegraph-ai is highlighted as a powerful open-source tool for orchestrating web scraping tasks, allowing for detailed control over the scraping process.
Q & A
What is the significance of the year 2024 in the context of web scraping for LLM?
-In 2024, there is a notable shift with startups, particularly those from recent Y Combinator batches, pivoting into web scraping. This is likely due to the growing interest in scraping the web to provide up-to-date answers for LLMs (Large Language Models).
What is Mendable's 'firecrawl' and how does it relate to web scraping?
-Mendable's 'firecrawl' is a tool specifically designed for web scraping using large language models. It allows users to perform natural language search queries on documentation sites, providing a more intuitive and efficient way to extract information from the web.
Can you explain the concept of 'embedding models' as mentioned in the context of Jina AI?
-Embedding models in the context of Jina AI refer to a type of machine learning model that can convert input data, such as text, into a numerical format that can be used for various tasks like search and recommendation. Jina AI offers these models and allows users to try them without an API key, indicating a generous free tier for experimentation.
What is the Reader API by Jina AI and how does it simplify web scraping?
-The Reader API by Jina AI simplifies web scraping by allowing users to append 'api.jina.ai' to any URL, which returns clean, structured data from that website. This API reduces the complexity of web scraping by automating the extraction process and providing human-readable outputs.
What is Scrapegraph-ai and how does it differ from other tools mentioned?
-Scrapegraph-ai is an open-source project that uses different Python modules to create data pipelines for web scraping. Unlike other tools that provide clean inputs, Scrapegraph-ai incorporates AI to answer questions at the end of the scraping process, offering a more comprehensive scraping solution.
Why is the speaker interested in scraping competitor's pricing pages?
-The speaker is interested in scraping competitor's pricing pages because they are building a product in the Learning and Development space. By understanding the pricing structures of competitors like Articulate 360 and others, they can make informed decisions for their own product development.
What is 'Tik token' and how is it used in the context of large language models?
-Tik token is a tokenization library used by OpenAI for their GPT models. In the context of large language models, it is used to count the number of tokens generated from the scraped content. This helps in estimating the cost of using these models, as they are often charged based on the number of tokens processed.
How does the cost of web scraping using different tools compare in terms of tokens?
-The cost of web scraping varies significantly depending on the tool used. For instance, using Beautiful Soup might result in a higher token count and thus higher costs, while tools like Mendable's 'firecrawl' or Jina AI's Reader API provide cleaner, more optimized outputs that require fewer tokens, potentially reducing costs.
What is the purpose of using 'pretty table' in the script?
-The 'pretty table' is a Python library used to create readable tables in the terminal. In the script, it is used to organize and display the results of web scraping in a structured format, making it easier to compare the effectiveness and cost of different web scraping tools.
How does the output from different web scraping tools compare in terms of readability and usefulness for LLMs?
-The output from different web scraping tools varies in terms of readability and usefulness. Tools like Jina AI's Reader API provide very human-readable outputs, which are ideal for tasks that require reasoning. In contrast, 'firecrawl' outputs are more marked-down, which might be more suitable for large language models that prioritize clean data over readability.
Outlines
🌐 Web Scraping Trends and Tools
The speaker discusses the emerging trend of startups, particularly those from Y Combinator, pivoting into web scraping. They highlight the growing interest in this field, possibly due to the need for up-to-date information for platforms like LMS. They mention 'mendable' as an example, a platform that uses a robot icon for natural language search queries within documentation sites. The speaker introduces 'fire crawl', a tool for web scraping using large language models, and 'Gina AI', known for its embedding models that can be tested without an API key. They also touch on 'scrape graph AI', an open-source project that orchestrates different Python modules to create scraping pipelines. The speaker plans to use these tools to scrape competitor pricing pages for market research in the Learning and Development space.
🔍 Comparing Web Scraping Techniques
The speaker sets up a comparison between different web scraping tools: Beautiful Soup, Gina AI, and Mendable. They install Beautiful Soup, a straightforward but easily detectable scraping tool, and set up a function to scrape with it. Gina AI is described as simple to use, requiring only a string before a URL for clean data retrieval. Mendable, which recently pivoted to include documentation chatbots, also offers a scraping service that requires an API key. The speaker plans to run these tools simultaneously to scrape the same websites and compare their efficiency and output quality. They also mention the importance of tokenization in large language models, specifically using 'Tik token' from OpenAI to count and cost tokens based on scraped content.
💸 Cost Analysis of Scraping Tools
The speaker conducts a cost analysis of the web scraping tools by comparing the number of tokens generated by each tool when scraping content. They use a 'pretty table' library in Python to organize the data in rows and columns for easy comparison. The analysis aims to determine which tool is the most cost-effective for the speaker's web scraping needs. The speaker also discusses the tokenization process and how it affects the cost, mentioning that newer generations of models like GPT-4 can reduce costs due to more efficient tokenization. They run the scraping process and provide a cost comparison table, highlighting the differences in token usage and potential cost savings.
📊 Extracting and Analyzing Scraped Data
The speaker uses OpenAI's GPT-4 model to extract specific information from the scraped data, focusing on competitor pricing tiers and costs. They set up an OpenAI client and use a utility function to display the extracted content in a table. The speaker runs GPT-4 on the inputs from the previous scraping tools and requests JSON outputs with the pricing information. They find that while some tools provide accurate and usable data, others may not fully comply with the extraction requests, highlighting the importance of being specific with prompts when working with large language models. The speaker also introduces 'scrape graph', an open-source tool for creating scraping pipelines with graph data structures.
🛠️ Demonstrating Scrape Graph and User Interaction
In the final part, the speaker demonstrates the use of 'scrape graph' by taking a live example where they scrape a website for electric unicycle models and their specifications. They interact with the audience to determine what information to extract and successfully retrieve the model names and speeds. The speaker also discusses the token consumption of the scrape graph tool and shares a method to calculate it, which they learned from the tool's community discussions. The session concludes with a Q&A where the speaker addresses any questions from the audience.
Mindmap
Keywords
💡Web Scraping
💡LLM (Large Language Models)
💡Jina AI
💡Mendable
💡Scrapegraph-ai
💡API Key
💡Tokenization
💡Beautiful Soup
💡Markdown
💡Cost Analysis
Highlights
In 2024, startups are pivoting into web scraping, especially those from recent Y Combinator batches.
Mendable introduces 'firecrawl', a tool for web scraping using large language models.
Jina AI offers a 'reader API' that cleans data from any URL without an API key.
Scrapegraph-ai is an open-source project for creating web scraping pipelines with AI.
The presenter is scraping competitor pricing pages for market research in the Learning and Development space.
Tik token is used to count the number of tokens for large language models, affecting cost.
Beautiful Soup is a straightforward web scraping tool but can be easily detected.
Firecrawl outputs data in a markdown format that is more suitable for large language models.
Jina AI's reader API provides human-readable markdown, which is beneficial for reasoning tasks.
The cost of web scraping can vary significantly between tools, affecting the choice for businesses.
Scrapegraph-ai allows for complex orchestration of web scraping tasks using different Python modules.
The presenter demonstrates the use of GPT-4 for entity extraction from web scraping data.
Different web scraping tools have varying levels of success in extracting pricing information.
Scrapegraph-ai is shown to accurately extract model names and speeds from a website.
The importance of being specific with prompts when working with large language models is emphasized.
The presenter discusses the cost implications of using different web scraping tools with large language models.