Automated Content Quality Assurance for Crowdsourcing Educational Platforms - Brainly x deepsense.ai

Data Science Milan
15 Dec 202157:25

TLDRThe Data Science Milan community hosted a meetup discussing Brainly's collaboration with deepsense.ai to develop an AI system for automated content quality assurance. The system, called AQA, uses machine learning to filter and maintain the quality of user-generated educational content, enhancing the learning experience for students.

Takeaways

  • 😀 The Data Science Milan community, founded in 2016, has grown to over 1,700 members and hosts monthly meetups on YouTube.
  • 📚 The community is part of the Italian Association for Machine Learning and is always looking for volunteers to join their organizing team.
  • 🌐 Their website (www.datasciencemilan.org) offers recordings of past talks, a blog with summaries, and a newsletter sign-up for community updates.
  • 🔍 Brainly is an educational platform with a community Q&A product that helps over 1.5 billion students worldwide with their learning challenges.
  • 🔎 Brainly's 'Snap to Solve' feature allows students to take a picture of a question and receive answers from the community or AI if not immediately available.
  • 🤖 The AI department at Brainly aims to personalize the learning experience by constructing learning profiles for students based on their interactions.
  • 🎯 The goal of the AI is to provide predictive interventions, offering tailored learning paths and feedback to students to preemptively address educational struggles.
  • 🛠️ Traditional content moderation strategies have limitations, including bias, inconsistency, and scalability issues, necessitating the exploration of AI solutions.
  • 🔄 The AI system for automated content quality assurance uses machine learning to classify content into categories like spam, nonsense, and personal identifiers, improving moderation efficiency.
  • 👥 The development of the AI system involved a cross-functional team of data scientists, machine learning engineers, and project managers from Brainly and deepsense.ai.
  • 🔬 deepsense.ai is a data science company that provides AI solutions globally, with expertise in predictive analytics, computer vision, and NLP, collaborating with Brainly on the AI system.

Q & A

  • What is the main focus of the Data Science Milan community meetup?

    -The main focus of the Data Science Milan community meetup is to discuss and share insights on the application of AI and machine learning in various fields, with a specific presentation on the Automated Content Quality Assurance system developed for educational platforms like Brainly.

  • How was the Data Science Milan community founded and how often do they meet?

    -The Data Science Milan community was founded in February 2016 and typically meets once a month on YouTube, with aspirations to arrange physical venues as well.

  • What is the role of the AI department at Brainly?

    -The AI department at Brainly is tasked with personalizing the learning experience for both students and parents, providing tailored content and predictive interventions to address future educational struggles ahead of time.

  • Can you explain the 'Snap to Solve' feature of Brainly's product?

    -The 'Snap to Solve' feature allows users to point their phone's camera at an educational question, and the Brainly product will either find the answer from its knowledge base or ask a community member to provide an answer if it's not readily available.

  • What is the significance of the community question and answers product in Brainly's ecosystem?

    -The community question and answers product is the core of Brainly's ecosystem, serving as a knowledge base where students can ask questions and receive step-by-step explanations to guide them through their homework and learning process.

  • How does Brainly use AI to enhance the user experience over time?

    -Brainly uses AI to construct learning profiles for students based on their interactions, allowing the platform to provide increasingly relevant content and a tailored learning path, including predictive interventions for future educational needs.

  • What are the challenges faced by traditional content moderation strategies without machine learning?

    -Traditional content moderation strategies face challenges such as potential bias, inconsistency, delays due to manual approval, difficulty in policy changes, reliance on external moderation services which can be expensive and hard to scale, and the inability to provide explanations for content classification.

  • Can you describe the multi-label classification approach used in the automated content quality assurance system?

    -The multi-label classification approach groups different content labels into high risk, low risk, and a third group called 'request to fix'. High risk content requires immediate moderation, low risk content is safe but low quality, and 'request to fix' content can be improved by the user or the system.

  • What is the purpose of the blacklist feed in the automated content quality assurance system?

    -The blacklist feed serves as a dataset with pre-calculated predictions of low-quality content. It allows client applications consuming content from Brainly to filter out low-quality questions based on the attributes available in the blacklist feed.

  • How does the automated content quality assurance system handle the issue of mathematical expressions being flagged as nonsense by NLP models?

    -The system uses a workaround where it avoids applying the non-English and wrongly detection models to mathematical expressions. It also plans to develop an in-house mathematical expression detector for better handling of such content.

  • What are some of the pre-trained state-of-the-art models and tools used in the automated content quality assurance system?

    -Some of the models and tools used include Detoxify for toxicity detection, Perspective API for toxicity scores, FastText and langid for language identification, Gibberish Detector for nonsensical content, and Microsoft Presidio for personally identifiable information detection.

Outlines

00:00

🎵 Technical Difficulties and Introduction to Data Science Milan

The script begins with a technical issue where the presenter is testing the audio, specifically the music stream for the audience. After resolving the audio issue, the presenter, John Maros Pagania, introduces the Data Science Milan community, highlighting its foundation in February 2016 and its growth to over 1,700 members. The community meets monthly on YouTube and is part of the Italian Association for Machine Learning. John also mentions the staff and volunteers behind the events and invites new volunteers, directing interested individuals to their website and Slack workspace. The community offers a newsletter, blog summaries, and recordings of past events.

05:00

🤖 Presenting the AI System for Automated Content Quality Assurance

John introduces himself and his role at Greenlee as the Director of Artificial Intelligence. He also introduces Artur Zagado, a Senior Data Scientist at Deep Sense AI, who will be presenting alongside him. The presentation focuses on the automated content quality assurance system developed for educational platforms, specifically for Brainly, a Q&A platform for students. The system aims to personalize the learning experience and provide predictive interventions to assist students in their educational journey. The script discusses the importance of content quality and the strategy to achieve it through investments in data, user attributes, visual search, and curriculum analysis.

10:00

🏊‍♂️ Maintaining Content Quality: Traditional and AI-Augmented Strategies

The script explores traditional content moderation strategies such as community moderation, external moderation services, and shadow banning. It highlights the limitations of these methods, including bias, delays, and scalability issues. The introduction of AI into the moderation process is discussed, moving from a binary classification approach to a multi-label classification problem. The AI system categorizes content into high risk, low risk, and 'request to fix' groups, improving the moderation process by providing immediate action on harmful content and suggestions for improvement on lower-quality content.

15:02

🛠️ Demonstrating the AI-Powered Content Moderation System

A live demonstration of the AI content moderation system is provided, showcasing its capabilities in detecting various types of content issues such as spam, non-English content, and mathematical expressions that may be misinterpreted by natural language processing models. The system uses an ensemble of models and an assembly function to filter and prevent low-quality questions. It also maintains a blacklist feed for client applications to filter content. The demonstration highlights the system's ability to detect and suggest improvements on content based on pre-calculated predictions.

20:05

📈 Insights from the Automated Content Quality Assurance System

Artur Zagado from Deep Sense AI shares lessons learned from developing the Automated Content Quality Assurance (AQUA) system. The first lesson emphasizes the importance of manually annotating a data sample to understand the problem better. The team used a multi-label scheme for content quality taxonomy, which helped in setting up guidelines for future annotations and improving the quality of labels over time.

25:07

🔍 Analyzing Data Patterns and Utilizing NLP Solutions

The script discusses the importance of analyzing data patterns to identify potential issues and find quick solutions. It mentions the use of regular expressions to improve spam filtering and the challenges of mathematical expressions for NLP models. The presentation also highlights the use of state-of-the-art NLP solutions such as Detoxify, Perspective API, Gibberish Detector, and Microsoft Presidio for various content detection tasks.

30:09

📝 Breaking Down Complex Problems and Addressing Incomplete Questions

Artur explains how complex problems like incomplete question detection can be broken down into simpler sub-problems. The team identified four major sub-problems: missing context, missing questions, missing choices, and truncated contents. A baseline model combining heuristics and NLP is proposed for detecting incomplete questions, which involves checking for specific tokens, multiple-choice formats, and imperative verbs to identify missing elements in questions.

35:09

🚀 Future Directions and Closing Remarks

The presentation concludes with future directions for the AQUA system, including improving baseline models with more Brainly-related data, researching mathematical formula detection, subject classification, and using interpretability methods and weak supervision for smarter data annotation. The team also plans to expand the solution to other markets and languages. The script ends with an invitation for questions and contact information for those interested in career opportunities with the companies involved.

40:10

🔧 Handling Experiment Tracking and Final Q&A

The final part of the script addresses a question about experiment tracking, highlighting the use of neptune.ai for managing experiments efficiently. The presenter also mentions an upcoming case study on using neptune for tracking computer vision models and encourages viewers to stay tuned for its release. The session concludes with holiday wishes and an invitation for further questions.

Mindmap

Keywords

💡Automated Content Quality Assurance

Automated Content Quality Assurance refers to the use of technology, such as artificial intelligence and machine learning, to evaluate and ensure the quality of content automatically. In the context of the video, this concept is central as it discusses the system developed by Brainly and deepsense.ai to maintain the quality of educational content on the Brainly platform. The system is designed to filter and clean the content pool, ensuring that only high-quality educational material is available to the users.

💡Crowdsourcing Educational Platforms

Crowdsourcing Educational Platforms are online environments where content is generated and curated by the community of users, rather than by a centralized authority. The script mentions Brainly as an example of such a platform, where students can ask questions and receive answers from other members of the community. The challenge is to ensure the accuracy and reliability of the content provided by a crowd, which is where automated content quality assurance comes into play.

💡Data Science Milan Community

The Data Science Milan Community is an independent group that brings together individuals interested in data science. Founded in February 2016, it has grown to over 1,700 members who meet monthly on platforms like YouTube. The community is part of a larger group, the Italian Association for Machine Learning, and organizes events and shares resources, such as talks and blog summaries, to foster learning and collaboration in the field of data science.

💡Machine Learning

Machine Learning is a subset of artificial intelligence that provides systems the ability to learn and improve from experience without being explicitly programmed. In the video script, machine learning is mentioned as a key technology used in developing the automated content quality assurance system. It is utilized for tasks such as extracting content attributes, enriching metadata, and improving user profiling to personalize the learning experience.

💡Personalized Learning Experience

A Personalized Learning Experience is an educational approach that tailors teaching methods and content to the individual needs, interests, and abilities of each student. The script discusses Brainly's mission to provide such an experience by using AI to understand user interactions and preferences, and then offer content and learning paths that are most relevant and engaging for the student.

💡Community Question and Answers

Community Question and Answers is a feature of the Brainly platform where users can post questions and receive answers from the community. It serves as a knowledge base for educational queries, allowing students to get help with their homework or learn new concepts. The script highlights this feature as the core product of Brainly, which facilitates learning by leveraging the collective knowledge of its user base.

💡Predictive Interventions

Predictive Interventions in the context of the video refer to the proactive use of AI to anticipate and address future educational challenges before they arise. The AI department at Brainly aims to move beyond reactive responses to user needs and instead provide tailored feedback and learning paths that help students succeed in their educational journey by predicting their future educational struggles.

💡Content Attributes

Content Attributes are the characteristics or features of the content that can be extracted and analyzed using machine learning. In the script, it is mentioned that Brainly uses machine learning to extract content attributes and enrich them with additional metadata. This process helps in understanding the nature of the content and its relevance to the user's learning needs.

💡User's Attributes

User's Attributes refer to the characteristics or features of a user that can be identified and utilized to personalize their experience. The script discusses how Brainly uses machine learning to understand user interactions and construct learning profiles, which in turn help in providing a personalized learning experience by showing relevant content based on the user's educational level, interests, and curriculum.

💡Visual Search

Visual Search is a functionality that allows users to search for content using images or visual input, rather than text. The script mentions 'snap to solve' as an example of visual search on the Brainly platform, where users can point their phone camera at a question and the app will find the answer or ask the community if the answer is not readily available.

💡Curriculum Analysis

Curriculum Analysis involves the examination of user sessions and patterns to provide recommender systems with insights into the content that is most relevant to the user's educational journey. The script discusses how Brainly uses this analysis to offer a tailored learning path and to make predictive interventions that enhance the educational experience.

Highlights

Introduction to the Data Science Milan community, an independent group founded in 2016 with over 1.7k members.

The community's monthly meetups are held on YouTube, with plans to arrange physical venues soon.

Community members can engage through a monthly digest newsletter and Slack workspace.

John Maros Pagania works at Brainly as Director of Artificial Intelligence.

Arturo Zagado is a Senior Data Scientist at deepsense.ai, collaborating with Brainly on automated content quality assurance.

Brainly's core product is a community Q&A platform helping over 1.5 billion students worldwide.

The platform uses AI to personalize the learning experience for students and parents.

AI strategies at Brainly include investments in content, user attributes, visual search, and curriculum analysis.

Traditional content moderation strategies have limitations such as bias, delays, and scalability issues.

AI can augment the moderation process by classifying content into high risk, low risk, and request to fix categories.

The multi-label classification approach groups different content labels for better moderation.

Demonstration of the automated content quality assurance system showcasing its capabilities.

Use of machine learning frameworks and libraries such as scikit-learn, Neptune AI, and various NLP tools.

The system is deployed on Amazon Web Services with a combination of serverless and microservices architectures.

Deepsense.ai's expertise in predictive analytics, computer vision, and NLP contributes to the development of the AQA system.

Lessons learned from the project include the importance of manually annotating data samples and analyzing user behavior patterns.

The project utilizes zero-shot classification for identifying low-quality questions without needing labeled data.

State-of-the-art NLP solutions are integrated into the system for tasks like toxicity detection and language identification.

Future directions include improving baseline models, researching mathematical formula detection, and expanding to other markets and languages.