Automated Content Quality Assurance for Crowdsourcing Educational Platforms - Brainly x deepsense.ai
TLDRThe Data Science Milan community hosted a meetup discussing Brainly's collaboration with deepsense.ai to develop an AI system for automated content quality assurance. The system, called AQA, uses machine learning to filter and maintain the quality of user-generated educational content, enhancing the learning experience for students.
Takeaways
- 😀 The Data Science Milan community, founded in 2016, has grown to over 1,700 members and hosts monthly meetups on YouTube.
- 📚 The community is part of the Italian Association for Machine Learning and is always looking for volunteers to join their organizing team.
- 🌐 Their website (www.datasciencemilan.org) offers recordings of past talks, a blog with summaries, and a newsletter sign-up for community updates.
- 🔍 Brainly is an educational platform with a community Q&A product that helps over 1.5 billion students worldwide with their learning challenges.
- 🔎 Brainly's 'Snap to Solve' feature allows students to take a picture of a question and receive answers from the community or AI if not immediately available.
- 🤖 The AI department at Brainly aims to personalize the learning experience by constructing learning profiles for students based on their interactions.
- 🎯 The goal of the AI is to provide predictive interventions, offering tailored learning paths and feedback to students to preemptively address educational struggles.
- 🛠️ Traditional content moderation strategies have limitations, including bias, inconsistency, and scalability issues, necessitating the exploration of AI solutions.
- 🔄 The AI system for automated content quality assurance uses machine learning to classify content into categories like spam, nonsense, and personal identifiers, improving moderation efficiency.
- 👥 The development of the AI system involved a cross-functional team of data scientists, machine learning engineers, and project managers from Brainly and deepsense.ai.
- 🔬 deepsense.ai is a data science company that provides AI solutions globally, with expertise in predictive analytics, computer vision, and NLP, collaborating with Brainly on the AI system.
Q & A
What is the main focus of the Data Science Milan community meetup?
-The main focus of the Data Science Milan community meetup is to discuss and share insights on the application of AI and machine learning in various fields, with a specific presentation on the Automated Content Quality Assurance system developed for educational platforms like Brainly.
How was the Data Science Milan community founded and how often do they meet?
-The Data Science Milan community was founded in February 2016 and typically meets once a month on YouTube, with aspirations to arrange physical venues as well.
What is the role of the AI department at Brainly?
-The AI department at Brainly is tasked with personalizing the learning experience for both students and parents, providing tailored content and predictive interventions to address future educational struggles ahead of time.
Can you explain the 'Snap to Solve' feature of Brainly's product?
-The 'Snap to Solve' feature allows users to point their phone's camera at an educational question, and the Brainly product will either find the answer from its knowledge base or ask a community member to provide an answer if it's not readily available.
What is the significance of the community question and answers product in Brainly's ecosystem?
-The community question and answers product is the core of Brainly's ecosystem, serving as a knowledge base where students can ask questions and receive step-by-step explanations to guide them through their homework and learning process.
How does Brainly use AI to enhance the user experience over time?
-Brainly uses AI to construct learning profiles for students based on their interactions, allowing the platform to provide increasingly relevant content and a tailored learning path, including predictive interventions for future educational needs.
What are the challenges faced by traditional content moderation strategies without machine learning?
-Traditional content moderation strategies face challenges such as potential bias, inconsistency, delays due to manual approval, difficulty in policy changes, reliance on external moderation services which can be expensive and hard to scale, and the inability to provide explanations for content classification.
Can you describe the multi-label classification approach used in the automated content quality assurance system?
-The multi-label classification approach groups different content labels into high risk, low risk, and a third group called 'request to fix'. High risk content requires immediate moderation, low risk content is safe but low quality, and 'request to fix' content can be improved by the user or the system.
What is the purpose of the blacklist feed in the automated content quality assurance system?
-The blacklist feed serves as a dataset with pre-calculated predictions of low-quality content. It allows client applications consuming content from Brainly to filter out low-quality questions based on the attributes available in the blacklist feed.
How does the automated content quality assurance system handle the issue of mathematical expressions being flagged as nonsense by NLP models?
-The system uses a workaround where it avoids applying the non-English and wrongly detection models to mathematical expressions. It also plans to develop an in-house mathematical expression detector for better handling of such content.
What are some of the pre-trained state-of-the-art models and tools used in the automated content quality assurance system?
-Some of the models and tools used include Detoxify for toxicity detection, Perspective API for toxicity scores, FastText and langid for language identification, Gibberish Detector for nonsensical content, and Microsoft Presidio for personally identifiable information detection.
Outlines
🎵 Technical Difficulties and Introduction to Data Science Milan
The script begins with a technical issue where the presenter is testing the audio, specifically the music stream for the audience. After resolving the audio issue, the presenter, John Maros Pagania, introduces the Data Science Milan community, highlighting its foundation in February 2016 and its growth to over 1,700 members. The community meets monthly on YouTube and is part of the Italian Association for Machine Learning. John also mentions the staff and volunteers behind the events and invites new volunteers, directing interested individuals to their website and Slack workspace. The community offers a newsletter, blog summaries, and recordings of past events.
🤖 Presenting the AI System for Automated Content Quality Assurance
John introduces himself and his role at Greenlee as the Director of Artificial Intelligence. He also introduces Artur Zagado, a Senior Data Scientist at Deep Sense AI, who will be presenting alongside him. The presentation focuses on the automated content quality assurance system developed for educational platforms, specifically for Brainly, a Q&A platform for students. The system aims to personalize the learning experience and provide predictive interventions to assist students in their educational journey. The script discusses the importance of content quality and the strategy to achieve it through investments in data, user attributes, visual search, and curriculum analysis.
🏊♂️ Maintaining Content Quality: Traditional and AI-Augmented Strategies
The script explores traditional content moderation strategies such as community moderation, external moderation services, and shadow banning. It highlights the limitations of these methods, including bias, delays, and scalability issues. The introduction of AI into the moderation process is discussed, moving from a binary classification approach to a multi-label classification problem. The AI system categorizes content into high risk, low risk, and 'request to fix' groups, improving the moderation process by providing immediate action on harmful content and suggestions for improvement on lower-quality content.
🛠️ Demonstrating the AI-Powered Content Moderation System
A live demonstration of the AI content moderation system is provided, showcasing its capabilities in detecting various types of content issues such as spam, non-English content, and mathematical expressions that may be misinterpreted by natural language processing models. The system uses an ensemble of models and an assembly function to filter and prevent low-quality questions. It also maintains a blacklist feed for client applications to filter content. The demonstration highlights the system's ability to detect and suggest improvements on content based on pre-calculated predictions.
📈 Insights from the Automated Content Quality Assurance System
Artur Zagado from Deep Sense AI shares lessons learned from developing the Automated Content Quality Assurance (AQUA) system. The first lesson emphasizes the importance of manually annotating a data sample to understand the problem better. The team used a multi-label scheme for content quality taxonomy, which helped in setting up guidelines for future annotations and improving the quality of labels over time.
🔍 Analyzing Data Patterns and Utilizing NLP Solutions
The script discusses the importance of analyzing data patterns to identify potential issues and find quick solutions. It mentions the use of regular expressions to improve spam filtering and the challenges of mathematical expressions for NLP models. The presentation also highlights the use of state-of-the-art NLP solutions such as Detoxify, Perspective API, Gibberish Detector, and Microsoft Presidio for various content detection tasks.
📝 Breaking Down Complex Problems and Addressing Incomplete Questions
Artur explains how complex problems like incomplete question detection can be broken down into simpler sub-problems. The team identified four major sub-problems: missing context, missing questions, missing choices, and truncated contents. A baseline model combining heuristics and NLP is proposed for detecting incomplete questions, which involves checking for specific tokens, multiple-choice formats, and imperative verbs to identify missing elements in questions.
🚀 Future Directions and Closing Remarks
The presentation concludes with future directions for the AQUA system, including improving baseline models with more Brainly-related data, researching mathematical formula detection, subject classification, and using interpretability methods and weak supervision for smarter data annotation. The team also plans to expand the solution to other markets and languages. The script ends with an invitation for questions and contact information for those interested in career opportunities with the companies involved.
🔧 Handling Experiment Tracking and Final Q&A
The final part of the script addresses a question about experiment tracking, highlighting the use of neptune.ai for managing experiments efficiently. The presenter also mentions an upcoming case study on using neptune for tracking computer vision models and encourages viewers to stay tuned for its release. The session concludes with holiday wishes and an invitation for further questions.
Mindmap
Keywords
💡Automated Content Quality Assurance
💡Crowdsourcing Educational Platforms
💡Data Science Milan Community
💡Machine Learning
💡Personalized Learning Experience
💡Community Question and Answers
💡Predictive Interventions
💡Content Attributes
💡User's Attributes
💡Visual Search
💡Curriculum Analysis
Highlights
Introduction to the Data Science Milan community, an independent group founded in 2016 with over 1.7k members.
The community's monthly meetups are held on YouTube, with plans to arrange physical venues soon.
Community members can engage through a monthly digest newsletter and Slack workspace.
John Maros Pagania works at Brainly as Director of Artificial Intelligence.
Arturo Zagado is a Senior Data Scientist at deepsense.ai, collaborating with Brainly on automated content quality assurance.
Brainly's core product is a community Q&A platform helping over 1.5 billion students worldwide.
The platform uses AI to personalize the learning experience for students and parents.
AI strategies at Brainly include investments in content, user attributes, visual search, and curriculum analysis.
Traditional content moderation strategies have limitations such as bias, delays, and scalability issues.
AI can augment the moderation process by classifying content into high risk, low risk, and request to fix categories.
The multi-label classification approach groups different content labels for better moderation.
Demonstration of the automated content quality assurance system showcasing its capabilities.
Use of machine learning frameworks and libraries such as scikit-learn, Neptune AI, and various NLP tools.
The system is deployed on Amazon Web Services with a combination of serverless and microservices architectures.
Deepsense.ai's expertise in predictive analytics, computer vision, and NLP contributes to the development of the AQA system.
Lessons learned from the project include the importance of manually annotating data samples and analyzing user behavior patterns.
The project utilizes zero-shot classification for identifying low-quality questions without needing labeled data.
State-of-the-art NLP solutions are integrated into the system for tasks like toxicity detection and language identification.
Future directions include improving baseline models, researching mathematical formula detection, and expanding to other markets and languages.