This project develops a robust platform to distinguish between human-generated and AI-generated text. As language models like GPT and BERT advance, they blur the lines of textual authenticity, necessitating tools to verify the origin of digital content. Our project employs deep learning techniques to identify subtle patterns and discrepancies differentiating AI-generated text from human-written content.
The distinction between human and machine-generated text is increasingly challenging due to the capabilities of modern AI to mimic human style and thought processes. This project addresses the need for effective methodologies to ascertain the authenticity of information across digital platforms.
We utilize a dataset with over 100,000 instances, categorized into human-generated and AI-generated texts, spanning various genres and difficulty levels. The data collection sources include:
- Kaggle Dataset by Alejopaullier
- Kaggle Dataset by TheDrCat
- Kaggle LLM - Detect AI Generated Text Competition
The project leverages open-source models like Bert or RNN, focusing on enhancing feature extraction techniques and integrating a robust classification layer. The models utilize contextual embeddings and self-attention mechanisms to achieve high accuracy in detection.
Our aim is to develop a classification system capable of distinguishing AI from human-generated texts with high precision and recall. This tool will be crucial for upholding the integrity of digital content, with potential applications in academic integrity and beyond.
Contributions are welcome! For major changes, please open an issue first to discuss what you would like to change.
This project is distributed under the MIT License. See LICENSE
for more information.
- T.B. Brown et al., "Language Models are Few-Shot Learners", NeurIPS, 2020.
- D. Ippolito et al., "Detecting Machine-Generated Text with BERT", ACL, 2020.