Skip to content

Latest commit

 

History

History
585 lines (451 loc) · 19.1 KB

README.md

File metadata and controls

585 lines (451 loc) · 19.1 KB

# Comprehensive Arabic NLP Data Processing and Cleaning Guide

This README provides an extensive guide for preprocessing and cleaning Arabic text data for Natural Language Processing (NLP) tasks, covering all aspects of data preparation.

Table of Contents

  1. Text Normalization for Arabic
  2. Arabic-Specific Noise Removal
  3. Arabic Tokenization
  4. Arabic Stop Word Removal
  5. Arabic Stemming and Lemmatization
  6. Handling Arabic Diacritics
  7. Handling Numbers and Special Characters in Arabic Text
  8. Arabic Text Segmentation
  9. Handling Arabic Dialects
  10. Normalization of Arabic User-Generated Content
  11. Handling Arabizi (Arabic Chat Alphabet)
  12. Arabic Word Disambiguation
  13. Handling Elongated Words in Arabic
  14. Arabic Text Correction
  15. Arabic Named Entity Recognition (NER)
  16. Arabic Text Classification
  17. Arabic Sentiment Analysis
  18. Handling Emojis and Emoticons in Arabic Text
  19. Data Augmentation for Arabic NLP
  20. Recent Advances in Arabic NLP

1. Text Normalization for Arabic

Expand on the previous normalization steps to include more cases:

import re

def normalize_arabic(text):
    text = re.sub("[إأآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    text = re.sub("گ", "ك", text)
    text = re.sub("ڤ", "ف", text)
    text = re.sub("چ", "ج", text)
    text = re.sub("پ", "ب", text)
    text = re.sub("ڜ", "ش", text)
    text = re.sub("ڪ", "ك", text)
    text = re.sub("ڧ", "ق", text)
    text = re.sub("ٱ", "ا", text)
    return text

# Example usage
raw_text = "هذا نص تجريبي يحتوي على أحرف مختلفة مثل إ و أ و آ و ى و ڤ و چ"
normalized_text = normalize_arabic(raw_text)
print(normalized_text)

2. Arabic-Specific Noise Removal

Extend noise removal to handle more cases:

import re

def remove_arabic_noise(text):
    # Remove diacritics
    text = re.sub(r'[\u0617-\u061A\u064B-\u0652]', '', text)
    # Remove tatweel
    text = re.sub(r'\u0640', '', text)
    # Remove non-Arabic characters
    text = re.sub(r'[^\u0600-\u06FF\s]', '', text)
    # Remove HTML tags
    text = re.sub('<.*?>', '', text)
    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Example usage
noisy_text = "هَـــذا نَـــصّ <b>تَجْــرِيــبـِـيّ</b> مع   مسافات  زائدة"
clean_text = remove_arabic_noise(noisy_text)
print(clean_text)

3. Arabic Tokenization

Use more advanced tokenization techniques:

from camel_tools.tokenizers.word import simple_word_tokenize
from camel_tools.tokenizers.morphological import MorphologicalTokenizer

def tokenize_arabic(text, method='simple'):
    if method == 'simple':
        return simple_word_tokenize(text)
    elif method == 'morphological':
        mt = MorphologicalTokenizer.pretrained()
        return mt.tokenize(text)

# Example usage
text = "هذا مثال على تقطيع النص العربي بطريقة متقدمة."
simple_tokens = tokenize_arabic(text, 'simple')
morphological_tokens = tokenize_arabic(text, 'morphological')
print("Simple tokenization:", simple_tokens)
print("Morphological tokenization:", morphological_tokens)

4. Arabic Stop Word Removal

Use multiple stop word lists and allow customization:

from camel_tools.utils.stopwords import STOPWORDS as CAMEL_STOPWORDS
from nltk.corpus import stopwords

NLTK_STOPWORDS = set(stopwords.words('arabic'))

def remove_arabic_stopwords(tokens, custom_stopwords=None, use_nltk=True, use_camel=True):
    stopword_set = set()
    if use_nltk:
        stopword_set.update(NLTK_STOPWORDS)
    if use_camel:
        stopword_set.update(CAMEL_STOPWORDS)
    if custom_stopwords:
        stopword_set.update(custom_stopwords)
    
    return [token for token in tokens if token not in stopword_set]

# Example usage
tokens = ["هذا", "مثال", "على", "إزالة", "كلمات", "التوقف", "بشكل", "متقدم"]
custom_stopwords = ["متقدم"]
filtered_tokens = remove_arabic_stopwords(tokens, custom_stopwords=custom_stopwords)
print(filtered_tokens)

5. Arabic Stemming and Lemmatization

Compare different stemming and lemmatization techniques:

from camel_tools.stem import CAMeLStemmer
from farasa.stemmer import FarasaStemmer
from tashaphyne.stemming import ArabicLightStemmer

camel_stemmer = CAMeLStemmer.pretrained('calima-msa-s31')
farasa_stemmer = FarasaStemmer()
light_stemmer = ArabicLightStemmer()

def process_arabic(text, method='stem', tool='camel'):
    if method == 'stem':
        if tool == 'camel':
            return camel_stemmer.stem(text)
        elif tool == 'farasa':
            return farasa_stemmer.stem(text)
        elif tool == 'light':
            return ' '.join([light_stemmer.light_stem(word) for word in text.split()])
    elif method == 'lemmatize':
        return camel_stemmer.lemmatize(text)

# Example usage
text = "الكتب المدرسية مفيدة للطلاب"
camel_stemmed = process_arabic(text, 'stem', 'camel')
farasa_stemmed = process_arabic(text, 'stem', 'farasa')
light_stemmed = process_arabic(text, 'stem', 'light')
lemmatized = process_arabic(text, 'lemmatize')

print("CAMeL stemmed:", camel_stemmed)
print("Farasa stemmed:", farasa_stemmed)
print("Light stemmed:", light_stemmed)
print("Lemmatized:", lemmatized)

6. Handling Arabic Diacritics

Provide options for diacritic handling:

import pyarabic.araby as araby

def handle_diacritics(text, method='remove'):
    if method == 'remove':
        return araby.strip_diacritics(text)
    elif method == 'keep':
        return text
    elif method == 'normalize':
        return araby.normalize_hamza(araby.strip_shadda(text))

# Example usage
text_with_diacritics = "اللُّغَةُ العَرَبِيَّةُ جَمِيلَةٌ"
removed_diacritics = handle_diacritics(text_with_diacritics, 'remove')
normalized_diacritics = handle_diacritics(text_with_diacritics, 'normalize')

print("Original:", text_with_diacritics)
print("Removed diacritics:", removed_diacritics)
print("Normalized diacritics:", normalized_diacritics)

7. Handling Numbers and Special Characters in Arabic Text

Process numbers and special characters:

import re

def handle_numbers_and_special_chars(text, mode='remove'):
    if mode == 'remove':
        # Remove numbers and special characters
        return re.sub(r'[^\u0600-\u06FF\s]', '', text)
    elif mode == 'normalize':
        # Normalize Arabic numbers to Hindi numbers
        number_map = {
            '٠': '0', '١': '1', '٢': '2', '٣': '3', '٤': '4',
            '٥': '5', '٦': '6', '٧': '7', '٨': '8', '٩': '9'
        }
        for arabic, hindi in number_map.items():
            text = text.replace(arabic, hindi)
        return text

# Example usage
text = "يوجد ٣ تفاحات و٥ برتقالات في السلة!"
removed_numbers = handle_numbers_and_special_chars(text, 'remove')
normalized_numbers = handle_numbers_and_special_chars(text, 'normalize')

print("Original:", text)
print("Removed numbers and special chars:", removed_numbers)
print("Normalized numbers:", normalized_numbers)

8. Arabic Text Segmentation

Implement text segmentation for Arabic:

from camel_tools.segmenters.word import MaxLikelihoodProbabilityModel

def segment_arabic_text(text):
    mlp_model = MaxLikelihoodProbabilityModel.pretrained()
    segmented = mlp_model.segment(text)
    return ' '.join(segmented)

# Example usage
text = "وقالمصدرإنهناكتحسنافيالوضع"
segmented_text = segment_arabic_text(text)
print("Original:", text)
print("Segmented:", segmented_text)

9. Handling Arabic Dialects

Process different Arabic dialects:

from camel_tools.dialectid import DialectIdentifier

def identify_dialect(text):
    did = DialectIdentifier.pretrained()
    dialect = did.predict(text)
    return dialect

def normalize_dialect(text, target_dialect='MSA'):
    # This is a placeholder function. In practice, you would use more sophisticated
    # methods to normalize dialects, which is an active area of research.
    return text

# Example usage
text = "شلونك حبيبي؟ شخبارك اليوم؟"
dialect = identify_dialect(text)
normalized_text = normalize_dialect(text)

print("Original text:", text)
print("Identified dialect:", dialect)
print("Normalized to MSA:", normalized_text)

10. Normalization of Arabic User-Generated Content

Handle common issues in user-generated content:

import re

def normalize_user_content(text):
    # Convert repeated characters to single occurrence
    text = re.sub(r'(.)\1+', r'\1', text)
    
    # Normalize common chat spellings
    chat_spellings = {
        'إنشالله': 'إن شاء الله',
        'يسلمو': 'يسلموا',
        'عشان': 'علشان',
    }
    for chat, formal in chat_spellings.items():
        text = text.replace(chat, formal)
    
    return text

# Example usage
user_text = "يااااا سلااااام!! إنشالله بكرة نتقابل عشان نروح السينما"
normalized_text = normalize_user_content(user_text)
print("Original:", user_text)
print("Normalized:", normalized_text)

11. Handling Arabizi (Arabic Chat Alphabet)

Convert Arabizi to Arabic script:

def arabizi_to_arabic(text):
    # This is a simplified conversion. A complete solution would be more complex.
    conversion_dict = {
        'a': 'ا', 'b': 'ب', 't': 'ت', 'th': 'ث', 'g': 'ج', '7': 'ح', 'kh': 'خ',
        'd': 'د', 'th': 'ذ', 'r': 'ر', 'z': 'ز', 's': 'س', 'sh': 'ش', '9': 'ص',
        '6': 'ط', '3': 'ع', 'gh': 'غ', 'f': 'ف', 'q': 'ق', 'k': 'ك', 'l': 'ل',
        'm': 'م', 'n': 'ن', 'h': 'ه', 'w': 'و', 'y': 'ي'
    }
    
    for latin, arabic in conversion_dict.items():
        text = text.replace(latin, arabic)
    
    return text

# Example usage
arabizi_text = "mar7aba, kayf 7alak?"
arabic_text = arabizi_to_arabic(arabizi_text)
print("Arabizi:", arabizi_text)
print("Arabic:", arabic_text)

12. Arabic Word Disambiguation

Implement word sense disambiguation for Arabic:

from camel_tools.disambig import CamelDisambiguator

def disambiguate_arabic(text):
    disambiguator = CamelDisambiguator.pretrained('calima-msa-r13')
    disambiguated = disambiguator.disambiguate(text.split())
    return [d.analyses[0].analysis['lex'] for d in disambiguated]

# Example usage
text = "ذهب الرجل إلى البنك"
disambiguated = disambiguate_arabic(text)
print("Original:", text)
print("Disambiguated:", ' '.join(disambiguated))

13. Handling Elongated Words in Arabic

Normalize elongated words:

import re

def normalize_elongated_words(text):
    # Remove elongation
    text = re.sub(r'(.)\1+', r'\1\1', text)
    return text

# Example usage
elongated_text = "يااااا سلاااام على هذا البرنااامج الراااائع"
normalized_text = normalize_elongated_words(elongated_text)
print("Elongated:", elongated_text)
print("Normalized:", normalized_text)

14. Arabic Text Correction

Implement basic text correction for common mistakes:

def correct_arabic_text(text):
    corrections = {
        'انشاء الله': 'إن شاء الله',
        'لاكن': 'لكن',
        'إنشالله': 'إن شاء الله',
        'الذي': 'الذي',
        'هاذا': 'هذا',
        'إنه': 'إنه',
    }
    
    for mistake, correction in corrections.items():
        text = text.replace(mistake, correction)
    
    return text

# Example usage
incorrect_text = "انشاء الله سوف اذهب الى المدرسه غدا لاكن هاذا يعتمد على الطقس"
corrected_text = correct_arabic_text(incorrect_text)
print("Incorrect:", incorrect_text)
print("Corrected:", corrected_text)

15. Arabic Named Entity Recognition (NER)

Use state-of-the-art models for Arabic NER:

from camel_tools.ner import NERecognizer

def recognize_entities(text):
    ner = NERecognizer.pretrained()
    labels = ner.predict_sentence(text)
    entities = []
    current_entity = []
    current_label = None
    
    for word, label in zip(text.split(), labels):
        if label.startswith('B-'):
            if current_entity:
                entities.append((' '.join(current_entity), current_label))
                current_entity = []
            current_entity.append(word)
            current_label = label[2:]
        elif label.startswith('I-') and current_entity:
            current_entity.append(word)
        else:
            if current_entity:
                entities.append((' '.join(current_entity), current_label))
                current_entity = []
                current_label = None
    
    if current_entity:
        entities.append((' '.join(current_entity), current_label))
    
    return entities

# Example usage
text = "يعيش محمد في القاهرة ويعمل في شركة جوجل."
entities = recognize_entities(text)
print("Text:", text)
print("Recognized entities:", entities)

16. Arabic Text Classification

Implement Arabic text classification using modern deep learning approaches:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

def classify_arabic_text(text, model_name="aubmindlab/bert-base-arabertv2"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
    return predictions.tolist()[0]

# Example usage
text = "هذا النص رائع ومفيد جداً"
classification = classify_arabic_text(text)
print(f"Text: {text}")
print(f"Classification probabilities: {classification}")

17. Arabic Sentiment Analysis

Perform sentiment analysis on Arabic text using specialized models:

from transformers import pipeline

def analyze_arabic_sentiment(text):
    sentiment_pipeline = pipeline("sentiment-analysis", model="CAMeL-Lab/bert-base-arabic-camelbert-msa-sentiment")
    result = sentiment_pipeline(text)[0]
    return result['label'], result['score']

# Example usage
text = "أنا سعيد جداً بهذا المنتج!"
sentiment, score = analyze_arabic_sentiment(text)
print(f"Text: {text}")
print(f"Sentiment: {sentiment}, Score: {score}")

18. Handling Emojis and Emoticons in Arabic Text

Process emojis and emoticons in Arabic text:

import emoji

def handle_emojis(text, mode='remove'):
    if mode == 'remove':
        return emoji.replace_emoji(text, '')
    elif mode == 'description':
        return emoji.demojize(text, language='ar')
    return text

# Example usage
text_with_emoji = "أنا أحب القراءة 📚 وأستمتع بها كثيراً 😊"
text_without_emoji = handle_emojis(text_with_emoji, 'remove')
text_with_descriptions = handle_emojis(text_with_emoji, 'description')

print("Original:", text_with_emoji)
print("Without emojis:", text_without_emoji)
print("With emoji descriptions:", text_with_descriptions)

19. Data Augmentation for Arabic NLP

Implement data augmentation techniques for Arabic:

import random
from camel_tools.morphology import analyzer

def augment_arabic_data(text, num_augmentations=1):
    morph = analyzer.pretrained_analyzer()
    words = text.split()
    augmented_texts = []

    for _ in range(num_augmentations):
        new_words = []
        for word in words:
            analysis = morph.analyze(word)
            if analysis:
                # Randomly choose a different form of the word
                new_word = random.choice(analysis).inflected
                new_words.append(new_word)
            else:
                new_words.append(word)
        augmented_texts.append(' '.join(new_words))

    return augmented_texts

# Example usage
original_text = "الكتاب مفيد للقراءة"
augmented_data = augment_arabic_data(original_text, num_augmentations=3)

print("Original:", original_text)
print("Augmented data:")
for i, text in enumerate(augmented_data, 1):
    print(f"{i}. {text}")

20. Recent Advances in Arabic NLP

Here are some recent advances in Arabic NLP:

  1. Large Language Models for Arabic:

    • AraBERT: A transformer-based model pre-trained on a large Arabic corpus.
    • AraGPT2: An Arabic version of GPT-2, capable of generating coherent Arabic text.
    • MARBERT: A large-scale pre-trained masked language model for Arabic.
  2. Multilingual Models:

    • mBERT and XLM-R have shown impressive performance on Arabic NLP tasks without being specifically trained on Arabic.
  3. Dialect-Specific Models:

    • MADAR: A comprehensive Arabic dialect identification system.
    • Multi-dialect BERT models: Pre-trained on various Arabic dialects for improved performance on dialectal Arabic.
  4. Cross-Lingual Transfer Learning:

    • Techniques to transfer knowledge from high-resource languages to improve Arabic NLP tasks.
  5. Arabic Text Summarization:

    • AraBERT-summarizer: Fine-tuned AraBERT model for Arabic text summarization.
  6. Improved Arabic Speech Recognition:

    • End-to-end models using transformers have significantly improved Arabic ASR accuracy.
  7. Arabic Question Answering:

    • Arabic-SQuAD: A large-scale dataset for Arabic question answering.
    • ArabicQA models: BERT-based models fine-tuned for Arabic question answering tasks.
  8. Neural Machine Translation:

    • Significant improvements in Arabic-English and Arabic-other languages translation using transformer-based models.
  9. Arabic Sentiment Analysis:

    • ASAD: A Twitter-based Arabic Sentiment Analysis Dataset.
    • AraSenTi-Tweet: A large-scale Arabic sentiment analysis dataset.
  10. Arabic Named Entity Recognition:

    • ANERcorp: A large-scale manually annotated Arabic NER corpus.
    • CAMeL Tools: A suite of Arabic NLP tools including state-of-the-art NER models.

To stay updated with the latest advances:

  • Regularly check conferences like ACL, EMNLP, and WANLP for Arabic NLP papers.
  • Follow research from institutions like QCRI, NYU Abu Dhabi, and Carnegie Mellon University in Qatar.
  • Monitor Arabic NLP-focused workshops and shared tasks in major NLP conferences.

Remember to adapt these techniques and code examples to your specific Arabic NLP task and dataset. Always validate your preprocessing pipeline to ensure it's not introducing unintended biases or errors in your data.