IqraEval.2 Challenge Interspeech 2026

Overview

This Challenge Interspeech 2026 is a shared task aimed at advancing automatic assessment of Modern Standard Arabic (MSA) pronunciation by leveraging computational methods to detect and diagnose pronunciation errors. The focus on MSA provides a standardized and well-defined context for evaluating Arabic pronunciation.

Participants will develop systems capable of detecting mispronunciations (e.g., substitution, deletion, or insertion of phonemes).

Timeline

1 December 2025: Registration opens
15 December 2025: Release of training data, evaluation set, Arabic phoneme set, and phonemiser
15 February 2026: Registration closes; leaderboard frozen
17 February 2026: Results announced
25 February 2026: Challenge paper submission deadline

Task Description: MSA Mispronunciation Detection System

Design a model to detect and provide detailed feedback on mispronunciations in MSA speech. Users read vowelized sentences; the model predicts the spoken phoneme sequence and flags deviations. Evaluation is on the MSA-Test dataset with human‐annotated errors.

Figure: Overview of the Mispronunciation Detection Workflow

1. Read the Sentence

System shows a Reference Sentence plus its Reference Phoneme Sequence.

Example:

Arabic: يَتَحَدَّثُ النَّاسُ اللُّغَةَ الْعَرَبِيَّةَ
Phoneme: < y a t a H a d d a v u n n aa s u l l u g h a t a l E a r a b i y y a t a

2. Save Recording

User speaks; system captures and stores the audio waveform.

3. Mispronunciation Detection

Model predicts the phoneme sequence—deviations from reference indicate mispronunciations.

Example of Mispronunciation:

Reference: < y a t a H a d d a v u n n aa s u l l u g h a t a l E a r a b i y y a t a
Predicted: < y a t a H a d d a s u n n aa s u l l u g h a t u l E a r a b i y y a t a

Here, v→s (substitution) represents a common pronunciation error.

Phoneme Set Description

The phoneme set employed in this work derives from a specialized phonetizer developed specifically for vowelized Modern Standard Arabic (MSA). It encompasses a comprehensive inventory of phonemes designed to capture essential phonetic and prosodic features, including stress, pausing, intonation, emphaticness, and gemination. Notably, gemination—the lengthening of consonant sounds—is explicitly represented by duplicating the consonant symbol (e.g., /b/ becomes /bb/). This approach ensures a detailed yet practical representation of speech sounds, which is critical for accurate mispronunciation detection.

To phonemize additional datasets or custom text using this standard, we provide the open-source tool at the MSA Phonetizer Repository. Important: This phonetizer requires the input Arabic text to be fully diacritized to ensure accurate phonetic transcription. For further details on the symbols used, please refer to the Phoneme Inventory.

Training Data Overview

To ensure robustness, our training strategy utilizes a mix of native speech (pseudo-labeled), synthetic mispronunciations, and real recorded errors.

1. Native Speech (Pseudo-Labeled)

Dataset: IqraEval/Iqra_train
Volume: ~79 hours (Train) + 3.4 hours (Dev)
This dataset consists of recordings from native MSA speakers. As these speakers are assumed to pronounce the text correctly, this subset is treated as "Golden" data using pseudo-labels.

Columns:

audio: The speech waveform.
sentence: The original raw text.
tashkeel_sentence: Fully diacritized text, generated using an internal SOTA diacritizer (assumed correct).
phoneme_ref: The reference canonical phoneme sequence.
phoneme_mis: The realized phoneme sequence.
Note: Since no errors are present, this is identical to phoneme_ref.

2. Synthetic Mispronunciations (TTS)

Dataset: IqraEval/Iqra_TTS
Volume: ~80 hours
To compensate for the lack of errors in the native set, we generated a synthetic dataset using various trained TTS systems. Mispronunciations were deliberately introduced into the input text before audio generation.

Columns:

audio: The synthesized waveform.
sentence_ref: The original correct text.
sentence_aug: The text containing augmented errors.
phoneme_ref: The canonical phoneme sequence of the correct text.
phoneme_mis: The phoneme sequence corresponding to the synthesized mispronunciation.
tashkeel_sentence: The fully diacritized version of the reference text.

3. Real Mispronunciations (Interspeech 2026)

Dataset: IqraEval/Iqra_Extra_IS26
Volume: ~2 hours
Moving beyond synthetic data, this subset contains real recordings of human mispronunciations collected specifically for Interspeech 2026.

Columns:

audio: The speech waveform.
sentence: The original text.
phoneme_ref: The target canonical phoneme sequence.
phoneme_mis: The actual realized phonemes containing human errors.

Evaluation Dataset

Dataset: IqraEval/QuranMB.v2
Currently, only the audio files are released for this evaluation set. It serves as a benchmark for detecting mispronunciations in a distinct domain.

Important Note on Data Leakage:
Strict measures were taken to ensure experimental integrity. We have verified that there is no overlap in speakers or content (sentences) between the training datasets (`Iqra_train`, `Iqra_TTS`, `Iqra_Extra_IS26`) and the evaluation datasets.

Submission Details (Draft)

Soon to be released

-->

Evaluation Criteria

🏆 Primary Metric

The Leaderboard is ranked primarily by the Phoneme-level F1-score. While other metrics (FRR, FAR, DER) are computed for analysis, F1 determines the final standing.

We use a hierarchical evaluation strategy (detection + diagnostic) based on the MDD Overview framework.

1. Input Definitions

What is said: The annotated phoneme sequence.
What is predicted: The output from your model.
What should have been said: The reference (target) sequence.

2. Confusion Matrix Components

From the inputs above, we compute the following counts:

TA (True Accept)	Correct phonemes properly accepted.
TR (True Reject)	Mispronunciations correctly detected.
FR (False Reject)	Correct phonemes incorrectly flagged as errors.
FA (False Accept)	Mispronunciations missed (labeled as correct).

3. Calculated Metrics

Detection Metrics (Leaderboard Ranking)

Precision: TR / (TR + FR)
Recall: TR / (TR + FA)
F1-Score: 2 · (Precision · Recall) / (Precision + Recall)

Diagnostic Rates (Auxiliary)

FRR (False Reject Rate): FR / (TA + FR)
FAR (False Accept Rate): FA / (FA + TR)
DER (Diagnostic Error Rate): DE / (CD + DE)

Suggested Research Directions

Advanced Mispronunciation Detection Models
Apply state-of-the-art self-supervised models (e.g., Wav2Vec2.0, HuBERT), using variants that are pre-trained/fine-tuned on Arabic speech. These models can then be fine-tuned on MSA datasets to improve phoneme-level accuracy.
Data Augmentation Strategies
Create synthetic mispronunciation examples using pipelines like SpeechBlender. Augmenting limited Arabic speech data helps mitigate data scarcity and improves model robustness.
Analysis of Common Mispronunciation Patterns
Perform statistical analysis on the MSA-Test dataset to identify prevalent errors (e.g., substituting similar phonemes, swapping vowels). These insights can drive targeted training and tailored feedback rules.

Registration

Teams and individual participants must register to gain access to the test set. Please complete the registration form using the link below:

Registration Form

Registration opens on December 1, 2025.

Future Updates

Further details on the open-set leaderboard submission will be posted on the shared task website (December 15, 2025). Stay tuned!

Contact and Support

For inquiries and support, reach out to the task coordinators.

References

El Kheir Y. et al., “SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation,” arXiv:2211.00923, 2022.
Aly S. A. et al., “ASMDD: Arabic Speech Mispronunciation Detection Dataset,” arXiv:2111.01136, 2021.
Moustafa A. & Aly S. A., “Efficient Voice Identification Using Wav2Vec2.0 and HuBERT…,” arXiv:2111.06331, 2021.
El Kheir Y. et al., “Automatic Pronunciation Assessment – A Review,” arXiv:2310.13974, 2021.