Data Sources
This project references external tools and datasets in README.md. The main sources used in the pipeline are:
- Recitation collection: https://github.com/obadx/prepare-quran-dataset
- Segmentation by pauses: https://github.com/obadx/recitations-segmenter
- Quran‑tuned Whisper model: https://huggingface.co/tarteel-ai/whisper-base-ar-quran
- Correction + script conversion: https://github.com/obadx/quran-transcript
How these sources fit together
- Raw audio is collected and curated.
- Segments are created using pause‑based splitting.
- Automatic transcription provides initial Imlaey text.
- Tasmeea correction improves transcription fidelity.
- Script conversion yields Uthmani text.
- Phonetizer generates phoneme + sifat labels.
Recommended metadata to track
For reproducibility, record these fields per segment:
- Reciter id / source
- Recitation style (murattal/mujawad/hadr)
- Audio format (sample rate, bitrate)
- Segment boundaries (start/end timestamps)
- Reference sura/ayah and word span
- Moshaf attributes used
Licensing note
Each external dataset or model has its own license. Record and cite the original source licenses in your dataset documentation.