Skip to content

Pipeline Steps

This section expands the high‑level steps described in README.md. The goal is to make the data and label flow explicit, which is essential for research reproducibility.

Step‑by‑step flow (conceptual)

StepInputOutputNotes
1. Collect recitationsRaw audioCurated audio setPrefer high‑quality reciters with metadata (reciter, style, speed).
2. Segment by pausesRaw audioSegmented clipsPause‑based segmentation is more stable than ayah‑level segmentation for training.
3. Transcribe audioSegmented clipsImlaey textA Quran‑tuned Whisper model is used for initial transcription.
4. Correct transcriptsImlaey textCorrected Imlaey textUse tasmeea alignment (quran-transcript) to fix mistakes.
5. Convert scriptsImlaey → UthmaniUthmani textMapping is handled by the Quran script map in quran-transcript.
6. PhonetizeUthmani textPhoneme + sifat labelsquran_transcript.quran_phonetizer outputs phonemes and attributes.
7. Train modelAudio + labelsMulti‑level CTC modelWav2Vec2BERT + multiple CTC heads.

Artifacts you should save

For reproducibility, store these artifacts in your data pipeline:

  • segments.jsonl – audio segment metadata (start/end, reciter, source).
  • transcripts_raw.jsonl – initial transcription.
  • transcripts_fixed.jsonl – corrected Imlaey text.
  • uthmani.jsonl – converted Uthmani text per segment.
  • phonetic_labels.jsonl – phoneme + sifat sequences.
  • train/valid/test splits with fixed random seeds.

Label generation details

The core label generator is:

python
from quran_transcript import quran_phonetizer

It produces:

  • phonemes: the phonetic script string
  • sifat: per‑phoneme attribute labels

These are then tokenized per level by MultiLevelTokenizer during training.

Known sensitivities

  • Segmentation errors propagate to alignment and degrade sifat quality.
  • Transcription noise in the Imlaey stage can cause mapping failures.
  • Recitation speed changes the length distribution and may require curriculum or augmentation.

Next step

See Evaluation and Metrics for recommended benchmarks and reporting.