Architecture
This project uses a multi‑level CTC architecture on top of Wav2Vec2BERT. Each level predicts a different target sequence:
- Phoneme sequence (primary)
- Sifat attributes (secondary heads; one head per attribute)
The implementation lives in src/quran_muaalem/modeling/.
High‑level flow
Audio (16 kHz)
→ Feature extractor (AutoFeatureExtractor)
→ Wav2Vec2BERT encoder
→ Multi‑level CTC heads
→ Greedy CTC decode + alignment
→ Phonemes + Sifat outputsMulti‑level CTC heads
Wav2Vec2BertForMultilevelCTC (in modeling_multi_level_ctc.py) creates a linear head per level:
phonemeshams_or_jahrshidda_or_rakhawatafkheem_or_taqeeqitbaqsafeerqalqlatikraartafashieistitalaghonna
Each head is trained with its own CTC loss. Loss weights are configurable via:
level_to_vocab_sizelevel_to_loss_weight
(see configuration_multi_level_ctc.py).
Tokenization per level
MultiLevelTokenizer builds a tokenizer per level using Wav2Vec2CTCTokenizer and the model’s vocab files. Phoneme tokens are Arabic phoneme symbols; sifat tokens are bracketed Arabic labels mapped from SifaOutput.
Decoding and alignment
At inference time:
- The model outputs per‑level logits.
- Each level is decoded with greedy CTC (
top‑1 + collapse + blank removal). - The phoneme sequence is chunked into phoneme groups.
- Sifat sequences are aligned to the phoneme groups and to the reference (if provided).
Alignment logic is implemented in src/quran_muaalem/decode.py.
Practical implications for researchers
- Primary signal: phoneme head is the most reliable output.
- Attribute heads can be sensitive to alignment quality; evaluation should measure both raw accuracy and alignment robustness.
- Loss weighting affects attribute precision/recall significantly and should be tuned against a validation set.