Architecture

This project uses a multi‑level CTC architecture on top of Wav2Vec2BERT. Each level predicts a different target sequence:

Phoneme sequence (primary)
Sifat attributes (secondary heads; one head per attribute)

The implementation lives in src/quran_muaalem/modeling/.

High‑level flow

Audio (16 kHz)
  → Feature extractor (AutoFeatureExtractor)
  → Wav2Vec2BERT encoder
  → Multi‑level CTC heads
  → Greedy CTC decode + alignment
  → Phonemes + Sifat outputs

Multi‑level CTC heads

Wav2Vec2BertForMultilevelCTC (in modeling_multi_level_ctc.py) creates a linear head per level:

phonemes
hams_or_jahr
shidda_or_rakhawa
tafkheem_or_taqeeq
itbaq
safeer
qalqla
tikraar
tafashie
istitala
ghonna

Each head is trained with its own CTC loss. Loss weights are configurable via:

level_to_vocab_size
level_to_loss_weight

(see configuration_multi_level_ctc.py).

Tokenization per level

MultiLevelTokenizer builds a tokenizer per level using Wav2Vec2CTCTokenizer and the model’s vocab files. Phoneme tokens are Arabic phoneme symbols; sifat tokens are bracketed Arabic labels mapped from SifaOutput.

Decoding and alignment

At inference time:

The model outputs per‑level logits.
Each level is decoded with greedy CTC (top‑1 + collapse + blank removal).
The phoneme sequence is chunked into phoneme groups.
Sifat sequences are aligned to the phoneme groups and to the reference (if provided).

Alignment logic is implemented in src/quran_muaalem/decode.py.

Practical implications for researchers

Primary signal: phoneme head is the most reliable output.
Attribute heads can be sensitive to alignment quality; evaluation should measure both raw accuracy and alignment robustness.
Loss weighting affects attribute precision/recall significantly and should be tuned against a validation set.

Architecture ​

High‑level flow ​

Multi‑level CTC heads ​

Tokenization per level ​