Evaluation and Metrics

This page defines recommended evaluation metrics for researchers. The goal is to make model comparisons consistent and reproducible.

1) Core metrics

Compute edit distance between predicted phoneme sequence and reference phoneme sequence.
Report PER = (S + D + I) / N where S,D,I are substitutions, deletions, insertions, and N is reference length.

Measure how often predicted phoneme groups align to the correct reference group.
Use alignment accuracy or IoU‑style overlap if groups are expanded.

If you report error types, measure how often the model highlights the correct phoneme group for a mistake.

Create evaluation sets by recording condition and recitation style:

Keep a fixed seed for train/valid/test to ensure reproducibility.

When publishing results, include:

Unit.probs are raw CTC softmax outputs and are not calibrated. If you use confidence thresholds: