Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/khaphanspace/gonhanh.org/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The validation algorithm ensures that only valid Vietnamese syllables are transformed by the IME engine. It uses a whitelist-based approach with 6 sequential rules to validate syllable structure, phonotactics, and spelling conventions.

Purpose

Validation occurs BEFORE transformation:

"duoc" + j → VALID   → transform → "được" ✓
"claus" + s → INVALID → keep original → "clauss" ✓
"http" + s → INVALID → keep original → "https" ✓
Protects:
  • Code identifiers (function, const)
  • Foreign names (John, Claude)
  • Loanwords (pizza, email)
  • URLs and email addresses

Syllable Structure

Formula

Syllable = (C₁)(G)V(C₂)

C₁ = Initial consonant (phụ âm đầu)  - optional
G  = Glide (âm đệm: o, u)            - optional
V  = Vowel nucleus (nguyên âm)       - REQUIRED
C₂ = Final consonant (âm cuối)       - optional

Examples

InputC₁GVC₂Valid
a--a-
banb-an
hoahoa-
quaqu-a-
giaugi-au-
nghiengngh-ieng
duocd-uoc

Data Constants

Initial Consonants (C₁)

core/src/data/constants.rs
VALID_INITIALS_1: [b, c, d, g, h, k, l, m, n, p, q, r, s, t, v, x]

Final Consonants (C₂)

VALID_FINALS_1: [
  c, k, m, n, p, t,  // Consonants
  i, y, o, u         // Semi-vowels
]
k is included for ethnic minority names (Đắk Lắk, Đắk Nông)

Spelling Rules

ConsonantInvalid beforeShould use
ce, i, yk
ka, o, uc
q(always with u)qu
SPELLING_RULES: [
  ([C], [E, I, Y], "use k instead"),
  ([K], [A, O, U], "use c instead"),
]
ConsonantInvalid beforeShould use
gegh
gha, o, ug
SPELLING_RULES: [
  ([G], [E], "use gh instead"),
  ([G,H], [A, O, U], "use g instead"),
]
ConsonantInvalid beforeShould use
nge, ingh
ngha, o, ung
SPELLING_RULES: [
  ([N,G], [E, I], "use ngh instead"),
  ([N,G,H], [A, O, U], "use ng instead"),
]

Valid Vowel Patterns (Whitelist)

The engine uses an inclusion approach: only patterns in this whitelist are valid. Any combination not listed is automatically rejected.
core/src/data/constants.rs
VALID_DIPHTHONGS: [
  // Standard Vietnamese diphthongs
  [A, I], [A, O], [A, U], [A, Y],  // ai, ao, au, ay
  [E, I], [E, O], [E, U],          // ei (Telex), eo, êu
  [I, A], [I, E], [I, U],          // ia, iê, iu
  [O, A], [O, E], [O, I],          // oa, oe, oi/ôi/ơi
  [U, A], [U, E], [U, I], [U, O], [U, Y], [U, U],  // ua/ưa, uê, ui/ưi, uo/uô/ươ, uy, ưu
  [Y, E],                          // yê
  
  // Telex intermediate states (for delayed transformations)
  [A, A], [E, E], [O, O],          // aa→â, ee→ê, oo→ô toggle
]
Why Inclusion over Exclusion?
AspectInclusion (whitelist)Exclusion (blacklist)
CoverageComprehensive - catches all invalidOnly catches listed patterns
MaintenanceNeed to add Telex statesEasy to miss edge cases
RiskFalse negative (need Telex states)False positive (miss invalid)
Invalid patterns (for reference):
  • ea → sea, beach, teacher, search
  • ou → you, our, house, about, would
  • yo → yoke, York, your, beyond

Validation Rules

Rule Execution Order

core/src/engine/validation.rs
const RULES: &[Rule] = &[
    rule_has_vowel,           // Rule 1
    rule_valid_initial,       // Rule 2
    rule_all_chars_parsed,    // Rule 3
    rule_spelling,            // Rule 4
    rule_valid_final,         // Rule 5
    rule_valid_vowel_pattern, // Rule 6
];
1

Rule 1: Has Vowel

fn rule_has_vowel(_snap: &BufferSnapshot, syllable: &Syllable) -> Option<ValidationResult> {
    if syllable.is_empty() {
        return Some(ValidationResult::NoVowel);
    }
    None
}
Examples:
  • a, em, an
  • bcd, bcdfgh (no vowel)
2

Rule 2: Valid Initial

fn rule_valid_initial(snap: &BufferSnapshot, syllable: &Syllable) -> Option<ValidationResult> {
    if syllable.initial.is_empty() {
        return None;
    }

    let initial: Vec<u16> = syllable.initial.iter().map(|&i| snap.keys[i]).collect();

    let is_valid = match initial.len() {
        1 => constants::VALID_INITIALS_1.contains(&initial[0]),
        2 => constants::VALID_INITIALS_2
            .iter()
            .any(|p| p[0] == initial[0] && p[1] == initial[1]),
        3 => initial[0] == keys::N && initial[1] == keys::G && initial[2] == keys::H,
        _ => false,
    };

    if !is_valid {
        return Some(ValidationResult::InvalidInitial);
    }
    None
}
Examples:
  • ba, ca, nghe, nghieng
  • clau, john, bla, string (invalid initial)
3

Rule 3: All Chars Parsed

fn rule_all_chars_parsed(snap: &BufferSnapshot, syllable: &Syllable) -> Option<ValidationResult> {
    let parsed = syllable.initial.len()
        + syllable.glide.map_or(0, |_| 1)
        + syllable.vowel.len()
        + syllable.final_c.len();

    if parsed != snap.keys.len() {
        return Some(ValidationResult::InvalidFinal);
    }
    None
}
Purpose: Ensures no characters are left unparsed (would indicate invalid structure)
4

Rule 4: Spelling Rules

fn rule_spelling(snap: &BufferSnapshot, syllable: &Syllable) -> Option<ValidationResult> {
    if syllable.initial.is_empty() || syllable.vowel.is_empty() {
        return None;
    }

    let initial: Vec<u16> = syllable.initial.iter().map(|&i| snap.keys[i]).collect();
    let first_vowel = snap.keys[syllable.glide.unwrap_or(syllable.vowel[0])];

    for &(consonant, vowels, _msg) in constants::SPELLING_RULES {
        if initial == consonant && vowels.contains(&first_vowel) {
            return Some(ValidationResult::InvalidSpelling);
        }
    }

    None
}
Examples:
  • ca, ke, ghe, nghi
  • ci, ce, cy, ka, ko, ku, ge, nge (spelling violations)
5

Rule 5: Valid Final

fn rule_valid_final(snap: &BufferSnapshot, syllable: &Syllable) -> Option<ValidationResult> {
    if syllable.final_c.is_empty() {
        return None;
    }

    let final_c: Vec<u16> = syllable.final_c.iter().map(|&i| snap.keys[i]).collect();

    let is_valid = match final_c.len() {
        1 => constants::VALID_FINALS_1.contains(&final_c[0]),
        2 => constants::VALID_FINALS_2
            .iter()
            .any(|p| p[0] == final_c[0] && p[1] == final_c[1]),
        _ => false,
    };

    if !is_valid {
        return Some(ValidationResult::InvalidFinal);
    }
    None
}
Examples:
  • an, em, ong, anh, ach
  • gues (s is invalid final)
6

Rule 6: Valid Vowel Pattern (Whitelist)

fn rule_valid_vowel_pattern(
    snap: &BufferSnapshot,
    syllable: &Syllable,
) -> Option<ValidationResult> {
    if syllable.vowel.len() < 2 {
        return None; // Single vowel always valid
    }

    let vowel_keys: Vec<u16> = syllable.vowel.iter().map(|&i| snap.keys[i]).collect();

    match vowel_keys.len() {
        2 => {
            let pair = [vowel_keys[0], vowel_keys[1]];
            if !constants::VALID_DIPHTHONGS.contains(&pair) {
                return Some(ValidationResult::InvalidVowelPattern);
            }
        }
        3 => {
            let triple = [vowel_keys[0], vowel_keys[1], vowel_keys[2]];
            if !constants::VALID_TRIPHTHONGS.contains(&triple) {
                return Some(ValidationResult::InvalidVowelPattern);
            }
        }
        _ => {
            return Some(ValidationResult::InvalidVowelPattern);
        }
    }

    None
}
Examples:
  • ai, ao, eo, ia, iu, oa, ui, uy (valid diphthongs)
  • iêu, oai, uôi, ươi (valid triphthongs)
  • ou, yo, ea, ae (not in whitelist)

Foreign Word Detection

Beyond validation, the engine detects foreign word patterns to skip transformation:
core/src/engine/validation.rs
pub fn is_foreign_word_pattern(
    buffer_keys: &[u16],
    buffer_tones: &[u8],
    modifier_key: u16,
) -> bool
Detection Patterns:

Invalid Vowel Patterns

  • ou (you, our, house)
  • yo (yoke, York)
  • ea (search, beach)

Consonant Clusters

  • T+R (metric, matrix)
  • P+R (spectrum)
  • C+R (across)
Special Case: Skip check when buffer has horn transforms (ư, ơ, ươ) - indicates intentional Vietnamese input (e.g., “rượu”)

Validation API

core/src/engine/validation.rs
/// Validate buffer as Vietnamese syllable
pub fn validate(snap: &BufferSnapshot) -> ValidationResult

/// Quick check (keys only - legacy)
pub fn is_valid(buffer_keys: &[u16]) -> bool

/// Full validation with modifier info
pub fn is_valid_with_tones(keys: &[u16], tones: &[u8]) -> bool

/// Pre-transformation validation (excludes vowel pattern check)
pub fn is_valid_for_transform(buffer_keys: &[u16]) -> bool

/// Foreign word pattern detection
pub fn is_foreign_word_pattern(
    buffer_keys: &[u16],
    buffer_tones: &[u8],
    modifier_key: u16,
) -> bool

pub enum ValidationResult {
    Valid,
    InvalidInitial,
    InvalidFinal,
    InvalidSpelling,
    InvalidVowelPattern,
    NoVowel,
}

Integration with Engine

on_key(key)

├─ [is_modifier(key)?]
│  │
│  ├─ ★ VALIDATION: Before transform
│  │   └─ is_valid(buffer)?
│  │       ├─ NO  → return NONE (don't transform)
│  │       └─ YES → continue transform
│  │
│  └─ Apply transformation

└─ [is_letter(key)?] → push to buffer

Test Coverage

const VALID: &[&str] = &[
    "ba", "ca", "an", "em", "gi", "gia", "giau",
    "ke", "ki", "ky", "nghe", "nghi", "nghieng",
    "truong", "nguoi", "duoc",
];

Performance Considerations

Fast Failure

Rules execute sequentially, first failure returns immediately

Constant Lookup

All validation data in const arrays, O(1) access

No Allocations

Uses slices and indices, no heap allocations

Zero-Copy

BufferSnapshot uses references to original buffer

See Also: