PRD-STD-015: Multilingual AI Quality & Safety
Standard ID: PRD-STD-015 Version: 1.0 Status: Active Compliance Level: Level 2 (Managed) Effective Date: 2026-02-22 Last Reviewed: 2026-02-22
This page is the normative source of requirements for this control area. Use it to define policy, evidence expectations, and audit/compliance criteria.
For implementation and rollout support:
- Execution plan: Apply-Ready Rollout Kit
- Adoption sequencing: Production Rollout Paths
- Hands-on tutorials: Production Tutorials & Starter Guides
- Runnable repos / apply paths: Reference Implementations
Use the Compliance Level metadata on this page to sequence adoption with other PRD-STDs.
1. Purpose
This standard defines mandatory quality and safety controls for AI products that operate across multiple languages, dialects, or scripts. AI models exhibit significant performance variance across languages — safety filters calibrated for English often fail for Arabic, code-switching inputs produce unpredictable outputs, and bias manifests differently across linguistic and cultural contexts.
Without explicit multilingual controls, organizations risk deploying AI features that are safe in one language but harmful, inaccurate, or unusable in others.
2. Scope
This standard applies to:
- Any AI product feature that supports more than one language, processes multilingual input, or serves users across linguistic communities
- Conversational AI, content generation, classification, moderation, search, and recommendation features operating in multilingual contexts
- Single-language products serving dialect-diverse populations (e.g., Arabic dialects: MSA, Egyptian, Gulf, Levantine, Maghrebi)
This standard does not replace PRD-STD-010 or PRD-STD-001. It adds language-specific controls required for multilingual AI product operation.
3. Definitions
| Term | Definition |
|---|---|
| Supported Language | A language for which the AI product claims functional coverage, including quality, safety, and performance guarantees |
| Language Coverage Matrix | A documented mapping of supported languages to evaluated quality metrics, safety test results, and known limitations per language |
| Code-Switching | The practice of alternating between two or more languages within a single conversation, sentence, or input — common in multilingual user populations |
| Dialect Variant | A regional or social variation of a language with distinct vocabulary, grammar, or pragmatic norms that may affect AI model performance |
| Cross-Language Parity | The degree to which AI product quality, safety, and fairness metrics are consistent across supported languages |
| Multilingual Safety Evaluation | Structured testing of harmful output, policy violations, and abuse patterns across all supported languages |
| Script Normalization | The process of standardizing text encoding, directionality (LTR/RTL), and character representations to ensure consistent AI processing |
4. Requirements
4.1 Multilingual Evaluation Standards
REQ-015-01: Every AI product MUST maintain a Language Coverage Matrix documenting all supported languages with evaluated quality benchmarks, safety test status, and known limitations.
REQ-015-02: Quality evaluation MUST be performed independently for each supported language. Aggregate cross-language metrics MUST NOT be used as the sole indicator of per-language quality.
REQ-015-03: Minimum evaluation coverage MUST include task accuracy, response relevance, fluency, and factual consistency per supported language.
REQ-015-04: Organizations SHOULD maintain language-specific evaluation datasets curated with native-speaker review, refreshed at least annually.
4.2 Cross-Language Safety Testing
REQ-015-05: Safety evaluation MUST be executed independently for every supported language before release. A feature MUST NOT launch in a language that has not passed safety evaluation.
REQ-015-06: Adversarial abuse testing MUST include language-specific attack patterns including culturally-specific harmful content, script-based obfuscation, and transliteration-based policy evasion.
REQ-015-07: Cross-lingual transfer attacks — where harmful prompts in one language exploit model behavior in another — MUST be included in Tier 2 and Tier 3 safety evaluation.
REQ-015-08: Organizations SHOULD maintain per-language harmful content taxonomies that account for culturally-specific sensitivities, taboo topics, and regulatory differences.
4.3 Dialect & Code-Switching Handling
REQ-015-09: When an AI product serves dialect-diverse populations, evaluation MUST include the major dialect variants relevant to the user population with documented coverage and known limitations.
REQ-015-10: AI features MUST handle code-switching input without producing errors, truncated responses, or language confusion. Graceful degradation to a dominant language is acceptable if documented.
REQ-015-11: Organizations SHOULD implement script normalization for languages with multiple encoding standards (e.g., Arabic Unicode normalization forms, CJK unified ideographs) to ensure consistent AI processing.
4.4 Multilingual Bias & Fairness Assessment
REQ-015-12: Fairness evaluation MUST be conducted per supported language, not only on aggregated cross-language results.
REQ-015-13: AI products MUST test for and document cross-language quality parity gaps where performance in one supported language is materially worse than others.
REQ-015-14: When significant cross-language parity gaps are detected, the organization MUST either remediate before launch, restrict the affected language to a lower capability tier with user disclosure, or document the gap as a known limitation with a remediation timeline.
REQ-015-15: Organizations SHOULD evaluate demographic fairness within each supported language (e.g., gender bias in Arabic vs. English may manifest differently due to grammatical gender systems).
4.5 Language-Specific Prompt Engineering
REQ-015-16: System prompts and safety instructions MUST be validated in each supported language. Direct translation of English-language prompts without validation is prohibited.
REQ-015-17: Prompt libraries MUST include language-specific variants where prompt effectiveness varies by language (e.g., instruction-following patterns, formatting conventions, politeness norms).
REQ-015-18: Organizations SHOULD implement language detection and routing to direct inputs to language-optimized model configurations or prompt variants.
5. Implementation Guidance
Minimum Multilingual Governance Pack
Teams SHOULD establish:
- Language Coverage Matrix template
- Per-language safety evaluation protocol
- Dialect coverage assessment for primary user populations
- Cross-language parity dashboard
- Language-specific prompt validation checklist
- Multilingual adversarial test suite
Example Language Coverage Matrix
| Language | Quality Score | Safety Status | Dialect Coverage | Known Limitations | Last Evaluated |
|---|---|---|---|---|---|
| English (en) | 92/100 | Passed | N/A | None | 2026-02-15 |
| Arabic (ar-MSA) | 87/100 | Passed | MSA baseline | Reduced accuracy for technical domains | 2026-02-15 |
| Arabic (ar-EG) | 79/100 | Passed | Egyptian dialect | Code-switching with English degrades quality by ~8% | 2026-02-15 |
| Arabic (ar-SA) | 81/100 | Passed | Gulf dialect | Limited Najdi sub-dialect coverage | 2026-02-15 |
| French (fr) | 85/100 | Passed | Metropolitan French | Quebec French not evaluated | 2026-02-15 |
| Urdu (ur) | 68/100 | Conditional | Standard Urdu | Script rendering issues; safety tests incomplete for 2 categories | 2026-01-30 |
Minimum Operational Metrics
Track at least:
- per-language quality score trend
- cross-language parity gap (max/min quality ratio)
- per-language safety evaluation pass rate
- code-switching error rate
- dialect coverage percentage for primary markets
- language-specific user satisfaction scores
6. Exceptions & Waiver Process
Waivers are limited to non-safety procedural controls and MUST include:
- business justification
- compensating controls
- named approver
- expiration date (maximum 30 days)
No waivers are permitted for:
- launching in a language without safety evaluation
- ignoring cross-language parity gaps exceeding 20% without a documented remediation plan
- deploying untranslated English safety instructions in non-English language surfaces
7. Related Standards
- PRD-STD-001: Prompt Engineering Standards
- PRD-STD-010: AI Product Safety & Trust Controls
- PRD-STD-011: Model & Data Governance
- KSA Regulatory Profile
- Fairness & Bias Assessment
- AI Product Lifecycle
8. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-02-22 | AEEF Standards Committee | Initial release |