Training Data Governance
Training data quality directly determines model quality. Without governance, data provenance is lost, labeling quality degrades, bias goes undetected, and regulatory compliance becomes impossible. This page defines the practices organizations SHOULD adopt to govern training data throughout the AI product lifecycle.
Data Sourcing Requirements
All training data sources SHOULD have documented provenance before use in model development.
- Verify licensing and usage rights before ingestion.
- Categorize each source by type: first-party customer data, public datasets, purchased/licensed data, or synthetic data.
- Maintain a data source registry with the following fields:
| Field | Description |
|---|---|
| Source name | Human-readable identifier for the data source |
| License type | Open, commercial, internal, synthetic |
| Acquisition date | Date the data was obtained or generated |
| Permitted uses | Training, evaluation, fine-tuning, benchmarking |
| Expiration | License expiry or review date |
| Data subject consent status | Consent basis, jurisdiction, limitations |
When using customer data for training, consent and legal basis requirements per PRD-STD-014 apply. Data sourcing decisions SHOULD be reviewed by legal or compliance stakeholders for Tier 2 and Tier 3 use cases.
Labeling Quality Standards
Labeling quality SHOULD be systematically measured and enforced.
- Define annotator qualification requirements per task complexity (e.g., domain expertise for medical labeling, language fluency for multilingual tasks).
- Establish inter-annotator agreement (IAA) thresholds:
| Task Type | Recommended IAA Threshold (Cohen's Kappa) |
|---|---|
| Factual / objective tasks | >= 0.85 |
| Subjective / nuanced tasks | >= 0.70 |
| Safety-critical tasks | >= 0.90 |
- Implement label audit sampling: minimum 5% of labeled data SHOULD be reviewed by senior annotators.
- Document labeling guidelines with concrete examples for each label category, including edge cases.
- Track labeling quality metrics: IAA score per annotator pair, error rate by category, annotator consistency.
- Address bias in labeling through diverse annotator pools and explicit bias review protocols.
Data Versioning & Lineage
Every training dataset SHOULD have a unique version identifier. Changes to training data SHOULD produce a new dataset version.
- Maintain data lineage records linking: dataset version → source data → transformations applied → resulting model version.
- Use dataset versioning tools (DVC, LakeFS, Delta Lake, or equivalent).
- Store dataset metadata alongside each version: row count, feature statistics, creation date, creator, purpose.
- Dataset versions used for production model training SHOULD be immutable — no retroactive modification without creating a new version.
Data Quality Assurance
Automated data quality checks SHOULD be integrated into the data pipeline before training.
Define and enforce quality checks for:
- Completeness — Missing value rates per feature, with maximum acceptable thresholds.
- Consistency — Conflicting labels for identical or near-identical inputs.
- Distribution analysis — Class imbalance detection with minimum representation thresholds.
- Duplication detection — Exact and near-duplicate identification.
- Outlier identification — Statistical outliers flagged for human review.
Implement automated data quality gates:
Data Quality Gate: PASS | WARN | FAIL
Dataset: <name and version>
Completeness: <pass/fail + missing rate>
Consistency: <pass/fail + conflict count>
Class Balance: <pass/fail + imbalance ratio>
Duplicates: <pass/fail + duplicate rate>
Outliers: <count flagged for review>
Gate Decision: <proceed / review required / blocked>
When dataset composition changes significantly, run statistical tests (e.g., Kolmogorov-Smirnov, chi-squared) to detect distribution shifts relative to the previous version.
Data Lifecycle Management
Training datasets SHOULD have defined retention periods aligned with model lifecycle and regulatory requirements.
| Dataset Type | Retention Guidance | Deletion Trigger |
|---|---|---|
| Active training set | Retain while model is in production | Model archived + retention expired |
| Evaluation benchmark | Indefinite (benchmark) | Superseded by new benchmark |
| Fine-tuning dataset | Retain per regulatory + consent requirements | Consent withdrawal or regulatory deadline |
- Archive datasets when models are retired to support audit and compliance inquiries.
- Implement deletion workflows for training data subject to erasure requests per PRD-STD-014.
- Track which models were trained on data subject to erasure and assess retraining necessity.