Skip to main content

Training Data Governance

Training data quality directly determines model quality. Without governance, data provenance is lost, labeling quality degrades, bias goes undetected, and regulatory compliance becomes impossible. This page defines the practices organizations SHOULD adopt to govern training data throughout the AI product lifecycle.

Data Sourcing Requirements

All training data sources SHOULD have documented provenance before use in model development.

  • Verify licensing and usage rights before ingestion.
  • Categorize each source by type: first-party customer data, public datasets, purchased/licensed data, or synthetic data.
  • Maintain a data source registry with the following fields:
FieldDescription
Source nameHuman-readable identifier for the data source
License typeOpen, commercial, internal, synthetic
Acquisition dateDate the data was obtained or generated
Permitted usesTraining, evaluation, fine-tuning, benchmarking
ExpirationLicense expiry or review date
Data subject consent statusConsent basis, jurisdiction, limitations

When using customer data for training, consent and legal basis requirements per PRD-STD-014 apply. Data sourcing decisions SHOULD be reviewed by legal or compliance stakeholders for Tier 2 and Tier 3 use cases.

Labeling Quality Standards

Labeling quality SHOULD be systematically measured and enforced.

  • Define annotator qualification requirements per task complexity (e.g., domain expertise for medical labeling, language fluency for multilingual tasks).
  • Establish inter-annotator agreement (IAA) thresholds:
Task TypeRecommended IAA Threshold (Cohen's Kappa)
Factual / objective tasks>= 0.85
Subjective / nuanced tasks>= 0.70
Safety-critical tasks>= 0.90
  • Implement label audit sampling: minimum 5% of labeled data SHOULD be reviewed by senior annotators.
  • Document labeling guidelines with concrete examples for each label category, including edge cases.
  • Track labeling quality metrics: IAA score per annotator pair, error rate by category, annotator consistency.
  • Address bias in labeling through diverse annotator pools and explicit bias review protocols.

Data Versioning & Lineage

Every training dataset SHOULD have a unique version identifier. Changes to training data SHOULD produce a new dataset version.

  • Maintain data lineage records linking: dataset version → source data → transformations applied → resulting model version.
  • Use dataset versioning tools (DVC, LakeFS, Delta Lake, or equivalent).
  • Store dataset metadata alongside each version: row count, feature statistics, creation date, creator, purpose.
  • Dataset versions used for production model training SHOULD be immutable — no retroactive modification without creating a new version.

Data Quality Assurance

Automated data quality checks SHOULD be integrated into the data pipeline before training.

Define and enforce quality checks for:

  • Completeness — Missing value rates per feature, with maximum acceptable thresholds.
  • Consistency — Conflicting labels for identical or near-identical inputs.
  • Distribution analysis — Class imbalance detection with minimum representation thresholds.
  • Duplication detection — Exact and near-duplicate identification.
  • Outlier identification — Statistical outliers flagged for human review.

Implement automated data quality gates:

Data Quality Gate: PASS | WARN | FAIL
Dataset: <name and version>
Completeness: <pass/fail + missing rate>
Consistency: <pass/fail + conflict count>
Class Balance: <pass/fail + imbalance ratio>
Duplicates: <pass/fail + duplicate rate>
Outliers: <count flagged for review>
Gate Decision: <proceed / review required / blocked>

When dataset composition changes significantly, run statistical tests (e.g., Kolmogorov-Smirnov, chi-squared) to detect distribution shifts relative to the previous version.

Data Lifecycle Management

Training datasets SHOULD have defined retention periods aligned with model lifecycle and regulatory requirements.

Dataset TypeRetention GuidanceDeletion Trigger
Active training setRetain while model is in productionModel archived + retention expired
Evaluation benchmarkIndefinite (benchmark)Superseded by new benchmark
Fine-tuning datasetRetain per regulatory + consent requirementsConsent withdrawal or regulatory deadline
  • Archive datasets when models are retired to support audit and compliance inquiries.
  • Implement deletion workflows for training data subject to erasure requests per PRD-STD-014.
  • Track which models were trained on data subject to erasure and assess retraining necessity.

Cross-References