Training Data Lineage
Training data lineage is the record of which source data contributed to an AI model's training — a foundational requirement of the FINOS AI Governance Framework and an explicit obligation under the EU AI Act for general-purpose AI model providers. Institutions must maintain information tracking which source data contributed to AI-generated outputs, because regulators asking a question after the fact need to know where a model's behavior came from.
The current wave of vendor AI data collection directly tests an institution's ability to maintain training data lineage. When a SaaS vendor uses customer content for model training, the institution's data becomes part of a model's training lineage — often without the institution's ability to audit or reverse the contribution.
What Training Data Lineage Requires
Maintaining training data lineage is not a single artifact — it is a control chain across procurement, operations, and audit:
- Source data inventory — what data was available to each model during training
- Contribution records — which of that data was actually used, and in what training run
- Transformation history — de-identification, aggregation, embedding steps between source and model
- Retention tracking — how long lineage records are kept and how they are retrieved
- Third-party lineage — attestations from vendors whose models were trained on institution data
Why Vendor AI Data Policies Make This Hard
When a vendor unilaterally changes AI data policies — enabling training on customer content by default, with notification timelines shorter than compliance review cycles — the institution's training data lineage is compromised before it has a chance to assess the change. The data enters the training pipeline; the lineage record is whatever the vendor chooses to provide.
This is why the FINOS AIGF requires institutions to verify model provenance, not just accept vendor policy documents, and why CC4AI is being built as a machine-readable attestation format.
Training Data Lineage and the EU AI Act
General-purpose AI model providers under the EU AI Act must publish a summary of training datasets, maintain records of training data origins, and comply with the Copyright Directive. For financial institutions using third-party AI, the regulatory question extends to whether the institution can demonstrate it understood, consented to, and maintained oversight of how its operational data was used in any model's training pipeline.
How Reign Addresses Training Data Lineage
Reign's Evidence Engine integrates training data lineage tracking for both internal AI workloads and third-party AI services — mapping operational data flows, vendor attestations, and model-contribution records into AIGF-aligned evidence. The output is retrievable on demand, structured for regulator ingestion, and continuously updated as vendor practices change.
