Digital Sovereignty through AI-ready & Trusted Data for AI

Context

In an AI landscape increasingly shaped by proprietary black-box models, digital sovereignty is becoming a strategic priority for Switzerland. What matters is not only where a model is developed, but also which data it is based on, how this data is curated, and whether its use is legally, technically, and socially transparent and accountable.

Apertus demonstrates that AI can be built differently: as an open model developed in Switzerland, grounded in trustworthy and responsibly curated data. This requires AI-ready Data, which are further developed into Trusted Data for AI: documented, curated, legally verified, and purpose-specific datasets that can be used for training, fine-tuning, or evaluating AI systems.

Problem Statement

Today, large AI models are often trained on publicly available data whose origin, licensing conditions, quality, and risk profile are difficult for the public to assess. This results in mistrust, compliance risks, and a lack of reproducibility, auditability, and responsible AI practices.

For Apertus, the central challenge is not simply to make more data available, but to systematically prepare data so that it is:

documented and discoverable
legally and ethically assessed
persistent and versioned for traceability
suitable for specific model functions
able to reduce compliance debt and privacy debt (as defined below)

Challenge Framing

This challenge invites teams to develop solutions that demonstrate how a data space can function as a regulatory and technical membrane between data from diverse sources (e.g. web data, Common Crawl, as well as trusted actors contributing data under defined or restricted conditions) and trusted training data for AI (in this prototype: training data for Apertus).

At its core is a three-stage refinement process:

Inbound / Accessible

Raw data is ingested, documented, and enriched with minimum metadata.

Processing / Explorable

Data is technically, legally, and qualitatively assessed, cleaned, curated, and evaluated in terms of risk.

Outbound / Purpose-fit

Curated, versioned, and documented datasets are made available as Trusted Data for AI for Apertus or comparable AI systems, provided they meet the defined quality threshold within the prototype.

Key Concepts

Digital sovereignty
AI-ready Data
Trusted Data for AI
Data commons and data spaces
Data stewardship
Responsible AI and data
Compliance debt / privacy debt
Transparency, reproducibility, and auditability

Expected Outputs

A functional technical and sociotechnical prototype
Inbound processes for data collection and metadata description
Mechanisms for automated data cleaning, licensing, and compliance checks
A framework to measure and reduce compliance debt
Documentation of assumptions, governance decisions, and learnings
A concept for scaling and real-world deployment

Note: We are aware that the output we are asking for may remain predominantly in prototype stage.

Challenge Partners

Swiss Data Alliance

Apertus