Data Provenance Initiative
The Data Provenance Initiative is a volunteer collective of AI researchers from around the world. We conduct large-scale audits of the massive datasets that power state-of-the-art AI models. We have audited over 4,000 popular text, speech, and video datasets, tracing them from origin to creation, cataloging data sources, licenses, creators, and other metadata, which researchers can examine using our Explorer tool. We recently analyzed 14,000 web domains, to understand the evolving provenance and consent signals behind AI data. The purpose of this work is to map the landscape of AI data, improving transparency, documentation, and informed use of data.
Web Design
Visualization

The graphic on the left represents the geographical distribution of the creators of AI training data.

The right is displays the visualization without distortion. Flags are drawn in proportion to that country’s contribution to fine tuning data.

You may also like

Back to Top