Building Legal LLM Training Datasets: The Complete Guide (2025

Disclaimer: This article is for informational purposes only and does not constitute legal advice. The law in this area is evolving rapidly. Consult a qualified intellectual property or technology attorney before finalising your data acquisition strategy.

The question teams should be asking in 2026 is not "is web scraping legal?" — it's "how do we build a training dataset that is defensible, documented, and actually good?" The difference matters enormously. A legally sound dataset is not just one that avoids lawsuits. It's one with clear provenance, the right licenses, consistent quality signals, and a curation pipeline you can explain to a judge, a regulator, or an enterprise customer asking about data lineage.

This guide is structured as a practical playbook. It covers the four ways to source training data (open datasets, web scraping, licensed data, and synthetic generation), how to assess and tier your sources by legal posture, the technical pipeline for scraping and curating responsibly, and the compliance documentation every AI team needs to maintain.

Primary sourcing methods covered

30+

Open & licensed data sources catalogued

Pipeline stages in the curation workflow

Licensing Tiers: Know Your Source Before You Touch It

Before anything else, every data source needs a license assessment. There are four tiers that determine how freely you can use content for LLM training. The goal of a well-built dataset is to maximise the share of Tier 1 and Tier 2 sources — and have documented legal rationale for anything in Tier 3.

Licensing Framework

Four Data Tiers for LLM Training — Legal Posture & Use Rules

Tier 1 — Open

CC0, CC-BY, MIT, Apache 2.0, public domain. No restrictions on use, modification, or commercial training. CC-BY requires attribution in documentation — not in model outputs. The gold standard. Maximize this tier. Examples: Wikipedia, Common Crawl (filtered), OpenWebText, GitHub open source repos, government datasets (data.gov, EU Open Data Portal), arXiv preprints (CC-BY).

Use Freely

Tier 2 — Research-OK

CC-BY-SA, CC-BY-NC, or explicit AI training permission. Sharealike requires derivative datasets to carry the same license — check whether your training corpus itself is a "derivative." NC (non-commercial) restricts commercial deployment of the trained model. Some publishers have signed explicit data licensing deals (AP, Reuters, Axel Springer with OpenAI/Google) — match against your own agreements. Statutory text and data mining exceptions in the EU and Japan also place content here for qualifying research contexts.

Use with Care

Tier 3 — Contested

All rights reserved, ToS-silent on AI, or ToS-ambiguous. Most of the open web falls here. Fair use may apply, but requires documented analysis of all four factors — purpose, nature, amount, and market harm. You can include Tier 3 sources, but you need written legal rationale for each significant source, and you should track volume carefully. The riskier the source (news publishers, creative works, educational content), the stronger your fair use memo needs to be. Courts in 2025 ruled training transformative but left market harm as an open contest.

Document First

Tier 4 — Avoid

Explicit opt-out, login-required, piracy sites, or personal data without consent. Content from sites that have posted machine-readable opt-out directives (User-agent: GPTBot Disallow: / or equivalent) must be excluded — legally required in the EU under DSM Art. 4, best practice everywhere else. Scraped PII without a lawful basis (names, emails, photos) violates GDPR and CCPA regardless of public visibility. Piracy-sourced content (shadow libraries, torrent content) destroys any fair use argument and was the central issue in Bartz v. Anthropic.

Exclude

Tier assessment should happen at ingestion time — before content enters your pipeline. Retroactively removing Tier 4 content from an already-trained model is not currently possible; the only remedy is retraining from a clean dataset.

Open Datasets: Your Tier 1 Foundation

A significant amount of high-quality, legally unambiguous training data already exists as open datasets. For most LLM training projects, open datasets should form the backbone — providing volume, domain coverage, and legal clarity — before any custom web scraping begins.

Dataset Catalogue

Key Open & Licensed Datasets for LLM Training (2026)

Dataset	Domain	License	Size / Notes
Common Crawl	General web	Open	Petabyte-scale monthly snapshots. Raw quality — requires heavy filtering. Foundation of GPT, LLaMA, and most frontier models.
Wikipedia (all languages)	Encyclopedic	CC-BY-SA	~21GB (English). Clean, factual, multilingual. Check derivative licensing if publishing your dataset.
arXiv bulk access	Science / Research	CC-BY	2M+ papers. LaTeX source available. Requires registration with arXiv S3 bulk access program.
GitHub (open source)	Code	MIT / Apache	Filter to permissive licenses only. The Stack (BigCode) provides pre-filtered, license-verified code datasets.
The Pile (EleutherAI)	Mixed	Open	825GB curated from 22 datasets. Well-documented sourcing. Some sub-components have restrictions — check per-subset license.
ROOTS / BLOOM corpus	Multilingual	CC-BY	1.61TB, 59 languages. Built by BigScience with explicit data governance documentation — a model for how to document a dataset.
RedPajama v2	General web	Open	30T tokens from Common Crawl with quality annotations. Ready-to-use with quality signal metadata.
Dolma (AI2)	Mixed	Open	3T tokens. Apache 2.0. From Allen Institute — full data documentation and ethics review published.
OpenWebText2	Web (Reddit-curated)	Open	Recreates GPT-2's WebText approach using upvoted Reddit links. Higher quality signal than raw crawl.
OSCAR (INRIA)	Multilingual web	Open	Filtered from Common Crawl by language. Available via Hugging Face. Well-documented deduplication.
Project Gutenberg	Books / Literature	Public Domain	70,000+ pre-1928 public domain works. Use PG-19 (DeepMind subset) for pre-processed access. Not representative of modern language.
US Government / EU Open Data	Legal / Government	Public Domain	data.gov, regulations.gov, EUR-Lex, CourtListener (court opinions). All public domain or CC0. Excellent for legal and civic domains.
PubMed Central (Open Access)	Biomedical	CC-BY / CC0	Millions of full-text OA articles. Subset available via API. Filter to CC-BY or CC0 only.
Multilingual LibriSpeech (text)	Books (speech-derived)	CC-BY 4.0	Text transcripts from audiobooks. Derived from LibriVox, which uses public domain texts.
C4 (Colossal Clean Crawled)	Web (filtered)	Open	T5's training corpus. 750GB of English web text after aggressive heuristic filtering. Widely benchmarked.

When using multi-component datasets like The Pile or ROOTS, document which sub-components you are using — license terms vary by subset. HuggingFace Datasets provides machine-readable license metadata for most listed datasets.

Web Scraping for Training Data: Doing It Right

For domains not covered by existing open datasets — recent news, niche technical content, contemporary fiction, non-English web — custom web scraping is often the only way to get coverage. The key is not to avoid scraping, but to scrape in a way that is documented, respects opt-outs, and can withstand scrutiny.

"The goal is not to avoid scraping. It's to scrape in a way that a reasonable judge would see as acting in good faith." — General principle from hiQ v. LinkedIn, applied to AI training contexts

Scraping Protocol

Technical & Legal Best Practices for Training Data Web Crawls

Before You Crawl

Check robots.txt — respect all Disallow directives, especially User-agent: * and AI-specific agents
Check ToS for explicit scraping / AI training prohibitions. Log your review decision per domain
Identify whether the site requires login — if yes, stop. Authenticated scraping triggers CFAA risk
Assign a Tier (1–4) to the domain before queueing it
Check if the domain has explicitly opted out of AI crawling via ai.txt, meta tags, or HTTP headers

Crawler Configuration

Use an honest, identifiable User-Agent string (e.g. YourBot/1.0 (+https://yourco.com/bot))
Set Crawl-delay to at least 1–2 seconds per domain, more for smaller sites
Respect Crawl-delay and Request-rate directives in robots.txt
Do not use IP rotation to evade blocks — treat a block as an opt-out signal
Log every request with timestamp, URL, and HTTP response code

During Crawl

Strip PII at extraction time: names, emails, phone numbers, addresses, user IDs
Check HTTP headers for X-Robots-Tag: noindex or noai — treat these as Tier 4
Do not follow redirect chains beyond 3 hops — can lead to gated content inadvertently
Capture the full canonical URL and fetch date for every document — critical for provenance records
Record the robots.txt state at crawl time (robots.txt changes retroactively affect the legal record)

Post-Crawl Processing

Run a second-pass PII scan across the full corpus before training ingestion
Apply near-deduplication (MinHash LSH or similar) to reduce memorisation risk
Filter by language confidence score — low-confidence multilingual content degrades quality
Run quality heuristics: remove boilerplate, nav text, cookie banners, and ad copy
Tag every document with source domain, crawl date, tier classification, and license field

Never use third-party scraping-as-a-service tools that route through proxies or claim to bypass blocks — the Reddit v. Perplexity case (2025) introduced DMCA anti-circumvention claims specifically around this pattern.

Tools for Building Training Datasets

The ecosystem for dataset engineering has matured significantly. These are the core tools used by teams building production-grade training corpora.

Scrapy

Web Crawling

Python framework for large-scale crawls. Middleware support for rate limiting, robots.txt compliance, and custom user-agent management. Best for structured domain-specific crawls.

Playwright / Puppeteer

JS-rendered Pages

For JavaScript-heavy sites that Scrapy cannot render. Use sparingly — JS-rendered scraping is harder to rate-limit and easier to mis-classify as bypassing controls.

trafilatura

Text Extraction

Best-in-class HTML-to-text extractor. Removes boilerplate, navigation, ads, and cookie banners. Python library; outperforms BeautifulSoup for article extraction.

resiliparse

WARC / CC Processing

C++-backed Python library for processing Common Crawl WARC files at scale. Used by major OSCAR and RedPajama pipelines. Handles multi-TB corpora efficiently.

datatrove (HuggingFace)

Pipeline Orchestration

HuggingFace's framework for large-scale text dataset pipelines. Handles deduplication, filtering, and tokenisation at scale with built-in distributed execution.

MinHash / datasketch

Deduplication

Near-duplicate detection using Locality Sensitive Hashing. Essential for reducing memorisation in training corpora. Run at both document and paragraph level.

presidio (Microsoft)

PII Detection

Open source PII detection and anonymisation. Covers names, emails, phone numbers, IBANs, SSNs across multiple languages. Run as a post-processing step before ingestion.

fastText / langdetect

Language ID

Fast language identification at the document level. Critical for multilingual pipelines — low-confidence language assignments significantly degrade quality in non-English subsets.

Hugging Face Datasets

Dataset Hub

Central repository for published datasets. Machine-readable license metadata, version control, and streaming access. Start every search for existing data here before crawling.

Licensed & Partnership Data

For high-value content domains — quality journalism, books, scientific literature, financial data — licensing agreements are increasingly the right answer. The deals OpenAI, Google, and Apple have signed with major publishers are not just PR moves. They are legal risk elimination for a category of content that would otherwise sit squarely in Tier 3.

Licensing Strategy

When to Pursue a Licensing Agreement vs Relying on Fair Use

Licensing makes sense when: (1) the content domain is central to your model's intended capabilities — e.g. a legal AI that needs clean case law, or a news AI that needs current articles; (2) the rights holder is likely to notice and object — large publishers actively monitor for LLM reproduction; (3) you are building a commercial product that will directly compete in the rights holder's market.

For large-scale web data licensing, the standard models are: bulk dataset licenses (one-time or annual fee for a snapshot), API access licenses (ongoing programmatic access, often with restrictions on model redistribution), and revenue-sharing agreements (emerging model where publishers receive a share of AI product revenue). AP, Reuters, and the Financial Times have all signed deals in the latter two categories.

For smaller publishers and individual creators, opt-in registries are emerging — similar to how music licensing works through ASCAP/BMI. The ai.txt standard and the Spawning API allow rights holders to signal preferences and negotiate terms programmatically. Building your crawler to check these signals now positions you for the licensing infrastructure that is coming.

Synthetic Data: When to Generate, Not Collect

Synthetic data generation — using an existing LLM to produce training examples — has become a mainstream component of dataset pipelines, particularly for instruction tuning and RLHF. It solves certain legal problems cleanly but introduces others. The rules here are different from scraping.

Synthetic Data

Where Synthetic Generation Fits — and Where It Doesn't

Use Freely

Instruction-tuning data. Generating (prompt, response) pairs for fine-tuning on a specific task — customer support, coding assistance, summarisation — is the most common use case. Output from open-weight models (LLaMA 3, Mistral, Falcon) can generally be used freely, subject to each model's license. Check whether the model license prohibits using outputs to train competing models (some do).

Strong Use Case

Caution

Outputs from closed API models (GPT-4, Claude, Gemini). OpenAI, Anthropic, and Google all prohibit using their API outputs to train competing models in their ToS. "Distillation" — using a frontier model's outputs to train a smaller model — violates these terms. Verified exception: OpenAI's o1/o3 research API explicitly permits certain research use. Always check current ToS before building a synthetic pipeline on top of a closed API.

Check ToS

Avoid

Synthetic "reproductions" of real copyrighted content. Using a model to paraphrase or reconstruct copyrighted articles, books, or code as a way to launder the copyright is not a legal workaround — courts are increasingly sceptical of this approach, and it is likely to be tested directly in the NYT v. OpenAI case. Similarly, synthetic PII (realistic fake names, emails, addresses) carries its own GDPR risks if it could be used to re-identify real individuals.

Legal Risk

The most defensible synthetic data pipelines generate content that has no equivalent real-world copyright holder — e.g. novel reasoning traces, formatted instruction data, or domain-specific factual Q&A grounded in public domain sources.

The Full Curation Pipeline

A legal training dataset is not just a legal data source — it's also a documented, reproducible pipeline. Here is the end-to-end workflow used by teams building production corpora, with notes on which steps are primarily engineering concerns and which have legal significance.

Engineering Workflow

8-Stage Dataset Curation Pipeline — From Source to Training-Ready

Source Inventory & Tier Classification

For each candidate source, assign a Tier (1–4), record the license or legal rationale, check robots.txt and ToS, and document the review decision. This creates your data provenance record — the single most important document if you ever face a legal challenge.

Legal Engineering

Collection (Crawl / Download / API)

Execute the crawl or download using rate-limited, identified bots. For open datasets, download from canonical sources (HuggingFace, the original repository). Log every document URL, fetch timestamp, and HTTP status. For licensed data, execute under the signed agreement terms.

Engineering

Extraction & Format Normalisation

Convert HTML/PDF/WARC to clean text using trafilatura or equivalent. Strip boilerplate, navigation, advertisements, cookie notices, and HTML artifacts. Normalise encoding to UTF-8. Extract metadata: title, author, date, URL, language.

Engineering

PII Detection & Removal

Run presidio or equivalent over the full extracted corpus. Flag and redact: names in identifying contexts, email addresses, phone numbers, physical addresses, government ID numbers, financial account numbers, and any health-related personal data. This step is legally required under GDPR for any corpus that may contain EU personal data. Document the tool version, recall/precision rates, and any manual spot-checking.

Legal (GDPR) Engineering

Quality Filtering

Apply heuristic filters to remove low-quality content: documents below minimum token count, documents with high symbol-to-word ratios, documents that are primarily lists of URLs or product codes, spam content, and adult content classifiers if required. Run language ID and filter to target language(s). Quality filtering also has a legal benefit: removing "junk" web content reduces the proportion of potentially problematic Tier 3 material in the final corpus.

Engineering

Near-Deduplication

Run MinHash LSH at both document and paragraph level. Deduplification reduces memorisation risk (models are less likely to reproduce exact content from the training set) and improves training efficiency. Use aggressive thresholds — Jaccard similarity >0.8 is a common removal threshold. Document your deduplication parameters: they become part of your dataset card.

Engineering Memorisation Risk

Opt-Out Compliance Pass

Before finalising, run a second pass against the current opt-out registry: check the latest robots.txt for each crawled domain (robots.txt can change after your initial crawl), check Spawning / Have I Been Trained / ai.txt registries, and remove any domains that have posted AI training opt-outs since your crawl date. In the EU, this step is legally required under DSM Art. 4. Everywhere else, it is best practice and establishes good faith.

Legal (EU Required) Engineering

Dataset Card & Documentation

Publish a dataset card documenting: source inventory with tiers, collection dates, pipeline steps and tool versions, PII removal approach, deduplication parameters, known limitations, and legal rationale summary. This documentation serves multiple purposes: EU AI Act compliance (mandatory for general-purpose AI models), enterprise customer due diligence, internal audit trail, and fair use documentation. The ROOTS/BLOOM and Dolma dataset cards are the current best-practice benchmarks.

Legal (EU AI Act) Engineering

Steps 4 and 7 have direct legal significance — their absence is the most common gap in AI team compliance practices. Step 8 (documentation) is now mandatory in the EU and is increasingly requested by enterprise customers as part of AI procurement due diligence.

Documentation: Your Legal & Compliance Record

The EU AI Act, which took effect for general-purpose AI models in 2025, requires providers to document their training data sources. But even outside Europe, dataset documentation has become a commercial necessity — enterprise customers are asking for data lineage as part of procurement, and the ability to produce clear documentation is increasingly a competitive differentiator.

At minimum, your dataset record should contain: a source inventory mapping every significant source to its tier, license, and review decision; a data collection log with dates, tools, and configurations; a PII treatment record documenting tools used, scan coverage, and removal rates; an opt-out compliance log showing when and how opt-out checks were run; and a dataset card in HuggingFace or equivalent format that synthesises all of the above.

This documentation does three things simultaneously. It satisfies regulatory requirements. It gives your legal team what they need to construct a fair use defense quickly if challenged. And it gives enterprise customers the data lineage assurance they need to deploy your model internally.

Appendix

Key Case Law Reference

The following cases are the primary legal authorities governing web scraping and LLM training data as of February 2026. For full analysis, see the companion article on legal risks.

2022 — Ninth Circuit

hiQ Labs v. LinkedIn

Scraping publicly accessible data does not violate the CFAA. Foundation case for all subsequent scraping law. Does not address copyright, ToS, or trespass claims.

June 23, 2025 — N.D. California

Bartz v. Anthropic PBC

Lawfully purchased books used for training = fair use. Training on pirated books = sent to trial. Source provenance is legally material. The "how you obtained it" question matters as much as "what you did with it."

June 25, 2025 — N.D. California

Kadrey v. Meta Platforms

Meta won on weak plaintiff arguments. Judge Chhabria warned that plaintiffs with strong market harm evidence will often win. Training is transformative, but market substitution at LLM scale is an unsettled question.

2025 — SDNY

Thomson Reuters v. ROSS Intelligence

Training a directly competitive AI product on a competitor's proprietary content is not fair use. Competitive substitution defeats the fair use defense.

Active — ongoing

New York Times v. OpenAI & Microsoft

Most consequential pending case. NYT has evidence of verbatim reproduction. A ruling against OpenAI would be the clearest signal yet that market-substituting LLM training is not fair use.

Active — filed Oct 2025

Reddit v. Perplexity AI

Introduces DMCA Section 1201 anti-circumvention claims against third-party scraping tools used to bypass platform blocks. If successful, significantly narrows what "public data" means for scraping purposes.

Contents

Licensing Tiers
Open Datasets
Web Scraping Protocol
Tools & Libraries
Licensed & Partnership Data
Synthetic Data
Curation Pipeline
Documentation
↳ Case Law Appendix

Tier Quick Ref

Tier 1 CC0, CC-BY, MIT, Apache

Tier 2 Research / licensed content

Tier 3 All-rights-reserved web

Tier 4 Opt-out, login, piracy, PII

Pipeline Stages

1. Source inventory

2. Collection

3. Extraction

4. PII removal

5. Quality filter

6. Deduplication

7. Opt-out pass

8. Dataset card

Purple = legal significance

Building Legal LLM Training Datasets:The Complete Guide

Licensing Tiers: Know Your Source Before You Touch It

Open Datasets: Your Tier 1 Foundation

Web Scraping for Training Data: Doing It Right

Tools for Building Training Datasets

Licensed & Partnership Data

When to Pursue a Licensing Agreement vs Relying on Fair Use

Synthetic Data: When to Generate, Not Collect

The Full Curation Pipeline

Documentation: Your Legal & Compliance Record

Key Case Law Reference

Building Legal LLM Training Datasets:
The Complete Guide