The question teams should be asking in 2026 is not "is web scraping legal?" — it's "how do we build a training dataset that is defensible, documented, and actually good?" The difference matters enormously. A legally sound dataset is not just one that avoids lawsuits. It's one with clear provenance, the right licenses, consistent quality signals, and a curation pipeline you can explain to a judge, a regulator, or an enterprise customer asking about data lineage.
This guide is structured as a practical playbook. It covers the four ways to source training data (open datasets, web scraping, licensed data, and synthetic generation), how to assess and tier your sources by legal posture, the technical pipeline for scraping and curating responsibly, and the compliance documentation every AI team needs to maintain.
Licensing Tiers: Know Your Source Before You Touch It
Before anything else, every data source needs a license assessment. There are four tiers that determine how freely you can use content for LLM training. The goal of a well-built dataset is to maximise the share of Tier 1 and Tier 2 sources — and have documented legal rationale for anything in Tier 3.
User-agent: GPTBot Disallow: / or equivalent) must be excluded — legally required in the EU under DSM Art. 4, best practice everywhere else. Scraped PII without a lawful basis (names, emails, photos) violates GDPR and CCPA regardless of public visibility. Piracy-sourced content (shadow libraries, torrent content) destroys any fair use argument and was the central issue in Bartz v. Anthropic.
Tier assessment should happen at ingestion time — before content enters your pipeline. Retroactively removing Tier 4 content from an already-trained model is not currently possible; the only remedy is retraining from a clean dataset.
Open Datasets: Your Tier 1 Foundation
A significant amount of high-quality, legally unambiguous training data already exists as open datasets. For most LLM training projects, open datasets should form the backbone — providing volume, domain coverage, and legal clarity — before any custom web scraping begins.
| Dataset | Domain | License | Size / Notes |
|---|---|---|---|
| Common Crawl | General web | Open | Petabyte-scale monthly snapshots. Raw quality — requires heavy filtering. Foundation of GPT, LLaMA, and most frontier models. |
| Wikipedia (all languages) | Encyclopedic | CC-BY-SA | ~21GB (English). Clean, factual, multilingual. Check derivative licensing if publishing your dataset. |
| arXiv bulk access | Science / Research | CC-BY | 2M+ papers. LaTeX source available. Requires registration with arXiv S3 bulk access program. |
| GitHub (open source) | Code | MIT / Apache | Filter to permissive licenses only. The Stack (BigCode) provides pre-filtered, license-verified code datasets. |
| The Pile (EleutherAI) | Mixed | Open | 825GB curated from 22 datasets. Well-documented sourcing. Some sub-components have restrictions — check per-subset license. |
| ROOTS / BLOOM corpus | Multilingual | CC-BY | 1.61TB, 59 languages. Built by BigScience with explicit data governance documentation — a model for how to document a dataset. |
| RedPajama v2 | General web | Open | 30T tokens from Common Crawl with quality annotations. Ready-to-use with quality signal metadata. |
| Dolma (AI2) | Mixed | Open | 3T tokens. Apache 2.0. From Allen Institute — full data documentation and ethics review published. |
| OpenWebText2 | Web (Reddit-curated) | Open | Recreates GPT-2's WebText approach using upvoted Reddit links. Higher quality signal than raw crawl. |
| OSCAR (INRIA) | Multilingual web | Open | Filtered from Common Crawl by language. Available via Hugging Face. Well-documented deduplication. |
| Project Gutenberg | Books / Literature | Public Domain | 70,000+ pre-1928 public domain works. Use PG-19 (DeepMind subset) for pre-processed access. Not representative of modern language. |
| US Government / EU Open Data | Legal / Government | Public Domain | data.gov, regulations.gov, EUR-Lex, CourtListener (court opinions). All public domain or CC0. Excellent for legal and civic domains. |
| PubMed Central (Open Access) | Biomedical | CC-BY / CC0 | Millions of full-text OA articles. Subset available via API. Filter to CC-BY or CC0 only. |
| Multilingual LibriSpeech (text) | Books (speech-derived) | CC-BY 4.0 | Text transcripts from audiobooks. Derived from LibriVox, which uses public domain texts. |
| C4 (Colossal Clean Crawled) | Web (filtered) | Open | T5's training corpus. 750GB of English web text after aggressive heuristic filtering. Widely benchmarked. |
When using multi-component datasets like The Pile or ROOTS, document which sub-components you are using — license terms vary by subset. HuggingFace Datasets provides machine-readable license metadata for most listed datasets.
Web Scraping for Training Data: Doing It Right
For domains not covered by existing open datasets — recent news, niche technical content, contemporary fiction, non-English web — custom web scraping is often the only way to get coverage. The key is not to avoid scraping, but to scrape in a way that is documented, respects opt-outs, and can withstand scrutiny.
"The goal is not to avoid scraping. It's to scrape in a way that a reasonable judge would see as acting in good faith." — General principle from hiQ v. LinkedIn, applied to AI training contexts
- Check
robots.txt— respect allDisallowdirectives, especiallyUser-agent: *and AI-specific agents - Check ToS for explicit scraping / AI training prohibitions. Log your review decision per domain
- Identify whether the site requires login — if yes, stop. Authenticated scraping triggers CFAA risk
- Assign a Tier (1–4) to the domain before queueing it
- Check if the domain has explicitly opted out of AI crawling via
ai.txt, meta tags, or HTTP headers
- Use an honest, identifiable
User-Agentstring (e.g.YourBot/1.0 (+https://yourco.com/bot)) - Set
Crawl-delayto at least 1–2 seconds per domain, more for smaller sites - Respect
Crawl-delayandRequest-ratedirectives inrobots.txt - Do not use IP rotation to evade blocks — treat a block as an opt-out signal
- Log every request with timestamp, URL, and HTTP response code
- Strip PII at extraction time: names, emails, phone numbers, addresses, user IDs
- Check HTTP headers for
X-Robots-Tag: noindexornoai— treat these as Tier 4 - Do not follow redirect chains beyond 3 hops — can lead to gated content inadvertently
- Capture the full canonical URL and fetch date for every document — critical for provenance records
- Record the
robots.txtstate at crawl time (robots.txt changes retroactively affect the legal record)
- Run a second-pass PII scan across the full corpus before training ingestion
- Apply near-deduplication (MinHash LSH or similar) to reduce memorisation risk
- Filter by language confidence score — low-confidence multilingual content degrades quality
- Run quality heuristics: remove boilerplate, nav text, cookie banners, and ad copy
- Tag every document with source domain, crawl date, tier classification, and license field
Never use third-party scraping-as-a-service tools that route through proxies or claim to bypass blocks — the Reddit v. Perplexity case (2025) introduced DMCA anti-circumvention claims specifically around this pattern.
Tools for Building Training Datasets
The ecosystem for dataset engineering has matured significantly. These are the core tools used by teams building production-grade training corpora.
Licensed & Partnership Data
For high-value content domains — quality journalism, books, scientific literature, financial data — licensing agreements are increasingly the right answer. The deals OpenAI, Google, and Apple have signed with major publishers are not just PR moves. They are legal risk elimination for a category of content that would otherwise sit squarely in Tier 3.
When to Pursue a Licensing Agreement vs Relying on Fair Use
Licensing makes sense when: (1) the content domain is central to your model's intended capabilities — e.g. a legal AI that needs clean case law, or a news AI that needs current articles; (2) the rights holder is likely to notice and object — large publishers actively monitor for LLM reproduction; (3) you are building a commercial product that will directly compete in the rights holder's market.
For large-scale web data licensing, the standard models are: bulk dataset licenses (one-time or annual fee for a snapshot), API access licenses (ongoing programmatic access, often with restrictions on model redistribution), and revenue-sharing agreements (emerging model where publishers receive a share of AI product revenue). AP, Reuters, and the Financial Times have all signed deals in the latter two categories.
For smaller publishers and individual creators, opt-in registries are emerging — similar to how music licensing works through ASCAP/BMI. The ai.txt standard and the Spawning API allow rights holders to signal preferences and negotiate terms programmatically. Building your crawler to check these signals now positions you for the licensing infrastructure that is coming.
Synthetic Data: When to Generate, Not Collect
Synthetic data generation — using an existing LLM to produce training examples — has become a mainstream component of dataset pipelines, particularly for instruction tuning and RLHF. It solves certain legal problems cleanly but introduces others. The rules here are different from scraping.
The most defensible synthetic data pipelines generate content that has no equivalent real-world copyright holder — e.g. novel reasoning traces, formatted instruction data, or domain-specific factual Q&A grounded in public domain sources.
The Full Curation Pipeline
A legal training dataset is not just a legal data source — it's also a documented, reproducible pipeline. Here is the end-to-end workflow used by teams building production corpora, with notes on which steps are primarily engineering concerns and which have legal significance.
Steps 4 and 7 have direct legal significance — their absence is the most common gap in AI team compliance practices. Step 8 (documentation) is now mandatory in the EU and is increasingly requested by enterprise customers as part of AI procurement due diligence.
Documentation: Your Legal & Compliance Record
The EU AI Act, which took effect for general-purpose AI models in 2025, requires providers to document their training data sources. But even outside Europe, dataset documentation has become a commercial necessity — enterprise customers are asking for data lineage as part of procurement, and the ability to produce clear documentation is increasingly a competitive differentiator.
At minimum, your dataset record should contain: a source inventory mapping every significant source to its tier, license, and review decision; a data collection log with dates, tools, and configurations; a PII treatment record documenting tools used, scan coverage, and removal rates; an opt-out compliance log showing when and how opt-out checks were run; and a dataset card in HuggingFace or equivalent format that synthesises all of the above.
This documentation does three things simultaneously. It satisfies regulatory requirements. It gives your legal team what they need to construct a fair use defense quickly if challenged. And it gives enterprise customers the data lineage assurance they need to deploy your model internally.
Key Case Law Reference
The following cases are the primary legal authorities governing web scraping and LLM training data as of February 2026. For full analysis, see the companion article on legal risks.