Artificial Intelligence
Behind the scenes of an AI-Driven Web Scraping System
Introduction
We built an automated event aggregation platform, ingesting events from hundreds of partner venues, arts organizations, and ticketing platforms into a single searchable index. Every partner has a different website, a different structure, and a different set of rules for where event data lives.
We could not hand-craft scrapers for each one. So we built a system where an AI generates the scraping rules, validates its own output, and corrects itself when it gets things wrong.
This post covers what we learned: the real challenges of AI-driven scraping, the strategies that worked, and the ones that didn’t. This post does not cover the architecture of the platform or the system design of it.
Not All Pages Are Equal: Static HTML vs SPAs vs Lazy-Loaded Content
The first assumption that breaks is that a URL returns the content you need. There are actually three categories of page:
Static pages: the full HTML is in the HTTP response. curl works. These are straightforward.
Single-Page Applications (SPAs): the HTTP response is a skeleton with a <div id="app"></div> and a bundle of JavaScript. The actual content renders in the browser after the JS executes. React, Vue, and Angular sites fall here. A raw HTTP request returns nothing useful.
Server-rendered with lazy loading: the page delivers HTML, but some content (infinite scroll lists, tabs) loads only after user interaction or as the user scrolls. The initial response has partial content.
We had to move from simple HTTP requests to a headless browser (Playwright) that executes JavaScript and waits for content to stabilize before capturing the page.
The cost tradeoff is real, a headless browser fetch is slower than a raw HTTP request. The practical test before choosing your approach: curl the URL and see if the content you need is in the response. If it isn’t, you need a browser.
The Libraries Doing the Work
These are the categories of tools involved, the specific libraries are interchangeable, but these are the ones we used:
- Headless browser automation:
Playwrightfor executing JavaScript and capturing rendered HTML. - Browser fingerprint spoofing:
playwright-stealthto make headless Chromium look like a real browser. It fakes canvas fingerprint, WebGL, navigator properties, and timing APIs so the browser appears human. - HTML parsing:
BeautifulSoup+lxmlfor navigating the DOM, evaluating CSS selectors, and extracting text. - Structured LLM output:
Pydanticmodels that define the exact shape you expect from the LLM. If the model returns something malformed, Pydantic rejects it before it reaches your pipeline. Not strictly required, but it enforces a contract between the LLM and the rest of your pipeline that pays off quickly. - Agent orchestration:
LlamaIndexfor building the agentic loops: tool-calling, retry logic, and multistep DOM exploration. There are lighter tools if all you need is model switching; LlamaIndex earns its weight when you need agent behavior.
Asking the LLM to Generate Its Own Selectors
The core idea: instead of writing a CSS selector for each site by hand, give the LLM the page HTML and ask it to figure out the pattern.
Most partner sites have two kinds of pages that matter. A listing page shows upcoming events as a grid or list, this is where pagination lives and where you discover the full set of events. Detail pages are individual event pages, one per event, with the full title, date, venue, price, and description.
In practice, you give the LLM the HTML of both: the listing page, so it can identify the selector for event cards and the URL pattern that links to detail pages; and the HTML of 2-3 sample detail pages, so it can map each field to a CSS selector or JSON-LD path.
The LLM outputs a scrape configuration, which CSS selector identifies event cards on the listing page, what URL pattern distinguishes event links from navigation links, and how to paginate, plus field mappings that specify how to extract each piece of data (title, date, venue, price) from the detail page.
The gap that will surprise you: LLMs produce selectors that look correct but don’t work. In our experience, first-attempt selectors failed to yield any data roughly 30-40% of the time. The LLM is good at pattern recognition but has no way to verify its output against the real DOM.
The solution is validation-first design: every generated selector gets tested against the actual page before anything is stored. If the selector yields zero results, the LLM gets diagnostic feedback and tries again with a fixed retry limit.
The Cost of Sending Full HTML to an LLM
HTML is verbose. A single event listing page can easily be 200-400k characters. Depending on which LLM you’re using, sending that on every retry may not just be expensive, it may be impossible if the page exceeds the model’s context window.
A few things that help:
Pre-clean the HTML before sending. Strip <style>, <noscript>, <svg>, <meta>, and <iframe> tags. The LLM needs the DOM structure, not the CSS bundles or analytics tags. This typically reduces HTML size by 3-5x. Be careful with <script> tags though, JSON-LD structured data lives in <script type="application/ld+json"> blocks, so stripping all scripts indiscriminately will remove content you may need. Strip inline JavaScript, but preserve script tags with a type attribute that isn’t text/javascript.
Know the context window limits of your LLM. For some pages, no amount of cleaning solves the problem: you either clean away content you need, or you never clean enough to fit the model’s context window. The fix for those cases was a DOM-exploration agent, instead of sending the full page, it downloads the HTML to a temporary file and calls targeted tools (count_selector, dom_excerpt, find_repeating_blocks) to answer its questions without loading the whole document into context.
Separate selector generation from data extraction. The LLM’s job is to produce selectors. The actual URL extraction and field parsing is done deterministically by your code. The LLM does not need to see every event URL or every piece of extracted data.
When HTML Structure Changes, Scrapers Break
This is the fundamental fragility of CSS-selector-based scraping, with or without AI.
The most common failure mode is hashed class names. Frontend frameworks like Wix, Chakra UI generate class names that include a hash, they change with every build. A selector like div.qElViY works today and breaks tomorrow when the site redeploys.
The hierarchy of selector stability, from most to least durable:
- JSON-LD structured data, a declared schema contract (covered in the next section)
data-testidattributes, added by developers explicitly for automation; rarely changeidattributes, generally stable- Semantic HTML elements,
h1,address,time,article itemprop/ schema.org attributes, part of a public SEO contract- Named class prefixes, platform-specific stable classes like
tn-(TheaterManager) orot_(OvationTix) - Layout/visual classes, the first thing to break on a redesign
When building an AI-driven scraper, instruct the LLM explicitly to prefer selectors higher in this hierarchy. The more your selector relies on visual styling classes, the shorter its lifespan.
JSON-LD: The Cheat Code You Should Always Check First
Many well-maintained sites embed structured data in their pages as JSON-LD, following the schema.org vocabulary. For events, it may look like:
<script type="application/ld+json">
{
"@type": "Event",
"name": "Jazz Night at the Garden",
"startDate": "2026-06-15T19:30:00",
"location": {
"name": "Rehman's Garden",
"address": { "streetAddress": "54th & Blvd" }
},
"offers": { "price": 25 }
}
</script>
When this is present, you don’t need CSS selectors for those fields at all. JSON path expressions ($.name, $.startDate, $.location.address.streetAddress) are stable, unambiguous, and immune to frontend redesigns.
The catch: two subtle failure modes.
First, some frameworks inject HTML comment markers (<!---->) inside <script> tags as part of their server-side rendering process. json.loads fails on these silently if you’re swallowing JSONDecodeError. The fix is to strip HTML comments from the script content before parsing.
Second, the structured data might be incomplete or absent for some pages on the same site, you need a CSS fallback for those cases.
Always check for JSON-LD before writing a selector. It’s a contract the site owner maintains for search engines, and it benefits you too.
Bot Detection: The Wall You Will Eventually Hit
There are two tiers of bot protection, and they require different responses.
Passive detection identifies bots by their browser fingerprint, missing or inconsistent values for canvas rendering, WebGL, navigator properties, timing APIs. Headless Chromium out of the box has tells. playwright-stealth and similar libraries patch these, and they work well against passive detection.
Active detection (DataDome, Cloudflare Enterprise, PerimeterX) operates at the network and TLS level. It analyzes IP reputation, TLS fingerprint, request timing patterns, and mouse movement. Stealth plugins do not help here. The response is either a CAPTCHA iframe or an outright block.
No amount of selector tuning, wait strategy adjustments, or stealth improvements will make a difference, the block is at the infrastructure level.
The practical responses when you hit an active block:
- Check whether the platform has a public API.
- Use a residential proxy service, routes traffic through real ISP IPs, which have better reputation scores.
- Establish a data partnership with the platform directly.
The architectural lesson: bot detection is an infrastructure problem, not a scraping logic problem. Recognize it early and don’t waste time on selector improvements that won’t help.
APIs Are More Reliable Than Scrapers
Before building a scraper for any site, spend some time checking whether an API or structured feed already exists.
For our aggregator, few partners we initially planned to scrape turned out to have public APIs. Switching to the API gave us more data fields, no bot risk, and a feed that doesn’t break when someone updates the homepage design.
The checklist before scraping:
- Does the platform have documented API endpoints?
- Is there an RSS feed?
- Does the page embed an external widget (Tockify, Humanitix, Eventbrite) that exposes its own API?
- Does the page load its events from an internal JSON API? Even if undocumented, an agent can detect this from network requests and produce an API-based config, no HTML parsing required.
- Does the page contain JSON-LD structured data?
Scraping is the right tool when the answer to all of these is no. Typically this occurs when a venue controls their own custom-built website with no third-party ticketing integration.
The Correction Loop: Human-in-the-Loop Validation
A one-shot LLM pass is not enough for production. The extraction needs a feedback mechanism.
The approach that worked for us:
- The LLM generates an initial scrape configuration
- The configuration is tested against the real site, does it extract the expected fields?
- A human reviewer sees the extracted data and marks what’s wrong
- The corrections (expected values) are fed back to the LLM with diagnostic context
- The LLM revises the configuration
- Repeat up to a fixed retry limit
The human review step is not a failure of automation. It’s where data quality is established. The LLM gets the pattern right 50-60% of the time on the first pass; the correction loop closes the gap.
The retry limit matters. Without a ceiling, the loop can run indefinitely in some cases. We set a limit of 3 correction attempts before escalating to a human with a structured failure report.
Diagnosing Extraction Failures, Don’t Just Retry Blindly
When the LLM’s selector fails, the worst response is to send the same prompt again and hope for a different answer. You need to tell the LLM why it failed.
Most extraction failures fall into a small number of recognizable categories:
- Extracted too little, the selector matched only a fragment of the expected value (the LLM was too specific)
- Extracted too much, the selector matched a container that includes the target plus surrounding noise
- Wrong DOM region, the selector matched a real element, but not the one that contains the event data (e.g., matched a venue name in the page footer instead of the event header)
- Attribute vs. text confusion, the selector extracted a URL from an
hrefattribute when you needed the link’s visible text - Template mismatch, the sample pages the LLM was given don’t represent the full range of page templates on the site
Classifying the failure before retrying allows you to give the LLM targeted guidance: “the selector is too narrow, find a parent element”, or “you matched the wrong section of the page, the event header is in the top portion”. Targeted feedback produces better corrections than “try again”.
Conclusion
AI-driven scraping works, but it works best as a structured system, not a single LLM call. The LLM’s role is pattern recognition and selector generation. Validation, diagnosis, retry logic, and cleaning are engineering problems that sit around the LLM, not inside it.
The system that shipped looked nothing like the system we initially designed. Every architectural decision in this post was forced by a real-world edge case we didn’t anticipate. That’s the nature of scraping at scale: the web is not a uniform surface, and the gaps between what you expect and what you find are where the real engineering happens.
If you’re unsure where AI fits into your systems, our team at OmbuLabs.ai can help. Let’s talk . 🤖