All posts
Engineering

Rewriting agent search from scratch

2026-04-03 · 4 min read

Scraped agent databases go stale in roughly six weeks. We learned this the slow way: every time we shipped a sharper filter for agent search, the underlying data was already lying to us, and a better filter on bad data just produces wrong answers faster. So we tore out the scraped source and rebuilt it from the ground up.

Why agent data rots so fast

Literary agents are not a stable dataset. They open and close to queries by the season. They move between agencies. They take on a genre for two years and then quietly stop. An agent who was hungry for cozy fantasy in the spring may be closed to everything by the autumn, having just signed three. None of that shows up in a directory snapshot.

A scraper captures a moment: the agent's name, agency, a stale "wishlist" blurb, maybe a list of clients that hasn't been touched since the page was built. The day you scrape it, it's roughly right. Six weeks later it is confidently wrong in ways the writer can't see — and a confidently wrong recommendation is worse than no recommendation, because the writer acts on it. They spend a query, and a hopeful week, on an agent who closed to submissions a month ago or who hasn't repped this genre since 2019.

The thing a scraper can't capture

The deeper problem is that a directory describes what an agent says they want, and writers need to know what an agent actually represents. Those are different facts, and the gap between them is where most query mismatches live.

What an agent actually represents shows up in their recent deals: the books they have sold in the last year or two, to which imprints, in which categories. That signal is harder to assemble than a wishlist blurb, but it is the only one that reliably predicts whether your book is a fit. An agent who sold three upmarket book-club novels last year is a real lead for your upmarket book-club novel, regardless of what their bio says. An agent whose bio lists your genre but who hasn't sold in it since the last administration is not.

What we rebuilt

So we stopped treating "the directory" as the source of truth and rebuilt around what an agent represents now. The new model is organized around three things that actually move: current submission status, stated interests, and — weighted most heavily — recent representation. We refresh on a cadence tuned to how fast each of those decays rather than re-scraping everything on a fixed clock, because status changes weekly while a representation history changes slowly.

That let us rebuild the search itself around fit rather than keyword soup. Instead of matching your "fantasy" against an agent's "fantasy" — a match that means almost nothing — the search reasons about what you wrote against what each agent has recently sold, the same way a thoughtful writer would if they had time to read every deal announcement. Comp-based matching only works if the representation data underneath it is current; that's the whole reason the rebuild had to start at the data layer, not the filter.

The tradeoff we chose

Freshness and coverage pull against each other. You can have a giant list that is mostly stale, or a smaller list you can stand behind. We chose the smaller, fresher one. A writer is far better served by twenty agents who genuinely fit and are genuinely open than by two hundred where most of the signal is a year out of date. The cost of a wrong recommendation in querying isn't a bad search result — it's a wasted submission and a longer road to representation.

The lesson

The thing we keep relearning, in agent search and everywhere else, is that the data model is the product. A beautiful interface on stale data is a liar with good manners. No amount of filter polish, ranking cleverness, or AI on top fixes a source that's describing a world six weeks gone. If the foundation isn't current, the right move isn't a better query box. It's to rebuild the foundation.

Chapter ·Keep reading

Find more field notes on the blog.