From URL to Production: Our Technical Journey

The first time we got TEZELA working end-to-end, it took 47 minutes to process a single URL. The output was technically correct but looked like someone had written React components in their sleep. We shipped it anyway—to exactly zero users—and learned more in the next two weeks than the previous two months combined.

This is the story of building a system that takes a URL as input and spits out production-ready code on the other end. It's about the things that broke, the assumptions that were wrong, and the weird problems you only discover when real users start submitting URLs to sites you'd never imagine existed.

The prototype that validated nothing

Our MVP was embarrassingly simple: fetch HTML with axios, parse it with cheerio, template out some React components. It worked great on the three hand-picked websites we tested against. It completely fell apart on site number four.

The problem? Modern websites don't work that way. A raw HTML fetch gets you server-side rendered shell, but most content lives in JavaScript land. We were trying to extract content from React apps by parsing the HTML they shipped before hydration. It's like trying to understand a movie by reading the LOADING screen.

We needed a real browser. So we reached for Playwright and spun up headless Chrome instances. Suddenly we could see the actual rendered content, but now we had a new problem: knowing when the page was "done."

Turns out, pages are never really done. Lazy-loaded images, infinite scroll, analytics pixels that fire ten seconds after load—there's always something happening. We started with networkidle events, moved to watching for DOM mutations, and eventually landed on a hybrid approach that's still imperfect but works well enough. The threshold we settled on: wait 2 seconds after the last meaningful network activity or DOM change, whichever comes last. It's arbitrary, but it catches 95% of content on 99% of sites.

The other 5% taught us humility.

The semantic analysis problem

With content extracted, we hit the real challenge: figuring out what it means.

Here's what raw DOM gives you: 47 div elements, some spans, a few images, maybe a heading or two if you're lucky. Here's what you need: "This is a hero section with a value proposition. This is a product grid with 6 items. This is a testimonials section with 4 customer quotes."

Semantic understanding is the hardest part of this entire system, and we've rewritten it three times.

Version one was pure heuristics. We wrote rules: "If it's near the top and has a large font size and a button, it's probably a hero." This worked until we encountered a site where the navigation had large text and buttons. And another where the hero section was actually in the middle because there was a banner at the top. And another where... you get the idea.

Version two added ML. We collected training data—hundreds of websites with hand-labeled sections—and trained a classifier. This worked better but had a different problem: model drift. As web design trends changed (remember when every site had a diagonal slice background in 2023?), our training data got stale.

Version three, the current system, combines both approaches with what we call "consistency scoring." The insight: most websites are internally consistent. If we see six similar content blocks with the same structure, they're probably all the same thing. We look for repeated patterns, then use our classifier to identify what those patterns represent.

This catches things we never anticipated. One user submitted a site where every "button" was actually an anchor tag styled to look like a button, wrapped in three nested divs, with the click handler attached to the grandparent. Our pattern recognition caught it: "These six weirdly-nested structures all appear in similar contexts and have similar content, they're probably CTAs."

Consistency scoring brought our semantic accuracy from ~73% to ~89%. That last 11% keeps us up at night.

When design generation became a constraint problem

You'd think generating designs would be the easy part. We have a component library, we have content, just map one to the other. Right?

Wrong. Design generation is a constraint satisfaction problem disguised as a simple mapping exercise.

Consider a hero section. We have HeroWithImage, HeroWithVideo, HeroMinimal, HeroSplit, and a dozen more variants. Content determines some of it—if there's a video, you probably want HeroWithVideo—but not all of it. You also need to consider:

What's below this section? If it's another image-heavy section, maybe don't use HeroWithImage
How much text is there? HeroMinimal breaks down with three paragraphs of copy
Is there a CTA? Multiple CTAs? How prominent should they be?
What's the overall page rhythm? Too many split layouts in a row looks repetitive

We initially tried to solve this with a scoring system. Each component got a score based on content fit, and we picked the highest score. This produced technically correct but aesthetically boring designs. Everything felt same-y.

The breakthrough came when we added randomness—not random selection, but weighted variation. Instead of always picking the highest scoring component, we sample from the top 3-5 candidates with probability proportional to their scores. Same content can produce meaningfully different designs.

This introduced a new problem: some users would get unlucky and draw a mediocre design. We solved it by generating three variations and using a quality model to pick the best one. The quality model is simple: it scores designs based on visual hierarchy, rhythm, white space, and consistency. It's not perfect, but it keeps us out of the uncanny valley.

Build time went from 47 minutes to 8 minutes. Still too slow.

Code generation is an AST problem

Our first code generator was string templates. It looked like PHP from 2006. It produced code like this:

function HeroSection() {
  return <div className="hero-section-container-wrapper-outer"><div className="hero-section-container-wrapper-inner"><div className="hero-section-content">...

Three problems with this approach:

Escaping user content is a nightmare
Type safety is impossible
The output looks like machine-generated garbage

We rewrote it using AST manipulation. Instead of templating strings, we build an Abstract Syntax Tree and serialize it to code. This gives us proper TypeScript types, automatic escaping, and output that looks like a human wrote it.

Here's what surprised us: the generated code is often better than hand-written code. Not because we're amazing at codegen, but because consistency is easy for computers and hard for humans. Every component has proper TypeScript types. Every image has proper alt text. Every interactive element has proper ARIA labels. Humans forget things. Parsers don't.

The AST approach also made debugging infinitely easier. When something breaks, we can inspect the tree structure before it becomes code. String templates give you syntax errors. AST manipulation gives you semantic errors, which are much easier to reason about.

The deployment pipeline nobody asked about

Deployment should be boring. Generate code, run a build, upload to a CDN, done. It wasn't boring.

The first issue: Vite builds are fast, but not when you're running 50 of them concurrently. We were using a single build server and queuing jobs. Average wait time was 6 minutes. Peak wait time was 23 minutes. Unacceptable.

We moved to Kubernetes with auto-scaling. Spin up a pod for each build, tear it down when done. This worked great until we hit AWS instance limits. Turns out, rapidly creating and destroying pods triggers rate limits on EC2 APIs. Who knew?

The solution: keep a warm pool of build workers. Scale it based on queue depth with a 2-minute look-ahead. This keeps enough capacity ready without wasting money on idle workers. Current build time: 90th percentile is under 3 minutes.

The second issue: asset optimization. Early on, we just copied images directly into the build. File sizes were huge. We added imagemagick to resize and compress. File sizes were better, but builds got slower—image processing is CPU intensive.

Now we do it asynchronously. First deploy uses original images. Background job optimizes them and updates the CDN cache. Users see their site live in 5 minutes, optimized assets appear a few minutes later. Nobody notices the swap.

The third issue: SSL certificates. We used Let's Encrypt with automated cert provisioning. This worked fine until we hit their rate limits. 50 certs per domain per week. We exceeded that in our second day of open beta.

The fix: wildcard certificates for our subdomains, custom domain cert provisioning happens on-demand with better rate limit handling. We also batch cert requests—if multiple users request certs for domains from the same registrar within a short window, we request them together.

Scaling is never where you think it'll be

We architected for build scaling from day one. Kubernetes, auto-scaling, the works. That part scaled fine. The bottleneck showed up somewhere completely different: PostgreSQL.

Specifically, the builds table. Every build creates a row. Every build status update writes to that row. We were doing pessimistic locking on build updates to prevent race conditions. At 100 concurrent builds, lock contention became the bottleneck. Database CPU was at 80% just waiting on locks.

The solution was embarrassingly simple: optimistic locking with row versioning. No more lock contention, but now we had to handle version conflicts. That required retries with exponential backoff. Standard stuff, but we should have done it from the start.

The second bottleneck: logs. We stored build logs in the database. Text fields can be large, and we were fetching them frequently for the frontend. This added dozens of megabytes of data transfer for each log view.

Now logs live in S3. Database only stores pointers. This cut database traffic by 60% and costs by 40% (S3 is cheaper than RDS traffic).

Edge cases are 80% of the work

The prototype handled three websites perfectly. Production handles thousands of websites adequately.

Some favorites:

The Flash site from 2008 – Someone submitted a URL that was pure Flash. We extracted nothing useful. We now detect Flash and show a helpful error message. This has triggered exactly twice since we added it.

The 300MB homepage – One user submitted an e-commerce site where the homepage loaded 12,000 product images. Each image was 25KB. Do the math. Our extraction timed out after 5 minutes. We now have sanity checks: if a page loads more than 500 images, we bail and ask the user to specify a different page.

The infinite redirector – A site that redirected based on user agent. It sniffed our headless Chrome user agent and redirected to a "please use a supported browser" page. We rotate user agent strings now, randomly sampling from a list of real browser signatures.

The canvas-rendered text – Somebody built an entire website where all text was drawn on canvas elements. No DOM text nodes at all. We extracted nothing. This is technically undetectable (how do we know it's text and not a decorative graphic?), so we detect canvas-heavy pages and warn users upfront.

The page that was just a YouTube embed – Literally a full-screen YouTube player. Nothing else. We extracted the video URL and built a site around it. This actually worked well.

Each edge case became a test case. We have 2,400 test URLs now, covering everything from normal sites to absolute chaos. Every production error gets a test case. Our test suite takes 40 minutes to run. It's worth it.

The metrics that actually mattered

We tracked everything. Most metrics were vanity. Some mattered.

Build success rate – Obviously critical. Started at 67%, now at 94%. The missing 6% is split between genuinely broken sites (2%), sites that require authentication (3%), and bugs we haven't fixed yet (1%).

Content extraction accuracy – Harder to measure objectively. We sample builds and manually review them. Current score: 89% of content extracted correctly. That 11% is mostly edge cases: unusual technologies, heavily JavaScript-dependent rendering, or just weird HTML.

Design quality – Completely subjective, but we track it anyway. We show users two designs and ask which is better. We use this data to refine our quality model. Our model agrees with human preference 76% of the time, which is good enough to keep generating multiple options and picking the best.

Time to first deploy – This is the metric users care about. Started at 47 minutes. Currently averaging 2:35 (two minutes, thirty-five seconds) from URL submission to live site. We want to get under 5 minutes.

Cost per build – Engineering cares about this more than users do. Compute costs matter at scale. Early on: $0.87 per build. Current: $0.23 per build. Combination of optimization, better resource pooling, and moving image processing to async.

What we got wrong

Trying to be perfect – Our initial goal was 99% accuracy on content extraction. We spent months optimizing for that last few percent. Turns out, users are happy with 89% accuracy if builds are fast and the failure modes are graceful. Perfection is expensive. Good enough is often good enough.

Over-architecting early – We built a sophisticated queuing system with priority lanes, backpressure handling, and circuit breakers before we had 10 users. Most of that complexity was unnecessary. Build the simplest thing that works, then scale when you need to.

Underestimating browser automation costs – Headless Chrome is resource-intensive. Memory usage spikes to 500MB-2GB per instance depending on the site. We initially tried to run 20 instances on a single machine. The OOM killer was not happy. Proper resource isolation and limits are non-negotiable.

Assuming generated code wouldn't need maintenance – We thought: generate code once, deploy it, done. Reality: design trends change, security best practices evolve, dependencies need updates. Generated code isn't fire-and-forget. We now have a system to regenerate user sites with updated templates periodically. Most users never notice, but it keeps their sites modern.

What we'd do differently

If we rebuilt TEZELA from scratch today:

Start with a narrower scope. Instead of "any website," target specific categories first (portfolios, then landing pages, then e-commerce). Become great at one thing before being mediocre at everything.

Invest in observability from day one. We added comprehensive logging and monitoring after problems appeared. We should have had it from the start. Distributed tracing, structured logging, proper metrics—all of it. Debugging is 10x easier when you can see what's happening.

Build the feedback loop earlier. We launched with no way for users to tell us what went wrong. Users encountered errors, gave up, and we never knew why. Now we have detailed error states, user feedback forms, and session replay. The data is invaluable.

The architecture, finally

Here's the actual system architecture, without the stage-by-stage enumeration:

Users submit URLs through a React frontend. The request hits our FastAPI backend, which queues a job in Redis. Celery workers pick up jobs and spawn Playwright instances in Kubernetes pods. Each pod is isolated with resource limits (2GB RAM, 1 CPU).

Extraction runs until DOM settles or timeout (whichever comes first). Results go through our semantic analysis pipeline: structural analysis, visual analysis (computed styles), content analysis (NLP), and ML classification. Output is a semantic tree.

Design generation takes the semantic tree and runs it through our constraint solver, generating 3 design candidates. Quality model picks the best one. AST generator produces TypeScript/React code.

Build worker compiles with Vite, optimizes assets (synchronously for critical items, async for images), and deploys to our CDN (Cloudflare). DNS updates happen automatically. SSL certs are provisioned on-demand.

The entire pipeline is instrumented with OpenTelemetry. We track success rates, timing, error patterns, and resource usage for every stage. Errors trigger alerts in Slack. Weekly reports summarize trends.

The tech stack

Backend: Python 3.11, FastAPI, Celery, Playwright Frontend: React 18, TypeScript, Tailwind CSS, Vite Infrastructure: Kubernetes (EKS), PostgreSQL (RDS), Redis (ElastiCache) Deployment: AWS for compute, Cloudflare for CDN and DNS Monitoring: Datadog for metrics, Sentry for errors, LogRocket for session replay

Nothing exotic. Boring technology that scales and has good documentation. The hard parts aren't the tools, they're the problems.

What's next

We're not done. Some problems we're actively working on:

Multi-page support – Current system handles single pages well. Multi-page sites with varied content types are harder. We need better navigation extraction, page relationship understanding, and cross-page consistency.

Custom components – Let users bring their own components. This is tricky—how do we ensure custom code works with generated code? Type safety? Build-time checks? Runtime sandboxing?

Better semantic understanding – That 11% we're missing bugs us. We're exploring LLM-based analysis to complement our current approach. Early experiments are promising but expensive.

Real-time collaboration – Multiple team members editing a generated site simultaneously. This is a completely different problem space (CRDTs, operational transforms) but users are asking for it.

Performance optimization – Generated sites are fast, but they could be faster. Automatic image lazy loading, critical CSS extraction, aggressive code splitting—there's room for improvement.

What we learned

Building TEZELA taught us that the hardest technical problems are rarely the ones you anticipate. We expected design generation to be hard and semantic analysis to be straightforward. Reality was inverted.

We learned that real-world data is humbling. Your carefully crafted algorithms work great on test data and fall apart on the weird websites real users submit. Edge cases aren't 20% of the work, they're 80% of the work.

We learned that user experience matters more than technical perfection. A 89% accurate system that's fast and has good error messages beats a 95% accurate system that's slow and opaque.

We learned that boring technology is good technology. Kubernetes, PostgreSQL, Redis, React—none of it is exciting. All of it is reliable, well-documented, and scalable. Boring is a feature, not a bug.

Most importantly, we learned that shipping is better than perfecting. That first demo that took 47 minutes taught us more than any amount of planning would have. Get something in front of users as fast as possible. The feedback is worth more than the perfect architecture.

We're still learning. Every user interaction reveals new edge cases, new opportunities, new challenges. That's the fun part.

If you're building something similar—automated content understanding, code generation, deployment infrastructure—our advice is simple: start small, ship early, and prepare for the problems you don't expect. They're the interesting ones anyway.

Ready to build your website?

Experience the power of automated web design. Create a professional website in minutes with TEZELA.