SM

Command Palette

Search for a command to run...

Blog

The Scraper That Doesn't Get Banned: Building a Scalable Web Scraping Pipeline in Go

Syed Moinuddin13 min read
GoBackend
The Scraper That Doesn't Get Banned: Building a Scalable Web Scraping Pipeline in Go

A Go web scraping pipeline that survives at scale: rate limits per domain, rotating proxies, retries, and the layered architecture that keeps you from getting banned.

A scraping script that works on your laptop and a pipeline that runs for months at scale are two completely different things. The gap between them is proxies, rate limiting, retries, and architecture — and Go is unusually good at all four.

On localhost, scraping feels deterministic. You request a URL, the HTML loads, your selectors resolve, you get data. It works almost every time. Then you point the same code at ten thousand URLs across real targets, and the illusion collapses: IPs get banned within the hour, requests time out, a parser error kills the whole run, and the site starts feeding you CAPTCHAs.

I build crawlers in Go for a living, and the lesson that took the longest to sink in is that a scraper isn't a script — it's a pipeline. Once you treat it like a system with decoupled stages, the hard problems (blocking, scale, reliability) become layers you can solve one at a time. This is how I build them.

Go earns its place here for a concrete reason: goroutines and channels give you high-throughput concurrency with almost no overhead, it compiles to a single static binary you can drop on any box, and on parsing-heavy work Go scrapers typically run several times faster than the Python equivalent. S1 For a pipeline whose whole job is doing thousands of things at once, that matters.

The mental model: four decoupled stages

Every scalable scraper is the same four stages, and the trick is to keep them separate so a failure in one doesn't take down the others:

  1. Frontier — the queue of URLs waiting to be fetched, deduplicated so you never crawl the same page twice.
  2. Fetch — workers that pull URLs, send HTTP requests through proxies and rate limits, and return raw responses.
  3. Parse — extract structured data from raw HTML, after fetching, never inside it.
  4. Store — write results somewhere durable.

The single most important architectural rule, learned the hard way: fetch raw first, parse later. Don't let a regex error on one weird page kill a network request that cost you a proxy rotation and three seconds. S7 Pull the bytes, store or queue them, and parse in a stage that's allowed to fail independently.

Layer 1 — Fetch: pick the right tool, don't over-reach

The fetch layer is where most people reach for a headless browser by reflex. Don't. Headless Chrome is slow, memory-hungry, and usually unnecessary. Match the tool to the target: S1S2

  • net/http + goquery — for static HTML. Fastest, lightest, most control. goquery gives you jQuery-style CSS selectors over the response body.
  • Colly — when you need crawling: link-following, cookie handling, request queues, and built-in concurrency, all out of the box.
  • chromedp or Rodonly when the data is rendered by JavaScript and there's no underlying API to hit. Rod adds stealth features; chromedp is the common choice. Reserve these for pages that genuinely need a browser. S1S2

Most "I need a headless browser" situations are actually "I didn't check the network tab for the JSON endpoint the page is already calling." Check first.

Layer 2 — Rate limiting: rotate at human speed

The fastest way to get banned is to behave like a machine. A real visitor doesn't fire fifty requests per second from one address. S5 Your scraper shouldn't either, and the clean way to enforce that in Go is the official golang.org/x/time/rate package — a token-bucket limiter maintained alongside the standard library. S3

You create a limiter with a rate r and a burst size b, and you call Wait before each request. The key refinement for a real crawler: one limiter per domain, so you're polite to each target independently instead of throttling your whole pipeline to the slowest site.

type DomainLimiter struct {
    mu       sync.Mutex
    limiters map[string]*rate.Limiter
    r        rate.Limit
    burst    int
}
 
func NewDomainLimiter(perSecond float64, burst int) *DomainLimiter {
    return &DomainLimiter{
        limiters: make(map[string]*rate.Limiter),
        r:        rate.Limit(perSecond),
        burst:    burst,
    }
}
 
// Wait blocks until this domain is allowed another request.
func (d *DomainLimiter) Wait(ctx context.Context, host string) error {
    d.mu.Lock()
    lim, ok := d.limiters[host]
    if !ok {
        lim = rate.NewLimiter(d.r, d.burst)
        d.limiters[host] = lim
    }
    d.mu.Unlock()
    return lim.Wait(ctx) // sleeps until a token frees up
}

Wait is the method to reach for in a scraper — it blocks until a token is available rather than rejecting the request, which is exactly the smoothing behavior you want. S3 And watch the responses: the moment you start seeing 429 Too Many Requests, back off immediately. The server is telling you the limit. S7

Layer 3 — Proxies: the part that actually decides whether you get blocked

A single IP burns out fast against a serious target. After enough requests it gets flagged and your scraper stops working until you swap. S5 Proxies are how you spread load across many addresses, and choosing the right type matters more than the pool size on the marketing page: S4

  • Datacenter — fast and cheap, best for unprotected sites. Easiest to detect and block.
  • Residential — routed through real home ISP addresses, so requests look like they came from an apartment, not a server rack. Expensive (priced per GB), but they get through advanced bot management and geo-blocks.
  • ISP — static IPs with residential trust signals; ideal for sticky sessions like logged-in flows where rapid rotation triggers lockouts.
  • Mobile — extreme trust because blocking one risks blocking many real users behind carrier NAT. Reserve for hard edge cases.

The rotation decision is just as important as the type. Per-request rotation (a fresh IP every call) is right when each request is independent. Sticky sessions (one IP held for a few-minute window) are right when you need continuity — a cart, a login, a multi-step flow — because switching IPs mid-session breaks cookies and gets you bounced to a CAPTCHA. Most failed residential setups are using per-request rotation where they needed sticky sessions. S5

In Go, the clean pattern is one reusable http.Client per proxy, picked round-robin, so you're not rebuilding transports on every request:

type ProxyPool struct {
    clients []*http.Client
    n       uint64
}
 
func NewProxyPool(proxyURLs []string) (*ProxyPool, error) {
    clients := make([]*http.Client, 0, len(proxyURLs))
    for _, p := range proxyURLs {
        pu, err := url.Parse(p)
        if err != nil {
            return nil, err
        }
        clients = append(clients, &http.Client{
            Timeout:   30 * time.Second,
            Transport: &http.Transport{Proxy: http.ProxyURL(pu)},
        })
    }
    return &ProxyPool{clients: clients}, nil
}
 
func (p *ProxyPool) Next() *http.Client {
    i := atomic.AddUint64(&p.n, 1)
    return p.clients[i%uint64(len(p.clients))]
}

One 2026 caveat worth knowing: proxy sourcing is now a real security and compliance question. Google Cloud's takedown of the IPIDEA network in January 2026 showed that where your residential IPs come from matters, and Cloudflare's "Pay Per Crawl" (HTTP 402) model is starting to shift parts of the web from hard blocking toward metered, paid access. S4 Vet your provider, and don't assume free or grey-market pools are safe.

Layer 4 — Looking human: headers, user agents, and TLS

Proxies hide your IP, but the request itself still carries fingerprints. Rotate your User-Agent and send a realistic, consistent set of headers — a real browser doesn't send a bare request with no Accept-Language. S8 For aggressively protected targets, the next layer down is TLS fingerprint spoofing (libraries like CycleTLS), because anti-bot systems increasingly fingerprint the TLS handshake itself, not just the headers. S1 Layer these defenses based on how hard the target pushes back — don't pay the cost of browser automation and TLS spoofing on a site that a plain net/http request walks straight into.

Layer 5 — Resilience: failures are the normal case

Once you leave localhost, failure stops being an edge case and becomes an operating condition. Networks flake, proxies time out, targets get overloaded. S7 Build for it from the start:

  • Retry with exponential backoff and jitter. Don't retry instantly — that just hammers a struggling endpoint. Increase the delay each attempt, and add randomness so a thousand workers don't all retry in lockstep.
  • Know what's retryable. A 429 or 503 or a timeout deserves a retry. A 404 or 401 does not — retrying it wastes a proxy and a token.
  • Cap attempts, then dead-letter. After a few failures, stop and route the URL to a dead-letter queue for manual review instead of looping forever. S6
func DoWithRetry(ctx context.Context, attempt func() error) error {
    const maxAttempts = 4
    base := 500 * time.Millisecond
    for i := 0; i < maxAttempts; i++ {
        err := attempt()
        if err == nil {
            return nil
        }
        if !isRetryable(err) {
            return err // don't retry 4xx (except 429)
        }
        // exponential backoff with full jitter
        window := base * (1 << i)
        sleep := time.Duration(rand.Int63n(int64(window)))
        select {
        case <-time.After(sleep):
        case <-ctx.Done():
            return ctx.Err()
        }
    }
    return errors.New("max retries exceeded")
}

The mindset shift is accepting an error budget: at scale, 1–3% of requests will fail no matter what you do. Build the pipeline to retry and move on, not to demand perfection. S7

Layer 6 — Scaling out: workers, a shared frontier, and a queue

The in-process version of scale is the worker pool — a fixed set of goroutines all reading from one channel of URLs. This alone takes you a long way:

func Crawl(ctx context.Context, urls <-chan string, results chan<- Result, workers int) {
    var wg sync.WaitGroup
    for i := 0; i < workers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for u := range urls {        // ranges until the channel closes
                results <- fetch(ctx, u)
            }
        }()
    }
    wg.Wait()
    close(results)
}

When one machine isn't enough, you go distributed, and the components are well-established: a scheduler/queue holds the URL frontier, stateless workers pull from it across many nodes, and a store collects results. S6 The pieces that make it survive production:

  • Deduplicate the frontier. A shared store like Redis tracks seen URLs so workers across nodes never crawl the same page twice. S8
  • Use a real message queue (NATS, RabbitMQ, Kafka) for the URL frontier so work isn't lost when a worker crashes. S6
  • Separate retry queues with increasing delays, feeding a dead-letter queue after repeated failure — never push failed URLs straight back into the main queue, or you build an infinite loop that hammers broken endpoints. S6
  • Keep workers stateless so a crash loses no data, and restart them periodically — long-lived workers parsing malformed HTML accumulate memory leaks. S6S7
  • Score your workers. They won't all have the same proxy quality and network speed; route the hard jobs to your best performers. S6

Storage and observability

Store raw responses first (object storage or a raw_pages table), then run parsing as its own stage writing structured rows to Postgres or MongoDB. That separation means a parser bug is a re-run over saved bytes, not a re-crawl.

Then watch the system, because a scraper that can't see itself is a scraper that's quietly failing. The metrics that matter aren't vanity numbers — they're success rate, latency (if a target slows down, slow your scraper down to avoid a ban), and cost per successful request, which is the number that actually tells you whether a proxy provider is worth it. S4S7 When you trial proxies, compare on cost-per-success on your targets, not on headline pool size. S4

This isn't legal advice — I'm an engineer, not a lawyer, and this area is actively being shaped in courtrooms — but every scraper builder needs the lay of the land. S9

The simplest test anyone has produced: open an incognito window, don't log in, and try to view the page. If you can see it without authenticating, it's public data, and under current US case law scraping it is likely defensible. If it's behind a login, you're in dangerous territory. Courts have effectively drawn the line at authentication. S9 The landmark hiQ Labs v. LinkedIn ruling held that accessing publicly visible pages doesn't violate the Computer Fraud and Abuse Act — the Ninth Circuit ruled this way in 2019 and reaffirmed it in 2022. S9

But "not a CFAA violation" isn't the whole story:

  • Terms of Service can still bite. Meta v. Bright Data showed that scraping data subject to a site's terms may be a breach-of-contract issue even when the data looks public. S9
  • Anti-circumvention is its own risk. Reddit v. Perplexity (filed late 2025, pending in early 2026) invokes the DMCA over allegedly circumventing rate limits and anti-bot measures — a different legal theory from the CFAA. S9
  • robots.txt is voluntary but meaningful. Ignoring it isn't automatically illegal, but respecting it reads as good faith, and disregarding it can be used as evidence of bad faith in a lawsuit. Read it. S9
  • Personal data triggers privacy law. GDPR and CCPA apply the moment you collect data about identifiable people, regardless of whether the page was public. Minimize and avoid PII unless you have a real basis to handle it. S9

The honest summary: scrape public, non-personal data, respect robots.txt and rate limits, read the terms, and get real legal advice before any high-stakes commercial scraping.

The takeaway

A scraper that survives isn't the one with the cleverest selectors — it's the one built as a pipeline of independent layers. Fetch with the lightest tool that works. Meter yourself per domain and rotate at human speed. Pick proxy types and rotation modes to match the target. Assume failure and retry with backoff and jitter. Decouple fetch from parse so one bad page can't sink a run. Scale out with a deduplicated frontier, a durable queue, and stateless workers. And stay on the right side of the line while you do it.

Go gives you the concurrency and the single-binary simplicity to make all of this boringly reliable. The rest is discipline: build the layers, watch the metrics, and respect the sites you're pulling from.

FAQ

  1. Why use Go for web scraping instead of Python?

    Go compiles to a single static binary, handles massive concurrency through goroutines and channels with little overhead, and runs several times faster than Python on parsing-heavy work. For high-throughput pipelines, that throughput-per-dollar adds up fast.

  2. Colly or goquery — which should I use?

    Both, often together. Use net/http + goquery for parsing static HTML with full control; reach for Colly when you need crawling features like link-following, cookies, and built-in concurrency. Colly can even hand the response body to goquery inside a callback.

  3. When do I actually need a headless browser like chromedp or Rod?

    Only when the data is rendered by client-side JavaScript and there's no underlying API to call directly. Headless browsers are slow and memory-hungry — check the network tab for a JSON endpoint before reaching for one.

  4. How do I keep my scraper from getting IP-banned?

    Three things together: rate-limit per domain so you don't behave like a machine, rotate proxies so no single IP accumulates suspicious activity, and rotate user agents and headers so requests look like a real browser. Then back off the instant you see 429.

  5. Datacenter, residential, ISP, or mobile proxies — how do I choose?

    Datacenter for fast, unprotected sites; residential to get past advanced bot management and geo-blocks; ISP for sticky logged-in sessions; mobile only for extreme-trust edge cases. Match the proxy to how hard the target pushes back.

  6. Per-request rotation or sticky sessions?

    Per-request when each request is independent (spreads risk across many IPs). Sticky sessions when you need continuity like a cart or login — switching IPs mid-session breaks cookies and triggers CAPTCHAs. Picking the wrong one is the most common reason residential setups fail.

  7. How should I handle failures and retries?

    Retry transient errors (timeouts, 429, 5xx) with exponential backoff plus jitter, don't retry permanent ones (404, 401), cap the attempts, and route exhausted URLs to a dead-letter queue. Accept a 1–3% failure budget rather than chasing perfection.

  8. How do I scale from one machine to many?

    Move the URL frontier into a durable message queue (NATS, RabbitMQ, Kafka), deduplicate seen URLs in a shared store like Redis, run stateless workers that any node can execute, use separate retry queues feeding a dead-letter queue, and restart workers periodically to avoid memory leaks.

  9. Is web scraping legal?

    This isn't legal advice, but in the US, scraping publicly accessible data (visible without logging in) is generally defensible under hiQ v. LinkedIn. Risk rises sharply when you bypass authentication, ignore robots.txt, scrape personal data (GDPR/CCPA), or violate enforceable terms of service. Consult a lawyer for serious commercial work.

  10. How do I know if a proxy provider is worth it?

    Run a paid trial on your actual targets and compare on cost per successful request and success rate — not on advertised pool size. And in 2026, vet sourcing: proxy networks have been taken down over how their IPs were obtained.

Sources

  1. S1Roundproxies, "Web scraping in Golang: 2026 step-by-step guide," updated December 2025 — roundproxies.com
  2. S2Olostep, "Web Scraping in Golang: Libraries, Anti-Blocking, and Scale," 2026 — olostep.com
  3. S3The Go Programming Language, "Go Wiki: Rate Limiting" and golang.org/x/time/rate package docs — go.dev/wiki · pkg.go.dev
  4. S4Olostep, "Best Proxies for Web Scraping: What Actually Scales," 2026 — olostep.com
  5. S5Technology.org, "Rotating Residential Proxies for Web Scraping: When They Beat Datacenter Proxies," May 2026 — technology.org
  6. S6Bright Data, "Guide to Distributed Web Crawling: Scale Your Scraping," 2025 — brightdata.com
  7. S7OnlineProxy, "How to Scale Web Scraping: Architecture for 1 Million Requests per Day," April 2026 — medium.com
  8. S8DEV Community, "Building a High-Concurrency Web Crawler in Go: A Practical Guide," January 2026 — dev.to
  9. S9Legal landscape — PromptCloud, Tendem, and SociaVault compliance overviews (hiQ v. LinkedIn, Meta v. Bright Data, Reddit v. Perplexity, CFAA, robots.txt, GDPR/CCPA), 2026 — promptcloud.com · tendem.ai · sociavault.com

Written by

Syed Moinuddin

Full Stack Engineer.

Notes on AI tooling, agentic systems, and building things that survive contact with production.

Command Palette

Search for a command to run...