Build a Rate Limiter in Go — Part 3: The Token Bucket Algorithm

In Part 1 we built a sliding window rate limiter. In Part 2 we scaled it with map sharding. Both posts used the same underlying algorithm — count requests within a time window, reject when over the limit.

In Part 3 we take a different approach entirely: the token bucket.

Series overview:

Part 1 — Sliding window rate limiter

Part 2 — Scaling with map sharding

Part 3 — Token bucket algorithm (you are here)

Part 4 — Redis-backed distributed rate limiting

Part 5 — Benchmarks: sliding window vs token bucket vs Redis

The Mental Model

Imagine a bucket that holds tokens. The bucket has a maximum capacity — say, 10 tokens. Tokens are added to the bucket at a fixed rate — say, 1 token per second. When a request comes in, it consumes one token. If the bucket is empty, the request is rejected.

Two properties fall out of this naturally:

Burst handling — if no requests have come in for a while, the bucket fills up. A user can then make several requests in quick succession, up to the bucket’s capacity. The sliding window doesn’t allow this — it counts strictly within the window regardless of prior activity.

Sustained rate — over time, the average request rate can’t exceed the refill rate. The burst headroom gets consumed and the user is back to one request per refill interval.

This makes the token bucket a better fit for APIs where occasional bursts are legitimate — think a user opening your app after several hours of inactivity — but you still want to protect against sustained abuse.

Sliding Window vs Token Bucket

Before the implementation, it’s worth understanding when to reach for each:

	Sliding Window	Token Bucket
Burst handling	Strict — no burst allowed	Flexible — bursts up to capacity
Memory per user	Slice of timestamps	Two integers (tokens + last refill time)
Implementation complexity	Moderate	Simple
Best for	Strict per-window limits	APIs with legitimate burst patterns

The token bucket is also cheaper in memory — instead of storing a slice of timestamps per user, we only need two values: the current token count and the last refill timestamp.

The Implementation

The Struct

type bucket struct {
	tokens    float64
	lastRefil time.Time
}

type RateLimiter struct {
	mu       sync.Mutex
	buckets  map[string]*bucket
	capacity float64
	refillRate float64 // tokens per second
	done     chan struct{}
}

A few design decisions worth noting:

tokens is a float64 rather than an integer — this lets us handle fractional token accrual accurately over time without rounding errors
refillRate is expressed in tokens per second, making the math straightforward regardless of the refill interval
capacity is the maximum number of tokens a bucket can hold — this is the burst ceiling

Creating a New Rate Limiter

func NewRateLimiter(capacity float64, refillRate float64) *RateLimiter {
	rl := &RateLimiter{
		buckets:    make(map[string]*bucket),
		capacity:   capacity,
		refillRate: refillRate,
		done:       make(chan struct{}),
	}
	go rl.cleanup()
	return rl
}

Example: NewRateLimiter(10, 1) gives each user a bucket that holds 10 tokens and refills at 1 token per second. A user can burst up to 10 requests immediately, then is limited to 1 request per second thereafter.

Refilling a Bucket

Instead of a background ticker that refills all buckets, we refill lazily — only when a request comes in for that user. This is more efficient and eliminates the need for a separate refill goroutine.

func (rl *RateLimiter) refill(b *bucket) {
	now := time.Now()
	elapsed := now.Sub(b.lastRefil).Seconds()
	b.tokens = min(rl.capacity, b.tokens+elapsed*rl.refillRate)
	b.lastRefil = now
}

We calculate how much time has passed since the last refill, multiply by the refill rate to get the number of new tokens, and cap at capacity. Simple and cheap.

Checking If a Request Is Allowed

func (rl *RateLimiter) IsAllowed(userID string) bool {
	rl.mu.Lock()
	defer rl.mu.Unlock()

	b, ok := rl.buckets[userID]
	if !ok {
		b = &bucket{
			tokens:    rl.capacity,
			lastRefil: time.Now(),
		}
		rl.buckets[userID] = b
	}

	rl.refill(b)

	if b.tokens < 1 {
		return false
	}

	b.tokens--
	return true
}

New users start with a full bucket — their first interaction gets the full burst allowance. On every request we refill based on elapsed time, then check if there’s at least one token available.

The Cleanup Goroutine

Same problem as Part 1 — buckets for inactive users accumulate in memory. We still need a cleanup pass:

func (rl *RateLimiter) cleanup() {
	ticker := time.NewTicker(time.Minute)
	defer ticker.Stop()

	for {
		select {
		case <-rl.done:
			return
		case <-ticker.C:
			rl.mu.Lock()
			for userID, b := range rl.buckets {
				rl.refill(b)
				if b.tokens >= rl.capacity {
					delete(rl.buckets, userID)
				}
			}
			rl.mu.Unlock()
		}
	}
}

The cleanup heuristic here is different from Part 1 — instead of checking for expired timestamps, we check if the bucket is full. A full bucket means the user hasn’t made a request in long enough for all tokens to refill — they’re effectively inactive. Safe to evict.

Graceful Shutdown

func (rl *RateLimiter) Stop() {
	close(rl.done)
}

Putting It Together

// 10 token capacity, refills at 2 tokens per second
rl := NewRateLimiter(10, 2)
defer rl.Stop()

if rl.IsAllowed("user-123") {
    // handle request
} else {
    // return 429 Too Many Requests
}

When to Use Token Bucket vs Sliding Window

Reach for the sliding window when:

You need strict per-window enforcement — no exceptions for burst
Your API contract is expressed as “N requests per minute/hour”
Fairness between users is the primary concern

Reach for the token bucket when:

Bursts are a legitimate usage pattern for your API
You want lower memory overhead per user
Your rate limit is more naturally expressed as a sustained rate with burst headroom (e.g. webhooks, streaming endpoints)

Neither is universally better. The right choice depends on the semantics you want to enforce.

Honest Limitations

The same fundamental limitation from Parts 1 and 2 applies here — this is still in-memory and single-process. Multiple instances behind a load balancer means independent buckets per instance, and users can exceed the intended limit by spreading requests across instances.

The cleanup goroutine also still holds a global lock during its sweep. If you need the token bucket at high scale, the same sharding approach from Part 2 applies — each shard gets its own map of buckets and its own mutex.

Both problems get solved properly in Part 4 with Redis.

Full Implementation

package ratelimiter

import (
	"sync"
	"time"
)

type bucket struct {
	tokens     float64
	lastRefill time.Time
}

type RateLimiter struct {
	mu         sync.Mutex
	buckets    map[string]*bucket
	capacity   float64
	refillRate float64
	done       chan struct{}
}

func NewRateLimiter(capacity float64, refillRate float64) *RateLimiter {
	rl := &RateLimiter{
		buckets:    make(map[string]*bucket),
		capacity:   capacity,
		refillRate: refillRate,
		done:       make(chan struct{}),
	}
	go rl.cleanup()
	return rl
}

func (rl *RateLimiter) refill(b *bucket) {
	now := time.Now()
	elapsed := now.Sub(b.lastRefill).Seconds()
	b.tokens = min(rl.capacity, b.tokens+elapsed*rl.refillRate)
	b.lastRefill = now
}

func (rl *RateLimiter) IsAllowed(userID string) bool {
	rl.mu.Lock()
	defer rl.mu.Unlock()

	b, ok := rl.buckets[userID]
	if !ok {
		b = &bucket{
			tokens:     rl.capacity,
			lastRefill: time.Now(),
		}
		rl.buckets[userID] = b
	}

	rl.refill(b)

	if b.tokens < 1 {
		return false
	}

	b.tokens--
	return true
}

func (rl *RateLimiter) cleanup() {
	ticker := time.NewTicker(time.Minute)
	defer ticker.Stop()

	for {
		select {
		case <-rl.done:
			return
		case <-ticker.C:
			rl.mu.Lock()
			for userID, b := range rl.buckets {
				rl.refill(b)
				if b.tokens >= rl.capacity {
					delete(rl.buckets, userID)
				}
			}
			rl.mu.Unlock()
		}
	}
}

func (rl *RateLimiter) Stop() {
	close(rl.done)
}

What’s Next

In Part 4 we move beyond single-instance limitations entirely. We’ll build a Redis-backed rate limiter that works correctly across multiple instances — the last piece needed for a production-grade distributed system.

Follow along to catch the rest of the series.