πŸ“š Learning Hub
Β· 3 min read

Our AWS Bill Went from $200 to $14,000 in One Week


A forgotten Lambda function with a recursive trigger. That’s all it took to generate a five-figure AWS bill over a weekend. Here’s exactly what happened and how to make sure it never happens to you.

The email from AWS billing alerts hit at 7 AM on Monday: β€œYour estimated charges have exceeded $5,000.”

Our normal monthly bill was $200.

What happened

A Lambda function that processed image uploads had a bug. When it failed to process an image, it put the image back on the SQS queue for retry. The retry triggered the same Lambda. Which failed again. Which put it back on the queue.

An infinite loop, running at cloud scale.

The math

  • Each Lambda invocation: 3 seconds, 1024MB memory
  • Cost per invocation: ~$0.00005
  • Invocations per minute: ~2,000 (SQS kept feeding the Lambda)
  • Per hour: 120,000 invocations = $6
  • Per day: 2,880,000 invocations = $144
  • Per week: ~20 million invocations = $1,000

But that’s just Lambda. The real cost was data transfer and SQS:

  • Each retry read the image from S3: $0.0004 per GET request Γ— 20 million = $8,000
  • SQS messages: $0.40 per million Γ— 20 million = $8
  • Data transfer: the images averaged 2MB each, and… yeah.

Total: ~$14,000 in 7 days.

Why nobody noticed for a week

  1. Billing alerts were set to $500. Our normal bill was $200, so $500 seemed generous. The bill blew past $500 on day 3 (Saturday). Nobody checks email on Saturday.
  2. No Lambda concurrency limits. AWS will happily run 1,000 concurrent Lambdas by default. We never set a limit.
  3. CloudWatch alarms were on the wrong metric. We monitored error rate (percentage), not error count. The error rate was 0.1% β€” because the retries counted as new invocations, diluting the error rate.

The bug

def handler(event, context):
    for record in event['Records']:
        try:
            image_url = json.loads(record['body'])['url']
            process_image(image_url)
        except Exception as e:
            logger.error(f"Failed to process: {e}")
            raise  # This causes SQS to retry

The raise at the end tells SQS the message wasn’t processed, so SQS puts it back on the queue. For transient errors (network timeout), this is correct. For permanent errors (corrupt image), this creates an infinite loop.

The fix

def handler(event, context):
    for record in event['Records']:
        try:
            image_url = json.loads(record['body'])['url']
            process_image(image_url)
        except TransientError:
            raise  # Retry for temporary failures
        except Exception as e:
            logger.error(f"Permanent failure, sending to DLQ: {e}")
            # Don't raise β€” let SQS delete the message
            # The Dead Letter Queue catches it for investigation

Prevention measures

1. SQS Dead Letter Queue (should have been there from day 1)

{
  "RedrivePolicy": {
    "deadLetterTargetArn": "arn:aws:sqs:...:image-processing-dlq",
    "maxReceiveCount": 3
  }
}

After 3 failed attempts, the message goes to a DLQ instead of retrying forever.

2. Lambda concurrency limit

{
  "ReservedConcurrentExecutions": 10
}

Even if the queue fills up, only 10 Lambdas run at once. This caps the blast radius.

3. Better billing alerts

  • $300 (50% above normal) β€” Slack notification
  • $500 β€” email + Slack
  • $1,000 β€” PagerDuty

4. AWS Budget Actions

Set up an AWS Budget that automatically disables the Lambda’s trigger when spending exceeds a threshold. This is the nuclear option but it prevents $14,000 surprises.

Did AWS refund it?

We opened a support ticket explaining the situation. AWS credited back about $9,000 (first-time courtesy). We still paid ~$5,000 for the lesson.

The takeaway

Serverless scales automatically. Including your mistakes. Every Lambda needs:

  • A Dead Letter Queue
  • A concurrency limit
  • Billing alerts at reasonable thresholds
  • Distinction between retryable and permanent errors

The cloud doesn’t care if your code has a bug. It will happily run that bug a million times and send you the bill.

Related postmortems: We Deployed on a Friday Β· A Missing Index Took Down Our API Β· A Single Regex Caused 100% CPU. See also: AWS CLI cheat sheet Β· AWS Lambda Timeout fix

Related: Dark Side Of Serverless