5

I have a Python Elastic Beanstalk load-balanced app. Here is the path a user request takes on its way into the Elastic Beanstalk app:

user -> Elastic Beanstalk ELB -> Elastic Beanstalk mod_wsgi

The problem:

The first ~2-4 requests from user after eb deploy of a new app version will generate 504 errors from the ELB.

After these ~2-4 requests that generate 504s, everything is fine! 200s all around.

When the 504s happen, zero requests make it through to the Elastic Beanstalk mod_wsgi app according to /var/httpd/access_log. I only see the 200s after the ELB has decided to start working again.

Things I have tried that didn't work:

  1. I increased the Elastic Beanstalk ELB Idle Timeout to 300 seconds
  2. I increased the Elastic Beanstalk mod_wsgi apache KeepAliveTimeout to 300 seconds as suggested here: http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/ts-elb-error-message.html

One might say, "just live with the 504s!"

However, the actual problem is that in my production setup, I have CloudFlare between user and Elastic Beanstalk ELB. CloudFlare is set to aggressively cache .css and .js files, since I append md5 hashes to static file URLs. When requests for these important files fail with 504, CloudFlare appears to cache these failures as being 404s. Further requests for these files 404, thus breaking the visual styling of the site on every deploy.

Deploying the Elastic Beanstalk app again with the same app version will fix the CloudFlare 404 problem. This is not a great solution. I want to keep on using CloudFlare because it makes for an excellent transparent CDN, so getting rid of it is not a solution, either.

It's hard to believe I'm alone with this issue, but Google, stackoverflow/serverfault, and the AWS forums have not yielded any solutions—or even similar problem reports. I am hoping that my description of this behavior rings a bell with someone here. Thanks in advance.

2
  • What does the ELB health check configuration look like, in terms of health/unhealthy counts and request frequency? Tuning some of that might help mitigate the behavior you are seeing...
    – Castaglia
    Mar 27, 2016 at 18:38
  • What @Castaglia said. I'd look at your ELB health check settings and ensure you're getting it to only report healthy when it gets a 200 response. Set your health check frequency to be quite low (every 5 seconds or something), and it should wait until your application starts serving 200 responses before bringing the instance into service (and thus preventing your application from going down during deployment).
    – dannosaur
    Apr 27, 2018 at 17:45

2 Answers 2

1

I had exactly the same problem which I really think is a bug with the Beanstalk deployer.

I was using a "Rolling" deployment policy with 2 instances and batch size of 1, which should give zero downtime in theory. However in reality, during a deployment there is still a period of about 10 - 15 seconds where the ELB responds with 504.

Take a look at your "Update and Deployments" settings in your beanstalk configuration. I found that changing to "Rolling with additional batch" and using a batch size of 100% works well and gives zero downtime during an update.

Update October 2018 - I don't know how long it's been working for but Elastic Beanstalk rolling updates now works properly again with zero downtime for me.

0

Anyone else coming across this, I found that this issue may also crop up if you haven't configured your "health check" endpoint properly. EB will only rotate the servers into load balancing once EB is getting "healthy" replies from the health-check endpoint, which by default I think just checks that your server (nginx/apache/other for web apps) is responsive, not that your application is properly started.

In my case, the actual web server was responsive before my Flask application was fully up, leading to servers being rotated in before they were ready. I added an endpoint in my Flask app that just returns 200 and a dummy JSON body, and pointed EB at that as the health-check. Everything has been smooth sailing since then.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .