3

I am experiencing sharp drops and spikes in instance count every 30 minutes although the request rate has been stable for 2h at 2.4k RPS. Periodically there are a lot of warmup requests after many instances get shutdown at the same time. This also increases our operational costs because of the large number of idle instances.

  • App Engine Release: 1.8.1
  • Total number of instances: 235 total (15 Resident)
  • Average QPS: 9.143
  • Average Latency: 135.5 ms
  • Average Memory: 157.9 MBytes

The performance settings of the app are still at the defaults (F1 instances, min/max pending latency and min/max idle instances are still at automatic).

I will re-run the same test on F2 instances shortly. In the meantime:

  • Is this a known problem on GAE?
  • Is this caused by the memory consumption which is too high for F1s?
  • What can I do to fix this problem except going to F2s?
  • How can the average memory be above 128 MB using F1 instances?

instance count [F1] enter image description here RPS [F1] enter image description here total memory usage in MB [F1] enter image description here

Update after running the test on F2 instances

During the first 2h of the test instance churn was significantly reduced. Instance count was significantly more stable. In the last 2h of the test, instance count went up from 250 to 600 although the request rate was stable at 2.4k RPS.

instance count [F1 vs F2] enter image description here RPS [F1 vs F2] enter image description here total memory usage in MB [F1 vs F2] enter image description here milliseconds per request [F1 vs F2] enter image description here

2
  • This is very hard to say without knowing any details of your app. From what I've read, the instances are based off CPU usage - so more CPU, more instances.
    – Nathan C
    Jul 9, 2013 at 13:10
  • Right, but if the average memory footprint per instance is around 160 MB the CPU should be mostly busy garbage collecting anyway. F1 instances come with 128 MB of memory.
    – Ingo
    Jul 9, 2013 at 13:34

1 Answer 1

1

This information is taken partly from speaking to Google and from my own experience, I am not a Google staffer.

I've found Google's front end memory requirements are fuzzy targets and generally not hard limits which would cause constant GCs for most users since most apps would probably exceed it. I find their actual limit is around 170MB before the instance usually runs a risk of being quietly shutdown (I have noticed this can run up to around 200MB occasionally so I assume they have a periodic background instance reaper thread that does this work - this is hypothetical I have no evidence this is being done). If an instance looks to be running away with memory and I owned the server I know I'd be considering killing the process.

I would check how much memory most of your instances are actually using as this may be what is causing instances to be killed en-masse.

When using an F2 your server is able to startup and process requests twice as fast as an F1 resulting in fewer instances and due to higher memory ceiling less chance of being killed (again my opinion that seems to tally with my experience of running a number of enterprise class apps).

Also note that Google are currently rolling out (or RC testing?!!) an update to their servers from GAE 1.8.1 to 1.8.2 and this may be affecting apps like ours and is the reason I've found your post, we're seeing random memcache and response latency of 5-20 seconds returning fully front end memcached responses, something that would ordinarily be completed in <10 milliseconds (<80ms with network latency). During this roll out don't forget each VM/machine running instances will also need to be doing upgrades as well as serving other apps.

If this continues for more than a few hours we'll be collecting evidence and claiming back the costs - I suggest others do the same, remember... Google prides itself on system reliability it is a top priority.

1
  • Thanks! We are doing a lot more experiments with different local caches and instance types right now. I believe everybody using GAE has a lot to say about this. But I would like to see some best practices and docs about instance lifecycle from Google.
    – Ingo
    Jul 9, 2013 at 21:03

Not the answer you're looking for? Browse other questions tagged .