Problems with Java Garbage Collection

asked 2018-12-19 08:29:06 -0600

cartman1996 gravatar image

Hey! I am using puppetserver in version 6 in a cluster of 4 servers behind a haproxy. I also have a puppetdb in use. The Puppetca is running on the server with the haproxy. After the servers run for some hours, they start to behave strange. I have a health check configured in haproxy for them

httpchk GET /production/status/no_key HTTP/1.1\r\nAccept:\ pson\r\nHost:\ xxxx.xxxx.de

When the servers have run for some hours they start to fail with the health check. And the longer you wait the more often the checks fail and the nodes go down in haproxy. There is a timeout of 1 second for the health check and it fails with timeout. I think that it is a problem with the garbage collection of the java vm. I read about "stop the world time" on the internet. This time seems to be way too long in my case. I have tried different garbage collector algorithms and also different sizing's of the java vm heap. But nothing helped yet. The master servers have 64GB RAM and quadcore processors with hyper threading enabled. I configured max-active-jruby-instances to 7

No matter what java vm heap size i choose, it will be reached and after some time the garbage collections starts. Due to the values I've seen some times it takes more than 120 seconds until a master server answers again on requests of the health check. Then it will work properly for some time and the same thing happens again. With any time the DOWN time becomes longer and the time between these DOWN phases become shorter. If i restart the puppetserver it will work without any problem for some hours and then again will start to struggle.

Is anyone here who has experienced similar problems? I never had these kind of problems with the puppetserver. Only since the update to puppet6. I have had puppet5 running before that.

edit retag flag offensive close merge delete