How to make activemq more stable?

The puppet environment (v3.2.4) that I manage has about 550 servers and we have stability issues with activemq. In a test setup with 10 servers mcollective worked great. mco ping was reliably showing all servers. With mco puppet I could trigger puppet runs on all 10 servers. However in our production setup mco ping is randomly showing a few servers or none at all. The mcollective client is frequently loosing connection to activemq. Mcollective servers are loosing connection with activemq. The activemq log regularly shows "java.lang.OutOfMemoryError: GC overhead limit exceeded" messages.

I think your stability problems are the result of multiple causes.

  1. Timeouts. You might want to decrease the registerinterval from the recommended 300 to 60 or possibly lower particularly if you have machines behind firewalls. I've found that connections could timeout and lowering the interval helped

  2. More RAM. I'm using 1024M heap for 100 machines which has been stable over the last month. In our system 50 machines worked with 512M, but going to 100 in our busy periods would usually cause Activemq to crash.

  3. Separate services from each other or perhaps just allocate more resources. We use ...

