I ran into my first inexplicable crash that I eventually traced back to the ColdFusion Server Monitor. Now first off, this isn’t a problem or bug with the Server Monitor. This is to be expected. The server Monitor adds overhead to requests, and if you have an intense process, it’s going to generate a lot of monitoring data. It’s possible that you might reach its limit.
I just wanted to let people know what a crash caused by the monitoring service looks like, because it doesn’t give you a message that “You have left the monitoring service on in production!”
I had a long running complicated process crashing on my local workstation. It did work on our communal development server. So it wasn’t just the process itself. I thought maybe it was that my laptop wasn’t a server class machine. But actually, the virtual machine that we are testing on wasn’t tremendously more powerful.
The browser session would error out with a message that said:
500
Java heap space
java.lang.OutOfMemoryError: Java heap space
After digging in the JRun logs for awhile I found this:
javax.servlet.ServletException: ROOT CAUSE:
java.lang.OutOfMemoryError: Java heap space
at coldfusion.monitor.event.MonitoringServletFilter. doFilter(MonitoringServletFilter.java:70)
at coldfusion.bootstrap.BootstrapFilter.doFilter(BootstrapFilter.java:46)
at jrun.servlet.FilterChain.doFilter(FilterChain.java:94)
at jrun.servlet.FilterChain.service(FilterChain.java:101)
at jrun.servlet.ServletInvoker.invoke(ServletInvoker.java:106)
at jrun.servlet.JRunInvokerChain.invokeNext(JRunInvokerChain.java:42)
at jrun.servlet.JRunRequestDispatcher.invoke(JRunRequestDispatcher.java:284)
at jrun.servlet.ServletEngineService.dispatch(ServletEngineService.java:543)
at jrun.servlet.jrpp.JRunProxyService.invokeRunnable(JRunProxyService.java:203)
at jrunx.scheduler.ThreadPool$DownstreamMetrics.invokeRunnable(ThreadPool.java:320)
at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:428)
at jrunx.scheduler.ThreadPool$UpstreamMetrics.invokeRunnable(ThreadPool.java:266)
at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)
java.lang.OutOfMemoryError: GC overhead limit exceeded
Of course I didn’t bother actually reading this error until just now when I copied and pasted it. It clearly indicates that the problem is in the Monitoring Servlet Filter. In any case, after much trial and error, I turned off memory tracking and then turned off profiling. Once I turned off profiling the error went away.
The fact that the OutOfMemoryError is thrown from the MonitoringServletFilter does not mean that monitoring is the root cause. The MonitoringServletFilter is the “perimeter” of the monitoring system – when exceptions are thrown from within CF, they’re caught there, logged by monitoring, and rethrown. I would suggest you trace down the logs some more, and you’ll probably find an entry beneath the one for this exception indicating the root cause exception. And do keep in mind that OutOfMemoryErrors occur when, well, the JVM is out of memory – is there any possibility that your application is creating objects, and not throwing them away, eating all the JVM memory? Also, as we’ve noted before, do not run production systems with Memory Tracking on – that can quickly bring a server to its knees. If neither of these is a potential root cause, do drop me a mail with more details, and we’ll look into it ASAP.
LikeLike
Actually, Ashwin, creating many, many objects and holding them over the course of one request was EXACTLY what I was doing. But with profiling and memory monitoring turned off, I was giving myself more rope?
In any case, my goal here wasn’t to snipe at CF monitoring. It was to point out what it looks like if you’re doing something crazy that pushes monitoring to the point where it breaks.
LikeLike
Yep, definitely plenty of rope there! 😉 Going by our testing, profiling is safe to use in production, but as I noted, memory tracking could kill a server, especially if it’s creating too many objects. Try your test with memory tracking turned off, and let us know what happens. I didn’t at all mean to suggest that you were sniping at CF monitoring – just providing the background so you know why the stacktrace for the error looks the way it does.
LikeLike
I definitely tried it with memory tracking turned off, and profiling turned on. It still crashed.
What can I say? I was doing weird stuff.
LikeLike
How exactly do you turn off memory tracking and profiling, I have this exact problem on a clustered pair of 2x Servers with 8GB of RAM each 😦
LikeLike
In the CF administrator:
Go to Server Monitoring
Launch Server Monitor
Up at the top there should be 3 options that say Stop Monitoring, Stop Profiling, Stop Memory Tracking.
Turn them off.
However, these are turned off by default and if you have never turned them on this error could be caused by something else per Ashwin’s comment earlier in this thread.
LikeLike
Thanks for posting this. I have now found warnings to this effect buried in the user documentation, but it seems to me incumbent on Adobe to post this warning in big red letters on the Server Monitor screen so it’s clear to everyone that it should not be kept running on a production server. At CFUnited in June, there were a lot of Adobe people generating excitement about the Server Monitor, but no mention of its dangers. Of what use, exactly, is the Server Monitor if it can’t run on production? This is a blow to my confidence in Adobe products.
LikeLike
Well there are a lot of things you can do in the CFadministrator to really screw up the server. None of them get the same treatment. I think that Adobe acts responsibly here in that they don’t install CF with the monitoring running.
I do think mention of these dangers should be included in future documentation, and guides to setting up ColdFusion, but in reality the load burden of monitoring only comes up on heavily trafficked sites or in the case I discuss above, very complex sites.
LikeLike
We had a similar problem with Fusebox applications on our servers. We finally did several thread dumps and determined that there were locking issues. Turning off all server monitoring functions cleared the problems instantly.
LikeLike
If you have an object that has a lot of objects created in its variables scope, you may want to try this.
After you are done with that object, clean up so it can be garbage collected such as:
structDelete( variables, “objOrder” );
This combined with turning off the monitoring as listed above solved our problem.
LikeLike