"IRIS can not become quiescent" error

Question

Question

Igor Titarenko · Jul 30, 2020

#Business Process (BPL) #Interoperability #Testing #InterSystems IRIS

Did anyone run into this error when stopping a Production from Ens.Director?

Ens.Director::StopProduction => ERROR <Ens>ErrProductionNotQuiescent: IRIS can not become quiescent

It happens sporadically when an automated unit test from a class that extends %UnitTest.TestProduction runs a test on a Business Process. I already increased the parameter MAXWAIT to 30 seconds, but the error still happens.

Discussion (4)3

Log in or sign up to continue

Nigel Salm Jul 31, 2020 to Cristiano Silva

Hi

Is this running on a UNIX/Linux system?

If so I have noticed that if the production timeout is set too low Ensemble will spawn off a lot of jobs to try and clear the queues before it shuts down and this can consume lots of CPU and Memory and the system becomes unresponsive. Try setting the Shut down wait time to 60 seconds

Yours

Nigel

1 0

score 0 · Answer 1 · 2020-07-30T16:28:56-04:00

Hi Igor,

This occur because some Business Host still processing a message or waiting for a response.

You can override the method StopProduction to force the production to stop.

Class br.cjs.test.TestProduction Extends %UnitTest.TestProduction
{ 

/// Class name of the production. It must contain the production class name.
Parameter PRODUCTION As %String = "HC.Production"; 

/// Paratmerter used to force stop production by Ens.Director
Parameter FORCESTOPPRODUCTION As %Boolean = 1;

/// Code to run right after the production is started. Used, for example, to call a method that initiates the test.
/// If an error status is returned, the test will be aborted and failed and the production will be stopped.
/// So if a non fatal error occurs, you may invoke ..LogErrors(status,"OnAfterProductionStart()") and return $$$OK.
Method OnAfterProductionStart() As %Status
{
    #Dim exception As %Exception.General = ""
    #Dim statusCode As %Status = $System.Status.OK()
    Try
    {
        // Do Yor Tests
    }
    Catch (exception)
    {
        Set statusCode = exception.AsStatus()
    }
    Return statusCode
} 

Method StopProduction() As %Boolean [ Internal, Private ]
{
    Do ..GetMacros(.Macro)
    Do $$$LogMessage("Stopping production '"_..#PRODUCTION_"'")
    Set r = $$$AssertStatusOK(##class(Ens.Director).StopProduction(..#MAXWAIT, ..#FORCESTOPPRODUCTION), "Invoking Ens.Director::StopProduction")
    If 'r Quit 0
    Set r = $$$AssertStatusOK(..WaitForState(Macro("eProductionStateStopped")), "Verifying Ensemble state is 'Stopped'")
    If 'r Quit 0
    Quit 1
} 

}

score 0 · Answer 2 · 2020-08-06T10:00:03-04:00

Igor Titarenko Aug 6, 2020 to Nigel Salm

Yes, it's a Linux system. Increasing the the MAXWAIT parameter to 60 seconds seems to have resolved the issue.

0 0

score 1 · Answer 3 · 2020-08-06T19:20:02-04:00

Hi Igor

That's great. I have a good understanding of how Ensemble works on Windows and almost every Ensemble Interface I have written has ended up running on some form of Linux and most of those Interfaces are based on 3rd party requests coming into the Interface (in the form of Lab or Pharmacy orders for example) and at some point in time the Interface will send back the Results. So though the internals of the interface may be quite complex the quantity of data is not necessarily very high. However when I was writing the Ensemble engine for a prototype Pharmacy dispensing robot my Ensemble engine had to interact with the underlying Java based ROS (Robot Operating System) and every single mechanical component of the robot right down to the LED lights, motors, sensors and so on were generating a massive stream of JSON event messages which were grouped into queues with one or more business service handling each queue. As the Business Service OnRequest methods can only iterate at 1/10th of a second I ended up writing infinite loops within each OnRequest method to the point where each service was processing around 3-4000 messages per second. When we gave the robot the instruction to shut down the ROS would start shutting down the mechanical parts and I had to wait for the last messages from the components to ensure that they had all shut down correctly. The database was being journalled as well. We found that by forcing the production to halt had all sorts of ramifications. Some of them I mentioned in my first response. We couldn't leave data in the queues and pick up from where we left off when we restarted the robot and so we saw this behaviour of lots of ensemble processes firing up to help clear the queues, the WIJ file would grow very large and the system would ultimately freeze. That forced us to do a complete reboot but Ensemble would then have to deal with rolling back the WIJ file and it would take ages for the system to finally become responsive. I didn't have the option to throw more hard drives into the configuration nor more memory and eventually I got the Ubuntu guys to show me what was happening on the system during shutdown and that is where I saw this behaviour which was quite different from what I am used to on windows and that is when I discovered that by increasing the Wait Time for the production to stop did the trick. Just increasing it to 60 seconds made all the difference. I know this doesn't really add to my original reply but I thought I would give some context to my recommendation for other developers who are faced with similar issues.

Nigel