TLA+ and PHP 05

April 5, 2022

Previously

This post is part of a series of posts where I'm exploring how to use TLA+ to specify, check and then implement a PHP project that deals with concurrent processes; here are the links to the first post, the second, the third one, and the fourth one.

Fast-failure support

I'm using TLA+ to model a Loop that will be the core of a yet-to-be-implemented testing framework.
One of the features I find very useful is support for fast failure: as soon as a test fails, break out of the Loop, gracefully shut down the other running tests, and exit with an error code.

Since each test "job", whatever that will end up representing, could be running when the first failure comes is, the challenges are the graceful shutdown of the Loop and correct closing of the currently running jobs.

One of the assumptions I had coded in the specification before this step is that all worker processes would run, thus making the worker processes fair; in TLA+ terms, each process is "weakly fair".
Fast failure introduces a drastic change in that paradigm: a worker process could not run if the Loop never starts it because a previous worker's job failed.
This means I should update the worker process by removing the fair keyword from it: while it will run most of the time, it might not run at all if a previous worker job failed triggering the fast-failure handling.
After the update, though, the model checking will fail with Stuttering:

Stuttering caused by unfair worker process

In the image above, without expanding the details of the states, this happened:

The Loop started with two workers
When the first worker completed the Loop started a new one (i.e., set the _startedRegister[J3] value to TRUE).
The Loop waits for updates from the workers.
The last worker process started, J3 stutters: it simply never executes because it is an "unfair" (i.e., not fair) process.

An unfair process simulates a process that might never run: it crashes or takes too much time to produce any noticeable side-effect.

This little experiment shows that making the worker processes not fair is not the right solution.

How should I translate the following from plain English to something that TLA+ would understand?

Beyond PlusCal

The language I've used to write and update the specification is called PlusCal. What I write in PlusCal is then translated into TLA+ specification language proper.
PlusCal is not the specification language; it's merely a comment when it comes to TLA+.
While PlusCal is powerful, it does not allow the kind of access I need to solve my problem: it's time to get my hands dirty.

Before applying my solution to the actual specification, I would like to experiment with a simpler and smaller specification with all the elements required by the real one, but in a more uncomplicated form.
There is a Loop process that will start worker processes, each process should run when activated, and there is the concept of the Loop completing before all worker processes have a chance to start.

Here is the heavily commented version of the PlusCal code, \* starts a comment:

----------------------------- MODULE test_goto -----------------------------
EXTENDS TLC, Integers, FiniteSets
CONSTANTS Loop, Workers, NULL

(*--algorithm test_goto
variables
    started = [x \in Workers |-> FALSE],
    startedCount = 0,
    workTally = 0;

define
NotStartedWorkers == {x \in Workers: started[x] = FALSE} 
end define;

fair process loop = Loop
variables proc = NULL;
begin
    Loop_StartWorker:
        proc := CHOOSE x \in NotStartedWorkers: TRUE; \* Pick the first not started worker.
        startedCount := startedCount + 1;
        started[proc] := TRUE; \* Mark this process started, to power NotStartedWorkers.
        if startedCount /= 2 then \* Purposefully not starting the 3rd worker.
            goto Loop_StartWorker;
        end if;
    Loop_WaitForWorkDone:
        await workTally = 2; \* Two workers are done, the 3rd one wil never start.
end process;

fair process worker \in Workers begin \* If not blocked, workers should execute.
    Worker_AutostartGuard: \* Always blocked action: workers cannot start on their own.
        await FALSE = TRUE;
    Worker_Work: \* The Loop will move workers directly here to start them.
        workTally := workTally + 1;
end process;

end algorithm;*)
=============================================================================

The model I'm testing the specification with, after translation, is set up as follows:

What is the behavior spec? - 'Temporal formula'
What to check?
- Deadlock
- Termination
What is the model?
- Loop <- [ model value ]
- Workers <- [ model value ] {W1, W2, W3}
- NULL <- [ model value ]

Translating and running the model will result in a deadlock:

Deadlock on first model check

The Loop will start two workers out of 3; both of them blocked by a guard in the following form:

await FALSE = TRUE;

That await will permanently block the worker processes on the Worker_AutostartGuard action.
Instead of setting a flag in a register shared by Loop and workers to coordinate and stimulate the Loop starting the processes, the _startedRegister variable in the specification, I want the Loop to start the workers by moving them to the Worker_Work action directly.

Before I show the code, I would like to make a note about how PlusCal code is translated.
In the TLA+ translation, there is no concept of "processes"; there are simply states the system will transit to that, for convenience of writing, PlusCal will wrap in processes. The TLA+ specification will be one giant state machine, and the model checker will move from state to state, respecting the available nodes.
When translated from PlusCal to the TLA+ language, the translation will add a pc variable that will map each process to its next state; it can be seen in the image above, initially set to this:

<Initial Predicate>
    pc
        W1 -> Worker_AutostartGuard
        W2 -> Worker_AutostartGuard
        W3 -> Worker_AutostartGuard
        Loop -> Loop_StartWorker

My thinking is that I could move a worker to the Worker_Work phase by setting its entry in the pc variable to Worker_Work to make it look something like this:

<Initial Predicate>
    pc
        W1 -> Worker_Work
        W2 -> Worker_AutostartGuard
        W3 -> Worker_AutostartGuard
        Loop -> Loop_StartWorker

I've updated my PlusCal code to do that:

----------------------------- MODULE test_goto -----------------------------
EXTENDS TLC, Integers, FiniteSets
CONSTANTS Loop, Workers, NULL

(*--algorithm test_goto
variables
    started = [x \in Workers |-> FALSE],
    startedCount = 0,
    workTally = 0;

define
NotStartedWorkers == {x \in Workers: started[x] = FALSE} 
end define;

fair process loop = Loop
variables proc = NULL;
begin
    Loop_StartWorker:
        proc := CHOOSE x \in NotStartedWorkers: TRUE; \* Pick the first not started worker.
        startedCount := startedCount + 1;
        started[proc] := TRUE; \* Mark this process started, to power NotStartedWorkers.
        pc[proc] := "Worker_Work"; \* Move the worker process to the "Worker_Work" action directly.
        if startedCount /= 2 then \* Purposefully not starting the 3rd worker.
            goto Loop_StartWorker;
        end if;
    Loop_WaitForWorkDone:
        await workTally = 2; \* Two workers are done, the 3rd one wil never start.
end process;


fair process worker \in Workers begin \* If not blocked, workers should execute.
    Worker_AutostartGuard: \* Always blocked action: workers cannot start on their own.
        await FALSE = TRUE;
    Worker_Work: \* The Loop will move workers directly here to start them.
        workTally := workTally + 1;
end process;

end algorithm;*)
=============================================================================

Except that is not allowed, and the translation will fail with the message:

Unrecoverable error:
-- Multiple assignments to pc

I've not done multiple assignments to pc, but the translation did.
Looking at the translated code, I can see the pc variable is modified in each state, its next value (indicated by pc') updated using EXCEPT. As an example, this is the translation of the Worker_Work action:

Worker_Work(self) == /\ pc[self] = "Worker_Work"
                     /\ workTally' = workTally + 1
                     /\ pc' = [pc EXCEPT ![self] = "Done"]
                     /\ UNCHANGED << started, startedCount, proc >>

The syntax means "when the model checker will run the Worker_Work action, then the next state will be this."; what the model checker will do next is check all the available actions and pick one whose pre-conditions are met and execute it.

In plain English, the above means:

if the action I (a worker) should execute is the "Worker_Work" one
then the next value of workTally will be workTally + 1
and the next value of pc (the program counter) for myself will be Done
and started, startedCount and proc will not change.

Remember /\ is a logic AND, var' = <value> is an assignment applied to the next state, and self means the process itself, used as a key.
The string pc' = [pc EXCEPT ![self] = "Done"] means "the next state of pc is like pc except for the key self that will have a value of Done"; the syntax is a bit weird.

The words "action" and "state" should ring a bell from store-pattern implementations like Redux. It's an excellent concept to keep in mind and wrap my head around the fact that each PlusCal label (e.g., the Worker_Work one) will become an action (an operator, a function in TLA+ terms) that will take a state as an input and return the next state as an output.

Since each translation of a PlusCal label will update the next state of the pc variable, I cannot do that in the context of PlusCal.
Why not allow updates of the variables more than once? I would not know for sure, I did not write the TLA+ model checker, but I can guess to avoid the next state being a function of itself.

Messing with the translation

I've commented the problematic parts of the PlusCal code and translated it to get the bulk of it set up.
In the translation, I've commented the original code out and added the updated version right after it.

----------------------------- MODULE test_goto -----------------------------
EXTENDS TLC, Integers, FiniteSets
CONSTANTS Loop, Workers, NULL

(*--algorithm test_goto
variables
    started = [x \in Workers |-> FALSE],
    startedCount = 0,
    workTally = 0;

define
NotStartedWorkers == {x \in Workers: started[x] = FALSE} 
end define;

fair process loop = Loop
variables proc = NULL;
begin
    Loop_StartWorker:
        proc := CHOOSE x \in NotStartedWorkers: TRUE; \* Pick the first not started worker.
        startedCount := startedCount + 1;
        started[proc] := TRUE; \* Mark this process started, to power NotStartedWorkers.
        \* pc[proc] := "Worker_Work"; >>> TODO IN THE TRANSLATION.
        if startedCount /= 2 then \* Purposefully not starting the 3rd worker.
            goto Loop_StartWorker;
        end if;
    Loop_WaitForWorkDone:
        await workTally = 2; \* Two workers are done, the 3rd one wil never start.
        \* pc := [x \in DOMAIN pc |-> "Done"] >>> TODO IN THE TRANSLATION.
end process;

fair process worker \in Workers begin \* If not blocked, workers should execute.
    Worker_AutostartGuard: \* Always blocked action: workers cannot start on their own.
        await FALSE = TRUE;
    Worker_Work: \* The Loop will move workers directly here to start them.
        workTally := workTally + 1;
end process;

end algorithm;*)
\* BEGIN TRANSLATION (chksum(pcal) \in STRING /\ chksum(tla) \in STRING)
VARIABLES started, startedCount, workTally, pc

(* define statement *)
NotStartedWorkers == {x \in Workers: started[x] = FALSE}

VARIABLE proc

vars == << started, startedCount, workTally, pc, proc >>

ProcSet == {Loop} \cup (Workers)

Init == (* Global variables *)
        /\ started = [x \in Workers |-> FALSE]
        /\ startedCount = 0
        /\ workTally = 0
        (* Process loop *)
        /\ proc = NULL
        /\ pc = [self \in ProcSet |-> CASE self = Loop -> "Loop_StartWorker"
                                        [] self \in Workers -> "Worker_AutostartGuard"]

Loop_StartWorker == /\ pc[Loop] = "Loop_StartWorker"
                    /\ proc' = (CHOOSE x \in NotStartedWorkers: TRUE)
                    /\ startedCount' = startedCount + 1
                    /\ started' = [started EXCEPT ![proc'] = TRUE]
                    /\ IF startedCount' /= 2
                        \*   THEN /\ pc' = [pc EXCEPT ![Loop] = "Loop_StartWorker"]
                        \*   ELSE /\ pc' = [pc EXCEPT ![Loop] = "Loop_WaitForWorkDone"]
                          THEN /\ pc' = [pc EXCEPT ![Loop] = "Loop_StartWorker", ![proc'] = "Worker_Work"]
                          ELSE /\ pc' = [pc EXCEPT ![Loop] = "Loop_WaitForWorkDone", ![proc'] = "Worker_Work"]
                    /\ UNCHANGED workTally

Loop_WaitForWorkDone == /\ pc[Loop] = "Loop_WaitForWorkDone"
                        /\ workTally = 2
                        \* /\ pc' = [pc EXCEPT ![Loop] = "Done"]
                        /\ pc' = [x \in DOMAIN pc |-> "Done"]
                        /\ UNCHANGED << started, startedCount, workTally, proc >>

loop == Loop_StartWorker \/ Loop_WaitForWorkDone

Worker_AutostartGuard(self) == /\ pc[self] = "Worker_AutostartGuard"
                               /\ FALSE = TRUE
                               /\ pc' = [pc EXCEPT ![self] = "Worker_Work"]
                               /\ UNCHANGED << started, startedCount, 
                                               workTally, proc >>

Worker_Work(self) == /\ pc[self] = "Worker_Work"
                     /\ workTally' = workTally + 1
                     /\ pc' = [pc EXCEPT ![self] = "Done"]
                     /\ UNCHANGED << started, startedCount, proc >>

worker(self) == Worker_AutostartGuard(self) \/ Worker_Work(self)

(* Allow infinite stuttering to prevent deadlock on termination. *)
Terminating == /\ \A self \in ProcSet: pc[self] = "Done"
               /\ UNCHANGED vars

Next == loop
           \/ (\E self \in Workers: worker(self))
           \/ Terminating

Spec == /\ Init /\ [][Next]_vars
        /\ WF_vars(loop)
        /\ \A self \in Workers : WF_vars(worker(self))

Termination == <>(\A self \in ProcSet: pc[self] = "Done")

\* END TRANSLATION 
=============================================================================

In the Loop_StartWorker and Loop_WaitForWorkDone actions, I'm updating more than one entry of the pc' map, modeling the idea that the Loop will "start" the worker processes correctly without using a register and avoiding unterminated worker processes by setting the next stage of any process to "Done" in the Loop_WaitForWorkDone phase.

Checking the model for Termination and Deadlock will pass this time:

Model checking the updated translation

In my next post, I will apply this newfound "tool" to the actual specification I care about and keep working on it to move it toward its final version.