Killing zombies (POSIX)

lucasart · Post by **lucasart** » Sun Nov 08, 2020 1:41 pm

hgm wrote: ↑Sun Nov 08, 2020 12:26 pm Users even complain the GUI is at fault when it allows e.p. capture. Or when it refuses to start an engine they have not on their computer.

Ah, yes... Those are typical problems due the interface between the keyboard and the chair

bob · Post by **bob** » Mon Nov 09, 2020 6:17 am

The most elegant solution to this is to do the following:

(1) I assume you use either fork() or to spawn new processes. Either works.

(2) in the code that does the fork (the one where fork returns pid of child you need to catch the signal SIGCHLD. When a process terminates in unix/linux/etc, the O/S wants to deliver the termination status of the child back to the parent process. Until the parent grabs this status, you get a zombie process. The way to get the status and dismiss the zombie is to use the waidpid() system call. You need the PID set to -1 and use the WNOHANG flag. You can then use a loop in the handler that catches SIGCHLD signals, as the WNOHANG will cause waitpid() to immediately return with a value of zero rather than the PID and exit status of the process it found.

It is really a simple mechanism. Note that this is a very definite parent/child thing. The parent is the only process that gets the SIGCHLD signal, and only for the processes/threads this parent creates. If the parent creates a child, and then that child spawns another thread, the child will get the signal when its child terminates, and the parent will get a signal when its child terminates.

None of the above applies to threads using the usual pthread library. They don't create zombies.

lucasart · Post by **lucasart** » Mon Nov 09, 2020 7:11 am

bob wrote: ↑Mon Nov 09, 2020 6:17 am The most elegant solution to this is to do the following:

(1) I assume you use either fork() or to spawn new processes. Either works.

(2) in the code that does the fork (the one where fork returns pid of child you need to catch the signal SIGCHLD. When a process terminates in unix/linux/etc, the O/S wants to deliver the termination status of the child back to the parent process. Until the parent grabs this status, you get a zombie process. The way to get the status and dismiss the zombie is to use the waidpid() system call. You need the PID set to -1 and use the WNOHANG flag. You can then use a loop in the handler that catches SIGCHLD signals, as the WNOHANG will cause waitpid() to immediately return with a value of zero rather than the PID and exit status of the process it found.

It is really a simple mechanism. Note that this is a very definite parent/child thing. The parent is the only process that gets the SIGCHLD signal, and only for the processes/threads this parent creates. If the parent creates a child, and then that child spawns another thread, the child will get the signal when its child terminates, and the parent will get a signal when its child terminates.

None of the above applies to threads using the usual pthread library. They don't create zombies.

That's the opposite of my problem. Terminating the parent when a child dies is even easier, by taking advantage of the constant pipe communications in our use case. There's basically nothing to do other than exit() when fgets() returns EOF reading from the broken pipe.

The hard part is to guarantee that all children are terminated when the parent exits. There is no full proof POSIX solution for this. Only Linux (hence android) has prctl() which delegates termination to the kernel (where it belongs).

hgm · Post by **hgm** » Mon Nov 09, 2020 8:58 am

I think there is a miscommunication here, because you used the term 'zombie' to refer to disconnected (in terms of stdin/stdout) processes that are still running. While in *nix systems this term has a very specific meaning: processes that have already been terminated (e.g. because they did receive a SIGTERM, SIGKILL or SIGPIPE, or simply called exit(2) themselves), but for which the OS still keeps an entry in the process table (all other resources having been freed), to preserve the exit status until the parent requests it through a wait(2) call. Bob's posting refers to that official meaning.

lucasart · Post by **lucasart** » Wed Nov 11, 2020 6:09 am

hgm wrote: ↑Mon Nov 09, 2020 8:58 am I think there is a miscommunication here, because you used the term 'zombie' to refer to disconnected (in terms of stdin/stdout) processes that are still running. While in *nix systems this term has a very specific meaning: processes that have already been terminated (e.g. because they did receive a SIGTERM, SIGKILL or SIGPIPE, or simply called exit(2) themselves), but for which the OS still keeps an entry in the process table (all other resources having been freed), to preserve the exit status until the parent requests it through a wait(2) call. Bob's posting refers to that official meaning.

I just read the documentation on wait(), and now I understand. Indeed Cheng was not a zombie, because it was still alive. It was in fact an orphan, that had been disconnected (from stdin/out) but without being told so (SIGHUP). So it got automatically adopted by the init process 1. Of course, the init process does all the state of the art wait() stuff, but only when Cheng dies, and init receives SIGCHLD. Which it never did because Cheng was trolling

Fortunately, we almost never have to deal with all this crazy stuff, because init does it for us. When c-chess-cli dies (no matter how), the kernel send SIGHUP to all children (thanks to prctl), which kills them. And when they die, init will catch SIGCHLD and do the wait stuff.

hgm · Post by **hgm** » Wed Nov 11, 2020 8:51 am

Indeed. This whole process 1 'foster parenting' is basically just a *nix kludge for allowing wait(2) to work also when it is called after the process that it is waiting for has already terminated. Logically process 1 should terminate after it has done its job of initializing the system. But instead it is assigned a new function, just calling wait(2) in an infinite loop to clear away orphaned zombies.

bob · Post by **bob** » Thu Nov 12, 2020 4:39 am

I am not sure why your example fails. IE when the parent dies, the kernel should change the PPID to 1, which lets "init" take over as the parent, where it will absorb those SIGCHLD signals and wait() to get their status and dismiss 'em.

Of course, you do have to make sure the processes terminate, other they can wait indefinitely.

Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)