Killing zombies (POSIX)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Killing zombies (POSIX)

Post by lucasart »

hgm wrote: Sun Nov 08, 2020 12:26 pm Users even complain the GUI is at fault when it allows e.p. capture. Or when it refuses to start an engine they have not on their computer.
Ah, yes... Those are typical problems due the interface between the keyboard and the chair :lol:
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Killing zombies (POSIX)

Post by bob »

The most elegant solution to this is to do the following:

(1) I assume you use either fork() or to spawn new processes. Either works.

(2) in the code that does the fork (the one where fork returns pid of child you need to catch the signal SIGCHLD. When a process terminates in unix/linux/etc, the O/S wants to deliver the termination status of the child back to the parent process. Until the parent grabs this status, you get a zombie process. The way to get the status and dismiss the zombie is to use the waidpid() system call. You need the PID set to -1 and use the WNOHANG flag. You can then use a loop in the handler that catches SIGCHLD signals, as the WNOHANG will cause waitpid() to immediately return with a value of zero rather than the PID and exit status of the process it found.

It is really a simple mechanism. Note that this is a very definite parent/child thing. The parent is the only process that gets the SIGCHLD signal, and only for the processes/threads this parent creates. If the parent creates a child, and then that child spawns another thread, the child will get the signal when its child terminates, and the parent will get a signal when its child terminates.

None of the above applies to threads using the usual pthread library. They don't create zombies.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Killing zombies (POSIX)

Post by lucasart »

bob wrote: Mon Nov 09, 2020 6:17 am The most elegant solution to this is to do the following:

(1) I assume you use either fork() or to spawn new processes. Either works.

(2) in the code that does the fork (the one where fork returns pid of child you need to catch the signal SIGCHLD. When a process terminates in unix/linux/etc, the O/S wants to deliver the termination status of the child back to the parent process. Until the parent grabs this status, you get a zombie process. The way to get the status and dismiss the zombie is to use the waidpid() system call. You need the PID set to -1 and use the WNOHANG flag. You can then use a loop in the handler that catches SIGCHLD signals, as the WNOHANG will cause waitpid() to immediately return with a value of zero rather than the PID and exit status of the process it found.

It is really a simple mechanism. Note that this is a very definite parent/child thing. The parent is the only process that gets the SIGCHLD signal, and only for the processes/threads this parent creates. If the parent creates a child, and then that child spawns another thread, the child will get the signal when its child terminates, and the parent will get a signal when its child terminates.

None of the above applies to threads using the usual pthread library. They don't create zombies.
That's the opposite of my problem. Terminating the parent when a child dies is even easier, by taking advantage of the constant pipe communications in our use case. There's basically nothing to do other than exit() when fgets() returns EOF reading from the broken pipe.

The hard part is to guarantee that all children are terminated when the parent exits. There is no full proof POSIX solution for this. Only Linux (hence android) has prctl() which delegates termination to the kernel (where it belongs).
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Killing zombies (POSIX)

Post by hgm »

I think there is a miscommunication here, because you used the term 'zombie' to refer to disconnected (in terms of stdin/stdout) processes that are still running. While in *nix systems this term has a very specific meaning: processes that have already been terminated (e.g. because they did receive a SIGTERM, SIGKILL or SIGPIPE, or simply called exit(2) themselves), but for which the OS still keeps an entry in the process table (all other resources having been freed), to preserve the exit status until the parent requests it through a wait(2) call. Bob's posting refers to that official meaning.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Killing zombies (POSIX)

Post by lucasart »

hgm wrote: Mon Nov 09, 2020 8:58 am I think there is a miscommunication here, because you used the term 'zombie' to refer to disconnected (in terms of stdin/stdout) processes that are still running. While in *nix systems this term has a very specific meaning: processes that have already been terminated (e.g. because they did receive a SIGTERM, SIGKILL or SIGPIPE, or simply called exit(2) themselves), but for which the OS still keeps an entry in the process table (all other resources having been freed), to preserve the exit status until the parent requests it through a wait(2) call. Bob's posting refers to that official meaning.
I just read the documentation on wait(), and now I understand. Indeed Cheng was not a zombie, because it was still alive. It was in fact an orphan, that had been disconnected (from stdin/out) but without being told so (SIGHUP). So it got automatically adopted by the init process 1. Of course, the init process does all the state of the art wait() stuff, but only when Cheng dies, and init receives SIGCHLD. Which it never did because Cheng was trolling :lol:

Fortunately, we almost never have to deal with all this crazy stuff, because init does it for us. When c-chess-cli dies (no matter how), the kernel send SIGHUP to all children (thanks to prctl), which kills them. And when they die, init will catch SIGCHLD and do the wait stuff.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Killing zombies (POSIX)

Post by hgm »

Indeed. This whole process 1 'foster parenting' is basically just a *nix kludge for allowing wait(2) to work also when it is called after the process that it is waiting for has already terminated. Logically process 1 should terminate after it has done its job of initializing the system. But instead it is assigned a new function, just calling wait(2) in an infinite loop to clear away orphaned zombies.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Killing zombies (POSIX)

Post by bob »

I am not sure why your example fails. IE when the parent dies, the kernel should change the PPID to 1, which lets "init" take over as the parent, where it will absorb those SIGCHLD signals and wait() to get their status and dismiss 'em.

Of course, you do have to make sure the processes terminate, other they can wait indefinitely.