If you’ve programmed much in Linux or Unix, you’ve probably run into the fork system call. A call to fork causes your existing process — everything about it — to suddenly split into two complete copies. But they run on the same CPU. [Tristan Hume] had an idea. He wanted to have a call, telefork, that would create the copy on a different machine in a Linux cluster. He couldn’t let the idea go, so he finally wrote the code to do it himself.
If you think about it, parts of the problem are easy while others are very difficult. For example, creating a copy of the process’s code and data isn’t that hard. Since the target is a cluster, the machines are mostly the same — it’s not as though you are trying to move a Linux process to a Windows machine.
However, a real fork does give the new process some things that are tricky like open TCP connections. [Tristan] sidesteps these for now, but has ideas of how to make things better in the future. He built on examples from other Open Source projects that do similar things, including Distributed Multithread Checkpointing (DMTCP). The task requires a pretty good understanding of how the operating system lays out a process.
In addition to making the telefork a bit more robust, [Tristan] has some “crazier” ideas such as sending data to multiple machines at once, or using virtual memory paging to only copy memory as needed. He even wants to allow a process to think that it has many threads, but that some of them are running on different CPUs. That means a program could “think” it had hundreds or thousands of cores. It seems as though there would be a lot of devil in the details, but it could work in theory.