🏠

workers & isolation

spawn a whole process when a thread isn't enough.

actors vs workers

actorsworkers
unitTokio task (in same process)Python process (separate)
isolationshared memoryOS-level separation
fault tolerancecrash = uncaught paniccrash = process exits, main survives
overhead~100 ns/msg~ms for IPC
GILPython on hot path β†’ GIL-boundeach process has its own GIL
use forconcurrency, state machinesheavy CPU, untrusted code, crash containment

the basic shape

worker = FX.StartWorkerProcess() | run

result = FX.SendCommandToWorkerProcess(
    process=worker,
    command=FX.Print(content='hello from worker!'),
) | run

FX.StopWorkerProcess(process=worker) | run

The command is an FX value. The worker receives it, runs it, sends back the result. All over a binary protocol β€” no JSON, no serialization tax.

worker = a python with a zef runtime
main process worker process ──────────── ────────────── FX.Start... ───spawn──▢ (zef runtime) FX.SendCommand ───bytes──▢ run command ◀──bytes── send result FX.Stop... ─────────▢ exit

The worker is a full-fledged Python process that happens to have the Zef runtime loaded. Send it FX values; it runs them.

why it matters

fault containment

A segfault in the worker doesn't take down the main process. You get the crash info and can spawn a new worker.

worker = FX.StartWorkerProcess() | run

try:
    result = FX.SendCommandToWorkerProcess(
        process=worker,
        command=potentially_dangerous_fx,
    ) | run
except WorkerCrash as e:
    print(f'worker died: {e}')
    worker = FX.StartWorkerProcess() | run   # respawn

parallel CPU-bound work

Many workers in parallel. Because each has its own GIL, they actually run in parallel.

workers = [FX.StartWorkerProcess() | run for _ in range(4)]

chunks = split_data_into_chunks(dataset, 4)

results = [
    FX.SendCommandToWorkerProcess(
        process=w,
        command=process_chunk(chunk),   # a zef_function returning an FX
    ) | run
    for w, chunk in zip(workers, chunks)
]

untrusted code sandbox

Run code from users (plugins, user scripts) in a separate process with a restricted FX whitelist. (This is a planned/early-stage feature β€” check current capabilities.)

listing / stopping

# how many workers are alive?
FX.ListWorkerProcesses() | run

# stop one
FX.StopWorkerProcess(process=worker) | run

sending any FX value

The command can be any Zef value β€” typically an FX or a list of them:

FX.SendCommandToWorkerProcess(
    process=worker,
    command=[
        FX.Print(content='start'),
        FX.HTTPRequest(url='https://api.example.com/big-job'),
        FX.Print(content='done'),
    ],
) | run

when NOT to use workers

Use workers when you specifically need isolation. For everything else, actors are faster and simpler.

a concrete example β€” image processing

@zef_function
def resize_image(img_bytes: Bytes) -> Bytes:
    from PIL import Image
    import io
    im = Image.open(io.BytesIO(bytes(img_bytes)))
    im.thumbnail((200, 200))
    buf = io.BytesIO()
    im.save(buf, 'JPEG')
    return buf.getvalue()

# in main process β€” spawn a worker pool
POOL = [FX.StartWorkerProcess() | run for _ in range(4)]

def resize_batch(imgs):
    return [
        FX.SendCommandToWorkerProcess(
            process=POOL[i % len(POOL)],
            command=resize_image(img),
        ) | run
        for i, img in enumerate(imgs)
    ]

CPU-bound image resizing, 4x parallel, each in its own process. If PIL crashes on some weird jpeg, one worker dies and you restart it; the main process keeps serving other images.

remote workers β€” the next tier

Zef also supports workers on other machines via Remote Containers:

FX.InstallPodmanRemote(ip_address='5.223.58.246') | run
FX.BuildZefImageRemote(ip_address='5.223.58.246', wheel='...whl') | run
FX.RunZefContainerRemote(ip_address='5.223.58.246', session=True) | run

Same programming model: send commands, receive results. Just happens over the network.

effects as transport-independent values

Because effects are data, the same FX.* value can be run locally, in a worker, on another machine, or stored for later. The runtime picks the transport. You write the effect once.

the three-tier model

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” TIER 1 β”‚ pipelines & pure ops β”‚ ← your ordinary code β”‚ fast, shared memory, no GIL β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” TIER 2 β”‚ actors β”‚ ← concurrent workers β”‚ Tokio tasks, msg passing β”‚ same process β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” TIER 3 β”‚ worker processes β”‚ ← isolation, parallelism β”‚ OS-level separation β”‚ different processes β”‚ (local or remote) β”‚ (possibly different machines) β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Use the lightest tier that gets the job done. Jump to a heavier tier only when you need what it specifically provides.

reflection

Which tier would you use for each of these?

  1. Parsing 1,000,000 JSON strings
  2. Running untrusted user-submitted Python
  3. A chatroom with 100 clients
  4. Computing PI to 1M digits on 8 cores
answers
  1. Tier 1 β€” pure ops in a pipeline
  2. Tier 3 β€” untrusted = needs isolation
  3. Tier 2 β€” stateful per-client actors
  4. Tier 3 β€” GIL contention without processes

Next up: the finale β€” an end-to-end LLM chat agent. β†’