All writing

Threads, processes, and parallelism in Python

April 27, 2026 · 1 min read

A walk through interpreters, the GIL, threads, and processes — and when each gives you real parallelism.

1. What is an interpreter?

When you run python foo.py, your operating system launches the Python interpreter. It reads your source file, compiles it to bytecode — low-level instructions like “load name x”, “call function”, “add two numbers” — and executes that bytecode one instruction at a time on a stack machine it maintains in memory.

Python the language is a specification. An interpreter is a piece of software that implements that specification. The most common one is CPython, written in C, and the reference implementation.

How Python runs your code

1 / 4
foo.pyx = 1 + 2bytecodeLOAD_CONST; ADDInterpreterstack machineresultx = 3compileexecute

Your .py file is source text — a human-readable program.

1.1 CPython, briefly

CPython is the interpreter you get when you install Python from python.org or your system package manager. Other implementations exist — PyPy (JIT-compiled), Jython (runs on the JVM), IronPython (on .NET) — but CPython is what almost everyone means when they say “Python.” Crucially, the GIL is a CPython implementation detail, not a language feature. It's why the rest of this post matters.

2. The GIL

The Global Interpreter Lock is a mutex inside CPython that guarantees only one thread executes Python bytecode at a time within a single interpreter. It exists because CPython's memory management — reference counting, garbage collection, object state — is not thread-safe without it. The GIL makes the interpreter safe at the cost of true in-process parallelism for Python code.

The GIL — one token, two threads

1 / 3
Thread ArunningThread BwaitingGILOnly one thread holds the GIL at a time.

Thread A holds the GIL and runs bytecode. Thread B waits.

Every so often (every few milliseconds, by default), the interpreter releases the GIL to give other threads a chance. When Python code calls into C — like NumPy array operations, or blocking I/O — the C code can release the GIL explicitly while it works. This is why NumPy-heavy or I/O-heavy multithreaded Python can still gain parallelism: the heavy work happens outside the lock.

2.1 What the GIL means in practice

  • CPU-bound pure Python — threads give you no parallelism. Two threads doing arithmetic will together take roughly as long as one.
  • I/O-bound work — threads work fine. A thread blocked on a socket or a file releases the GIL, and others run.
  • Numeric / C-extension work — threads often work. NumPy, Torch, and many scientific libraries release the GIL for their heavy kernels.

3. Processes vs threads

A process is an isolated unit from the operating system's point of view: its own memory, its own file descriptors, its own interpreter if it's running Python. A thread is a unit of execution inside a process — it shares memory and the interpreter with its siblings.

Processes vs threads in memory

1 / 3
Process A — one address spacePython interpreterruns main()Heap memoryobjects, bindings

A single Python process owns one address space.

Threads are cheap to create and share memory by default, which makes communication trivial but synchronization tricky. Processes are more expensive and don't share memory by default, so you pay for communication (pickling, pipes, shared memory segments) but get isolation for free — including freedom from the GIL, because each process has its own GIL.

3.1 Fork, spawn, and starting methods

On Linux, Python's default multiprocessing start method historically was fork: the child inherits the parent's memory via copy-on-write. Fast, but unsafe if the parent holds locks, threads, or native resources. Spawn (the default on macOS since Python 3.8, and on Windows always) launches a fresh interpreter and pickles the target function and arguments across. Slower to start, much safer.

4. Sequential vs parallel execution

Two ways to run N units of work:

  • Sequentially — one at a time, in order. Predictable, trivial to debug, no coordination overhead. Total time ≈ sum of each unit's time.
  • In parallel — several at once, across threads or processes. Faster in wall-clock terms when the work is big enough to outweigh coordination cost.

The choice isn't always obvious. Parallelism has overhead: spawning workers, serializing arguments and results, coordinating completion. If each unit is short (milliseconds), the overhead can swamp the benefit — you'll finish slower in parallel than sequentially. This is usually called the “fan-out cost” crossover.

4.1 Sequential — the baseline

results = []
for item in items:
    results.append(do_work(item))

Boring, correct, and often the right answer. Use this first. Only reach for parallelism when you have evidence it helps.

4.2 Parallel with threads

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=8) as pool:
    results = list(pool.map(do_work, items))

Good when do_work is I/O-bound (HTTP calls, file reads, database queries) or calls into a C extension that releases the GIL. No memory copying. Shared state needs locks.

4.3 Parallel with processes

from concurrent.futures import ProcessPoolExecutor

if __name__ == "__main__":
    with ProcessPoolExecutor(max_workers=8) as pool:
        results = list(pool.map(do_work, items))

Good when do_work is CPU-bound pure Python. Each worker gets its own interpreter and its own GIL. Arguments and return values are pickled across process boundaries — keep them small. The if __name__ == "__main__" guard is not optional on spawn systems; without it, the child re-imports your module and tries to spin up more children, ad infinitum.

5. Known pitfalls

5.1 “Threads don't speed up my code”

If the work is CPU-bound pure Python, it won't — the GIL serializes it. Switch to processes, or move the hot path into a C extension.

5.2 “Processes are hanging on startup”

Usually one of: (a) missing if __name__ == "__main__" guard, (b) the target function isn't picklable (closures, lambdas, local functions), (c) CUDA or other native state was initialized in the parent before spawning — many libraries forbid forking after that.

5.3 “My parallel code is slower than sequential”

Each unit is probably too small relative to coordination overhead. Batch the work (process N items per task), or use a lower-overhead mechanism — threads instead of processes, or the main thread with asyncio for I/O.

5.4 “The shared dict got corrupted”

Even under the GIL, a sequence of bytecode instructions is not atomic — the interpreter can release the GIL between them. Use threading.Lock, queue.Queue, or thread-local storage. Don't assume “the GIL will protect me”; it protects the interpreter from itself, not your data structures from you.

6. When to pick what

A short decision tree:

  • One machine, I/O-bound — threads, or asyncio if the library supports it.
  • One machine, CPU-bound pure Python — processes.
  • One machine, CPU-bound via NumPy/Torch/etc. — threads often work, because the heavy lifting happens outside the GIL. Measure first.
  • One machine, tiny units of work — sequential. Seriously. Benchmark before reaching for parallelism.
  • Many machines — a distributed framework. Out of scope here, but the single-machine decisions still matter inside each node.

The boring path is usually right: start sequential, measure, then pick the smallest jump (threads or processes) that buys you what you need. Parallelism is a cost you pay for throughput; pay it deliberately.