PyTorch DDP
Environment Setup Before We Begin
Before we start, you need to install PyTorch + CUDA, which you can do by following the official website. You can proceed with the PyTorch installation as is, but I notice some people try to install CUDA manually. Personally, I do not recommend this. There are many settings to configure, especially on Windows. Therefore, I strongly recommend creating a virtual environment using miniconda or venv. Below, I will briefly outline the method I usually use for setup.
I prefer utilizing miniconda over anaconda. It is much lighter because it installs only the absolutely essential elements for creating a virtual environment.
1. Installing Miniconda:
Linux:
- Go to this Link and download
Miniconda3-latest-Linux-x86_64.sh. - Run
bash Miniconda3-latest-Linux-x86_64.sh.
Windows:
- Go to this Link and download/run
Miniconda3-latest-Windows-x86_64.exe.
2. Creating a Miniconda Virtual Environment:
Linux:
- In the terminal:
conda create -n your_own_env_name python=3.9
Windows:
- Run
Anaconda Promptfrom the Start menu. - In the terminal:
conda create -n your_own_env_name python=3.9
You can set your_own_env_name to whatever name you prefer.
3. Installing Packages within Miniconda (Same for Linux & Windows)
- Run
conda activate your_own_env_namein the terminal. - Select your desired version and OS on this Link and run the command in the terminal (pip / conda doesn’t matter).
- *You must run the command that includes CUDA (e.g.,
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia)
Additionally, recently when installing numpy, version 2 is sometimes installed. If that happens, you can reinstall version 1. (e.g.,
pip install numpy==1.26.etc.)
Now, PyTorch and CUDA are automatically installed within your virtual environment. If you wish to install additional CUDA environments like cuDNN, you can run conda install -c anaconda cudatoolkit==[desired version] and conda install -c anaconda cudnn. This will install the desired cudatoolkit version and the matching cudnn version.
Applying DDP
Now, let’s get straight to the point. We will look at applying DDP in two parts. The first is the terminal and script input method, and the second is the setting within the Python code.
Terminal and Script Input Method
First, assuming that DDP setup is complete within the Python code, you can input the following:
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port 56789 main.py
Let’s look at this step by step.
-
CUDA_VISIBLE_DEVICES=0,1,2,3: This specifies which GPUs you will use among the ones available to you. For example, if your local machine or server has 8 GPUs, this means you will only use GPUs 0 through 3. If you don’t specify this separately, it will use all of them. -
torchrun: Before PyTorch version 2.0, we usedpython -m torch.distributed.launch, but after version 2.0,torchrunallows you to execute DDP without needing to typepython -m. -
--nproc_per_node=4: You can decide how many processors to use on that node. Here, “processor” refers to a GPU. Simply match this to the number of GPUs.
(In the example above, since we use 4 GPUs, it is set to 4; the node is the local machine or server). -
--master_port 56789: This part is actually optional, but it sets the port number when running DDP. Sometimes, when running multiple DDP instances on a local machine or server, execution might fail if port numbers overlap. In that case, you can just input any number you like. -
main.py: The name of the file you want to execute. (Obviously…)
There are more arguments you can apply. For instance, there is --master_addr, which is used when utilizing multi-nodes, i.e., multiple servers. However, since this post is intended for those new to DDP and most won’t need to use multi-nodes, I will skip this.
Settings Within Python Code
Before we dive in, there is one short thing to mention. As you can see from specifying arguments like --master_addr above, DDP is similar to creating a sort of virtual environment for each GPU. In other words, it is easier to understand if you imagine that each GPU is configured like a virtual environment to run the Python file. During this process, variables are set within Python for each GPU (os.environ['variable_name']). Let’s briefly touch upon two variables.
-
os.environ['WORLD_SIZE']: This refers to the node number executed via torchrun. In practice, this becomes the server number. Since we are explaining based on 1 server (single node), in most cases, this will beos.environ['WORLD_SIZE']=0. -
os.environ['LOCAL_RANK']: This refers to each processor executed via torchrun. In other words, understand this as the GPU number (0 to 3 in the current example).
DDP Initialization
Based on this, let’s see how to set it up within the Python code. Everyone has their own coding style, but I usually write it as follows:
1
2
3
4
5
6
7
8
args.device = 'cuda:0'
args.world_size = 1
args.rank = 0
args.local_rank = int(os.environ.get("LOCAL_RANK", 0))
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')
args.world_size = torch.distributed.get_world_size()
args.local_rank = torch.distributed.get_rank()
This is just my coding style; I often utilize args.xxx. (You could set variables separately, but keeping them inside arguments makes it convenient to use anywhere in the code.)
Let’s examine the code line by line.
-
args.device='cuda:0' ~ args.rank=0: You can understand this as the step to initialize variables needed for DDP. -
args.local_rank = int(os.environ.get("LOCAL_RANK", 0)): As mentioned above,os.environ['LOCAL_RANK']refers to the GPU number where the code is running. Therefore, understand this as receiving the variable corresponding to the GPU number intoargs.local_rankvia the get() function inside os.environ.- Example 1: If it corresponds to GPU 0,
args.local_rank = 0 - Example 2: If it corresponds to GPU 3,
args.local_rank = 3
- Example 1: If it corresponds to GPU 0,
-
torch.cuda.set_device(args.local_rank): Usually, there are two ways to move parameters to the GPU in PyTorch:params.to('cuda:#')andparams.cuda(). I’ll skip explaining the first method as it is widely used. The second method moves them to the current default GPU. At this time, the default GPU is usually set to the first GPU (index 0). set_device can be used to change this. In other words, this part designates the default GPU using theargs.local_rankspecified above. -
torch.distributed.init_process_group(backend='nccl', init_method='env://'): Understand this as the part that finally initializes everything so that each GPU can operate correctly. Here, backend=’nccl’ specifies that CUDA will handle the operations, and init_method=’env://’ specifies the arbitrary virtual environment created for the current DDP. Since most people use this form, there is no need to change it. -
args.world_size, args.local_rank: I typically use these two for the finally designatedWORLD_SIZEandLOCAL_RANK. Forargs.local_rank, understand it as initializing it once more for verification purposes.
Applying DDP to the Model
Next, let’s look at how to apply DDP to the model and train it. (It’s very simple.)
1
2
3
4
5
6
from torch.nn.parallel import DistributedDataParallel as DDP
model = DDP(model,device_ids=[args.local_rank])
...
logits = model(x)
loss = loss_fn(logits, labels)
loss.backward()
As you can see, the model is distributed using DistributedDataParallel provided by PyTorch. In this process, it distributes to each GPU using the args.local_rank designated earlier. (Note: device_ids=[args.local_rank], brackets are needed for args.local_rank).
Additional Tip - export NCCL_P2P_DISABLE=1
Unfortunately, sometimes it doesn’t work smoothly… It might fail due to various reasons like twisted configurations or conflicts. Still, nowadays, thanks to LLMs (ChatGPT, Claude, Gemini, etc.), debugging has become much easier if you can catch the error message well. I highly recommend utilizing them properly if it’s not an internal code issue.
However… there are times when no error message appears, and you get stuck in infinite loading. This was my case; right when entering the DDP function, it would suddenly freeze and then hang indefinitely. I don’t know the exact cause, but it seems to be a situation where a problem occurs when GPUs are processing the model during the internal DDP function operations… (I’ve even tried debugging by printing everything inside the PyTorch framework… ![]()
)
In such cases, you can try export NCCL_P2P_DISABLE=1. If you run this once in the terminal and then run the code, it often works smoothly. Haha.
Conclusion…
This was my first time posting about coding, and I hope this information is beneficial to those who are new to PyTorch and DDP. Haha. I experienced it, and looking around, I see people struggling for quite a long time when trying DDP for the first time. (ㅠㅠ) There are times when looking up blogs doesn’t work, and LLMs are unkind, so I hope this post is a big help in those cases!