[Coding] Super Easy Guide to Applying PyTorch DDP

 
This is my first coding-related post. The topic is DDP. Recently, as model capacities have grown, using multiple GPUs has become essential. Consequently, utilizing DDP effectively has become very important. Therefore, in this post, I will share how to apply DDP. I will cut to the chase regarding the general mechanics and focus simply and clearly on the arguments. (I will proceed with the method I personally use!)

PyTorch DDP

Environment Setup Before We Begin

Before we start, you need to install PyTorch + CUDA, which you can do by following the official website. You can proceed with the PyTorch installation as is, but I notice some people try to install CUDA manually. Personally, I do not recommend this. There are many settings to configure, especially on Windows. Therefore, I strongly recommend creating a virtual environment using miniconda or venv. Below, I will briefly outline the method I usually use for setup.

I prefer utilizing miniconda over anaconda. It is much lighter because it installs only the absolutely essential elements for creating a virtual environment.

1. Installing Miniconda:

Linux:

  1. Go to this Link and download Miniconda3-latest-Linux-x86_64.sh.
  2. Run bash Miniconda3-latest-Linux-x86_64.sh.

Windows:

  1. Go to this Link and download/run Miniconda3-latest-Windows-x86_64.exe.

2. Creating a Miniconda Virtual Environment:

Linux:

  1. In the terminal: conda create -n your_own_env_name python=3.9

Windows:

  1. Run Anaconda Prompt from the Start menu.
  2. In the terminal: conda create -n your_own_env_name python=3.9

You can set your_own_env_name to whatever name you prefer.

3. Installing Packages within Miniconda (Same for Linux & Windows)

  1. Run conda activate your_own_env_name in the terminal.
  2. Select your desired version and OS on this Link and run the command in the terminal (pip / conda doesn’t matter).
  3. *You must run the command that includes CUDA (e.g., conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia)

Additionally, recently when installing numpy, version 2 is sometimes installed. If that happens, you can reinstall version 1. (e.g., pip install numpy==1.26. etc.)

Now, PyTorch and CUDA are automatically installed within your virtual environment. If you wish to install additional CUDA environments like cuDNN, you can run conda install -c anaconda cudatoolkit==[desired version] and conda install -c anaconda cudnn. This will install the desired cudatoolkit version and the matching cudnn version.

Applying DDP

Now, let’s get straight to the point. We will look at applying DDP in two parts. The first is the terminal and script input method, and the second is the setting within the Python code.

Terminal and Script Input Method

First, assuming that DDP setup is complete within the Python code, you can input the following:

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port 56789 main.py

Let’s look at this step by step.

  • CUDA_VISIBLE_DEVICES=0,1,2,3: This specifies which GPUs you will use among the ones available to you. For example, if your local machine or server has 8 GPUs, this means you will only use GPUs 0 through 3. If you don’t specify this separately, it will use all of them.
  • torchrun: Before PyTorch version 2.0, we used python -m torch.distributed.launch, but after version 2.0, torchrun allows you to execute DDP without needing to type python -m.
  • --nproc_per_node=4: You can decide how many processors to use on that node. Here, “processor” refers to a GPU. Simply match this to the number of GPUs.
    (In the example above, since we use 4 GPUs, it is set to 4; the node is the local machine or server).
  • --master_port 56789: This part is actually optional, but it sets the port number when running DDP. Sometimes, when running multiple DDP instances on a local machine or server, execution might fail if port numbers overlap. In that case, you can just input any number you like.
  • main.py: The name of the file you want to execute. (Obviously…)

There are more arguments you can apply. For instance, there is --master_addr, which is used when utilizing multi-nodes, i.e., multiple servers. However, since this post is intended for those new to DDP and most won’t need to use multi-nodes, I will skip this.

Settings Within Python Code

Before we dive in, there is one short thing to mention. As you can see from specifying arguments like --master_addr above, DDP is similar to creating a sort of virtual environment for each GPU. In other words, it is easier to understand if you imagine that each GPU is configured like a virtual environment to run the Python file. During this process, variables are set within Python for each GPU (os.environ['variable_name']). Let’s briefly touch upon two variables.

  • os.environ['WORLD_SIZE']: This refers to the node number executed via torchrun. In practice, this becomes the server number. Since we are explaining based on 1 server (single node), in most cases, this will be os.environ['WORLD_SIZE']=0.
  • os.environ['LOCAL_RANK']: This refers to each processor executed via torchrun. In other words, understand this as the GPU number (0 to 3 in the current example).

DDP Initialization

Based on this, let’s see how to set it up within the Python code. Everyone has their own coding style, but I usually write it as follows:

1
2
3
4
5
6
7
8
args.device = 'cuda:0'
args.world_size = 1
args.rank = 0
args.local_rank = int(os.environ.get("LOCAL_RANK", 0))
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')
args.world_size = torch.distributed.get_world_size()
args.local_rank = torch.distributed.get_rank()

This is just my coding style; I often utilize args.xxx. (You could set variables separately, but keeping them inside arguments makes it convenient to use anywhere in the code.)

Let’s examine the code line by line.

  • args.device='cuda:0' ~ args.rank=0: You can understand this as the step to initialize variables needed for DDP.
  • args.local_rank = int(os.environ.get("LOCAL_RANK", 0)): As mentioned above, os.environ['LOCAL_RANK'] refers to the GPU number where the code is running. Therefore, understand this as receiving the variable corresponding to the GPU number into args.local_rank via the get() function inside os.environ.
    • Example 1: If it corresponds to GPU 0, args.local_rank = 0
    • Example 2: If it corresponds to GPU 3, args.local_rank = 3
  • torch.cuda.set_device(args.local_rank): Usually, there are two ways to move parameters to the GPU in PyTorch: params.to('cuda:#') and params.cuda(). I’ll skip explaining the first method as it is widely used. The second method moves them to the current default GPU. At this time, the default GPU is usually set to the first GPU (index 0). set_device can be used to change this. In other words, this part designates the default GPU using the args.local_rank specified above.
  • torch.distributed.init_process_group(backend='nccl', init_method='env://'): Understand this as the part that finally initializes everything so that each GPU can operate correctly. Here, backend=’nccl’ specifies that CUDA will handle the operations, and init_method=’env://’ specifies the arbitrary virtual environment created for the current DDP. Since most people use this form, there is no need to change it.
  • args.world_size, args.local_rank: I typically use these two for the finally designated WORLD_SIZE and LOCAL_RANK. For args.local_rank, understand it as initializing it once more for verification purposes.

Applying DDP to the Model

Next, let’s look at how to apply DDP to the model and train it. (It’s very simple.)

1
2
3
4
5
6
from torch.nn.parallel import DistributedDataParallel as DDP
model = DDP(model,device_ids=[args.local_rank])
...
logits = model(x)
loss = loss_fn(logits, labels)
loss.backward() 

As you can see, the model is distributed using DistributedDataParallel provided by PyTorch. In this process, it distributes to each GPU using the args.local_rank designated earlier. (Note: device_ids=[args.local_rank], brackets are needed for args.local_rank).

Additional Tip - export NCCL_P2P_DISABLE=1

Unfortunately, sometimes it doesn’t work smoothly… It might fail due to various reasons like twisted configurations or conflicts. Still, nowadays, thanks to LLMs (ChatGPT, Claude, Gemini, etc.), debugging has become much easier if you can catch the error message well. I highly recommend utilizing them properly if it’s not an internal code issue.

However… there are times when no error message appears, and you get stuck in infinite loading. This was my case; right when entering the DDP function, it would suddenly freeze and then hang indefinitely. I don’t know the exact cause, but it seems to be a situation where a problem occurs when GPUs are processing the model during the internal DDP function operations… (I’ve even tried debugging by printing everything inside the PyTorch framework… :confounded::disappointed_relieved:)

In such cases, you can try export NCCL_P2P_DISABLE=1. If you run this once in the terminal and then run the code, it often works smoothly. Haha.

Conclusion…

This was my first time posting about coding, and I hope this information is beneficial to those who are new to PyTorch and DDP. Haha. I experienced it, and looking around, I see people struggling for quite a long time when trying DDP for the first time. (ㅠㅠ) There are times when looking up blogs doesn’t work, and LLMs are unkind, so I hope this post is a big help in those cases!