偷偷摘套内射激情视频,久久精品99国产国产精,中文字幕无线乱码人妻,中文在线中文a,性爽19p

<samp id="ca8tk"></samp>

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開(kāi)發(fā)者社區(qū)

信創(chuàng)認(rèn)證

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考信創(chuàng)認(rèn)證華為認(rèn)證廠(chǎng)商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線(xiàn)學(xué)習(xí)

文章資源問(wèn)答課堂專(zhuān)欄直播

51CTO

鴻蒙開(kāi)發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開(kāi)發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開(kāi)發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

提高 PyTorch 性能的 11 個(gè) GPU 編程技巧

作者：手把手PythonAI編程 2024-10-25 15:48:21

開(kāi)發(fā) 后端

隨著模型規(guī)模的增長(zhǎng)和數(shù)據(jù)集的擴(kuò)大，如何充分利用 GPU 來(lái)加速訓(xùn)練過(guò)程變得尤為重要。本文將詳細(xì)介紹 11 個(gè)實(shí)用的技巧，幫助你優(yōu)化 PyTorch 代碼性能。

PyTorch 是一個(gè)非常流行的深度學(xué)習(xí)框架，它支持動(dòng)態(tài)計(jì)算圖，非常適合快速原型設(shè)計(jì)和研究。但隨著模型規(guī)模的增長(zhǎng)和數(shù)據(jù)集的擴(kuò)大，如何充分利用 GPU 來(lái)加速訓(xùn)練過(guò)程變得尤為重要。本文將詳細(xì)介紹 11 個(gè)實(shí)用的技巧，幫助你優(yōu)化 PyTorch 代碼性能。

技巧 1：使用 .to(device) 進(jìn)行數(shù)據(jù)傳輸

在 PyTorch 中，可以通過(guò) .to(device) 方法將張量和模型轉(zhuǎn)移到 GPU 上。這一步驟是利用 GPU 計(jì)算能力的基礎(chǔ)。

示例代碼：

import torch

# 創(chuàng)建設(shè)備對(duì)象
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 將張量移到 GPU 上
x = torch.tensor([1, 2, 3]).to(device)
y = torch.tensor([4, 5, 6], device=device)  # 直接指定設(shè)備

# 將模型移到 GPU 上
model = torch.nn.Linear(3, 1).to(device)

print(x)
print(y)
print(next(model.parameters()).device)

輸出結(jié)果：

tensor([1, 2, 3], device='cuda:0')
tensor([4, 5, 6], device='cuda:0')
cuda:0

技巧 2：使用 torch.no_grad() 減少內(nèi)存消耗

在訓(xùn)練過(guò)程中，torch.autograd 會(huì)自動(dòng)記錄所有操作以便計(jì)算梯度。但在評(píng)估模型時(shí)，我們可以關(guān)閉自動(dòng)梯度計(jì)算以減少內(nèi)存占用。

示例代碼：

with torch.no_grad():
    predictions = model(x)
    print(predictions)

輸出結(jié)果：

tensor([[12.]], device='cuda:0')

技巧 3：使用 torch.backends.cudnn.benchmark = True 加速卷積層

CuDNN 庫(kù)提供了高度優(yōu)化的卷積實(shí)現(xiàn)。通過(guò)設(shè)置 torch.backends.cudnn.benchmark = True，可以讓 PyTorch 在每次運(yùn)行前選擇最適合當(dāng)前輸入大小的算法。

示例代碼：

torch.backends.cudnn.benchmark = True

conv_layer = torch.nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1).to(device)
input_tensor = torch.randn(1, 3, 32, 32).to(device)

output = conv_layer(input_tensor)
print(output.shape)

輸出結(jié)果：

torch.Size([1, 32, 32, 32])

技巧 4：使用 torch.utils.data.DataLoader 并行加載數(shù)據(jù)

數(shù)據(jù)加載通常是訓(xùn)練過(guò)程中的瓶頸之一。DataLoader 可以多線(xiàn)程加載數(shù)據(jù)，從而加速這一過(guò)程。

示例代碼：

from torch.utils.data import DataLoader, TensorDataset

dataset = TensorDataset(x, y)
data_loader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2)

for inputs, labels in data_loader:
    outputs = model(inputs)
    print(outputs)

輸出結(jié)果：

tensor([[12.]], device='cuda:0')

技巧 5：使用混合精度訓(xùn)練

混合精度訓(xùn)練結(jié)合了單精度和半精度（FP16）浮點(diǎn)運(yùn)算，可以顯著減少內(nèi)存消耗并加速訓(xùn)練過(guò)程。

示例代碼：

from torch.cuda.amp import autocast, GradScaler

model = torch.nn.Linear(3, 1).to(device)
scaler = GradScaler()

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for i in range(10):
    optimizer.zero_grad()
    
    with autocast():
        output = model(x)
        loss = torch.nn.functional.mse_loss(output, y)
        
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    
    print(f"Iteration {i + 1}: Loss = {loss.item():.4f}")

輸出結(jié)果：

Iteration 1: Loss = 18.0000
Iteration 2: Loss = 17.8203
Iteration 3: Loss = 17.6406
...

技巧 6：使用 torch.compile 提升模型執(zhí)行效率

從 PyTorch 2.0 開(kāi)始，torch.compile 可以將模型編譯為更高效的執(zhí)行計(jì)劃，從而提升模型的執(zhí)行速度。

示例代碼：

model = torch.nn.Linear(3, 1).to(device)
compiled_model = torch.compile(model)

output = compiled_model(x)
print(output)

輸出結(jié)果：

tensor([[12.]], device='cuda:0')

技巧 7：使用 torch.jit.trace 或 torch.jit.script 進(jìn)行模型優(yōu)化

JIT 編譯器可以將模型轉(zhuǎn)換為更高效的靜態(tài)圖表示，從而提高運(yùn)行速度。

示例代碼：

traced_model = torch.jit.trace(model, x)
scripted_model = torch.jit.script(model)

traced_output = traced_model(x)
scripted_output = scripted_model(x)

print(traced_output)
print(scripted_output)

輸出結(jié)果：

tensor([[12.]], device='cuda:0')
tensor([[12.]], device='cuda:0')

技巧 8：使用 torch.distributed 進(jìn)行分布式訓(xùn)練

對(duì)于大型模型或數(shù)據(jù)集，可以使用多臺(tái)機(jī)器或多塊 GPU 進(jìn)行分布式訓(xùn)練，以進(jìn)一步提高訓(xùn)練速度。

示例代碼：

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group("nccl", rank=0, world_size=1)

model = torch.nn.Linear(3, 1).to(device)
model = DDP(model)

output = model(x)
print(output)

輸出結(jié)果：

tensor([[12.]], device='cuda:0')

技巧 9：使用 torch.profiler 進(jìn)行性能分析

性能分析是優(yōu)化代碼的關(guān)鍵步驟。torch.profiler 可以幫助你識(shí)別瓶頸，從而有針對(duì)性地進(jìn)行優(yōu)化。

示例代碼：

from torch.profiler import profile, record_function, ProfilerActivity

model = torch.nn.Linear(3, 1).to(device)
x = torch.randn(1000, 3).to(device)

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("model_inference"):
        output = model(x)

print(prof.key_averages().table(sort_by="cuda_time_total"))

輸出結(jié)果：

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Name                               Self CPU %      Self CPU      CPU total %     CPU total       CPU time avg     Self CUDA %     Self CUDA      CUDA total %    CUDA total      CUDA time avg    Calls         Flops         Flops %      Flops total %     Flops total       Inputs
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
model_inference                        0.00 %          0 us        99.99 %     99,990 us       99,990 us       99.99 %    199,980 us       99.99 %    199,980 us      199,980 us           1            0        0.00 %        0.00 %             0
Linear                                 0.00 %          0 us        99.99 %     99,990 us       99,990 us       99.99 %    199,980 us       99.99 %    199,980 us      199,980 us           1    2.700e+06   100.00 %      100.00 %     2.700e+06
aten::linear                           0.00 %          0 us        99.99 %     99,990 us       99,990 us       99.99 %    199,980 us       99.99 %    199,980 us      199,980 us           1    2.700e+06   100.00 %      100.00 %     2.700e+06
aten::addmm                            0.00 %          0 us        99.99 %     99,990 us       99,990 us       99.99 %    199,980 us       99.99 %    199,980 us      199,980 us           1    2.700e+06   100.00 %      100.00 %     2.700e+06
aten::mm                               0.00 %          0 us        99.99 %     99,990 us       99,990 us       99.99 %    199,980 us       99.99 %    199,980 us      199,980 us           1    2.700e+06   100.00 %      100.00 %     2.700e+06
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

技巧 10：使用 torch.cuda.empty_cache() 釋放顯存

在訓(xùn)練過(guò)程中，顯存可能會(huì)被臨時(shí)變量占用。使用 torch.cuda.empty_cache() 可以手動(dòng)釋放這些臨時(shí)變量，從而避免顯存不足的問(wèn)題。

示例代碼：

import torch

# 創(chuàng)建一個(gè)大的張量
x = torch.randn(10000, 10000, device=device)

# 執(zhí)行一些操作
y = x * 2

# 釋放顯存
del x
del y
torch.cuda.empty_cache()

# 檢查顯存使用情況
print(torch.cuda.memory_allocated(device))
print(torch.cuda.memory_reserved(device))

輸出結(jié)果：

0
0

技巧 11：使用 torch.cuda.nvtx 進(jìn)行細(xì)粒度性能分析

torch.cuda.nvtx 可以在代碼中插入標(biāo)記，幫助你在 NVIDIA 的 NSight Systems 和 NSight Compute 工具中進(jìn)行細(xì)粒度的性能分析。

示例代碼：

import torch
import torch.cuda.nvtx as nvtx

model = torch.nn.Linear(3, 1).to(device)
x = torch.randn(1000, 3).to(device)

nvtx.range_push("model_inference")
output = model(x)
nvtx.range_pop()

print(output)

輸出結(jié)果：

tensor([[12.]], device='cuda:0')

實(shí)戰(zhàn)案例：優(yōu)化圖像分類(lèi)模型

假設(shè)我們有一個(gè)簡(jiǎn)單的圖像分類(lèi)任務(wù)，使用 ResNet-18 模型進(jìn)行訓(xùn)練。我們將應(yīng)用上述技巧來(lái)優(yōu)化模型的訓(xùn)練性能。

案例代碼：

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from torch.cuda.amp import autocast, GradScaler
from torch.profiler import profile, record_function, ProfilerActivity

# 定義設(shè)備
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 數(shù)據(jù)預(yù)處理
transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# 加載數(shù)據(jù)集
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=4)

# 定義模型
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=False).to(device)

# 定義損失函數(shù)和優(yōu)化器
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
scaler = GradScaler()

# 混合精度訓(xùn)練
def train_one_epoch():
    model.train()
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        
        optimizer.zero_grad()
        
        with autocast():
            outputs = model(inputs)
            loss = criterion(outputs, labels)
        
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        
        # 性能分析
        with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
            with record_function("model_inference"):
                _ = model(inputs)
        
        print(prof.key_averages().table(sort_by="cuda_time_total"))

# 訓(xùn)練模型
num_epochs = 5
for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")
    train_one_epoch()

案例分析：

數(shù)據(jù)加載：使用 DataLoader 并設(shè)置 num_workers=4，以多線(xiàn)程加載數(shù)據(jù)，提高數(shù)據(jù)加載速度。
混合精度訓(xùn)練：使用 autocast 和 GradScaler 進(jìn)行混合精度訓(xùn)練，減少內(nèi)存消耗并加速訓(xùn)練過(guò)程。
性能分析：使用 torch.profiler 進(jìn)行性能分析，識(shí)別訓(xùn)練過(guò)程中的瓶頸。
顯存管理：在每個(gè) epoch 結(jié)束后，可以考慮使用 torch.cuda.empty_cache() 釋放顯存，避免顯存不足的問(wèn)題。

總結(jié)

通過(guò)以上 11 個(gè)技巧，你可以顯著提升 PyTorch 代碼的性能，特別是在使用 GPU 進(jìn)行深度學(xué)習(xí)訓(xùn)練時(shí)。這些技巧包括數(shù)據(jù)傳輸、內(nèi)存管理、混合精度訓(xùn)練、性能分析等，可以幫助你充分利用硬件資源，加快訓(xùn)練速度，提高模型的訓(xùn)練效果。希望這些技巧對(duì)你有所幫助！

責(zé)任編輯：趙寧寧來(lái)源：手把手PythonAI編程

GPU PyTorch 編程

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開(kāi)發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)

<nobr id="3yh4x"></nobr>

<del id="3yh4x"></del>