偷偷摘套内射激情视频,久久精品99国产国产精,中文字幕无线乱码人妻,中文在线中文a,性爽19p

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

信創(chuàng)認(rèn)證

公眾號矩陣

移動端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考信創(chuàng)認(rèn)證華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

Meta「輕量級」KernelLLM顛覆GPU內(nèi)核生成，8B參數(shù)碾壓GPT-4o

作者：新智元 2025-05-27 15:19:52

Meta推出KernelLLM，這個基于Llama 3.1微調(diào)的8B模型，竟能將PyTorch代碼自動轉(zhuǎn)換為高效Triton GPU內(nèi)核。實(shí)測數(shù)據(jù)顯示，它的單次推理性能超越GPT-4o和DeepSeek V3，多次生成時得分飆升。

在AI領(lǐng)域，參數(shù)規(guī)模曾被視為「性能天花板」。

Meta最新發(fā)布的KernelLLM，卻用8B參數(shù)的「小身板」，在GPU內(nèi)核生成任務(wù)中把200B的GPT-4o按在地上摩擦。

這是一個基于Llama 3.1 Instruct進(jìn)行微調(diào)的8B參數(shù)模型，旨在將PyTorch模塊自動轉(zhuǎn)換為高效的Triton GPU內(nèi)核。

圖片

KernelLLM簡直是GPU內(nèi)核開發(fā)神器，用更少的參數(shù)實(shí)現(xiàn)了更強(qiáng)的性能，且簡單易用。

它只有8B參數(shù)，但是在KernelBench-Triton Level 1，單次推理性能超過了GPT-4o和DeepSeek V3。

通過多次推理，KernelLLM性能優(yōu)于DeepSeek R1。

圖片

這一切都來自一個參數(shù)規(guī)模比競爭對手小兩個數(shù)量級的模型。

@Denis Kanonik吐槽「這又是用測試集訓(xùn)練的嗎？」

圖片

KernelLLM讓內(nèi)核開發(fā)更易上手

KernelLLM是一款基于Llama 3.1 Instruct的8B模型，專門針對用Triton編寫GPU內(nèi)核的任務(wù)進(jìn)行了訓(xùn)練。

它能讓GPU編程變得更簡單，實(shí)現(xiàn)高性能GPU內(nèi)核生成的自動化。

KernelLLM通過自動化生成高效的Triton實(shí)現(xiàn)，滿足對高性能GPU內(nèi)核日益增長的需求。

隨著工作負(fù)載的增大和加速器架構(gòu)的多樣化，對定制化內(nèi)核解決方案的需求顯著增加。

現(xiàn)在市面上很多相關(guān)工具，要么只能在測試的時候優(yōu)化，要么就只盯著KernelBench的問題調(diào)優(yōu)，很難應(yīng)對更廣泛的場景。

KernelLLM是首個在外部（PyTorch，Triton）代碼對數(shù)據(jù)上進(jìn)行微調(diào)的LLM。

Triton內(nèi)核生成工作流程

把PyTorch代碼輸進(jìn)去，KernelLLM就會生成Triton內(nèi)核候選代碼。

然后用單元測試來驗(yàn)證這些代碼，用隨機(jī)輸入跑一跑，看看輸出對不對。要是生成好幾個候選代碼，還能比比哪個最好，挑出最優(yōu)的。

圖片

KernelLLM的Triton內(nèi)核生成流程：用KernelLLM把PyTorch代碼翻譯成Triton內(nèi)核的候選代碼。生成的代碼會通過單元測試驗(yàn)證，測試用已知形狀的隨機(jī)輸入數(shù)據(jù)運(yùn)行內(nèi)核。這個流程支持生成多個候選代碼（通過 pass@k評估），增加候選數(shù)量來提高質(zhì)量，最后選出最好的Triton內(nèi)核實(shí)現(xiàn)作為輸出（綠色部分）

為了訓(xùn)練這個模型，團(tuán)隊(duì)可是下了大功夫，用了25000多對（PyTorch，Triton）代碼示例，還有合成的樣本。

這些數(shù)據(jù)一部分來自TheStack的過濾代碼，一部分是通過torch.compile () 和提示技術(shù)生成的。

數(shù)據(jù)集KernelBook，參考鏈接：https://huggingface.co/datasets/GPUMODE/KernelBook。

訓(xùn)練時用的是Llama3.1-8B-Instruct模型，在自定義數(shù)據(jù)集上做了監(jiān)督微調(diào)（SFT），測試它在KernelBench-Triton上生成正確Triton內(nèi)核及調(diào)用代碼的能力。

KernelBench-Triton是基于KernelBench[Ouyang et al. 2025]開發(fā)的變體，專注Triton內(nèi)核生成。

訓(xùn)練和評估時，PyTorch代碼會配置一個包含格式示例的提示模板作為指令。

模型訓(xùn)練了10個epoch，批大小為32，采用標(biāo)準(zhǔn)SFT方法，超參數(shù)根據(jù)驗(yàn)證集的困惑度（perplexity）來選擇。

訓(xùn)練用了16個GPU，共耗時12小時（192 GPU小時），報(bào)告了最佳檢查點(diǎn)的驗(yàn)證結(jié)果。

性能評估

盡管模型規(guī)模較小，但其性能可與最先進(jìn)的LLM相媲美。

圖片

KernelBench-Triton測試中，8B參數(shù)的KernelLLM，單次推理得分20.2，比671B參數(shù)的DeepSeek V3（16分）和200B參數(shù)的GPT-4o（15分）都高。

圖片

要是多生成幾個候選代碼，得分還能蹭蹭往上漲，生成10個的時候能到51.8分，20個的時候能到57.1分。

KernelLLM推理用temperature=1.0和top_p=0.97運(yùn)行。

在KernelBench上測試了模型，這是一個開源基準(zhǔn)測試，用于評估LLM編寫的高效GPU內(nèi)核的能力。

它包含250個精心挑選的PyTorch模塊，按負(fù)載調(diào)整，從簡單的單操作（如Conv2D或Swish，Level 1）到完整的模型架構(gòu)（Level 3）。

它在不同難度的任務(wù)里表現(xiàn)都很穩(wěn)，不管是簡單的單個操作符，還是復(fù)雜的模型架構(gòu)，都能應(yīng)對。

測試會同時降低代碼的正確性（通過與參考PyTorch輸出對比）和性能（通過與基準(zhǔn)實(shí)現(xiàn)的加速比）。

團(tuán)隊(duì)開發(fā)了一個新的KernelBench-Triton變體，專門評估LLM生成Triton內(nèi)核的能力，非常適合測試KernelLLM。

所有測試都在NVIDIA H100 GPU上完成。

圖片

KernelLLM在pass@k中表現(xiàn)出近似對數(shù)線性的擴(kuò)展行為

KernelLLM怎么用？

先裝幾個依賴包：

pip install transformers accelerate torch triton
pip install transformers accelerate torch triton

用的時候，先導(dǎo)入庫，調(diào)用generate_triton函數(shù)，就能生成優(yōu)化后的Triton代碼啦。

KernelLLM提供了一個簡單的接口，用于從PyTorch代碼生成Triton核。

from kernelllm import KernelLLM# Initialize the modelmodel = KernelLLM()# Define your PyTorch modulepytorch_code = '''import torchimport torch.nn as nnclass Model(nn.Module):    """    A model that computes Hinge Loss for binary classification tasks.    """        def __init__(self):                super(Model, self).__init__()         def forward(self, predictions, targets):                return torch.mean(torch.clamp(1 - predictions * targets, min=0))batch_size = 128input_shape = (1,)def get_inputs():        return [torch.randn(batch_size, *input_shape), torch.randint(0, 2, (batch_size, 1)).float() * 2 - 1]def get_init_inputs():    return []'''# Generate optimized Triton codeoptimized_code = model.generate_triton(pytorch_code, max_new_tokens=512)print(optimized_code)

from kernelllm import KernelLLM
# Initialize the model
model = KernelLLM()
# Define your PyTorch module
pytorch_code = 
'''
import torch
import torch.nn as nnclass Model(nn.Module):    
"""
    A model that computes Hinge Loss for binary classification tasks.    
"""    
    def __init__(self):        
        super(Model, self).__init__()     
    def forward(self, predictions, targets):        
        return torch.mean(torch.clamp(1 - predictions * targets, min=0))
batch_size = 128
input_shape = (1,)
def get_inputs():    
    return [torch.randn(batch_size, *input_shape), torch.randint(0, 2, (batch_size, 1)).float() * 2 - 1]
def get_init_inputs():
    return []
'''
# Generate optimized Triton code
optimized_code = model.generate_triton(pytorch_code, max_new_tokens=512)
print(optimized_code)

要是不想寫腳本，還能直接運(yùn)行python kernelllm.py，使用內(nèi)置的REPL接口，打開交互式界面，實(shí)時看結(jié)果。

kernelllm.py提供了多種與模型交互的方法。

python kernelllm.py
python kernelllm.py

KernelLLM提供了幾種自定義生成過程的方法：

from kernelllm import KernelLLMmodel = KernelLLM()# Stream output in real-timemodel.stream_raw("Your prompt here", max_new_tokens=2048)# Generate raw text without the Triton-specific prompt templateraw_output = model.generate_raw("Your prompt here", temperature=1.0, max_new_tokens=2048)

from kernelllm import KernelLLM
model = KernelLLM()
# Stream output in real-time
model.stream_raw("Your prompt here", max_new_tokens=2048)
# Generate raw text without the Triton-specific prompt template
raw_output = model.generate_raw("Your prompt here", temperature=1.0, max_new_tokens=2048)

有時它會犯點(diǎn)小錯誤，比如API引用不對、語法出錯，有時候還不太能按指令生成理想的內(nèi)核。

生成的代碼結(jié)構(gòu)有點(diǎn)像編譯器自動吐出來的，有時在變量命名、張量形狀、類型處理和數(shù)值精度這些細(xì)節(jié)上也容易出問題。

參考資料：

https://x.com/reach_vb/status/1924478755898085552

https://huggingface.co/facebook/KernelLLM

責(zé)任編輯：武曉燕來源：新智元

GPU Meta GPT-4o

點(diǎn)贊

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營