偷偷摘套内射激情视频,久久精品99国产国产精,中文字幕无线乱码人妻,中文在线中文a,性爽19p

<button id="nhs92"></button>

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

信創(chuàng)認(rèn)證

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考信創(chuàng)認(rèn)證華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫

賬號(hào)設(shè)置退出

Unsloth 微調(diào) Qwen3 實(shí)戰(zhàn)教程來了！

2025-05-14 01:00:00

Qwen3–30B-A3B僅需17.5GB VRAM即可運(yùn)行。unsloth的Dynamic 2.0量化技術(shù)保證了高精度，同時(shí)支持原生128K上下文長(zhǎng)度。Qwen3模型具有思考模式和非思考模式，適用于不同復(fù)雜度的任務(wù)。

unsloth微調(diào)Qwen3模型提供顯著優(yōu)勢(shì)：訓(xùn)練速度提高2倍，VRAM使用減少70%，支持8倍長(zhǎng)的上下文。Qwen3–30B-A3B僅需17.5GB VRAM即可運(yùn)行。unsloth的Dynamic 2.0量化技術(shù)保證了高精度，同時(shí)支持原生128K上下文長(zhǎng)度。Qwen3模型具有思考模式和非思考模式，適用于不同復(fù)雜度的任務(wù)。微調(diào)后的模型可用于法律文檔分析、定制知識(shí)庫構(gòu)建等領(lǐng)域，能夠處理特定領(lǐng)域查詢并保持上下文，優(yōu)于純檢索系統(tǒng)。unsloth支持4bit/16bit的QLoRA/LoRA微調(diào)，適用于2018年后的NVIDIA GPU，為資源有限環(huán)境下的模型定制提供了高效解決方案。

圖片

Qwen3模型微調(diào)的主要場(chǎng)景

unsloth支持對(duì)Qwen3模型進(jìn)行微調(diào)，可以應(yīng)用于以下場(chǎng)景：

法律文檔輔助 — 在法律文本（合同、案例法、法規(guī)）上進(jìn)行微調(diào)，用于合同分析、案例法研究或合規(guī)支持
定制知識(shí)庫 — 將專業(yè)領(lǐng)域的知識(shí)直接嵌入到模型中，使其能夠處理特定領(lǐng)域的查詢和文檔總結(jié)

Qwen3模型本身具有兩種工作模式，使微調(diào)后的模型更加靈活：

思考模式(Thinking Mode)：模型會(huì)在給出最終答案前進(jìn)行逐步推理，適合需要深度思考的復(fù)雜問題
非思考模式(Non-Thinking Mode)：模型提供快速、近乎即時(shí)的回答，適合簡(jiǎn)單問題

使用unsloth微調(diào)Qwen3的主要優(yōu)勢(shì)

unsloth使Qwen3(8B)微調(diào)速度提高2倍，VRAM使用減少70%，并且比所有使用Flash Attention 2的環(huán)境支持長(zhǎng)8倍的上下文長(zhǎng)度。使用unsloth，Qwen3–30B-A3B模型可以舒適地在僅17.5GB VRAM的環(huán)境中運(yùn)行。

unsloth為Qwen3提供了Dynamic 2.0量化方法，在5-shot MMLU和KL散度基準(zhǔn)測(cè)試中提供最佳性能。這意味著可以運(yùn)行和微調(diào)量化后的Qwen3 LLM，同時(shí)保持最小的精度損失。unsloth還上傳了支持原生128K上下文長(zhǎng)度的Qwen3版本。

unsloth支持多種微調(diào)技術(shù)，包括4bit和16bit的QLoRA/LoRA微調(diào)。它通過手動(dòng)推導(dǎo)所有計(jì)算密集型數(shù)學(xué)步驟并手寫GPU核心，在不更改硬件的情況下使訓(xùn)練速度更快。

技術(shù)特點(diǎn)與支持

unsloth提供了多種設(shè)置選項(xiàng)來優(yōu)化微調(diào)過程：

max_seq_length = 2048：控制上下文長(zhǎng)度。雖然Qwen3支持40960，但建議測(cè)試時(shí)使用2048。unsloth能夠?qū)崿F(xiàn)8倍長(zhǎng)的上下文微調(diào)
load_in_4bit = True：?jiǎn)⒂?位量化，減少微調(diào)時(shí)內(nèi)存使用量至原來的1/4，適用于16GB GPU

unsloth上傳了所有版本的Qwen3，包括Dynamic 2.0 GGUF、動(dòng)態(tài)4位等格式到Hugging Face。此外，unsloth還支持包括30B-A3B和235B-A22B在內(nèi)的Qwen3 MOE模型。

unsloth的技術(shù)支持包括：

支持2018年以后的NVIDIA GPU，最低CUDA能力要求7.0
支持各種Transformer風(fēng)格的模型，包括Phi-4推理、Mixtral、MOE、Cohere等
支持任何訓(xùn)練算法，比如帶VLM的GRPO

實(shí)際應(yīng)用優(yōu)勢(shì)

與純檢索系統(tǒng)相比，微調(diào)提供了幾個(gè)顯著優(yōu)勢(shì)：

微調(diào)幾乎可以做到檢索增強(qiáng)生成(RAG)能做的一切，但反之則不然
在微調(diào)過程中，外部知識(shí)直接嵌入到模型中，使模型能夠處理特定領(lǐng)域查詢并在不依賴外部檢索系統(tǒng)的情況下保持上下文
即使在同時(shí)使用微調(diào)和RAG的混合設(shè)置中，微調(diào)后的模型也提供了可靠的后備方案

在特定領(lǐng)域，如醫(yī)療保健領(lǐng)域的視覺問答(VQA)任務(wù)中，微調(diào)使模型更好地理解領(lǐng)域特定的細(xì)微差別，提高其提供準(zhǔn)確和上下文相關(guān)響應(yīng)的能力。微調(diào)后的模型在精確度和召回率上表現(xiàn)明顯優(yōu)于零樣本預(yù)測(cè)。

為獲得最佳結(jié)果，建議策劃結(jié)構(gòu)良好的數(shù)據(jù)集，理想情況下是問答對(duì)形式。這可以增強(qiáng)學(xué)習(xí)、理解和響應(yīng)準(zhǔn)確性。

使用unsloth微調(diào)Qwen3模型可以實(shí)現(xiàn)更快的訓(xùn)練速度、更低的內(nèi)存需求和更長(zhǎng)的上下文支持，同時(shí)保持高精度。這使得即使在資源有限的環(huán)境中，也能夠?qū)?qiáng)大的Qwen3模型適配到特定領(lǐng)域的應(yīng)用場(chǎng)景中。

完整微調(diào)代碼

**微調(diào)后的模型獲得的能力:**
1. 雙模式操作能力: - 普通對(duì)話模式: 適用于日常聊天場(chǎng)景
 - 思考模式(Thinking Mode): 用于解決需要推理的問題2. 數(shù)學(xué)推理能力: 能夠解決數(shù)學(xué)問題并展示詳細(xì)的推理過程，如示例中的"解方程(x + 2)^2 = 0"
3. 對(duì)話能力保持: 同時(shí)保持了自然對(duì)話的能力，能夠進(jìn)行流暢的多輪對(duì)話微調(diào)使模型成為一個(gè)"雙重人格"的助手，既能進(jìn)行普通閑聊，又能在需要時(shí)切換到更嚴(yán)謹(jǐn)?shù)乃伎寄Ｊ絹斫鉀Q復(fù)雜問題，特別是數(shù)學(xué)問題。### 安裝
"""# Commented out IPython magic to ensure Python compatibility.
# %%capture
# import os
# if "COLAB_" not in "".join(os.environ.keys()):
#     # 如果不是在Google Colab環(huán)境中運(yùn)行，則簡(jiǎn)單安裝unsloth庫
#     !pip install unsloth
# else:
#     # 在Google Colab環(huán)境中運(yùn)行時(shí)的特殊安裝流程
#     # 首先安裝所有依賴庫，但不處理它們的依賴關(guān)系(--no-deps參數(shù))
#     !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
#     # 安裝常用的自然語言處理和模型托管工具
#     !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
#     # 最后安裝unsloth庫本身，不處理依賴(避免版本沖突)
#     !pip install --no-deps unsloth
#"""### Unsloth"""from unsloth import FastLanguageModel
import torchfourbit_models = [
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit"# [NEW] We support TTS models!
] # More models at <https://huggingface.co/unsloth>model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-14B",
    max_seq_length = 2048,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    token = "",      # use one if using gated models
)"""We now add LoRA adapters so we only need to update 1 to 10% of all parameters!"""# 添加LoRA適配器
# 通過LoRA技術(shù)，只需要更新1-10%的參數(shù)即可實(shí)現(xiàn)有效微調(diào)
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,           # # LoRA秩，建議值為8,16,32,64,128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # LoRA alpha值，建議設(shè)為rank或rank*2
    lora_dropout = 0, # LoRA dropout，0值經(jīng)過優(yōu)化
    bias = "none",    # 偏置設(shè)置，"none"已優(yōu)化
    # [新特性] "unsloth"模式減少30%顯存，可適應(yīng)2倍大的批次大小
    use_gradient_checkpointing = "unsloth", #梯度檢查點(diǎn)，用于長(zhǎng)上下文
    random_state = 3407,  # 隨機(jī)種子
    use_rslora = False,   # 是否使用rank stabilized LoRA
    loftq_config = None,  # LoftQ配置
)"""<a name="Data"></a>
### Data Prep
Qwen3 has both reasoning and a non reasoning mode. So, we should use 2 datasets:1. We use the [Open Math Reasoning]() dataset which was used to win the [AIMO](<https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard>) (AI Mathematical Olympiad - Progress Prize 2) challenge! We sample 10% of verifiable reasoning traces that used DeepSeek R1, and whicht got > 95% accuracy.2. We also leverage [Maxime Labonne's FineTome-100k](<https://huggingface.co/datasets/mlabonne/FineTome-100k>) dataset in ShareGPT style. But we need to convert it to HuggingFace's normal multiturn format as well.
"""# 數(shù)據(jù)準(zhǔn)備
# Qwen3同時(shí)具有推理和非推理模式，因此使用兩種數(shù)據(jù)集：
# 1. OpenMathReasoning數(shù)據(jù)集 - 用于數(shù)學(xué)推理能力
# 2. FineTome-100k數(shù)據(jù)集 - 用于一般對(duì)話能力
from datasets import load_dataset
# 加載數(shù)學(xué)推理數(shù)據(jù)集
reasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot",token="")
# 加載對(duì)話數(shù)據(jù)集
non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split = "train",token="")"""Let's see the structure of both datasets:"""# 查看推理數(shù)據(jù)集結(jié)構(gòu)
reasoning_dataset# 查看非推理數(shù)據(jù)集結(jié)構(gòu)
non_reasoning_dataset"""We now convert the reasoning dataset into conversational format:"""# 將推理數(shù)據(jù)集轉(zhuǎn)換為對(duì)話格式
# 將數(shù)學(xué)問題和解決方案轉(zhuǎn)換為用戶-助手對(duì)話格式
# 參數(shù):
#     examples: 批量樣本，包含問題和解決方案
# 返回:
#     包含對(duì)話格式的字典def generate_conversation(examples):
    problems  = examples["problem"]
    solutions = examples["generated_solution"]
    conversations = []
    for problem, solution in zip(problems, solutions):
        conversations.append([
            {"role" : "user",      "content" : problem},
            {"role" : "assistant", "content" : solution},
        ])
    return { "conversations": conversations, }# 將轉(zhuǎn)換后的推理數(shù)據(jù)集應(yīng)用對(duì)話模板
reasoning_conversations = tokenizer.apply_chat_template(
    reasoning_dataset.map(generate_conversation, batched = True)["conversations"],
    tokenize = False, # 不進(jìn)行分詞，僅應(yīng)用模板
)"""Let's see the first transformed row:"""# 查看轉(zhuǎn)換后的第一個(gè)樣本
reasoning_conversations[0]"""Next we take the non reasoning dataset and convert it to conversational format as well.We have to use Unsloth's `standardize_sharegpt` function to fix up the format of the dataset first.
"""# 處理非推理數(shù)據(jù)集，轉(zhuǎn)換為標(biāo)準(zhǔn)對(duì)話格式
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(non_reasoning_dataset)# 將標(biāo)準(zhǔn)化后的非推理數(shù)據(jù)集應(yīng)用對(duì)話模板
non_reasoning_conversations = tokenizer.apply_chat_template(
    dataset["conversations"],
    tokenize = False,
)"""Let's see the first row"""# 查看轉(zhuǎn)換后的第一個(gè)非推理樣本
non_reasoning_conversations[0]"""Now let's see how long both datasets are:"""# 查看兩個(gè)數(shù)據(jù)集的大小
print(len(reasoning_conversations))
print(len(non_reasoning_conversations))"""The non reasoning dataset is much longer. Let's assume we want the model to retain some reasoning capabilities, but we specifically want a chat model.Let's define a ratio of chat only data. The goal is to define some mixture of both sets of data.Let's select 25% reasoning and 75% chat based:
"""# 設(shè)置聊天數(shù)據(jù)比例
# 讓模型保持25%的推理能力和75%的聊天能力
chat_percentage = 0.75"""Let's sample the reasoning dataset by 25% (or whatever is 100% - chat_percentage)"""# 從非推理數(shù)據(jù)集中抽樣，抽取數(shù)量為推理數(shù)據(jù)集的25%
import pandas as pd
non_reasoning_subset = pd.Series(non_reasoning_conversations)
non_reasoning_subset = non_reasoning_subset.sample(
    int(len(reasoning_conversations) * (1.0 - chat_percentage)),# 采樣大?。和评頂?shù)據(jù)集大小的25%
    random_state = 2407,
)"""Finally combine both datasets:"""# 合并兩個(gè)數(shù)據(jù)集
data = pd.concat([
    pd.Series(reasoning_conversations),    # 推理對(duì)話數(shù)據(jù)
    pd.Series(non_reasoning_subset)        # 采樣后的非推理對(duì)話數(shù)據(jù)
])
data.name = "text"# 設(shè)置數(shù)據(jù)列名為"text"# 將合并的數(shù)據(jù)轉(zhuǎn)換為HuggingFace Dataset格式
from datasets import Dataset
combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
# 隨機(jī)打亂數(shù)據(jù)集
combined_dataset = combined_dataset.shuffle(seed = 3407)# 查看數(shù)據(jù)集的基本信息
print(combined_dataset)# 使用DataFrame展示前10條記錄
import pandas as pd# 轉(zhuǎn)換為pandas DataFrame以便更好地顯示
df = pd.DataFrame(combined_dataset[:10])
display(df)"""<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](<https://huggingface.co/docs/trl/sft_trainer>). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
"""# 使用HuggingFace TRL的SFTTrainer進(jìn)行訓(xùn)練
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,
    eval_dataset = None,  # 可以設(shè)置評(píng)估數(shù)據(jù)集
    args = SFTConfig(
        dataset_text_field = "text",  # 指定數(shù)據(jù)集中的文本字段
        per_device_train_batch_size = 2,  # 每個(gè)設(shè)備的訓(xùn)練批次大小
        gradient_accumulation_steps = 4,  # 使用梯度累積模擬更大批次大小
        warmup_steps = 5,  # 預(yù)熱步數(shù)
        # num_train_epochs = 1,  # 設(shè)置為1以進(jìn)行完整訓(xùn)練
        max_steps = 30,
        learning_rate = 2e-4,   # 學(xué)習(xí)率（長(zhǎng)期訓(xùn)練可降至2e-5）
        logging_steps = 1,  # 日志記錄間隔
        optim = "adamw_8bit",  # 優(yōu)化器
        weight_decay = 0.01,  # 權(quán)重衰減
        lr_scheduler_type = "linear",  # 學(xué)習(xí)率調(diào)度類型
        seed = 3407,  # 隨機(jī)種子
        report_to = "none",   # 可設(shè)置為"wandb"等進(jìn)行實(shí)驗(yàn)追蹤
    ),
)# 顯示當(dāng)前內(nèi)存統(tǒng)計(jì)
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")"""Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`"""# 開始訓(xùn)練模型
# 要恢復(fù)訓(xùn)練，可設(shè)置 resume_from_checkpoint = True
trainer_stats = trainer.train()# 顯示最終內(nèi)存和時(shí)間統(tǒng)計(jì)
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")"""<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Qwen-3` team, the recommended settings for reasoning inference are `temperature = 0.6, top_p = 0.95, top_k = 20`For normal chat based inference, `temperature = 0.7, top_p = 0.8, top_k = 20`
"""# 模型推理
# 使用Unsloth原生推理功能測(cè)試模型
# 根據(jù)Qwen-3團(tuán)隊(duì)建議：
# - 推理模式：temperature=0.6, top_p=0.95, top_k=20
# - 普通聊天模式：temperature=0.7, top_p=0.8, top_k=20# 測(cè)試沒有啟用thinking模式的普通對(duì)話
messages = [
    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # 必須添加生成提示
    enable_thinking = False,  # 禁用thinking模式
)# 使用普通對(duì)話參數(shù)進(jìn)行文本生成
from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # 增加以獲得更長(zhǎng)輸出
    temperature = 0.7, top_p = 0.8, top_k = 20, # 普通對(duì)話模式參數(shù)
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)# 測(cè)試啟用thinking模式的推理對(duì)話
messages = [
    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,  # 必須添加生成提示
    enable_thinking = True, # 啟用thinking模式
)# 使用推理模式參數(shù)進(jìn)行文本生成
from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1024,  # 增加以獲得更長(zhǎng)輸出
    temperature = 0.6, top_p = 0.95, top_k = 20, # 推理模式參數(shù)
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)"""<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
"""# 模型保存
# 以下是多種保存模型的方式# 保存LoRA適配器（不包含完整模型，體積?。?model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("leo009/Qwen3-lora_model", token = "") # 上傳到HuggingFace Hub
# tokenizer.push_to_hub("leo009/Qwen3-lora_model", token = "") # 上傳到HuggingFace Hub"""Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:"""# 加載剛剛保存的LoRA適配器（用于推理）
ifTrue:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model",  # 訓(xùn)練時(shí)使用的模型
        max_seq_length = 2048,
        load_in_4bit = True,
    )"""### Saving to float16 for VLLMWe also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to <https://huggingface.co/settings/tokens> for your personal tokens.
"""# 保存為float16格式（用于VLLM）
# 支持多種保存方式：merged_16bit（float16）、merged_4bit（int4）或lora（適配器）# Merge to 16bit
ifFalse:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
ifFalse: # 上傳到HuggingFace Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")# 保存為4位精度
ifTrue:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit_forced",) # 改為_forced版本
ifTrue: # 上傳到HuggingFace Hub
    model.push_to_hub_merged("leo009/Qwen3-vLLM", tokenizer, save_method = "merged_4bit_forced", token = "") # 同樣改為_forced版本# 僅保存LoRA適配器
ifFalse:
    model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
ifFalse: # 上傳到HuggingFace Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")"""### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.Some supported quant methods (full list on our [Wiki page](<https://github.com/unslothai/unsloth/wiki#gguf-quantization-options>)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](<https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb>)
"""# GGUF / llama.cpp 格式轉(zhuǎn)換
# 支持多種量化方法，如q8_0、q4_k_m、q5_k_m等# F16（Float16）格式# 精度類型：半精度浮點(diǎn)數(shù)（16位浮點(diǎn)數(shù)）
# 內(nèi)存占用：比原始FP32（32位浮點(diǎn)數(shù)）減少約50%的存儲(chǔ)空間
# 精度保留：保留了相對(duì)較高的數(shù)值精度，損失較小
# 推理性能：比FP32快，但比更低位量化格式慢
# 適用場(chǎng)景：當(dāng)需要在內(nèi)存使用和模型精度之間取得平衡時(shí)使用# Q4_K_M格式# 精度類型：混合4位量化格式（是GGUF量化方案的一種）
# 內(nèi)存占用：比F16減少約75%的存儲(chǔ)空間，比原始FP32減少約87.5%
# 量化策略：針對(duì)不同權(quán)重采用不同的量化策略# 對(duì)注意力機(jī)制中的WV矩陣和前饋網(wǎng)絡(luò)中的W2矩陣的一半使用Q6_K量化
# 對(duì)其余權(quán)重使用Q4_K量化# 精度與速度：犧牲一定精度以獲得更小的文件大小和更快的推理速度
# 適用場(chǎng)景：適合在資源受限設(shè)備上運(yùn)行模型，如個(gè)人電腦或移動(dòng)設(shè)備# # Save to 8bit Q8_0
# if False:
#     model.save_pretrained_gguf("model", tokenizer,)
# # Remember to go to <https://huggingface.co/settings/tokens> for a token!
# # And change hf to your username!
# if False:
#     model.push_to_hub_gguf("hf/model", tokenizer, token = "")# # 保存為16位GGUF
# if False:
#     model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
# if False: # 上傳到HuggingFace Hub
#     model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")# # 保存為q4_k_m格式GGUF
ifTrue:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
ifTrue:# 上傳到HuggingFace Hub
    model.push_to_hub_gguf("leo009/Qwen3-GGUF", tokenizer, quantization_method = "q4_k_m", token = "")# # 保存多種GGUF格式（批量導(dǎo)出更高效）
# if False:
#     model.push_to_hub_gguf(
#         "hf/model", # Change hf to your username!
#         tokenizer,
#         quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
#         token = "", # Get a token at <https://huggingface.co/settings/tokens>
#     )from google.colab import drive
drive.mount('/content/gdrive')# Save to Google Drive with q4_k_m quantization
ifTrue:
    model.save_pretrained_gguf("/content/gdrive/MyDrive/MyModel/model",
                              tokenizer,
                              quantization_method = "q4_k_m")"""Now, use the `model.gguf` file or `model-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](<https://github.com/janhq/jan>) and Open WebUI [here](<https://github.com/open-webui/open-webui>)And we're done! If you have any questions on Unsloth, we have a [Discord](<https://discord.gg/unsloth>) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](<https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb>)
2. Saving finetunes to Ollama. [Free notebook](<https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb>)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](<https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb>)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](<https://docs.unsloth.ai/get-started/unsloth-notebooks>)!<div class="align-center">
  <a href="<https://unsloth.ai>"><img src="<https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png>" width="115"></a>
  <a href="<https://discord.gg/unsloth>"><img src="<https://github.com/unslothai/unsloth/raw/main/images/Discord.png>" width="145"></a>
  <a href="<https://docs.unsloth.ai/>"><img src="<https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true>" width="125"></a>  Join Discord if you need help + ?? <i>Star us on <a href="<https://github.com/unslothai/unsloth>">Github</a> </i> ??
</div>"""

責(zé)任編輯：武曉燕來源：數(shù)據(jù)STUDIO

unsloth Qwen3 VRAM

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營