偷偷摘套内射激情视频,久久精品99国产国产精,中文字幕无线乱码人妻,中文在线中文a,性爽19p

Unsloth 微調(diào) Qwen3 實(shí)戰(zhàn)教程來了!

人工智能
Qwen3–30B-A3B僅需17.5GB VRAM即可運(yùn)行。unsloth的Dynamic 2.0量化技術(shù)保證了高精度,同時(shí)支持原生128K上下文長度。Qwen3模型具有思考模式和非思考模式,適用于不同復(fù)雜度的任務(wù)。

unsloth微調(diào)Qwen3模型提供顯著優(yōu)勢:訓(xùn)練速度提高2倍,VRAM使用減少70%,支持8倍長的上下文。Qwen3–30B-A3B僅需17.5GB VRAM即可運(yùn)行。unsloth的Dynamic 2.0量化技術(shù)保證了高精度,同時(shí)支持原生128K上下文長度。Qwen3模型具有思考模式和非思考模式,適用于不同復(fù)雜度的任務(wù)。微調(diào)后的模型可用于法律文檔分析、定制知識(shí)庫構(gòu)建等領(lǐng)域,能夠處理特定領(lǐng)域查詢并保持上下文,優(yōu)于純檢索系統(tǒng)。unsloth支持4bit/16bit的QLoRA/LoRA微調(diào),適用于2018年后的NVIDIA GPU,為資源有限環(huán)境下的模型定制提供了高效解決方案。

圖片圖片

Qwen3模型微調(diào)的主要場景

unsloth支持對(duì)Qwen3模型進(jìn)行微調(diào),可以應(yīng)用于以下場景:

  1. 法律文檔輔助 — 在法律文本(合同、案例法、法規(guī))上進(jìn)行微調(diào),用于合同分析、案例法研究或合規(guī)支持
  2. 定制知識(shí)庫 — 將專業(yè)領(lǐng)域的知識(shí)直接嵌入到模型中,使其能夠處理特定領(lǐng)域的查詢和文檔總結(jié)

Qwen3模型本身具有兩種工作模式,使微調(diào)后的模型更加靈活:

  1. 思考模式(Thinking Mode):模型會(huì)在給出最終答案前進(jìn)行逐步推理,適合需要深度思考的復(fù)雜問題
  2. 非思考模式(Non-Thinking Mode):模型提供快速、近乎即時(shí)的回答,適合簡單問題

使用unsloth微調(diào)Qwen3的主要優(yōu)勢

unsloth使Qwen3(8B)微調(diào)速度提高2倍,VRAM使用減少70%,并且比所有使用Flash Attention 2的環(huán)境支持長8倍的上下文長度。使用unsloth,Qwen3–30B-A3B模型可以舒適地在僅17.5GB VRAM的環(huán)境中運(yùn)行。

unsloth為Qwen3提供了Dynamic 2.0量化方法,在5-shot MMLU和KL散度基準(zhǔn)測試中提供最佳性能。這意味著可以運(yùn)行和微調(diào)量化后的Qwen3 LLM,同時(shí)保持最小的精度損失。unsloth還上傳了支持原生128K上下文長度的Qwen3版本。

unsloth支持多種微調(diào)技術(shù),包括4bit和16bit的QLoRA/LoRA微調(diào)。它通過手動(dòng)推導(dǎo)所有計(jì)算密集型數(shù)學(xué)步驟并手寫GPU核心,在不更改硬件的情況下使訓(xùn)練速度更快。

技術(shù)特點(diǎn)與支持

unsloth提供了多種設(shè)置選項(xiàng)來優(yōu)化微調(diào)過程:

  • max_seq_length = 2048:控制上下文長度。雖然Qwen3支持40960,但建議測試時(shí)使用2048。unsloth能夠?qū)崿F(xiàn)8倍長的上下文微調(diào)
  • load_in_4bit = True:啟用4位量化,減少微調(diào)時(shí)內(nèi)存使用量至原來的1/4,適用于16GB GPU

unsloth上傳了所有版本的Qwen3,包括Dynamic 2.0 GGUF、動(dòng)態(tài)4位等格式到Hugging Face。此外,unsloth還支持包括30B-A3B和235B-A22B在內(nèi)的Qwen3 MOE模型。

unsloth的技術(shù)支持包括:

  • 支持2018年以后的NVIDIA GPU,最低CUDA能力要求7.0
  • 支持各種Transformer風(fēng)格的模型,包括Phi-4推理、Mixtral、MOE、Cohere等
  • 支持任何訓(xùn)練算法,比如帶VLM的GRPO

實(shí)際應(yīng)用優(yōu)勢

與純檢索系統(tǒng)相比,微調(diào)提供了幾個(gè)顯著優(yōu)勢:

  1. 微調(diào)幾乎可以做到檢索增強(qiáng)生成(RAG)能做的一切,但反之則不然
  2. 在微調(diào)過程中,外部知識(shí)直接嵌入到模型中,使模型能夠處理特定領(lǐng)域查詢并在不依賴外部檢索系統(tǒng)的情況下保持上下文
  3. 即使在同時(shí)使用微調(diào)和RAG的混合設(shè)置中,微調(diào)后的模型也提供了可靠的后備方案

在特定領(lǐng)域,如醫(yī)療保健領(lǐng)域的視覺問答(VQA)任務(wù)中,微調(diào)使模型更好地理解領(lǐng)域特定的細(xì)微差別,提高其提供準(zhǔn)確和上下文相關(guān)響應(yīng)的能力。微調(diào)后的模型在精確度和召回率上表現(xiàn)明顯優(yōu)于零樣本預(yù)測。

為獲得最佳結(jié)果,建議策劃結(jié)構(gòu)良好的數(shù)據(jù)集,理想情況下是問答對(duì)形式。這可以增強(qiáng)學(xué)習(xí)、理解和響應(yīng)準(zhǔn)確性。

使用unsloth微調(diào)Qwen3模型可以實(shí)現(xiàn)更快的訓(xùn)練速度、更低的內(nèi)存需求和更長的上下文支持,同時(shí)保持高精度。這使得即使在資源有限的環(huán)境中,也能夠?qū)?qiáng)大的Qwen3模型適配到特定領(lǐng)域的應(yīng)用場景中。

完整微調(diào)代碼

**微調(diào)后的模型獲得的能力:**
1. 雙模式操作能力: - 普通對(duì)話模式: 適用于日常聊天場景
 - 思考模式(Thinking Mode): 用于解決需要推理的問題2. 數(shù)學(xué)推理能力: 能夠解決數(shù)學(xué)問題并展示詳細(xì)的推理過程,如示例中的"解方程(x + 2)^2 = 0"
3. 對(duì)話能力保持: 同時(shí)保持了自然對(duì)話的能力,能夠進(jìn)行流暢的多輪對(duì)話微調(diào)使模型成為一個(gè)"雙重人格"的助手,既能進(jìn)行普通閑聊,又能在需要時(shí)切換到更嚴(yán)謹(jǐn)?shù)乃伎寄J絹斫鉀Q復(fù)雜問題,特別是數(shù)學(xué)問題。### 安裝
"""# Commented out IPython magic to ensure Python compatibility.
# %%capture
# import os
# if "COLAB_" not in "".join(os.environ.keys()):
#     # 如果不是在Google Colab環(huán)境中運(yùn)行,則簡單安裝unsloth庫
#     !pip install unsloth
# else:
#     # 在Google Colab環(huán)境中運(yùn)行時(shí)的特殊安裝流程
#     # 首先安裝所有依賴庫,但不處理它們的依賴關(guān)系(--no-deps參數(shù))
#     !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
#     # 安裝常用的自然語言處理和模型托管工具
#     !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
#     # 最后安裝unsloth庫本身,不處理依賴(避免版本沖突)
#     !pip install --no-deps unsloth
#"""### Unsloth"""from unsloth import FastLanguageModel
import torchfourbit_models = [
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit"# [NEW] We support TTS models!
] # More models at <https://huggingface.co/unsloth>model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-14B",
    max_seq_length = 2048,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    token = "",      # use one if using gated models
)"""We now add LoRA adapters so we only need to update 1 to 10% of all parameters!"""# 添加LoRA適配器
# 通過LoRA技術(shù),只需要更新1-10%的參數(shù)即可實(shí)現(xiàn)有效微調(diào)
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,           # # LoRA秩,建議值為8,16,32,64,128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # LoRA alpha值,建議設(shè)為rank或rank*2
    lora_dropout = 0, # LoRA dropout,0值經(jīng)過優(yōu)化
    bias = "none",    # 偏置設(shè)置,"none"已優(yōu)化
    # [新特性] "unsloth"模式減少30%顯存,可適應(yīng)2倍大的批次大小
    use_gradient_checkpointing = "unsloth", #梯度檢查點(diǎn),用于長上下文
    random_state = 3407,  # 隨機(jī)種子
    use_rslora = False,   # 是否使用rank stabilized LoRA
    loftq_config = None,  # LoftQ配置
)"""<a name="Data"></a>
### Data Prep
Qwen3 has both reasoning and a non reasoning mode. So, we should use 2 datasets:1. We use the [Open Math Reasoning]() dataset which was used to win the [AIMO](<https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard>) (AI Mathematical Olympiad - Progress Prize 2) challenge! We sample 10% of verifiable reasoning traces that used DeepSeek R1, and whicht got > 95% accuracy.2. We also leverage [Maxime Labonne's FineTome-100k](<https://huggingface.co/datasets/mlabonne/FineTome-100k>) dataset in ShareGPT style. But we need to convert it to HuggingFace's normal multiturn format as well.
"""# 數(shù)據(jù)準(zhǔn)備
# Qwen3同時(shí)具有推理和非推理模式,因此使用兩種數(shù)據(jù)集:
# 1. OpenMathReasoning數(shù)據(jù)集 - 用于數(shù)學(xué)推理能力
# 2. FineTome-100k數(shù)據(jù)集 - 用于一般對(duì)話能力
from datasets import load_dataset
# 加載數(shù)學(xué)推理數(shù)據(jù)集
reasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot",token="")
# 加載對(duì)話數(shù)據(jù)集
non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split = "train",token="")"""Let's see the structure of both datasets:"""# 查看推理數(shù)據(jù)集結(jié)構(gòu)
reasoning_dataset# 查看非推理數(shù)據(jù)集結(jié)構(gòu)
non_reasoning_dataset"""We now convert the reasoning dataset into conversational format:"""# 將推理數(shù)據(jù)集轉(zhuǎn)換為對(duì)話格式
# 將數(shù)學(xué)問題和解決方案轉(zhuǎn)換為用戶-助手對(duì)話格式
# 參數(shù):
#     examples: 批量樣本,包含問題和解決方案
# 返回:
#     包含對(duì)話格式的字典def generate_conversation(examples):
    problems  = examples["problem"]
    solutions = examples["generated_solution"]
    conversations = []
    for problem, solution in zip(problems, solutions):
        conversations.append([
            {"role" : "user",      "content" : problem},
            {"role" : "assistant", "content" : solution},
        ])
    return { "conversations": conversations, }# 將轉(zhuǎn)換后的推理數(shù)據(jù)集應(yīng)用對(duì)話模板
reasoning_conversations = tokenizer.apply_chat_template(
    reasoning_dataset.map(generate_conversation, batched = True)["conversations"],
    tokenize = False, # 不進(jìn)行分詞,僅應(yīng)用模板
)"""Let's see the first transformed row:"""# 查看轉(zhuǎn)換后的第一個(gè)樣本
reasoning_conversations[0]"""Next we take the non reasoning dataset and convert it to conversational format as well.We have to use Unsloth's `standardize_sharegpt` function to fix up the format of the dataset first.
"""# 處理非推理數(shù)據(jù)集,轉(zhuǎn)換為標(biāo)準(zhǔn)對(duì)話格式
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(non_reasoning_dataset)# 將標(biāo)準(zhǔn)化后的非推理數(shù)據(jù)集應(yīng)用對(duì)話模板
non_reasoning_conversations = tokenizer.apply_chat_template(
    dataset["conversations"],
    tokenize = False,
)"""Let's see the first row"""# 查看轉(zhuǎn)換后的第一個(gè)非推理樣本
non_reasoning_conversations[0]"""Now let's see how long both datasets are:"""# 查看兩個(gè)數(shù)據(jù)集的大小
print(len(reasoning_conversations))
print(len(non_reasoning_conversations))"""The non reasoning dataset is much longer. Let's assume we want the model to retain some reasoning capabilities, but we specifically want a chat model.Let's define a ratio of chat only data. The goal is to define some mixture of both sets of data.Let's select 25% reasoning and 75% chat based:
"""# 設(shè)置聊天數(shù)據(jù)比例
# 讓模型保持25%的推理能力和75%的聊天能力
chat_percentage = 0.75"""Let's sample the reasoning dataset by 25% (or whatever is 100% - chat_percentage)"""# 從非推理數(shù)據(jù)集中抽樣,抽取數(shù)量為推理數(shù)據(jù)集的25%
import pandas as pd
non_reasoning_subset = pd.Series(non_reasoning_conversations)
non_reasoning_subset = non_reasoning_subset.sample(
    int(len(reasoning_conversations) * (1.0 - chat_percentage)),# 采樣大小:推理數(shù)據(jù)集大小的25%
    random_state = 2407,
)"""Finally combine both datasets:"""# 合并兩個(gè)數(shù)據(jù)集
data = pd.concat([
    pd.Series(reasoning_conversations),    # 推理對(duì)話數(shù)據(jù)
    pd.Series(non_reasoning_subset)        # 采樣后的非推理對(duì)話數(shù)據(jù)
])
data.name = "text"# 設(shè)置數(shù)據(jù)列名為"text"# 將合并的數(shù)據(jù)轉(zhuǎn)換為HuggingFace Dataset格式
from datasets import Dataset
combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
# 隨機(jī)打亂數(shù)據(jù)集
combined_dataset = combined_dataset.shuffle(seed = 3407)# 查看數(shù)據(jù)集的基本信息
print(combined_dataset)# 使用DataFrame展示前10條記錄
import pandas as pd# 轉(zhuǎn)換為pandas DataFrame以便更好地顯示
df = pd.DataFrame(combined_dataset[:10])
display(df)"""<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](<https://huggingface.co/docs/trl/sft_trainer>). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
"""# 使用HuggingFace TRL的SFTTrainer進(jìn)行訓(xùn)練
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,
    eval_dataset = None,  # 可以設(shè)置評(píng)估數(shù)據(jù)集
    args = SFTConfig(
        dataset_text_field = "text",  # 指定數(shù)據(jù)集中的文本字段
        per_device_train_batch_size = 2,  # 每個(gè)設(shè)備的訓(xùn)練批次大小
        gradient_accumulation_steps = 4,  # 使用梯度累積模擬更大批次大小
        warmup_steps = 5,  # 預(yù)熱步數(shù)
        # num_train_epochs = 1,  # 設(shè)置為1以進(jìn)行完整訓(xùn)練
        max_steps = 30,
        learning_rate = 2e-4,   # 學(xué)習(xí)率(長期訓(xùn)練可降至2e-5)
        logging_steps = 1,  # 日志記錄間隔
        optim = "adamw_8bit",  # 優(yōu)化器
        weight_decay = 0.01,  # 權(quán)重衰減
        lr_scheduler_type = "linear",  # 學(xué)習(xí)率調(diào)度類型
        seed = 3407,  # 隨機(jī)種子
        report_to = "none",   # 可設(shè)置為"wandb"等進(jìn)行實(shí)驗(yàn)追蹤
    ),
)# 顯示當(dāng)前內(nèi)存統(tǒng)計(jì)
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")"""Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`"""# 開始訓(xùn)練模型
# 要恢復(fù)訓(xùn)練,可設(shè)置 resume_from_checkpoint = True
trainer_stats = trainer.train()# 顯示最終內(nèi)存和時(shí)間統(tǒng)計(jì)
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")"""<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Qwen-3` team, the recommended settings for reasoning inference are `temperature = 0.6, top_p = 0.95, top_k = 20`For normal chat based inference, `temperature = 0.7, top_p = 0.8, top_k = 20`
"""# 模型推理
# 使用Unsloth原生推理功能測試模型
# 根據(jù)Qwen-3團(tuán)隊(duì)建議:
# - 推理模式:temperature=0.6, top_p=0.95, top_k=20
# - 普通聊天模式:temperature=0.7, top_p=0.8, top_k=20# 測試沒有啟用thinking模式的普通對(duì)話
messages = [
    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # 必須添加生成提示
    enable_thinking = False,  # 禁用thinking模式
)# 使用普通對(duì)話參數(shù)進(jìn)行文本生成
from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # 增加以獲得更長輸出
    temperature = 0.7, top_p = 0.8, top_k = 20, # 普通對(duì)話模式參數(shù)
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)# 測試啟用thinking模式的推理對(duì)話
messages = [
    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,  # 必須添加生成提示
    enable_thinking = True, # 啟用thinking模式
)# 使用推理模式參數(shù)進(jìn)行文本生成
from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1024,  # 增加以獲得更長輸出
    temperature = 0.6, top_p = 0.95, top_k = 20, # 推理模式參數(shù)
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)"""<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
"""# 模型保存
# 以下是多種保存模型的方式# 保存LoRA適配器(不包含完整模型,體積?。?model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("leo009/Qwen3-lora_model", token = "") # 上傳到HuggingFace Hub
# tokenizer.push_to_hub("leo009/Qwen3-lora_model", token = "") # 上傳到HuggingFace Hub"""Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:"""# 加載剛剛保存的LoRA適配器(用于推理)
ifTrue:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model",  # 訓(xùn)練時(shí)使用的模型
        max_seq_length = 2048,
        load_in_4bit = True,
    )"""### Saving to float16 for VLLMWe also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to <https://huggingface.co/settings/tokens> for your personal tokens.
"""# 保存為float16格式(用于VLLM)
# 支持多種保存方式:merged_16bit(float16)、merged_4bit(int4)或lora(適配器)# Merge to 16bit
ifFalse:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
ifFalse: # 上傳到HuggingFace Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")# 保存為4位精度
ifTrue:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit_forced",) # 改為_forced版本
ifTrue: # 上傳到HuggingFace Hub
    model.push_to_hub_merged("leo009/Qwen3-vLLM", tokenizer, save_method = "merged_4bit_forced", token = "") # 同樣改為_forced版本# 僅保存LoRA適配器
ifFalse:
    model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
ifFalse: # 上傳到HuggingFace Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")"""### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.Some supported quant methods (full list on our [Wiki page](<https://github.com/unslothai/unsloth/wiki#gguf-quantization-options>)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](<https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb>)
"""# GGUF / llama.cpp 格式轉(zhuǎn)換
# 支持多種量化方法,如q8_0、q4_k_m、q5_k_m等# F16(Float16)格式# 精度類型:半精度浮點(diǎn)數(shù)(16位浮點(diǎn)數(shù))
# 內(nèi)存占用:比原始FP32(32位浮點(diǎn)數(shù))減少約50%的存儲(chǔ)空間
# 精度保留:保留了相對(duì)較高的數(shù)值精度,損失較小
# 推理性能:比FP32快,但比更低位量化格式慢
# 適用場景:當(dāng)需要在內(nèi)存使用和模型精度之間取得平衡時(shí)使用# Q4_K_M格式# 精度類型:混合4位量化格式(是GGUF量化方案的一種)
# 內(nèi)存占用:比F16減少約75%的存儲(chǔ)空間,比原始FP32減少約87.5%
# 量化策略:針對(duì)不同權(quán)重采用不同的量化策略# 對(duì)注意力機(jī)制中的WV矩陣和前饋網(wǎng)絡(luò)中的W2矩陣的一半使用Q6_K量化
# 對(duì)其余權(quán)重使用Q4_K量化# 精度與速度:犧牲一定精度以獲得更小的文件大小和更快的推理速度
# 適用場景:適合在資源受限設(shè)備上運(yùn)行模型,如個(gè)人電腦或移動(dòng)設(shè)備# # Save to 8bit Q8_0
# if False:
#     model.save_pretrained_gguf("model", tokenizer,)
# # Remember to go to <https://huggingface.co/settings/tokens> for a token!
# # And change hf to your username!
# if False:
#     model.push_to_hub_gguf("hf/model", tokenizer, token = "")# # 保存為16位GGUF
# if False:
#     model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
# if False: # 上傳到HuggingFace Hub
#     model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")# # 保存為q4_k_m格式GGUF
ifTrue:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
ifTrue:# 上傳到HuggingFace Hub
    model.push_to_hub_gguf("leo009/Qwen3-GGUF", tokenizer, quantization_method = "q4_k_m", token = "")# # 保存多種GGUF格式(批量導(dǎo)出更高效)
# if False:
#     model.push_to_hub_gguf(
#         "hf/model", # Change hf to your username!
#         tokenizer,
#         quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
#         token = "", # Get a token at <https://huggingface.co/settings/tokens>
#     )from google.colab import drive
drive.mount('/content/gdrive')# Save to Google Drive with q4_k_m quantization
ifTrue:
    model.save_pretrained_gguf("/content/gdrive/MyDrive/MyModel/model",
                              tokenizer,
                              quantization_method = "q4_k_m")"""Now, use the `model.gguf` file or `model-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](<https://github.com/janhq/jan>) and Open WebUI [here](<https://github.com/open-webui/open-webui>)And we're done! If you have any questions on Unsloth, we have a [Discord](<https://discord.gg/unsloth>) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](<https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb>)
2. Saving finetunes to Ollama. [Free notebook](<https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb>)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](<https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb>)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](<https://docs.unsloth.ai/get-started/unsloth-notebooks>)!<div class="align-center">
  <a href="<https://unsloth.ai>"><img src="<https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png>" width="115"></a>
  <a href="<https://discord.gg/unsloth>"><img src="<https://github.com/unslothai/unsloth/raw/main/images/Discord.png>" width="145"></a>
  <a href="<https://docs.unsloth.ai/>"><img src="<https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true>" width="125"></a>  Join Discord if you need help + ?? <i>Star us on <a href="<https://github.com/unslothai/unsloth>">Github</a> </i> ??
</div>"""


責(zé)任編輯:武曉燕 來源: 數(shù)據(jù)STUDIO
相關(guān)推薦

2025-05-19 09:32:06

2025-05-06 13:43:31

零代碼Qwen3微調(diào)

2025-06-17 08:45:00

模型智能工具

2025-05-06 00:35:33

2025-04-30 14:12:36

Qwen3AgentMCP

2025-04-30 10:59:04

2025-08-04 09:19:06

2025-06-06 09:12:53

2025-02-24 08:10:00

2025-08-08 09:06:00

2025-05-08 09:22:14

2025-05-22 05:00:00

2025-04-30 09:11:15

2025-05-21 09:04:38

2025-09-05 09:02:00

2025-08-19 16:10:46

AI模型開源

2025-04-29 10:39:46

2025-04-30 07:26:04

2025-02-10 09:31:29

點(diǎn)贊
收藏

51CTO技術(shù)棧公眾號(hào)