OpenAI ke gpt-oss-20b Model ko Unsloth se Fine-Tune Karein 🧠
Is tutorial mein hum OpenAI ke naye gpt-oss-20b model ki power dekhenge, use Unsloth ke saath fine-tune karke. Yeh model ek khaas feature ke saath aata hai jise "Reasoning Effort" kehte hain, jisse aap model ki performance aur speed ko control kar sakte hain.
Step 1: Environment Setup (Installation) 🛠️
Sabse pehle, humein zaroori libraries install karni hongi. Unsloth is process ko kaafi aasan bana deta hai.
# %%capture
# import os, importlib.util
# !pip install --upgrade -qqq uv
# if importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):
# try: import numpy, PIL; get_numpy = f"numpy=={numpy.__version__}"; get_pil = f"pillow=={PIL.__version__}"
# except: get_numpy = "numpy"; get_pil = "pillow"
# !uv pip install -qqq \
# "torch>=2.8.0" "triton>=3.4.0" {get_numpy} {get_pil} torchvision bitsandbytes "transformers==4.56.2" \
# "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
# "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
# git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels
# elif importlib.util.find_spec("unsloth") is None:
# !uv pip install -qqq unsloth
# !uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo
Step 2: Model Load Karna (Unsloth ke saath) 🚀
Ab hum Unsloth ki FastLanguageModel class ka use karke model load karenge. Hum 4-bit quantization ka use karenge taaki memory usage kam ho.
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024
dtype = None
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/gpt-oss-20b",
dtype = dtype, # Auto detection ke liye None
max_seq_length = max_seq_length,
load_in_4bit = True, # Memory bachane ke liye 4-bit quantization
full_finetuning = False,
)
LoRA Adapters Add Karna
Hum parameter-efficient fine-tuning (PEFT) ke liye LoRA adapters add karenge. Isse hum sirf model ke kuch hi percent parameters ko train karke poore model ko fine-tune kar sakte hain.
model = FastLanguageModel.get_peft_model(
model,
r = 8, # 8, 16, 32, 64, 128 me se koi bhi choose karein
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth", # 30% kam VRAM use karta hai
random_state = 3407,
)
Step 3: Reasoning Effort ko Samajhna 🤔
gpt-oss models ka ek unique feature hai "Reasoning Effort". Isse aap model ki "sochne ki kshamta" ko control kar sakte hain. Iske teen levels hain:
- Low: Fast response ke liye, jahan complex reasoning ki zaroorat nahi hai.
- Medium: Performance aur speed ke beech ek balance.
- High: Sabse strong reasoning performance, lekin response time thoda zyada hota hai.
Chaliye `medium` effort ke saath ek example dekhte hain:
from transformers import TextStreamer
messages = [
{"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True,
return_tensors = "pt",
return_dict = True,
reasoning_effort = "medium", # **NEW!** Effort ko low, medium, ya high set karein
).to("cuda")
_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))
Aap dekhenge ki `high` effort par model zyada "sochta" hai aur behtar jawab deta hai.
Step 4: Dataset Taiyar Karna 📚
Hum fine-tuning ke liye HuggingFaceH4/Multilingual-Thinking dataset ka use karenge. Is dataset me chain-of-thought reasoning ke examples hain, jo model ko alag-alag languages me reason karna sikhate hain.
Hum dataset ko OpenAI ke Harmony format ke anusaar format karenge.
from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt
def formatting_prompts_func(examples):
convos = examples["messages"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
return { "text" : texts, }
dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking", split="train")
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)
# Pehla example dekhein
print(dataset[0]['text'])
Step 5: Model ko Fine-Tune Karna ⚙️
Ab hum model ko train karenge. Hum yahan speed ke liye sirf 30 steps run karenge, lekin aap poori training ke liye num_train_epochs=1 set kar sakte hain.
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
args = SFTConfig(
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 30, # Poori training ke liye isse None karein aur num_train_epochs=1 set karein
learning_rate = 2e-4,
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.001,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none",
),
)
Sirf Assistant ke Responses par Train Karein
Unsloth ka train_on_responses_only method sirf assistant ke jawab par train karta hai, jisse accuracy badhti hai.
from unsloth.chat_templates import train_on_responses_only
gpt_oss_kwargs = dict(instruction_part = "<|start|>user<|message|>", response_part="<|start|>assistant<|channel|>final<|message|>")
trainer = train_on_responses_only(
trainer,
**gpt_oss_kwargs,
)
trainer.train()
Step 6: Fine-Tuned Model se Inference Karna ✅
Training ke baad, chaliye dekhte hain ki hamara model ab French me reasoning kar pata hai ya nahi.
messages = [
{"role": "system", "content": "reasoning language: French\n\nYou are a helpful assistant that can solve mathematical problems."},
{"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True,
return_tensors = "pt",
return_dict = True,
reasoning_effort = "medium",
).to("cuda")
from transformers import TextStreamer
_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))
Fine-tuning ke baad, model ne French me "sochna" seekh liya hai!
Step 7: Model ko Save aur Load Karna 💾
Aap apne fine-tuned model ko LoRA adapters ke roop me save kar sakte hain.
model.save_pretrained("finetuned_model") # Local save
# model.push_to_hub("your_username/finetuned_model", token = "...") # Online save
Saved LoRA adapters ko load karne ke liye:
if False:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "finetuned_model",
max_seq_length = 1024,
dtype = None,
load_in_4bit = True,
)
Model ko float16 me Save Karna
Aap model ko VLLM jaise tools ke liye float16 format me bhi save kar sakte hain.
# Sirf ek option chunein!
# Locally 16bit me save karein
if False: model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "merged_16bit")
# Hugging Face account par export aur save karein
if False: model.push_to_hub_merged("YOUR_USERNAME/gpt-oss-finetune", tokenizer, save_method = "merged_16bit", token = "PUT_HERE")
💡 Summary aur Agle Steps
Congratulations! Aapne successfully OpenAI ke `gpt-oss-20b` model ko ek multi-lingual reasoning task ke liye fine-tune kar liya hai. Unsloth is process ko kaafi aasan aur memory-efficient bana deta hai.
Aap Unsloth ke aur bhi notebooks explore kar sakte hain, jaise ki Llama GRPO, DPO, aur Vision model fine-tuning!