A few months ago I wanted to understand what actually happens when you fine-tune a language model. Not the theory. The actual process. The code, the wait, the output. Every guide I found assumed you had a beefy GPU and a cloud budget to match. I have a MacBook with an Intel chip and stubbornness.
So I built slm-forge, a minimal setup to fine-tune TinyLlama on my laptop, deploy it locally via Ollama, and actually use it. This post walks through everything I did, why I made each choice, and what I learned.
Why TinyLlama
TinyLlama is a 1.1 billion parameter language model trained on 3 trillion tokens. It's small enough to load and run on a laptop, but capable enough to follow instructions and generate coherent text. It's not GPT-4. But it is a real transformer model with a real attention mechanism, and fine-tuning it teaches you exactly the same concepts you'd apply to a 70B model.
Think of it like learning to drive in a small hatchback instead of a truck. The physics is the same, the controls are the same, you're just not risking as much on each turn.
LoRA vs QLoRA: What They Are and Why It Matters Here
Full fine-tuning means updating every single weight in the model. For a 1.1B model that's a billion-plus parameters. On a MacBook, that's not happening.
LoRA (Low-Rank Adaptation) freezes the original weights and plugs in small trainable matrices into the attention layers. Because these matrices are tiny, you only train a fraction of the model. In my setup only 1,126,400 parameters out of 1.1 billion are trained. That's 0.1%.
Think of it like this. A doctor already knows medicine. You want them to specialize in cardiology. You don't send them back to medical school. You send them to a six-week fellowship. That's LoRA. Same base knowledge, just a small update on top.
QLoRA takes this further by also compressing the base model to 4-bit precision before training, which cuts memory usage even more. Sounds great, but it depends on a library called bitsandbytes that only runs on NVIDIA GPUs via CUDA. No NVIDIA GPU means no QLoRA. On an Intel Mac you're stuck with plain LoRA, which is exactly what I used.
It was enough.
The Libraries
Here's every key library in the project and what it actually does:
| Library | Version | What it does |
|---|---|---|
torch | 2.2.2 | The foundation. PyTorch handles all tensor operations, gradient tracking, and the training loop under the hood |
transformers | 4.57.6 | Hugging Face's library for loading pretrained models and tokenizers. AutoModelForCausalLM loads TinyLlama in one line |
peft | 0.19.1 | Parameter-Efficient Fine-Tuning. Provides LoraConfig and get_peft_model to inject LoRA adapters into the model |
trl | 1.0.0 | Transformer Reinforcement Learning. The SFTTrainer handles supervised fine-tuning with minimal boilerplate |
datasets | 4.8.4 | Hugging Face's dataset library. Loads the Alpaca dataset directly from the Hub and handles preprocessing |
accelerate | 1.13.0 | Abstracts hardware-specific training details. Used here to keep things working on CPU without code changes |
safetensors | 0.7.0 | Safe format for saving model weights. Faster and more secure than pickle-based .bin files |
The Setup
Clone the repo and install dependencies:
git clone https://github.com/murtaza-bagwala/slm-forge
cd slm-forge
python3 -m venv llm-finetune
source llm-finetune/bin/activate
pip install -r requirements.txtThe first time you run the training script, it will download TinyLlama from the Hugging Face Hub. It's about 2.2GB, so make sure you have the space.
The Training Data
We use the yahma/alpaca-cleaned dataset from Hugging Face. It's a cleaned version of Stanford's Alpaca dataset: 52,000 instruction-following examples generated by GPT-3. Each example has an instruction, an optional input, and an output.
A raw sample from the dataset looks like this:
{
"instruction": "Give three tips for staying healthy.",
"input": "",
"output": "1. Eat a balanced diet...\n2. Exercise regularly...\n3. Get enough sleep..."
}We format it into a single text string before training:
### Instruction:
Give three tips for staying healthy.
### Response:
1. Eat a balanced diet...
2. Exercise regularly...
3. Get enough sleep...This structure teaches the model the exact pattern it needs to follow at inference time. When you later prompt it with ### Instruction:, it knows a ### Response: should come next. That's the whole trick behind instruction fine-tuning.
The Training Script
Here's the full train.py:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# Cap to 4 cores — leaves the rest of your Mac responsive during training
torch.set_num_threads(4)
MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
print("Loading tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
print("Loading dataset...")
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:500]")
def format_prompt(example):
return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}
dataset = dataset.map(format_prompt)
config = SFTConfig(
output_dir="./tinyllama-lora",
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
fp16=False,
bf16=False,
use_cpu=True,
dataloader_num_workers=0,
logging_steps=10,
save_steps=200,
report_to="none",
dataset_text_field="text",
max_length=256,
)
trainer = SFTTrainer(
model=model,
args=config,
train_dataset=dataset,
)
print("Starting training...")
trainer.train()
trainer.save_model("./tinyllama-lora")
tokenizer.save_pretrained("./tinyllama-lora")
print("Done! Adapter saved to ./tinyllama-lora")A few things worth pointing out.
tokenizer.pad_token = tokenizer.eos_token is necessary because TinyLlama's tokenizer doesn't have a dedicated padding token. If you skip it, batching will fail with a cryptic error about pad token IDs.
The LoRA config targets q_proj and v_proj, the query and value projection matrices in the attention layers. These are where most of the model's instruction-following behavior comes from. If you want to train more of the model, you can also add k_proj, o_proj, and the MLP layers, but r=8 on just q and v gives solid results without burning too much CPU.
gradient_accumulation_steps=4 with per_device_train_batch_size=1 means the optimizer only updates weights every 4 steps, simulating an effective batch size of 4. On CPU where you can't fit large batches in memory, this is how you keep training stable without running out of RAM.
When I ran this, the terminal printed:
trainable params: 1,126,400 || all params: 1,101,174,784 || trainable%: 0.1023Each step took about 14 to 20 seconds on CPU. 125 steps total. Budget about 40 minutes, make some coffee, come back.
Merging the Adapter
After training, you have a base model and a separate LoRA adapter directory. To deploy the model, you need to merge them into a single set of weights. That's what merge.py does:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
LORA_PATH = "./tinyllama-lora"
OUTPUT_PATH = "./tinyllama-merged"
print("Loading base model...")
base = AutoModelForCausalLM.from_pretrained(BASE_MODEL)
print("Loading LoRA adapter...")
model = PeftModel.from_pretrained(base, LORA_PATH)
print("Merging weights...")
merged = model.merge_and_unload()
merged.save_pretrained(OUTPUT_PATH)
AutoTokenizer.from_pretrained(LORA_PATH).save_pretrained(OUTPUT_PATH)
print(f"Merged model saved to {OUTPUT_PATH}")merge_and_unload() computes W + BA for each adapted layer and gives you back a clean model with no extra overhead. The result in ./tinyllama-merged is a standard Hugging Face model folder that any tool can load directly.
Run it:
python merge.pyWhat Is GGUF and Why Do We Need llama.cpp
After merging, the model lives as a Hugging Face checkpoint. That format is great for training but not great for running locally on a CPU. It stores weights as full 32-bit floats, which is memory-heavy and slow at inference time.
GGUF (GPT-Generated Unified Format) is a file format designed specifically for running LLMs on regular hardware. It stores the model weights in a compressed, quantized format inside a single binary file. Instead of a folder with 10+ files, you get one .gguf file that's fast to load and efficient to run. Ollama, LM Studio, and other local inference tools all use GGUF under the hood.
llama.cpp is the open-source project that makes this conversion possible. It's a C++ implementation of LLM inference that can run models on CPU, Apple Silicon, and NVIDIA GPUs. It also ships a Python script that converts a Hugging Face model folder into a GGUF file with quantization applied.
Think of it as a model compressor. You take a 4GB training checkpoint, run it through llama.cpp's converter, and get a ~650MB GGUF file that runs fast on your laptop.
Here's how to do it:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
python convert_hf_to_gguf.py ../tinyllama-merged --outfile ../tinyllama-ft.gguf --outtype q4_0The q4_0 flag means 4-bit quantization. Each weight value that was stored as a 32-bit float gets compressed to 4 bits. You lose a tiny bit of precision, but the model goes from 4GB to under 700MB and runs noticeably faster. For a fine-tuned model running on CPU, this trade-off is absolutely worth it.
Now create the Modelfile in your project root:
FROM ./tinyllama-ft.gguf
SYSTEM "You are a helpful assistant fine-tuned on Alpaca instructions."
PARAMETER temperature 0.7
PARAMETER stop "### Instruction:"Load it into Ollama and run it:
ollama create tinyllama-ft -f Modelfile
ollama run tinyllama-ftHere's what an actual conversation looks like:
>>> Explain recursion like I'm five.
Recursion is when a function calls itself. Imagine you have a box of chocolates,
and inside that box is another box of chocolates, and inside that one is another
box... You keep opening boxes until you find an empty one. That empty box is your
base case. Without it, you'd open boxes forever.That response is coming from a model you trained, running entirely on your laptop, with no API call, no cloud, no subscription.
The Full Pipeline at a Glance
yahma/alpaca-cleaned (500 examples)
|
v
train.py (LoRA fine-tuning on CPU, ~40 min)
|
v
./tinyllama-lora (adapter weights only, ~17MB)
|
v
merge.py (merge adapter into base model)
|
v
./tinyllama-merged (full HuggingFace model, ~4GB)
|
v
llama.cpp convert (GGUF quantization)
|
v
tinyllama-ft.gguf (~650MB)
|
v
ollama create + run (local inference)Running This on an Intel Mac: What to Expect
This whole project was built and run on an Intel Mac, so I want to be straight with you about what that actually means. It works, but there are real limits you should know before you start.
Training is slow. Each step took me 14 to 20 seconds. With 500 examples and 1 epoch, that's 125 steps, which means about 40 minutes of training. If you try to use more data or more epochs, that time goes up fast. On a GPU this would take under 2 minutes.
You can't use fp16 or bf16. These are faster number formats that GPU training relies on. Intel CPUs don't support them properly, so you have to run everything in full 32-bit precision. That makes training slower and uses more memory. The fp16=False, bf16=False flags in the config are not optional here.
No MPS acceleration. Apple Silicon Macs (M1, M2, M3) can use Metal Performance Shaders to speed up training. Intel Macs don't have that. You're on pure CPU.
RAM gets tight. Loading TinyLlama alone takes about 4GB of RAM. If you have an 8GB machine and other apps open, your Mac will start swapping memory to disk and training will slow down even more. I'd recommend closing everything else before you start.
You can't go much bigger than 1B parameters. TinyLlama at 1.1B is close to the limit for what's practical on an Intel CPU. A 3B model will be painful. A 7B model will probably run out of memory entirely or take hours per epoch.
Inference in Ollama is slow too. After deployment, expect around 3 to 8 tokens per second. It's usable, but don't expect instant responses like you get from a cloud API.
None of this stopped me from finishing the project and learning a lot from it. Just set the right expectations going in so you don't get discouraged halfway through.
What I Actually Learned
Fine-tuning a model feels intimidating from the outside. Once you do it, the mechanics are straightforward. The hard part is understanding which pieces are doing what, and why the numbers are what they are.
The 0.1% trainable parameters number is what I keep coming back to. LoRA is not a shortcut or a hack. It's a smart idea. You don't need to retrain everything, you just need to nudge what's already there in the right direction. Like giving a good chef a new recipe instead of sending them back to cooking school.