-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Open
Description
System Info
I was training google/gemma-3-270m by adding LORA adapters on attention layers and new embedding tokens. I am facing following issues,
- on using
model.save_pretrained, I observed the saved model has an abnormal size. This is surprisingly around the size of the complete model. A quick comparison is shown below wherein, if I train a complete embedding layer the saved adapter size is ~350 MB but if I add ~30 new tokens and only train those usingtrainable_token_indices, the size becomes ~675 MBs!
📊 Model Size Comparison:
Normal LoRA: 35.77 MB
Embedding LoRA (training complete embedding layers ): 355.77 MB
Embedding LoRA (training partial embedding layers): 100.12 MB
New Tokens LoRA (training newly added embedding layers): 675.96 MB
- saved model cannot be loaded back, as it is throwing error wrt newly added token not being accounted for. I have to load the base model, then resize the model by adding new tokens again and then load the adapter. Is this the expected behavior?
Created a gist to test this behavior. The output of the script is as follows,
============================================================
MODEL SIZE TESTING SCRIPT
============================================================
🔍 Attempting to load: google/gemma-3-270m
✅ Successfully loaded: google/gemma-3-270m
✅ Loaded model: google/gemma-3-270m
Vocabulary size: 262,145
🧹 Cleaning existing output directory: ./test_model_sizes
✅ Removed existing directory
============================================================
TEST 1: NORMAL LoRA (Attention Layers Only)
============================================================
============================================================
Setting up NORMAL LoRA (attention layers only)
============================================================
LoRA Configuration:
r (rank): 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: {'q_proj', 'v_proj'}
trainable params: 737,280 || all params: 268,835,456 || trainable%: 0.2742
💾 Saving normal LoRA model to: ./test_model_sizes/normal_lora
============================================================
Model saved at: ./test_model_sizes/normal_lora
============================================================
📊 Model Statistics:
Total parameters: 268,835,456
Trainable parameters: 737,280
Vocabulary size: 262,145
💾 Saved Model Size:
Total size: 35.77 MB
📁 Files in saved directory:
adapter_model.safetensors: 2.82 MB
tokenizer_config.json: 1.10 MB
special_tokens_map.json: 662.00 B
tokenizer.json: 31.84 MB
README.md: 5.07 KB
adapter_config.json: 854.00 B
🔧 Trainable Parameters Breakdown:
base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.2.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.2.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.3.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.3.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.3.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.3.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.4.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.4.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.4.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.4.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.5.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.5.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.5.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.5.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.6.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.6.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.6.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.6.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.7.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.7.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.7.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.7.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.8.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.8.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.8.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.8.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.9.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.9.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.9.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.9.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.11.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.11.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.11.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.11.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.12.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.12.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.12.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.12.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.13.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.13.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.13.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.13.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.14.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.14.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.14.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.14.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.15.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.15.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.15.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.15.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.16.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.16.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.16.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.16.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.17.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.17.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.17.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.17.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
============================================================
============================================================
TEST 2: LoRA with EMBEDDING SPACE ADAPTER
============================================================
============================================================
Setting up LoRA with EMBEDDING SPACE ADAPTER (modules_to_save)
============================================================
LoRA Configuration:
r (rank): 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: {'q_proj', 'v_proj'}
modules_to_save: ['embed_tokens']
Original vocab size: 262,145
trainable params: 168,509,440 || all params: 436,607,616 || trainable%: 38.5952
💾 Saving embedding LoRA model to: ./test_model_sizes/embedding_lora
============================================================
Model saved at: ./test_model_sizes/embedding_lora
============================================================
📊 Model Statistics:
Total parameters: 436,607,616
Trainable parameters: 168,509,440
Vocabulary size: 262,145
💾 Saved Model Size:
Total size: 355.77 MB
📁 Files in saved directory:
adapter_model.safetensors: 322.82 MB
tokenizer_config.json: 1.10 MB
special_tokens_map.json: 662.00 B
tokenizer.json: 31.84 MB
README.md: 5.07 KB
adapter_config.json: 874.00 B
🔧 Trainable Parameters Breakdown:
base_model.model.model.embed_tokens.modules_to_save.default.weight: torch.Size([262144, 640]) (167,772,160 params)
base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.2.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.2.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.3.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.3.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.3.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.3.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.4.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.4.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.4.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.4.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.5.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.5.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.5.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.5.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.6.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.6.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.6.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.6.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.7.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.7.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.7.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.7.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.8.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.8.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.8.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.8.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.9.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.9.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.9.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.9.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.11.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.11.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.11.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.11.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.12.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.12.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.12.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.12.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.13.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.13.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.13.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.13.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.14.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.14.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.14.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.14.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.15.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.15.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.15.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.15.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.16.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.16.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.16.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.16.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.17.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.17.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.17.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.17.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
============================================================
============================================================
TEST 3: LoRA with TRAINABLE TOKEN INDICES
============================================================
============================================================
Setting up LoRA with EMBEDDING SPACE ADAPTER (trainable_token_indices)
============================================================
Vocabulary size: 262,145
Training tokens from index 235930 to 262144
Number of trainable token indices: 26,214
LoRA Configuration:
r (rank): 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: {'q_proj', 'v_proj'}
trainable_token_indices: 26,214 tokens
(indices 235930 to 262144)
trainable params: 17,514,240 || all params: 285,612,416 || trainable%: 6.1322
💾 Saving trainable indices LoRA model to: ./test_model_sizes/trainable_indices_lora
============================================================
Model saved at: ./test_model_sizes/trainable_indices_lora
============================================================
📊 Model Statistics:
Total parameters: 285,612,416
Trainable parameters: 17,514,240
Vocabulary size: 262,145
💾 Saved Model Size:
Total size: 100.12 MB
📁 Files in saved directory:
adapter_model.safetensors: 66.82 MB
tokenizer_config.json: 1.10 MB
special_tokens_map.json: 662.00 B
tokenizer.json: 31.84 MB
README.md: 5.07 KB
adapter_config.json: 359.26 KB
🔧 Trainable Parameters Breakdown:
base_model.model.model.embed_tokens.token_adapter.trainable_tokens_delta.default: torch.Size([26214, 640]) (16,776,960 params)
base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.2.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.2.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.3.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.3.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.3.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.3.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.4.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.4.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.4.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.4.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.5.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.5.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.5.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.5.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.6.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.6.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.6.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.6.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.7.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.7.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.7.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.7.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.8.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.8.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.8.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.8.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.9.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.9.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.9.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.9.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.11.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.11.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.11.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.11.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.12.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.12.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.12.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.12.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.13.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.13.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.13.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.13.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.14.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.14.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.14.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.14.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.15.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.15.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.15.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.15.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.16.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.16.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.16.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.16.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.17.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.17.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.17.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.17.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
============================================================
============================================================
TEST 4: LoRA with NEW TOKENS ADDED
============================================================
============================================================
Setting up LoRA with NEW TOKENS (train only new tokens)
============================================================
Original vocabulary size: 262,145
Added 24 new tokens to tokenizer
New vocabulary size: 262,169
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
Resized model embeddings to 262,169 tokens
New token indices: 262145 to 262168
Number of new token indices: 24
LoRA Configuration:
r (rank): 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: {'q_proj', 'v_proj'}
trainable_token_indices: 24 new tokens
(indices 262145 to 262168)
trainable params: 752,640 || all params: 268,866,816 || trainable%: 0.2799
💾 Saving new tokens LoRA model to: ./test_model_sizes/new_tokens_lora
/Users/.../Work/tts/gemma3_audio_codecs/.venv/lib/python3.9/site-packages/peft/utils/save_and_load.py:300: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning.
warnings.warn(
============================================================
Model saved at: ./test_model_sizes/new_tokens_lora
============================================================
📊 Model Statistics:
Total parameters: 268,866,816
Trainable parameters: 752,640
Vocabulary size: 262,169
💾 Saved Model Size:
Total size: 675.96 MB
📁 Files in saved directory:
adapter_model.safetensors: 643.00 MB
tokenizer_config.json: 1.11 MB
special_tokens_map.json: 662.00 B
tokenizer.json: 31.84 MB
README.md: 5.07 KB
adapter_config.json: 1.19 KB
🔧 Trainable Parameters Breakdown:
base_model.model.model.embed_tokens.token_adapter.trainable_tokens_delta.default: torch.Size([24, 640]) (15,360 params)
base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.2.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.2.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.3.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.3.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.3.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.3.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.4.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.4.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.4.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.4.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.5.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.5.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.5.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.5.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.6.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.6.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.6.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.6.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.7.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.7.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.7.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.7.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.8.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.8.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.8.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.8.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.9.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.9.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.9.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.9.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.11.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.11.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.11.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.11.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.12.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.12.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.12.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.12.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.13.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.13.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.13.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.13.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.14.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.14.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.14.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.14.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.15.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.15.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.15.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.15.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.16.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.16.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.16.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.16.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
base_model.model.model.layers.17.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.17.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
base_model.model.model.layers.17.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
base_model.model.model.layers.17.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
============================================================
============================================================
SIZE COMPARISON
============================================================
📊 Model Size Comparison:
Normal LoRA: 35.77 MB
Embedding LoRA (modules_to_save): 355.77 MB
Embedding LoRA (trainable_indices): 100.12 MB
New Tokens LoRA: 675.96 MB
📈 Size Ranking (smallest to largest):
1. Normal LoRA: 35.77 MB
2. Embedding LoRA (trainable_indices): 100.12 MB
3. Embedding LoRA (modules_to_save): 355.77 MB
4. New Tokens LoRA: 675.96 MB
Embedding LoRA (modules_to_save) is 9.95x larger than Normal LoRA
Embedding LoRA (trainable_indices) is 2.80x larger than Normal LoRA
New Tokens LoRA is 18.90x larger than Normal LoRA
Embedding LoRA (modules_to_save) is 3.55x larger than trainable_indices
New Tokens LoRA is 6.75x larger than trainable_indices
============================================================
TESTING COMPLETE
============================================================
Who can help?
No response
Reproduction
https://gist.github.com/imohitmayank/cead7ad4a63c8770bbd5a8f48d25aeeb
Expected behavior
Imo the saved model adapter should only contain the trainable model, and it should be easily loadable.
Metadata
Metadata
Assignees
Labels
No labels