Skip to content

LORA adapter with new embedding tokens has size and load issue #2898

@imohitmayank

Description

@imohitmayank

System Info

I was training google/gemma-3-270m by adding LORA adapters on attention layers and new embedding tokens. I am facing following issues,

  • on using model.save_pretrained, I observed the saved model has an abnormal size. This is surprisingly around the size of the complete model. A quick comparison is shown below wherein, if I train a complete embedding layer the saved adapter size is ~350 MB but if I add ~30 new tokens and only train those using trainable_token_indices, the size becomes ~675 MBs!
📊 Model Size Comparison:
  Normal LoRA:                    35.77 MB
  Embedding LoRA (training complete embedding layers ): 355.77 MB
  Embedding LoRA (training partial embedding layers): 100.12 MB
  New Tokens LoRA (training newly added embedding layers): 675.96 MB
  • saved model cannot be loaded back, as it is throwing error wrt newly added token not being accounted for. I have to load the base model, then resize the model by adding new tokens again and then load the adapter. Is this the expected behavior?

Created a gist to test this behavior. The output of the script is as follows,


============================================================
MODEL SIZE TESTING SCRIPT
============================================================

🔍 Attempting to load: google/gemma-3-270m
✅ Successfully loaded: google/gemma-3-270m

✅ Loaded model: google/gemma-3-270m
   Vocabulary size: 262,145

🧹 Cleaning existing output directory: ./test_model_sizes
✅ Removed existing directory

============================================================
TEST 1: NORMAL LoRA (Attention Layers Only)
============================================================

============================================================
Setting up NORMAL LoRA (attention layers only)
============================================================

LoRA Configuration:
  r (rank): 16
  lora_alpha: 32
  lora_dropout: 0.05
  target_modules: {'q_proj', 'v_proj'}
trainable params: 737,280 || all params: 268,835,456 || trainable%: 0.2742

💾 Saving normal LoRA model to: ./test_model_sizes/normal_lora

============================================================
Model saved at: ./test_model_sizes/normal_lora
============================================================

📊 Model Statistics:
  Total parameters: 268,835,456
  Trainable parameters: 737,280
  Vocabulary size: 262,145

💾 Saved Model Size:
  Total size: 35.77 MB

📁 Files in saved directory:
    adapter_model.safetensors: 2.82 MB
    tokenizer_config.json: 1.10 MB
    special_tokens_map.json: 662.00 B
    tokenizer.json: 31.84 MB
    README.md: 5.07 KB
    adapter_config.json: 854.00 B

🔧 Trainable Parameters Breakdown:
    base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.2.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.2.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.3.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.3.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.3.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.3.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.4.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.4.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.4.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.4.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.5.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.5.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.5.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.5.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.6.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.6.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.6.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.6.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.7.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.7.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.7.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.7.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.8.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.8.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.8.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.8.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.9.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.9.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.9.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.9.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.11.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.11.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.11.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.11.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.12.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.12.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.12.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.12.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.13.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.13.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.13.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.13.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.14.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.14.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.14.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.14.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.15.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.15.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.15.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.15.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.16.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.16.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.16.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.16.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.17.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.17.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.17.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.17.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)

============================================================

============================================================
TEST 2: LoRA with EMBEDDING SPACE ADAPTER
============================================================

============================================================
Setting up LoRA with EMBEDDING SPACE ADAPTER (modules_to_save)
============================================================

LoRA Configuration:
  r (rank): 16
  lora_alpha: 32
  lora_dropout: 0.05
  target_modules: {'q_proj', 'v_proj'}
  modules_to_save: ['embed_tokens']
  Original vocab size: 262,145
trainable params: 168,509,440 || all params: 436,607,616 || trainable%: 38.5952

💾 Saving embedding LoRA model to: ./test_model_sizes/embedding_lora

============================================================
Model saved at: ./test_model_sizes/embedding_lora
============================================================

📊 Model Statistics:
  Total parameters: 436,607,616
  Trainable parameters: 168,509,440
  Vocabulary size: 262,145

💾 Saved Model Size:
  Total size: 355.77 MB

📁 Files in saved directory:
    adapter_model.safetensors: 322.82 MB
    tokenizer_config.json: 1.10 MB
    special_tokens_map.json: 662.00 B
    tokenizer.json: 31.84 MB
    README.md: 5.07 KB
    adapter_config.json: 874.00 B

🔧 Trainable Parameters Breakdown:
    base_model.model.model.embed_tokens.modules_to_save.default.weight: torch.Size([262144, 640]) (167,772,160 params)
    base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.2.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.2.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.3.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.3.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.3.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.3.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.4.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.4.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.4.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.4.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.5.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.5.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.5.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.5.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.6.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.6.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.6.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.6.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.7.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.7.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.7.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.7.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.8.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.8.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.8.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.8.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.9.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.9.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.9.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.9.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.11.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.11.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.11.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.11.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.12.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.12.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.12.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.12.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.13.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.13.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.13.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.13.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.14.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.14.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.14.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.14.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.15.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.15.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.15.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.15.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.16.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.16.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.16.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.16.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.17.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.17.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.17.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.17.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)

============================================================

============================================================
TEST 3: LoRA with TRAINABLE TOKEN INDICES
============================================================

============================================================
Setting up LoRA with EMBEDDING SPACE ADAPTER (trainable_token_indices)
============================================================

  Vocabulary size: 262,145
  Training tokens from index 235930 to 262144
  Number of trainable token indices: 26,214

LoRA Configuration:
  r (rank): 16
  lora_alpha: 32
  lora_dropout: 0.05
  target_modules: {'q_proj', 'v_proj'}
  trainable_token_indices: 26,214 tokens
    (indices 235930 to 262144)
trainable params: 17,514,240 || all params: 285,612,416 || trainable%: 6.1322

💾 Saving trainable indices LoRA model to: ./test_model_sizes/trainable_indices_lora

============================================================
Model saved at: ./test_model_sizes/trainable_indices_lora
============================================================

📊 Model Statistics:
  Total parameters: 285,612,416
  Trainable parameters: 17,514,240
  Vocabulary size: 262,145

💾 Saved Model Size:
  Total size: 100.12 MB

📁 Files in saved directory:
    adapter_model.safetensors: 66.82 MB
    tokenizer_config.json: 1.10 MB
    special_tokens_map.json: 662.00 B
    tokenizer.json: 31.84 MB
    README.md: 5.07 KB
    adapter_config.json: 359.26 KB

🔧 Trainable Parameters Breakdown:
    base_model.model.model.embed_tokens.token_adapter.trainable_tokens_delta.default: torch.Size([26214, 640]) (16,776,960 params)
    base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.2.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.2.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.3.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.3.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.3.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.3.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.4.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.4.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.4.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.4.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.5.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.5.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.5.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.5.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.6.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.6.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.6.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.6.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.7.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.7.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.7.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.7.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.8.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.8.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.8.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.8.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.9.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.9.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.9.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.9.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.11.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.11.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.11.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.11.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.12.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.12.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.12.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.12.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.13.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.13.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.13.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.13.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.14.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.14.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.14.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.14.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.15.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.15.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.15.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.15.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.16.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.16.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.16.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.16.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.17.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.17.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.17.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.17.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)

============================================================

============================================================
TEST 4: LoRA with NEW TOKENS ADDED
============================================================

============================================================
Setting up LoRA with NEW TOKENS (train only new tokens)
============================================================

  Original vocabulary size: 262,145
  Added 24 new tokens to tokenizer
  New vocabulary size: 262,169
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
  Resized model embeddings to 262,169 tokens
  New token indices: 262145 to 262168
  Number of new token indices: 24

LoRA Configuration:
  r (rank): 16
  lora_alpha: 32
  lora_dropout: 0.05
  target_modules: {'q_proj', 'v_proj'}
  trainable_token_indices: 24 new tokens
    (indices 262145 to 262168)
trainable params: 752,640 || all params: 268,866,816 || trainable%: 0.2799

💾 Saving new tokens LoRA model to: ./test_model_sizes/new_tokens_lora
/Users/.../Work/tts/gemma3_audio_codecs/.venv/lib/python3.9/site-packages/peft/utils/save_and_load.py:300: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning.
  warnings.warn(

============================================================
Model saved at: ./test_model_sizes/new_tokens_lora
============================================================

📊 Model Statistics:
  Total parameters: 268,866,816
  Trainable parameters: 752,640
  Vocabulary size: 262,169

💾 Saved Model Size:
  Total size: 675.96 MB

📁 Files in saved directory:
    adapter_model.safetensors: 643.00 MB
    tokenizer_config.json: 1.11 MB
    special_tokens_map.json: 662.00 B
    tokenizer.json: 31.84 MB
    README.md: 5.07 KB
    adapter_config.json: 1.19 KB

🔧 Trainable Parameters Breakdown:
    base_model.model.model.embed_tokens.token_adapter.trainable_tokens_delta.default: torch.Size([24, 640]) (15,360 params)
    base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.2.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.2.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.3.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.3.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.3.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.3.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.4.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.4.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.4.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.4.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.5.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.5.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.5.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.5.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.6.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.6.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.6.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.6.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.7.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.7.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.7.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.7.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.8.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.8.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.8.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.8.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.9.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.9.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.9.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.9.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.11.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.11.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.11.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.11.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.12.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.12.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.12.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.12.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.13.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.13.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.13.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.13.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.14.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.14.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.14.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.14.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.15.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.15.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.15.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.15.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.16.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.16.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.16.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.16.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)
    base_model.model.model.layers.17.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.17.self_attn.q_proj.lora_B.default.weight: torch.Size([1024, 16]) (16,384 params)
    base_model.model.model.layers.17.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 640]) (10,240 params)
    base_model.model.model.layers.17.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 16]) (4,096 params)

============================================================

============================================================
SIZE COMPARISON
============================================================

📊 Model Size Comparison:
  Normal LoRA:                    35.77 MB
  Embedding LoRA (modules_to_save): 355.77 MB
  Embedding LoRA (trainable_indices): 100.12 MB
  New Tokens LoRA:                 675.96 MB

📈 Size Ranking (smallest to largest):
  1. Normal LoRA: 35.77 MB
  2. Embedding LoRA (trainable_indices): 100.12 MB
  3. Embedding LoRA (modules_to_save): 355.77 MB
  4. New Tokens LoRA: 675.96 MB

  Embedding LoRA (modules_to_save) is 9.95x larger than Normal LoRA
  Embedding LoRA (trainable_indices) is 2.80x larger than Normal LoRA
  New Tokens LoRA is 18.90x larger than Normal LoRA
  Embedding LoRA (modules_to_save) is 3.55x larger than trainable_indices
  New Tokens LoRA is 6.75x larger than trainable_indices

============================================================
TESTING COMPLETE
============================================================

Who can help?

No response

Reproduction

https://gist.github.com/imohitmayank/cead7ad4a63c8770bbd5a8f48d25aeeb

Expected behavior

Imo the saved model adapter should only contain the trainable model, and it should be easily loadable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions