Skip to content

Fix bf16 GGUF deleted by cleanup when exported alongside quantized types#4937

Open
Shivamjohri247 wants to merge 8 commits intounslothai:mainfrom
Shivamjohri247:fix/bf16-gguf-cleanup-deletion
Open

Fix bf16 GGUF deleted by cleanup when exported alongside quantized types#4937
Shivamjohri247 wants to merge 8 commits intounslothai:mainfrom
Shivamjohri247:fix/bf16-gguf-cleanup-deletion

Conversation

@Shivamjohri247
Copy link
Copy Markdown

Summary

Fixes #4932

When bf16 is passed alongside other quantization types (e.g. ["q4_k_m", "q6_k", "q8_0", "bf16"]), Unsloth's internal cleanup step deletes the bf16 GGUF file after using it as an intermediate for quantizing the other formats — even though bf16 was explicitly requested as an output.

Root Cause

In unsloth/save.py:1350-1353, the GGUF export logic:

  1. Creates a base_gguf file in first_conversion format (e.g. bf16) — this IS the bf16 output
  2. Loops through quantization_method, skipping any method that equals first_conversion (line 1300) — so bf16 is never re-created
  3. After quantization, unconditionally deletes base_gguf:
# BEFORE (broken)
if quants_created:
    all_saved_locations.remove(base_gguf)
    Path(base_gguf).unlink(missing_ok = True)

This means when bf16 is both the conversion intermediate AND a requested output, the file gets deleted. Only the quantized outputs survive. The log shows:

Unsloth: [1] Converting model into bf16 GGUF format.
Unsloth: [2] Converting GGUF bf16 into q4_k_m.
Unsloth: [2] Converting GGUF bf16 into q6_k.
Unsloth: [2] Converting GGUF bf16 into q8_0.
Unsloth: Model files cleanup...         ← BF16.gguf deleted here
Generated files: [Q8_0.gguf, Q6_K.gguf, Q4_K_M.gguf, BF16-mmproj.gguf]
                 ^ BF16.gguf missing

Fix

Guard the deletion so the intermediate file is only removed when it is NOT part of the user's requested quantization_method:

# AFTER (fixed)
if quants_created:
    if first_conversion not in quantization_method:
        all_saved_locations.remove(base_gguf)
        Path(base_gguf).unlink(missing_ok = True)

This also aligns with line 1363 which already correctly computes want_full_precision = first_conversion in frozenset(quantization_method) — the cleanup was just not respecting that.

Impact

  • Before: bf16 GGUF is silently deleted when exported alongside other quantized types
  • After: bf16 GGUF is preserved when explicitly requested; intermediate-only files are still cleaned up as before

Test plan

  • Export with mixed quant types: model.save_pretrained_gguf("./output", tokenizer, quantization_method=["q4_k_m", "q6_k", "q8_0", "bf16"]) — verify BF16.gguf is present in output
  • Export with only quant types (no bf16): model.save_pretrained_gguf("./output", tokenizer, quantization_method=["q4_k_m", "q8_0"]) — verify intermediate bf16 GGUF IS cleaned up
  • Export with bf16 only: model.save_pretrained_gguf("./output", tokenizer, quantization_method=["bf16"]) — verify BF16.gguf is present (existing behavior unchanged)

When bf16 is both the intermediate conversion format and a requested
output, the cleanup step was unconditionally deleting the base GGUF
file. Guard the deletion so the intermediate is only removed when it
is not part of the user's requested quantization methods.

Closes unslothai#4932
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies unsloth/save.py to prevent the deletion of the base GGUF file if it's not part of the quantization method. However, this change introduces a regression in the file ordering for Vision-Language Models (VLMs), causing the mmproj file to be incorrectly positioned after list reversal, which could break downstream logic. A suggestion is provided to correctly reorder the files for VLMs when the base model is kept.

Comment thread unsloth/save.py
When bf16 is a requested output alongside other quantized types for
VLMs, the base GGUF (text model) stays in the list. The subsequent
reverse() would place text_model at the end instead of mmproj,
breaking downstream code that expects mmproj as the last element.

Swap base_gguf past mmproj before reversal so mmproj ends up at
the correct position after the list is flipped.

Credit: Gemini Code Assist review on PR unslothai#4937
@Shivamjohri247
Copy link
Copy Markdown
Author

Thanks for the catch! You're right — for VLMs where initial_files = [text_model, mmproj], keeping the base GGUF and then calling reverse() would place text_model at the end instead of mmproj, breaking downstream code at line 2146 that expects mmproj as the last element.

Fixed in the latest push by swapping base_gguf past mmproj before reversal when it's a VLM (len(initial_files) > 1). Verified all 4 scenarios:

Scenario VLM bf16 requested Result
Non-VLM, no bf16 No No Base deleted, reversed quantized files
Non-VLM, with bf16 No Yes bf16 + quantized, correct order
VLM, no bf16 Yes No Base deleted, mmproj at end after reverse
VLM, with bf16 Yes Yes mmproj swapped to front before reverse, ends at end

@rolandtannous
Copy link
Copy Markdown
Collaborator

Thanks for the fix! I'm going to push a couple of commits to address a few items.

rolandtannous and others added 3 commits April 13, 2026 17:45
The previous len(initial_files) > 1 guard would incorrectly fire for
sharded text-only models (convert_to_gguf extends shard files into
initial_files when max_shard_size splits a large model), reordering the
shard list as if it were a VLM.

is_vlm is already maintained from convert_to_gguf's return value and is
the semantically correct flag for "has mmproj that must stay last".
When convert_to_gguf shards the text model output, initial_files has
more than two entries ([text_shard1, text_shard2, ..., mmproj]). The
previous insert(1, pop(0)) only repositioned one text part, leaving
mmproj mid-list after the trailing reverse() and breaking the
"mmproj is last" invariant consumed at save.py:2145 where
all_file_locations[-1] is used for --mmproj.

Rebuild the list explicitly so that after reverse() the final order is
[quants_desc..., text_parts_in_order..., mmproj], correct for any
number of text shards including the common single-file case.
@rolandtannous
Copy link
Copy Markdown
Collaborator

Nice catch on the root cause. The preservation logic is correct for the common case, but two edges gave me pause so I pushed follow-ups (feel free to revert if you disagree):

316af41a — use is_vlm instead of len(initial_files) > 1

convert_to_gguf extends initial_files with shard files when max_shard_size splits a large non-VLM model. The length check would then fire the VLM-only reorder for a sharded text-only model and scramble the shard list. is_vlm (already set from convert_to_gguf's return) is the semantic flag we actually want.

c9666917 — handle sharded VLMs in the reorder

For a sharded VLM, initial_files = [text_shard1, text_shard2, …, mmproj]. insert(1, pop(0)) only moves one text element, leaving mmproj mid-list after reverse() — which breaks the all_file_locations[-1] mmproj assumption at save.py:2145. Rebuilt the list explicitly so any number of text shards is handled.

@Shivamjohri247
Copy link
Copy Markdown
Author

Thanks for the thorough review and the follow-up commits! Both fixes make total sense:

  • Using is_vlm is clearly the right call over the length check — I didn't account for sharded non-VLM models.
  • The explicit list rebuild for sharded VLMs is the correct approach too, pop(0)/insert(1) was only moving one shard.

Appreciate you catching these edge cases. Happy with the changes, let's merge whenever you're ready.

rolandtannous and others added 2 commits April 14, 2026 13:10
The delete branch of the GGUF cleanup only removed initial_files[0]
from the returned list and from disk. For sharded outputs (large text
models split by max_shard_size), the remaining shards were left:
  - on disk as orphans
  - in the returned list, pushing mmproj out of its [-1] position

Exposed by sharded VLM exports (gemma-3-27b-it, 2 text shards + mmproj)
with quantization_method=["q4_k_m"]: shard 1 was deleted, shard 2
survived as a stray, and all_file_locations[-1] pointed at shard 2
instead of mmproj when constructing the --mmproj CLI hint.

Iterate over initial_files[:n_text] so all text parts are cleaned up;
mmproj is excluded by the slice and remains at its mmproj position.
@rolandtannous
Copy link
Copy Markdown
Collaborator

Follow-up from E2E testing on unsloth/gemma-3-27b-it (sharded VLM, 2 text shards + mmproj). Exposed a pre-existing bug in the cleanup delete branch that the fix 2 work adjacent to it did not address.

6bcb02f5 — delete all text intermediates, not just base_gguf

When first_conversion is not in quantization_method, the old code did:

all_saved_locations.remove(base_gguf)
Path(base_gguf).unlink(missing_ok = True)

base_gguf = initial_files[0]. For a sharded output, that's only shard 1 — shards 2..N are left orphaned on disk AND still in the returned list, which breaks the all_file_locations[-1] == mmproj invariant consumed at save.py:2145.

Observed with quantization_method=["q4_k_m"] on gemma-3-27b-it:

  • initial_files = [BF16-00001-of-00002, BF16-00002-of-00002, BF16-mmproj]
  • Returned list: [Q4_K_M, BF16-mmproj, BF16-00002-of-00002] ← shard 2 stray, mmproj not at [-1]
  • On disk: shard 1 gone, shard 2 (3.9G) still there as garbage

The fix iterates initial_files[:n_text] so every text part is both removed from the list and unlinked from disk. n_text excludes mmproj for VLMs so the existing mmproj preservation is untouched. Backwards compatible with the non-sharded single-file case (n_text=1, loop runs once, same behavior as before).

This is pre-existing — it was broken before this PR opened. Scoping it in because the PR is already the cleanup-logic PR and the gemma sharded test made it loud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] bf16 GGUF deleted by internal cleanup when exported alongside quantized types

2 participants