Fix UnicodeDecodeError when reading packed-refs with non-UTF8 characters#2091
Draft
MirrorDNA-Reflection-Protocol wants to merge 3 commits intogitpython-developers:mainfrom
Draft
Conversation
Fixes gitpython-developers#2064 The packed-refs file can contain ref names that are not valid UTF-8 (e.g., Latin-1 encoded tag names created by older Git versions or non-UTF8 systems). Previously, opening the file with encoding='UTF-8' would raise UnicodeDecodeError. Changes: - Add errors='surrogateescape' to the open() call in _iter_packed_refs() - This allows reading files with arbitrary byte sequences while still treating valid UTF-8 as text - Add test that verifies non-UTF8 packed-refs can be read successfully The 'surrogateescape' error handler is the standard Python approach for handling potentially non-UTF8 data in filesystem operations, as it preserves the original bytes in a reversible way.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a UnicodeDecodeError that occurred when GitPython attempted to read packed-refs files containing ref names encoded with non-UTF-8 character encodings (e.g., Latin-1 encoded tag names from older Git versions). The fix uses Python's surrogateescape error handler, which is the standard approach for handling filesystem operations with potentially mixed or unknown encodings.
Key changes:
- Adds
errors='surrogateescape'parameter to file reading in_iter_packed_refs()method - Adds comprehensive test that reproduces and verifies the fix for the Unicode decoding issue
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| git/refs/symbolic.py | Adds errors='surrogateescape' to the packed-refs file reader to handle non-UTF8 encoded ref names gracefully |
| test/test_refs.py | Adds test case that creates a packed-refs file with Latin-1 encoded ref name and verifies it can be read without errors |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #2064
The
packed-refsfile can contain ref names that are not valid UTF-8 (e.g., Latin-1 encoded tag names created by older Git versions or systems with different locale settings). Previously, GitPython would fail withUnicodeDecodeErrorwhen reading such files.Reproduction
As described in #2064:
Before fix:
After fix: Successfully reads all 101 tags.
Changes
errors='surrogateescape'to theopen()call in_iter_packed_refs()Technical Details
The
surrogateescapeerror handler is Python's standard approach for handling potentially non-UTF8 data in filesystem operations. It:\uDC80-\uDCFF)This is the same approach used by Python's
os.fsdecode()and is recommended for filesystem operations where encoding may be unknown or mixed.