Bug Description
PDF documents contain control characters (ASCII 0-31) that cause JSON parsing failures during translation. When parsing fails, the fallback mechanism uses raw paragraph.unicode, resulting in residual characters like "th" and "ft" appearing in translations.
Error Messages Observed
Error during automatic terms extract: Invalid control character at: line X column Y (char Z)
Error Illegal trailing comma before end of object: line X column Y (char Z)
Error Invalid \escape: line X column Y (char Z)
Example
Original text: "This is a sample paragraph"
Translation result: "这是示例段落th" (note the residual "th")
Steps to Reproduce
- Use a PDF that contains control characters
- Run translation with debug logging enabled
- Observe JSON parsing errors in logs
- Check translated output for residual characters
Root Cause
In babeldoc/format/pdf/document_il/midend/il/il_translator_llm_only.py, the _clean_json_output method removes wrapper tags but does NOT remove control characters:
def _clean_json_output(self, llm_output: str) -> str:
llm_output = llm_output.strip()
if llm_output.startswith("<json>"):
llm_output = llm_output[6:]
# ... more tag removal ...
# Missing: control character removal!
return llm_output.strip()
When JSON parsing fails due to control characters, the fallback mechanism is triggered:
except Exception as e:
error_message = f"Error {e} during translation. try fallback"
for llm_translate_tracker in llm_translate_trackers:
llm_translate_tracker.set_fallback_to_translate()
The fallback uses raw paragraph.unicode which contains the problematic control characters.
Suggested Fix
Add control character removal to _clean_json_output:
# Remove control characters (ASCII 0-31 except \n\t\r)
llm_output = ''.join(
char for char in llm_output
if ord(char) >= 32 or char in '\n\t\r'
)
Environment
- BabelDOC: v0.5.23
- pdf2zh-next: v2.8.2
- Python: 3.12+
Related
This issue was previously attempted to be fixed in pdf-translation project with monkey patching, but was reverted due to test failures.
Bug Description
PDF documents contain control characters (ASCII 0-31) that cause JSON parsing failures during translation. When parsing fails, the fallback mechanism uses raw
paragraph.unicode, resulting in residual characters like "th" and "ft" appearing in translations.Error Messages Observed
Example
Original text: "This is a sample paragraph"
Translation result: "这是示例段落th" (note the residual "th")
Steps to Reproduce
Root Cause
In
babeldoc/format/pdf/document_il/midend/il/il_translator_llm_only.py, the_clean_json_outputmethod removes wrapper tags but does NOT remove control characters:When JSON parsing fails due to control characters, the fallback mechanism is triggered:
The fallback uses raw
paragraph.unicodewhich contains the problematic control characters.Suggested Fix
Add control character removal to
_clean_json_output:Environment
Related
This issue was previously attempted to be fixed in pdf-translation project with monkey patching, but was reverted due to test failures.