Skip to content

[Bug] Control characters in PDF text cause JSON parsing failure and residual characters in translations #577

@boyingliu01

Description

@boyingliu01

Bug Description

PDF documents contain control characters (ASCII 0-31) that cause JSON parsing failures during translation. When parsing fails, the fallback mechanism uses raw paragraph.unicode, resulting in residual characters like "th" and "ft" appearing in translations.

Error Messages Observed

Error during automatic terms extract: Invalid control character at: line X column Y (char Z)
Error Illegal trailing comma before end of object: line X column Y (char Z)
Error Invalid \escape: line X column Y (char Z)

Example

Original text: "This is a sample paragraph"
Translation result: "这是示例段落th" (note the residual "th")

Steps to Reproduce

  1. Use a PDF that contains control characters
  2. Run translation with debug logging enabled
  3. Observe JSON parsing errors in logs
  4. Check translated output for residual characters

Root Cause

In babeldoc/format/pdf/document_il/midend/il/il_translator_llm_only.py, the _clean_json_output method removes wrapper tags but does NOT remove control characters:

def _clean_json_output(self, llm_output: str) -> str:
    llm_output = llm_output.strip()
    if llm_output.startswith("<json>"):
        llm_output = llm_output[6:]
    # ... more tag removal ...
    # Missing: control character removal!
    return llm_output.strip()

When JSON parsing fails due to control characters, the fallback mechanism is triggered:

except Exception as e:
    error_message = f"Error {e} during translation. try fallback"
    for llm_translate_tracker in llm_translate_trackers:
        llm_translate_tracker.set_fallback_to_translate()

The fallback uses raw paragraph.unicode which contains the problematic control characters.

Suggested Fix

Add control character removal to _clean_json_output:

# Remove control characters (ASCII 0-31 except \n\t\r)
llm_output = ''.join(
    char for char in llm_output
    if ord(char) >= 32 or char in '\n\t\r'
)

Environment

  • BabelDOC: v0.5.23
  • pdf2zh-next: v2.8.2
  • Python: 3.12+

Related

This issue was previously attempted to be fixed in pdf-translation project with monkey patching, but was reverted due to test failures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions