[Bug] Control characters in PDF text cause JSON parsing failure and residual characters in translations

## Bug Description

PDF documents contain control characters (ASCII 0-31) that cause JSON parsing failures during translation. When parsing fails, the fallback mechanism uses raw `paragraph.unicode`, resulting in residual characters like "th" and "ft" appearing in translations.

## Error Messages Observed

```
Error during automatic terms extract: Invalid control character at: line X column Y (char Z)
Error Illegal trailing comma before end of object: line X column Y (char Z)
Error Invalid \escape: line X column Y (char Z)
```

## Example

Original text: "This is a sample paragraph"
Translation result: "这是示例段落th" (note the residual "th")

## Steps to Reproduce

1. Use a PDF that contains control characters
2. Run translation with debug logging enabled
3. Observe JSON parsing errors in logs
4. Check translated output for residual characters

## Root Cause

In `babeldoc/format/pdf/document_il/midend/il/il_translator_llm_only.py`, the `_clean_json_output` method removes wrapper tags but does NOT remove control characters:

```python
def _clean_json_output(self, llm_output: str) -> str:
    llm_output = llm_output.strip()
    if llm_output.startswith("<json>"):
        llm_output = llm_output[6:]
    # ... more tag removal ...
    # Missing: control character removal!
    return llm_output.strip()
```

When JSON parsing fails due to control characters, the fallback mechanism is triggered:
```python
except Exception as e:
    error_message = f"Error {e} during translation. try fallback"
    for llm_translate_tracker in llm_translate_trackers:
        llm_translate_tracker.set_fallback_to_translate()
```

The fallback uses raw `paragraph.unicode` which contains the problematic control characters.

## Suggested Fix

Add control character removal to `_clean_json_output`:

```python
# Remove control characters (ASCII 0-31 except \n\t\r)
llm_output = ''.join(
    char for char in llm_output
    if ord(char) >= 32 or char in '\n\t\r'
)
```

## Environment

- BabelDOC: v0.5.23
- pdf2zh-next: v2.8.2
- Python: 3.12+

## Related

This issue was previously attempted to be fixed in pdf-translation project with monkey patching, but was reverted due to test failures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Control characters in PDF text cause JSON parsing failure and residual characters in translations #577

Bug Description

Error Messages Observed

Example

Steps to Reproduce

Root Cause

Suggested Fix

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] Control characters in PDF text cause JSON parsing failure and residual characters in translations #577

Description

Bug Description

Error Messages Observed

Example

Steps to Reproduce

Root Cause

Suggested Fix

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions