bodyfile: extend character escaping for characters special Unicode and non-Unicode characters

Certain file systems allow for characters that either have a special meaning in Unicode such as U+d800 and/or non-Unicode characters

The [extended bodyfile 3 format](https://dfimagetools.readthedocs.io/en/latest/sources/Bodyfile-format.html#name-value) currently does not specify how to handle these characters. Proposal is to escape such characters as "\u####" and "\U########", preferring the short form over the long form where possible.

* [x] [Control characters](https://en.wikipedia.org/wiki/C0_and_C1_control_codes) U+1-U+8, U+B-U+C, U+E-U+1F, U+7F-U+84, U+86-U+9F (already covered)
* [x] Unicode surrogate characters U+d800-U+dfff - https://github.com/log2timeline/dfimagetools/pull/78
* [x] Undefined Unicode characters - https://github.com/log2timeline/dfimagetools/pull/95
  * U+FDD0-U+FDDF
  * U+fffe-U+ffff
  * U+1FFFE-U+1FFFF
  * U+2FFFE-U+2FFFF
  * U+3FFFE-U+3FFFF
  * U+4FFFE-U+4FFFF
  * U+5FFFE-U+5FFFF
  * U+6FFFE-U+6FFFF
  * U+7FFFE-U+7FFFF
  * U+8FFFE-U+8FFFF
  * U+9FFFE-U+9FFFF
  * U+AFFFE-U+AFFFF
  * U+BFFFE-U+BFFFF
  * U+CFFFE-U+CFFFF
  * U+DFFFE-U+DFFFF
  * U+EFFFE-U+EFFFF
  * U+FFFFE-U+FFFFF
  * U+10FFFE-U+10FFFF
* [x] Other values observed to be not printable - https://github.com/log2timeline/dfimagetools/pull/95
  * U+2028, U+2029, U+E000, U+F8FF, U+F0000, U+FFFFD, U+100000, U+10FFFD

Open questions
* What about "Unicode compatibility characters" ?
* What about U+110000-U+ffffffff
* What about original path uses a specific codepage (encoding), which is converted to Unicode, however that can be encoded into multiple variations of the original encoding e.g. encoding U+2252 to cp932. What if there are 2 paths that decode to the same string? How should the original path be best preserved? 
* filename contains a path segment separator (e.g. \ or /), if not escaped this leads to ambiguity e.g. if / is a path segment separator is 'test/1234' a single file name or a path ?

A related discussion https://github.com/dfxml-working-group/dfxml_schema/issues/34

Also consider if the format should be extended with a header to specify its encoding?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bodyfile: extend character escaping for characters special Unicode and non-Unicode characters #77

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bodyfile: extend character escaping for characters special Unicode and non-Unicode characters #77

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions