Fix user_defined_symbols encoding for WORD model#1255
Conversation
WORD tokenization emits pieces with the escaped whitespace prefix, so user_defined_symbols must include the prefixed form during training.
Verify that a user-defined symbol is encoded as its own piece instead of unk when training a word model.
Train a word model with a user-defined symbol and assert encode output uses the symbol piece rather than unk.
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
=================================== FAILURES =================================== self = <test.sentencepiece_test.TestSentencepieceProcessor testMethod=test_word_model_user_defined_symbol>
E AssertionError: True is not false /project/test/sentencepiece_test.py:656: AssertionError |
Keep the end-to-end piece output assertion while the C++ regression test covers id-level behavior for user_defined_symbols in WORD models.
|
Thanks for the report and re-run @taku910 . |
Summary
Fixes #801.
When training with
model_type=word, tokens listed inuser_defined_symbolswere encoded as<unk>even though the symbol appeared in the vocabulary. WORD models tokenize text into whitespace-prefixed pieces (for example▁.), but only the raw symbol (for example.) was registered during training.This change registers both forms for WORD models when whitespace escaping is enabled.
Changes
user_defined_symbolsduring WORD model training (src/trainer_interface.cc).src/word_model_trainer_test.cc.python/test/sentencepiece_test.py.Test plan
cmake -B build -DSPM_BUILD_TEST=ON && cmake --build buildctest --test-dir build -R sentencepiece_test --output-on-failuretest_word_model_user_defined_symbolinpython/test/sentencepiece_test.py