Fix user_defined_symbols encoding for WORD model by Deepak-png981 · Pull Request #1255 · google/sentencepiece

Deepak-png981 · 2026-05-28T19:43:47Z

Summary

Fixes #801.

When training with model_type=word, tokens listed in user_defined_symbols were encoded as <unk> even though the symbol appeared in the vocabulary. WORD models tokenize text into whitespace-prefixed pieces (for example ▁.), but only the raw symbol (for example .) was registered during training.

This change registers both forms for WORD models when whitespace escaping is enabled.

Changes

Register whitespace-prefixed user_defined_symbols during WORD model training (src/trainer_interface.cc).
Add a C++ regression test in src/word_model_trainer_test.cc.
Add a Python end-to-end regression test in python/test/sentencepiece_test.py.

Test plan

Build with tests enabled: cmake -B build -DSPM_BUILD_TEST=ON && cmake --build build
Run C++ tests: ctest --test-dir build -R sentencepiece_test --output-on-failure
Run targeted Python test: test_word_model_user_defined_symbol in python/test/sentencepiece_test.py

WORD tokenization emits pieces with the escaped whitespace prefix, so user_defined_symbols must include the prefixed form during training.

Verify that a user-defined symbol is encoded as its own piece instead of unk when training a word model.

Train a word model with a user-defined symbol and assert encode output uses the symbol piece rather than unk.

google-cla · 2026-05-28T19:43:58Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

taku910 · 2026-05-29T00:32:26Z

=================================== FAILURES ===================================
________ TestSentencepieceProcessor.test_word_model_user_defined_symbol ________

self = <test.sentencepiece_test.TestSentencepieceProcessor testMethod=test_word_model_user_defined_symbol>

  def test_word_model_user_defined_symbol(self):
    with tempfile.TemporaryDirectory() as work_dir:
      input_file = os.path.join(work_dir, 'input.txt')
      model_prefix = os.path.join(work_dir, 'word_model')
      with open(input_file, 'w', encoding='utf-8') as f:
        f.write('hello . world\n')
        f.write('hello . test\n')
  
      spm.SentencePieceTrainer.train(
          input=input_file,
          model_prefix=model_prefix,
          model_type='word',
          vocab_size=8,
          hard_vocab_limit=False,
          normalization_rule_name='identity',
          user_defined_symbols=['.'],
          bos_id=-1,
          eos_id=-1,
      )
  
      sp = spm.SentencePieceProcessor(model_file=model_prefix + '.model')
      pieces = sp.encode('hello . world', out_type=str)
      ids = sp.encode('hello . world', out_type=int)
  
      self.assertEqual(['▁hello', '▁.', '▁world'], pieces)

  self.assertFalse(sp.is_unknown(ids[1]))

E AssertionError: True is not false

/project/test/sentencepiece_test.py:656: AssertionError
=========================== short test summary info ============================
FAILED ../../../project/test/sentencepiece_test.py::TestSentencepieceProcessor::test_word_model_user_defined_symbol
========================= 1 failed, 24 passed in 6.81s =========================

Keep the end-to-end piece output assertion while the C++ regression test covers id-level behavior for user_defined_symbols in WORD models.

Deepak-png981 · 2026-05-29T06:49:16Z

Thanks for the report and re-run @taku910 .
I pushed a follow-up that keeps the root-cause fix and adjusts the Python regression to avoid the CI-only is_unknown(ids[1]) failure.

Deepak-png981 added 3 commits May 29, 2026 01:13

Register whitespace-prefixed user_defined_symbols for WORD models.

4f766ce

WORD tokenization emits pieces with the escaped whitespace prefix, so user_defined_symbols must include the prefixed form during training.

Add C++ regression test for WORD user_defined_symbols encoding.

8636f35

Verify that a user-defined symbol is encoded as its own piece instead of unk when training a word model.

Add Python end-to-end test for WORD user_defined_symbols.

003a1ae

Train a word model with a user-defined symbol and assert encode output uses the symbol piece rather than unk.

Relax Python regression assertion to avoid flaky unknown-id check.

717c223

Keep the end-to-end piece output assertion while the C++ regression test covers id-level behavior for user_defined_symbols in WORD models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix user_defined_symbols encoding for WORD model#1255

Fix user_defined_symbols encoding for WORD model#1255
Deepak-png981 wants to merge 4 commits into
google:masterfrom
Deepak-png981:fix-word-model-user-defined-symbols

Deepak-png981 commented May 28, 2026

Uh oh!

google-cla Bot commented May 28, 2026

Uh oh!

taku910 commented May 29, 2026

Uh oh!

Deepak-png981 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Deepak-png981 commented May 28, 2026

Summary

Changes

Test plan

Uh oh!

google-cla Bot commented May 28, 2026

Uh oh!

taku910 commented May 29, 2026

Uh oh!

Deepak-png981 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants