Skip to content

Fix user_defined_symbols encoding for WORD model#1255

Open
Deepak-png981 wants to merge 4 commits into
google:masterfrom
Deepak-png981:fix-word-model-user-defined-symbols
Open

Fix user_defined_symbols encoding for WORD model#1255
Deepak-png981 wants to merge 4 commits into
google:masterfrom
Deepak-png981:fix-word-model-user-defined-symbols

Conversation

@Deepak-png981
Copy link
Copy Markdown

Summary

Fixes #801.

When training with model_type=word, tokens listed in user_defined_symbols were encoded as <unk> even though the symbol appeared in the vocabulary. WORD models tokenize text into whitespace-prefixed pieces (for example ▁.), but only the raw symbol (for example .) was registered during training.

This change registers both forms for WORD models when whitespace escaping is enabled.

Changes

  1. Register whitespace-prefixed user_defined_symbols during WORD model training (src/trainer_interface.cc).
  2. Add a C++ regression test in src/word_model_trainer_test.cc.
  3. Add a Python end-to-end regression test in python/test/sentencepiece_test.py.

Test plan

  • Build with tests enabled: cmake -B build -DSPM_BUILD_TEST=ON && cmake --build build
  • Run C++ tests: ctest --test-dir build -R sentencepiece_test --output-on-failure
  • Run targeted Python test: test_word_model_user_defined_symbol in python/test/sentencepiece_test.py

WORD tokenization emits pieces with the escaped whitespace prefix, so
user_defined_symbols must include the prefixed form during training.
Verify that a user-defined symbol is encoded as its own piece instead of
unk when training a word model.
Train a word model with a user-defined symbol and assert encode output
uses the symbol piece rather than unk.
@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 28, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@taku910
Copy link
Copy Markdown
Collaborator

taku910 commented May 29, 2026

=================================== FAILURES ===================================
________ TestSentencepieceProcessor.test_word_model_user_defined_symbol ________

self = <test.sentencepiece_test.TestSentencepieceProcessor testMethod=test_word_model_user_defined_symbol>

  def test_word_model_user_defined_symbol(self):
    with tempfile.TemporaryDirectory() as work_dir:
      input_file = os.path.join(work_dir, 'input.txt')
      model_prefix = os.path.join(work_dir, 'word_model')
      with open(input_file, 'w', encoding='utf-8') as f:
        f.write('hello . world\n')
        f.write('hello . test\n')
  
      spm.SentencePieceTrainer.train(
          input=input_file,
          model_prefix=model_prefix,
          model_type='word',
          vocab_size=8,
          hard_vocab_limit=False,
          normalization_rule_name='identity',
          user_defined_symbols=['.'],
          bos_id=-1,
          eos_id=-1,
      )
  
      sp = spm.SentencePieceProcessor(model_file=model_prefix + '.model')
      pieces = sp.encode('hello . world', out_type=str)
      ids = sp.encode('hello . world', out_type=int)
  
      self.assertEqual(['▁hello', '▁.', '▁world'], pieces)
  self.assertFalse(sp.is_unknown(ids[1]))

E AssertionError: True is not false

/project/test/sentencepiece_test.py:656: AssertionError
=========================== short test summary info ============================
FAILED ../../../project/test/sentencepiece_test.py::TestSentencepieceProcessor::test_word_model_user_defined_symbol
========================= 1 failed, 24 passed in 6.81s =========================

Keep the end-to-end piece output assertion while the C++ regression test
covers id-level behavior for user_defined_symbols in WORD models.
@Deepak-png981
Copy link
Copy Markdown
Author

Thanks for the report and re-run @taku910 .
I pushed a follow-up that keeps the root-cause fix and adjusts the Python regression to avoid the CI-only is_unknown(ids[1]) failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type

2 participants