A simple text generator that learns patterns from input text and generates new text in a similar style.
This program implements a character-level statistical language model using k-grams. It analyzes transition frequencies between k-character sequences and their following characters, then uses these statistics to generate new text that mimics the style of the input data.
- Training: The model scans the input text and builds a frequency table of k-grams (sequences of k characters) and the characters that follow them
- Initialization: Text generation begins by selecting a random k-gram weighted by its frequency in the training data
- Generation: For each step:
- Sample the next character from the current k-gram's learned distribution
- Update the k-gram by removing its first character and appending the new one
- If the new k-gram doesn't exist in the model, restart with a fresh random k-gram
To produce the slm executable:
makeTo clean build artifacts:
make clean./slm <k> <input_file> <generation_length>Arguments:
k: K-gram size (must be ≥ 1)input_file: Path to training text filegeneration_length: Number of characters to generate (must be ≥ k)
Example:
./slm 3 moby.txt 500This trains a 3-gram model on moby.txt and generates 500 characters.
src/model.hpp/cpp: K-gram model training and samplingsrc/textgen.hpp/cpp: Text generation algorithmsrc/main.cpp: CLI interface and validation
- C++17 compatible compiler (g++, clang++)
- Standard library with
<random>,<map>,<vector>support