TextBox is developed based on Python and PyTorch for reproducing and developing text generation algorithms in a unified, comprehensive and efficient framework for research purpose. Our library includes 16 text generation algorithms, covering two major tasks:
- Unconditional (input-free) Generation
- Sequence-to-Sequence (Seq2Seq) Generation, including Machine Translation and Summarization
We provide the support for 6 benchmark text generation datasets. A user can apply our library to process the original data copy, or simply download the processed datasets by our team.
View TextBox homepage for more information: https://github.com/RUCAIBox/TextBox.
View TextBox document for more implement details: https://textbox.readthedocs.io/en/latest/.
- Unified and modularized framework. TextBox is built upon PyTorch and designed to be highly modularized, by decoupling diverse models into a set of highly reusable modules.
- Comprehensive models, benchmark datasets and standardized evaluations. TextBox also contains a wide range of text generation models, covering the categories of VAE, GAN, RNN or Transformer based models, and pre-trained language models (PLM).
- Extensible and flexible framework. TextBox provides convenient interfaces of various common functions or modules in text generation models, RNN encoder-decoder, Transformer encoder-decoder and pre-trained language model.
- Easy and convenient to get started. TextBox provides flexible configuration files, which allows green hands to run experiments without modifying source code, and allows researchers to conduct qualitative analysis by modifying few configurations.
TextBox requires:
-
Python >= 3.6.2 -
torch >= 1.6.0. Please follow the official instructions to install the appropriate version according to your CUDA version and NVIDIA driver version. -
GCC >= 5.1.0
pip install textboxgit clone https://github.com/RUCAIBox/TextBox.git && cd TextBox
pip install -e . --verboseWith the source code, you can use the provided script for initial usage of our library:
python run_textbox.py --model=RNN --dataset=COCO --task_type=unconditionalThis script will run the RNN model on the COCO dataset.
If you want to change the parameters, such as rnn_type, max_vocab_size, just set the additional command parameters as you need:
python run_textbox.py --model=RNN --dataset=COCO --task_type=unconditional \
--rnn_type=lstm --max_vocab_size=4000We also support to modify YAML configuration files in corresponding dataset and model properties folders and include it in the command line.
If you want to change the model, the dataset or the task type, just run the script by modifying corresponding command parameters:
python run_textbox.py --model=[model_name] --dataset=[dataset_name] --task_type=[task_name]model_name is the model to be run, such as RNN and BART.
TextBox covers three major task types of text generation, namely unconditional, translation and summarization.
If TextBox is installed from pip, you can create a new python file, download the dataset, and write and run the following code:
from textbox.quick_start import run_textbox
run_textbox(config_dict={'model': 'RNN',
'dataset': 'COCO',
'data_path': './dataset',
'task_type': 'unconditional'})This will perform the training and test of the RNN model on the COCO dataset.
If you want to run different models, parameters or datasets, the operations are same with Start from source.
TextBox supports to apply part of pretrained language models (PLM) to conduct text generation. Take the GPT-2 for example, we will show you how to use PLMs to fine-tune.
-
Download the GPT-2 model provided from Hugging Face (https://huggingface.co/gpt2/tree/main), including
config.json,merges.txt,pytorch_model.bin,tokenizer.jsonandvocab.json. Then put them in a folder at the same level astextbox, such aspretrained_model/gpt2. -
After downloading, you just need to run the command:
python run_textbox.py --model=GPT2 --dataset=COCO --task_type=unconditional \
--pretrained_model_path=pretrained_model/gpt2TextBox is developed and maintained by AI Box.
TextBox uses MIT License.