Finetune tokenizer padding is not set to custom max length

Hey.  In the "finetune.py", the tokenizer is configured to pad to the max length of the longest sequence of the batch, as opposed to the entire dataset.  Thanks.

PROBLEM:
`

    def __call__(self, examples):
        tokenized = self.tokenizer(
            [ex["codons"] for ex in examples],
            return_attention_mask=True,
            return_token_type_ids=True,
            truncation=True,
            padding=True,
            max_length=MAX_LEN,
            return_tensors="pt",
        )
`
FIX:
`

    def __call__(self, examples):
        tokenized = self.tokenizer(
            [ex["codons"] for ex in examples],
            return_attention_mask=True,
            return_token_type_ids=True,
            truncation=True,
            padding='max_length', #fixed this to pad to max length of dataset, not batch.
            max_length=MAX_LEN,
            return_tensors="pt",
        )
`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetune tokenizer padding is not set to custom max length #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Finetune tokenizer padding is not set to custom max length #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions