Hey. In the "finetune.py", the tokenizer is configured to pad to the max length of the longest sequence of the batch, as opposed to the entire dataset. Thanks.
PROBLEM:
`
def __call__(self, examples):
tokenized = self.tokenizer(
[ex["codons"] for ex in examples],
return_attention_mask=True,
return_token_type_ids=True,
truncation=True,
padding=True,
max_length=MAX_LEN,
return_tensors="pt",
)
FIX:
def __call__(self, examples):
tokenized = self.tokenizer(
[ex["codons"] for ex in examples],
return_attention_mask=True,
return_token_type_ids=True,
truncation=True,
padding='max_length', #fixed this to pad to max length of dataset, not batch.
max_length=MAX_LEN,
return_tensors="pt",
)
`
Hey. In the "finetune.py", the tokenizer is configured to pad to the max length of the longest sequence of the batch, as opposed to the entire dataset. Thanks.
PROBLEM:
`
FIX:`