Add-training-info-to-paper#18
Open
RashikShahjahan wants to merge 16 commits into
Open
Conversation
Refactor for our dataset
Update qlora config with unsloth optimizations
* updated file to be same as the one that ran on kaggle * changed notebook to be same as the one that ran on kaggle * baseline results raw with prompts and answer history * results grid * cleaned up main(), added api uploads to HF, replaced kaggle and colab paths with direct downloads from github, added dict NATIVE_LABEL_MAP to convert answers from native language
add training metrics to hf
added missing file path
…#5) Add reusable evaluation runner (eval_pipeline.py) that loads tiny-aya-base, merges PEFT/QLoRA adapters, and runs XNLI, XStoryCloze, TyDi QA, and MMLU benchmarks with per-benchmark timing. Includes batch mode for evaluating multiple adapters in sequence. Also adds Kaggle notebook for running finetuned model benchmarks and updates the evaluation README with usage docs and output format. * WIP: scaffold eval pipelines, load args and benchmarks, TODOs * Scaffolding and loading evals * implemented runners for benchmarks * eval benchmarking notebook
…ing [AYA-180] (#17) * feat(eval): add English benchmark evaluation for catastrophic forgetting check [AYA-180] Changes to both baseline and finetuned notebooks: - Rename main_english() → eval_english_prompts() - Rename main_language() → eval_native_prompts() - Add English data loading (MGSM-en, XNLI-en, CSQA-en) - Add English data eval to eval_english_prompts() (3 extra lines) - Simplify save_results() to handle any result keys dynamically - Document file schema in upload section comments File schema after this change: english_prompt_results.json — 12 metrics (zh/es/ur/en data, English prompts) native_prompt_results.json — 9 metrics (zh/es/ur data, native prompts) Note: Re-running existing conditions will produce english_prompt_results.json with 3 additional keys ({mgsm,xnli,csqa}_en_acc). Old results on HF only have the 9 zh/es/ur keys. Also fixes in baseline notebook: - Missing `seed = 42` (was undefined, would crash) - n_samples=5 → None for full evaluation runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Madison Edgar <7844510+madiedgar@users.noreply.github.com> Signed-off-by: Madison (Pfaff) Edgar <7844510+madiedgar@users.noreply.github.com> * fix(eval): correct XNLI label extraction and add re-scoring script [AYA-88] Fix XNLI label extraction in both benchmarking notebooks to use first-line-only parsing, preventing code leakage corruption (e.g. Legesher keywords like تصدیق(entailment) on line 2 overriding actual predictions). Expand native label map with Urdu paraphrases and add case-insensitive matching. Add standalone rescore_xnli.py script that downloads existing results from HuggingFace, re-applies the corrected extraction, and optionally uploads fixed files back. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Madison Edgar <7844510+madiedgar@users.noreply.github.com> Signed-off-by: Madison (Pfaff) Edgar <7844510+madiedgar@users.noreply.github.com> * fix(eval): fix rescore script JSON structure handling and np.random.seed - rescore_xnli.py: handle flat list structure (data["xnli_zh"] is a list, not a dict with "results" sub-key); fix summary key lookup to use "_acc" suffix matching actual JSON schema - baseline_benchmarking.ipynb: fix np.seed=42 → np.random.seed(seed), remove duplicate random.seed(seed) call Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Madison Edgar <7844510+madiedgar@users.noreply.github.com> Signed-off-by: Madison (Pfaff) Edgar <7844510+madiedgar@users.noreply.github.com> * docs(eval): add detailed context comments to XNLI extraction logic [AYA-88] Document why each fix was implemented with concrete examples, affected prediction counts, and references to the evaluation-summary analysis. Comments added to all three files: rescore_xnli.py (module docstring + function docstring + inline), baseline and finetuned notebooks (NATIVE_LABEL_MAP header + extract_xnli_label block comment with examples). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Madison Edgar <7844510+madiedgar@users.noreply.github.com> Signed-off-by: Madison (Pfaff) Edgar <7844510+madiedgar@users.noreply.github.com> * docs(eval): note rescore_xnli.py is a one-time correction script Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Madison Edgar <7844510+madiedgar@users.noreply.github.com> Signed-off-by: Madison (Pfaff) Edgar <7844510+madiedgar@users.noreply.github.com> --------- Signed-off-by: Madison Edgar <7844510+madiedgar@users.noreply.github.com> Signed-off-by: Madison (Pfaff) Edgar <7844510+madiedgar@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.