Skip to content

Add-training-info-to-paper#18

Open
RashikShahjahan wants to merge 16 commits into
add-research-paper-latexfrom
add-training-info-to-paper
Open

Add-training-info-to-paper#18
RashikShahjahan wants to merge 16 commits into
add-research-paper-latexfrom
add-training-info-to-paper

Conversation

@RashikShahjahan
Copy link
Copy Markdown
Contributor

No description provided.

RashikShahjahan and others added 16 commits March 21, 2026 03:39
Update qlora config with unsloth optimizations
* updated file to be same as the one that ran on kaggle

* changed notebook to be same as the one that ran on kaggle

* baseline results raw with prompts and answer history

* results grid

* cleaned up main(), added api uploads to HF, replaced kaggle and colab paths with direct downloads from github, added dict NATIVE_LABEL_MAP to convert answers from native language
…#5)

Add reusable evaluation runner (eval_pipeline.py) that loads tiny-aya-base,
merges PEFT/QLoRA adapters, and runs XNLI, XStoryCloze, TyDi QA, and MMLU
benchmarks with per-benchmark timing. Includes batch mode for evaluating
multiple adapters in sequence.

Also adds Kaggle notebook for running finetuned model benchmarks and updates
the evaluation README with usage docs and output format.

* WIP: scaffold eval pipelines, load args and benchmarks, TODOs
* Scaffolding and loading evals
* implemented runners for benchmarks
* eval benchmarking notebook
…ing [AYA-180] (#17)

* feat(eval): add English benchmark evaluation for catastrophic forgetting check [AYA-180]

Changes to both baseline and finetuned notebooks:
- Rename main_english() → eval_english_prompts()
- Rename main_language() → eval_native_prompts()
- Add English data loading (MGSM-en, XNLI-en, CSQA-en)
- Add English data eval to eval_english_prompts() (3 extra lines)
- Simplify save_results() to handle any result keys dynamically
- Document file schema in upload section comments

File schema after this change:
  english_prompt_results.json — 12 metrics (zh/es/ur/en data, English prompts)
  native_prompt_results.json  — 9 metrics (zh/es/ur data, native prompts)

Note: Re-running existing conditions will produce english_prompt_results.json
with 3 additional keys ({mgsm,xnli,csqa}_en_acc). Old results on HF only
have the 9 zh/es/ur keys.

Also fixes in baseline notebook:
- Missing `seed = 42` (was undefined, would crash)
- n_samples=5 → None for full evaluation runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Madison Edgar <7844510+madiedgar@users.noreply.github.com>
Signed-off-by: Madison (Pfaff) Edgar <7844510+madiedgar@users.noreply.github.com>

* fix(eval): correct XNLI label extraction and add re-scoring script [AYA-88]

Fix XNLI label extraction in both benchmarking notebooks to use
first-line-only parsing, preventing code leakage corruption (e.g.
Legesher keywords like تصدیق(entailment) on line 2 overriding actual
predictions). Expand native label map with Urdu paraphrases and add
case-insensitive matching.

Add standalone rescore_xnli.py script that downloads existing results
from HuggingFace, re-applies the corrected extraction, and optionally
uploads fixed files back.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Madison Edgar <7844510+madiedgar@users.noreply.github.com>
Signed-off-by: Madison (Pfaff) Edgar <7844510+madiedgar@users.noreply.github.com>

* fix(eval): fix rescore script JSON structure handling and np.random.seed

- rescore_xnli.py: handle flat list structure (data["xnli_zh"] is a
  list, not a dict with "results" sub-key); fix summary key lookup to
  use "_acc" suffix matching actual JSON schema
- baseline_benchmarking.ipynb: fix np.seed=42 → np.random.seed(seed),
  remove duplicate random.seed(seed) call

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Madison Edgar <7844510+madiedgar@users.noreply.github.com>
Signed-off-by: Madison (Pfaff) Edgar <7844510+madiedgar@users.noreply.github.com>

* docs(eval): add detailed context comments to XNLI extraction logic [AYA-88]

Document why each fix was implemented with concrete examples, affected
prediction counts, and references to the evaluation-summary analysis.
Comments added to all three files: rescore_xnli.py (module docstring +
function docstring + inline), baseline and finetuned notebooks (NATIVE_LABEL_MAP
header + extract_xnli_label block comment with examples).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Madison Edgar <7844510+madiedgar@users.noreply.github.com>
Signed-off-by: Madison (Pfaff) Edgar <7844510+madiedgar@users.noreply.github.com>

* docs(eval): note rescore_xnli.py is a one-time correction script

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Madison Edgar <7844510+madiedgar@users.noreply.github.com>
Signed-off-by: Madison (Pfaff) Edgar <7844510+madiedgar@users.noreply.github.com>

---------

Signed-off-by: Madison Edgar <7844510+madiedgar@users.noreply.github.com>
Signed-off-by: Madison (Pfaff) Edgar <7844510+madiedgar@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants