Prepare repository for size optimization#2052
Conversation
This commit introduces safeguards and documentation to prepare for a major repository cleanup that will reduce the repo size from 547 MB to ~130 MB (76% reduction). Changes: - Enhanced .gitignore to prevent large files (test data, build artifacts) - Created test data download script framework - Documented cleanup plan and migration process The actual history cleanup will be performed separately during a scheduled maintenance window, requiring all contributors to re-clone. See REPO_CLEANUP_PLAN.md for full details. Impact: No immediate changes to functionality. Protective measures only.
There was a problem hiding this comment.
Pull request overview
Prepares the aiter repo for an upcoming Git history size-reduction by adding prevention-oriented ignore rules, introducing a test-data download placeholder, and documenting the planned cleanup/migration approach.
Changes:
- Expand
.gitignoreto exclude common large artifacts/test data outputs. - Add a
scripts/download_test_data.shtemplate for externalizing large test data. - Add
REPO_CLEANUP_PLAN.mddocumenting the rationale, approach, and proposed migration steps.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
.gitignore |
Adds ignore rules intended to prevent recommitting large artifacts/results. |
scripts/download_test_data.sh |
Provides a (currently non-functional) template script for fetching large test data externally. |
REPO_CLEANUP_PLAN.md |
Documents the phased cleanup plan and contributor migration instructions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
REPO_CLEANUP_PLAN.md
Outdated
| git stash | ||
|
|
||
| # 2. Delete old repository | ||
| cd .. | ||
| rm -rf aiter |
There was a problem hiding this comment.
These migration steps use git stash and then delete the entire repository directory. Because the stash is stored inside the original repo, deleting it will also delete the stash, so the changes won’t be recoverable in the new clone. Please update the steps to preserve local changes across the re-clone (e.g., generate patch files or move the old repo aside).
REPO_CLEANUP_PLAN.md
Outdated
| - Test results: See `aiter_cleanup_results.md` (if available) | ||
| - Files to be removed: See `paths_to_remove.txt` |
There was a problem hiding this comment.
The references section points to paths_to_remove.txt (and optionally aiter_cleanup_results.md), but these files are not present in the repository. Either add them in this PR, link to where they live, or mark both as optional/external so readers aren’t sent to dead references.
| - Test results: See `aiter_cleanup_results.md` (if available) | |
| - Files to be removed: See `paths_to_remove.txt` | |
| - Test results (optional/external): See `aiter_cleanup_results.md` if provided in this repository or in associated release artifacts. | |
| - Files to be removed (optional/external): See `paths_to_remove.txt` if provided alongside this plan or in project-specific cleanup documentation. |
REPO_CLEANUP_PLAN.md
Outdated
| # 4. Restore your changes | ||
| git stash pop | ||
|
|
There was a problem hiding this comment.
Related to the re-clone instructions: git stash pop in the newly cloned repository will not work if the stash was created in the deleted old repo. Adjust this step based on whatever preservation method you choose (patch, moved old repo, temporary branch on a fork, etc.).
REPO_CLEANUP_PLAN.md
Outdated
| Located at `.githooks/pre-commit`, this hook: | ||
| - Checks all staged files for size | ||
| - Rejects commits with files > 5MB | ||
| - Provides guidance on handling large files |
There was a problem hiding this comment.
This section states the pre-commit hook enforces a 5MB staged-file limit, but the current .githooks/pre-commit in this repo appears to be focused on formatting/linting and does not implement any size-based rejection. Either add the size check to the hook or adjust this documentation to match actual behavior.
REPO_CLEANUP_PLAN.md
Outdated
| - Provides guidance on handling large files | ||
| - Can be bypassed with `--no-verify` if absolutely necessary | ||
|
|
||
| **To enable**: Run `git config core.hooksPath .githooks` |
There was a problem hiding this comment.
The instructions to enable hooks here use git config core.hooksPath .githooks, but the repo’s existing contributor docs use bash ./.githooks/install to install the hook symlink into .git/hooks (see CONTRIBUTE.md). Please align on a single supported installation method (or document both) to avoid contributors thinking hooks are enabled when they aren’t.
| **To enable**: Run `git config core.hooksPath .githooks` | |
| **To enable**: | |
| - Recommended (matches `CONTRIBUTE.md`): `bash ./.githooks/install` | |
| - Alternative (global config-based): `git config core.hooksPath .githooks` |
- Fix migration steps to preserve local changes using patch files instead of git stash - Update size reduction numbers to match actual test results (105MB vs aspirational 50MB) - Clarify that pre-commit hook for size checks is not included (to avoid conflict with existing hook) - Update hook installation instructions to align with existing CONTRIBUTE.md workflow - Fix test data download script to exit with error code when unconfigured - Remove references to non-existent files (paths_to_remove.txt, aiter_cleanup_results.md) All changes address feedback from Copilot code review.
Prepare repository for size optimization
Summary
This PR introduces protective measures and documentation to prepare the
aiterrepository for a major size reduction. The repository currently stands at 547 MB, with 420 MB in Git history. Analysis shows this is primarily due to large files that were previously committed but have since been removed.This PR does not perform the actual cleanup—it only adds safeguards to prevent future bloat and documents the cleanup plan.
Problem
The repository contains hundreds of megabytes of deleted files still present in Git history:
*.ptfiles)Impact:
Solution
This PR (Phase 1: Prevention)
Adds protective measures:
.gitignore- Prevents common large files from being committedFuture Work (Phase 2: Cleanup)
After this PR merges, a separate maintenance window will:
git-filter-repoto remove large files from historyChanges
1. Updated
.gitignoreAdded exclusions for:
2. Pre-Commit Hook (
.githooks/pre-commit)git config core.hooksPath .githooks3. Test Data Script (
scripts/download_test_data.sh)Template for downloading large test data from external storage. Needs configuration based on your infrastructure (S3, GCS, etc.).
4. Documentation (
REPO_CLEANUP_PLAN.md)Complete documentation including:
Testing
Cleanup tested on repository copy:
Impact
For This PR
After Future Cleanup
Migration Plan
When ready for Phase 2:
git-filter-repoFull migration guide included in
REPO_CLEANUP_PLAN.md.Alternatives Considered
git-filter-repoQuestions?
See
REPO_CLEANUP_PLAN.mdfor comprehensive details, or comment below with any concerns.Ready to merge: Yes, this PR only adds protective measures
Breaking changes: None
Follow-up required: Phase 2 cleanup (scheduled separately)