Fix issue to md#181
Conversation
97b763c to
2764dec
Compare
There was a problem hiding this comment.
Pull request overview
This PR implements image format validation for the issue-to-markdown automation workflow to address issue #144. The changes add comprehensive validation to ensure that only valid image files (JPG, JPEG, PNG, GIF, WebP, SVG) are uploaded to the image field and embedded in content fields.
Key Changes:
- Added image format validation with support for 6 image formats (JPG, JPEG, PNG, GIF, WebP, SVG)
- Implemented validation for both dedicated image fields and images embedded in content/description fields
- Added automated error reporting via issue comments when invalid files are detected
- Enhanced workflow to handle validation failures gracefully with proper error messaging
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 13 comments.
| File | Description |
|---|---|
.github/workflows/issue-to-pr.yml |
Added error handling with continue-on-error and conditional PR creation based on validation success |
.github/issue-to-md/scripts/issue_to_md.py |
Implemented image validation functions, refactored main logic into helper functions, and added issue commenting for validation errors |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _download_and_save_image(url: str, issue_number: int) -> tuple[str, list[str]]: | ||
| """Download image and save to uploads directory.""" | ||
| errors = [] | ||
| try: | ||
| dprint("Downloading image:", url) | ||
| resp = requests.get(url, timeout=20) | ||
| resp.raise_for_status() | ||
| ctype = (resp.headers.get("Content-Type") or "").split(";")[0].strip() | ||
| dprint(f"Content-Type: {ctype}") | ||
|
|
||
| ext = CONTENT_TYPE_TO_EXT.get(ctype) or mimetypes.guess_extension(ctype) or Path(url).suffix or ".png" | ||
| dprint(f"Determined extension: {ext}") | ||
|
|
||
| if ext.lower() not in ALLOWED_IMAGE_FORMATS: | ||
| error_msg = ( | ||
| f"Invalid image type. Server returned content type '{ctype}' " | ||
| f"(extension: '{ext}'). Allowed formats: {', '.join(sorted(ALLOWED_IMAGE_FORMATS))}" | ||
| ) | ||
| errors.append(error_msg) | ||
| dprint(error_msg) | ||
| return "", errors | ||
|
|
||
| fname = f"{issue_number}_{datetime.now().strftime('%Y%m%d%H%M%S')}{ext}" | ||
| ensure_dirs(UPLOADS_DIR) | ||
| path = UPLOADS_DIR / fname | ||
| path.write_bytes(resp.content) | ||
| rel = f"uploads/{fname}" | ||
| dprint("Saved image to", path, "→", rel) | ||
| return rel, [] | ||
| except Exception as e: | ||
| error_msg = f"Failed to download image from {url}: {str(e)}" | ||
| errors.append(error_msg) | ||
| dprint(error_msg) | ||
| return "", errors |
There was a problem hiding this comment.
The function _download_and_save_image attempts to download images synchronously with a 20-second timeout, but there's no retry logic or handling for transient network failures. If a valid image URL temporarily fails (e.g., network hiccup, rate limiting), the entire workflow fails. Consider adding retry logic with exponential backoff for more robust handling of temporary failures.
| ctype = (resp.headers.get("Content-Type") or "").split(";")[0].strip() | ||
| dprint(f"Content-Type: {ctype}") | ||
|
|
||
| ext = CONTENT_TYPE_TO_EXT.get(ctype) or mimetypes.guess_extension(ctype) or Path(url).suffix or ".png" |
There was a problem hiding this comment.
The extension determination at line 289 has a potential issue where mimetypes.guess_extension() can return extensions not in ALLOWED_IMAGE_FORMATS. For example, for image/jpeg, it might return .jpe or .jfif instead of .jpg. This would then fail validation at line 292. The code should normalize the guessed extension or add all possible JPEG extensions to ALLOWED_IMAGE_FORMATS. Consider mapping the result of mimetypes.guess_extension() through CONTENT_TYPE_TO_EXT first, or normalizing extensions like .jpe and .jfif to .jpg.
| def find_all_images(text: str) -> list[str]: | ||
| """ | ||
| Find all image URLs in markdown or HTML text. | ||
| Returns list of URLs. | ||
| """ | ||
| if not text: | ||
| return [] | ||
| urls = [] | ||
| # Find markdown images | ||
| for match in IMG_MD_RE.finditer(text): | ||
| urls.append(match.group(1)) | ||
| # Find HTML images | ||
| for match in IMG_SRC_RE.finditer(text): | ||
| urls.append(match.group(1)) | ||
| return urls | ||
|
|
There was a problem hiding this comment.
The find_all_images function is defined but never used anywhere in the code. This creates dead code that should either be removed or utilized. If it was intended for future use, consider whether it's needed for the current implementation.
| def find_all_images(text: str) -> list[str]: | |
| """ | |
| Find all image URLs in markdown or HTML text. | |
| Returns list of URLs. | |
| """ | |
| if not text: | |
| return [] | |
| urls = [] | |
| # Find markdown images | |
| for match in IMG_MD_RE.finditer(text): | |
| urls.append(match.group(1)) | |
| # Find HTML images | |
| for match in IMG_SRC_RE.finditer(text): | |
| urls.append(match.group(1)) | |
| return urls |
| if not text: | ||
| return [] | ||
| # GitHub upload URLs pattern | ||
| file_url_pattern = re.compile(r'https?://(?:github\.com/[^/]+/[^/]+/files/\d+/[^)\s]+|github\.com/user-attachments/[^)\s]+)') |
There was a problem hiding this comment.
The regex pattern in find_uploaded_files has an issue with URL matching. It uses [^)\s]+ which stops at whitespace or closing parenthesis, but the regex is not enclosed in any delimiters. This could incorrectly match partial URLs or fail to match complete URLs. Additionally, URLs in markdown are typically in the format  or [text](url), so stopping at ) could truncate the URL. Consider using a more robust pattern like r'https?://(?:github\.com/[^/]+/[^/]+/files/\d+/[^\s)]+|github\.com/user-attachments/[^\s)]+)' and ensure proper testing with various URL formats.
| file_url_pattern = re.compile(r'https?://(?:github\.com/[^/]+/[^/]+/files/\d+/[^)\s]+|github\.com/user-attachments/[^)\s]+)') | |
| file_url_pattern = re.compile( | |
| r'https?://(?:github\.com/[^/]+/[^/]+/files/\d+/[^\s)]+|github\.com/user-attachments/[^\s)]+)' | |
| ) |
| errors.append( | ||
| f"Invalid file type '{ext}' uploaded to image field. " | ||
| f"Only image files are allowed: {', '.join(sorted(ALLOWED_IMAGE_FORMATS))}" |
There was a problem hiding this comment.
The error message at line 268 mentions "uploaded to image field" but this function is also called for content fields (short_description_en, short_description_tr, full_content_en, full_content_tr) as seen in _validate_content_fields. This creates a misleading error message. The error message should be more generic or the function should accept a field name parameter to provide accurate context.
| Returns (is_valid, error_message). | ||
| """ | ||
| # Extract file extension from URL | ||
| path = Path(url.split("?")[0]) # Remove query params |
There was a problem hiding this comment.
The URL parsing in validate_image_format at line 218 uses url.split("?")[0] to remove query parameters, but URLs can also contain fragment identifiers (e.g., #section). If a URL contains a fragment like image.jpg#ref, the fragment will remain in the path and could interfere with extension detection. Consider using urlparse from urllib.parse for more robust URL parsing, or split on both ? and #.
| dprint(error_msg) | ||
| return "", errors | ||
|
|
||
| def download_image_if_present(markdown_or_html: str, validate_only: bool = False) -> tuple[str, list[str]]: |
There was a problem hiding this comment.
The validate_only parameter in download_image_if_present is defined but only checked at line 351-352, after all validation steps have already occurred. This means validation always happens regardless of this flag, and the parameter only controls whether downloading occurs. The parameter name is misleading - consider renaming it to skip_download or removing it entirely since it's always passed as False in the single call site (line 498).
| if ext and ext not in ALLOWED_IMAGE_FORMATS: | ||
| errors.append( | ||
| f"Invalid file type '{ext}' uploaded to image field. " | ||
| f"Only image files are allowed: {', '.join(sorted(ALLOWED_IMAGE_FORMATS))}" | ||
| ) | ||
| dprint(f"Rejected non-image file: {ext}") |
There was a problem hiding this comment.
The validation logic in _validate_file_uploads (lines 257-272) and the dedicated image field validation could reject valid image URLs that lack extensions. For instance, many CDN URLs don't have file extensions in the path (e.g., https://example.com/image?id=123). While line 223 returns True for URLs without extensions, the _validate_file_uploads function at line 266 uses if ext which means it will skip files without extensions rather than validating them. This creates an inconsistency where some paths validate extension-less URLs while others skip them. Consider making the behavior consistent across all validation paths.
| if ext and ext not in ALLOWED_IMAGE_FORMATS: | |
| errors.append( | |
| f"Invalid file type '{ext}' uploaded to image field. " | |
| f"Only image files are allowed: {', '.join(sorted(ALLOWED_IMAGE_FORMATS))}" | |
| ) | |
| dprint(f"Rejected non-image file: {ext}") | |
| if ext: | |
| if ext not in ALLOWED_IMAGE_FORMATS: | |
| errors.append( | |
| f"Invalid file type '{ext}' uploaded to image field. " | |
| f"Only image files are allowed: {', '.join(sorted(ALLOWED_IMAGE_FORMATS))}" | |
| ) | |
| dprint(f"Rejected non-image file: {ext}") |
| dprint(error_msg) | ||
| return "", errors | ||
|
|
||
| fname = f"{issue_number}_{datetime.now().strftime('%Y%m%d%H%M%S')}{ext}" |
There was a problem hiding this comment.
The filename generation at line 301 uses datetime.now() without timezone information, which could lead to issues in distributed systems or when dealing with issues from different timezones. The file uses IST_TZ (Istanbul timezone) elsewhere in the code. Consider using datetime.now(IST_TZ) for consistency, or use UTC with datetime.now(timezone.utc) for better practices in file naming.
| fname = f"{issue_number}_{datetime.now().strftime('%Y%m%d%H%M%S')}{ext}" | |
| fname = f"{issue_number}_{datetime.now(IST_TZ).strftime('%Y%m%d%H%M%S')}{ext}" |
gokceuludogan
left a comment
There was a problem hiding this comment.
Looks good overall 👍
I am not sure we should allow images in fields other than the dedicated image field, especially in the short description. My preference would be to support images only in the dedicated image field and, optionally, the full content body. Allowing images elsewhere increases the risk of layout issues and makes validation harder.
Regarding image constraints: since images are rendered in multiple contexts with different aspect ratios (featured/hero view: ~2.1:1, news card view: ~4:3), it would be helpful to validate image dimensions and clearly warn users.
I have already revised the issue template accordingly (see #180), but adding proper validation would be ideal, especially given the related issue in #176.
|
Having body images or not is not a crucial problem. They can stay. But honestly we don't create news content that requires multiple images. So I want to downplay this full-page news feature more and more. |
gokceuludogan
left a comment
There was a problem hiding this comment.
The current status looks sufficient to me.
Having body images or not is not a crucial problem. They can stay. But honestly we don't create news content that requires multiple images. So I want to downplay this full-page news feature more and more.
I am not entirely sure what your longer-term plan is here, but for validating the current functionality, this seems more than adequate. How do you think we should move forward from this point? @doganulus
|
I see that you have approved the PR, but it hasn't been merged yet, and I am not authorized to do so. |



Issue #144 - Check Image and Content Validity in Issue-to-Md Feature is tried to be solved in this PR.
Below are the issue tests that is conducted on my forked repository:
Below are the Test PRs of the valid Issues.
Test 25 (right) and 26 (left) look like below in the news grid.
