Skip to content

fix(utils): prevent sitemap metadata leak across url without loc#3795

Merged
barjin merged 1 commit into
apify:masterfrom
anxkhn:loop/crawlee__003
Jun 30, 2026
Merged

fix(utils): prevent sitemap metadata leak across url without loc#3795
barjin merged 1 commit into
apify:masterfrom
anxkhn:loop/crawlee__003

Conversation

@anxkhn

@anxkhn anxkhn commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What

SitemapXmlParser in packages/utils/src/internals/sitemap.ts could leak per-url metadata from a malformed <url> block into the next one, and emitted Invalid Date for unparseable <lastmod> values. This fixes both.

Why

  1. Metadata leak. A <url> with no <loc> is correctly dropped, but onCloseTag only reset the this.url buffer when loc was present. So a loc-less <url> that carried lastmod / priority / changefreq left those values in the buffer, and they bled onto the next emitted url:

    - if (name === 'url' && this.url.loc !== undefined) {
    -     this.push({ type: 'url', ...this.url, loc: this.url.loc } satisfies SitemapItem);
    + if (name === 'url') {
    +     if (this.url.loc !== undefined) {
    +         this.push({ type: 'url', ...this.url, loc: this.url.loc } satisfies SitemapItem);
    +     }
          this.url = {};
      }
  2. Invalid lastmod. this.url.lastmod = new Date(text) was set unconditionally, so junk like <lastmod>not-a-date</lastmod> produced an Invalid Date (its .getTime() is NaN and consumers calling .toISOString() throw). Now lastmod is only set when the parsed date is valid, mirroring how changefreq is already enum-guarded.

    - this.url.lastmod = new Date(text);
    + const lastmod = new Date(text);
    + if (!Number.isNaN(lastmod.getTime())) {
    +     this.url.lastmod = lastmod;
    + }

Well-formed sitemaps are unaffected; the buffer was already cleared for valid urls.

Testing

Added one offline regression test to packages/utils/test/sitemap.test.ts using a { type: 'raw' } urlset (no network): a url with an invalid lastmod, a loc-less url carrying lastmod/priority, then a bare url. It asserts only the two loc'd urls are emitted, the invalid lastmod is dropped, and no lastmod/priority leaks onto the following url. Fails on master, passes with the fix. Full sitemap suite: 30/30. yarn lint and tsc -p packages/utils --noEmit clean.

Note

This mirrors the same metadata-leak fix I opened against the sibling crawlee-python repo (apify/crawlee-python#1992); the TS parser additionally needed the lastmod validity guard. Happy to file a tracking issue if you'd prefer one.

A <url> with no <loc> is dropped, but SitemapXmlParser only reset its url buffer when loc was present, so its lastmod/priority/changefreq bled into the next emitted url. Reset the buffer on every </url>, and only set lastmod when the date is valid so junk values no longer emit Invalid Date.

@barjin barjin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you @anxkhn !

@barjin barjin merged commit 895ed62 into apify:master Jun 30, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants