Releases · GateNLP/ultimate-sitemap-parser

10 Sep 08:28

github-actions

1.6.0

79634e0

1.6.0 Latest

Latest

New Features

Added recurse_callback and recurse_list_callback parameters to usp.tree.sitemap_tree_for_homepage to filter which sub-sitemaps are recursed into (#106 by @nicolas-popsize)

Bug Fixes

If a FileNotFoundError is encountered when cleaning up a sitemap page temporary file, it will now be caught and logged as a warning. (#108)
- This resolves an error which we believe only occurs on Windows in complex environments (e.g. when running the full Pytest suite)

Contributors

nicolas-popsize

Assets 6

11 Aug 10:54

github-actions

1.5.0

e61158e

1.5.0

Bug Fixes

Set different timeouts for HTTP request connection and read to lower maximum request length. Instead of 60s for each, it is now 9.05s for connection and 60s for read. (#95)

Assets 6

23 Apr 10:51

github-actions

1.4.0

51d9479

1.4.0

New Features

Support parsing sitemaps when a proper XML namespace is not declared (#87)

Bug Fixes

Fix incorrect logic in gunzip behaviour which attempted to gunzip responses that were already gunzipped by requests (#89)
Change log output for gunzip failures to include the URL instead of request response object (#89)

Assets 6

31 Mar 15:30

github-actions

1.3.1

b6cee1a

1.3.1

Bug Fixes

Fixed an issue with temporary file handling, which would cause USP to always crash on Windows (#84)

Assets 6

17 Mar 10:37

github-actions

1.3.0

3eda963

1.3.0

This release drops support for Python 3.8. The minimum supported version is now Python 3.9.

New Features

Recursive sitemaps are detected and will return an InvalidSitemap instead (#74)
Known sitemap paths will be skipped if they redirect to a sitemap already found (#77)
The reported URL of a sitemap will now be its actual URL after redirects (#74)
Log level in CLI can now be changed with the -v or -vv flags, and output to a file with -l (#76)
When fetching known sitemap paths, 404 errors are now logged at a lower level (#78)

Bug Fixes

Some logging at INFO level has been changed to DEBUG (#76)

API Changes

Added AbstractWebClient.url() method to return the actual URL fetched after redirects. Custom web clients will need to implement this method.

Assets 6

18 Feb 10:25

github-actions

1.2.0

761df8a

1.2.0

New Features

Support passing additional known sitemap paths to usp.tree.sitemap_tree_for_homepage (#69)
The requests web client now creates a session object for better performance, which can be overridden by the user (#70)

Documentation

Added improved documentation for customising the HTTP client.

Assets 6

29 Jan 12:14

github-actions

1.1.1

95b592f

1.1.1

Bug Fixes

Changed log level when a suspected gzipped sitemap can't be un-gzipped from error to warning, since parsing can usually continue (#62 by @redreceipt)
Line references in logs now reference the correct location instead of lines within the logging helper file (#63)

Contributors

redreceipt

Assets 6

20 Jan 14:12

github-actions

1.1.0

ae51209

1.1.0

New Features

Added support for alternate localised pages with hreflang.
If an HTTP error is encountered, the contents of the error page is logged at INFO level.
Added optional configurable wait time to HTTP request client.

Assets 6

13 Jan 11:26

github-actions

1.0.0

91b343e

1.0.0

Ultimate Sitemap Parser is now maintained by the GATE Team at the School of Computer Science, University of Sheffield. We’d like to thank Linas Valiukas and Hal Roberts for their work on this package, and Paige Gulley for coordinating the transfer of the library.

Breaking Changes

Python v3.8 is now the lowest supported version of Python. Future releases will follow Python’s version support.

New Features

CLI tool to parse and list sitemaps on the command line (see CLI Reference)
All sitemap objects now implement a consistent interface, allowing traversal of the tree irrespective of type:
All sitemaps now have pages and sub_sitemaps properties, returning their children of that type, or an empty list where not applicable
Added all_sitemaps() method to iterate over all descendant sitemaps
Pickling page sitemaps now includes page data, which previously was not included as it was swapped to disk
Sitemaps and pages now implement to_dict() method to convert to dictionaries (requested in #18)
Added optional arguments to usp.tree.sitemap_tree_for_homepage() to disable robots.txt-based or known-path-based sitemap discovery. Default behaviour is still to use both.
Parse sitemaps from a string with Local Parsing (requested in #26)
Support for the Google Image sitemap extension
Add proxy support with RequestsWebClient.set_proxies() (#20 by @tgrandje)
Add additional sitemap discovery paths for news sitemaps (d3bdaae)
Add parameter to RequestsWebClient.init() to disable certificate verification (#37 by @japherwocky)

Performance

Improvement of parse performance by approximately 90%
Optimised lookup of page URLs when checking if duplicate
Optimised datetime parse in XML Sitemaps by trying full ISO8601 parsers before the general parser

Bug Fixes

Invalid datetimes will be parsed as None instead of crashing (reported in #22, #31)
Invalid priorities will be set to the default (0.5) instead of crashing
Moved version attribute into main class module
Robots.txt index sitemaps now count for the max recursion depth (reported in #29). The default maximum has been increased by 1 to compensate for this.
Remove log configuration so it can be specified at application level (reported in #25, #24 by @dsoprea/@antonialoytorrens-ikaue)
Resolve warnings caused by http.HTTPStatus usage (3867b6e)
Don’t add InvalidSitemap object if robots.txt is not found (#39 by @gbenson)
Fix incorrect lowercasing of URLS discovered in robots.txt (reported in #40, #35 by @ArthurMelin)

Assets 6

18 Dec 11:44

github-actions

1.0.0rc1

a3b066b

1.0.0rc1 Pre-release

Pre-release

Release 1.0.0rc1

Assets 6

Releases: GateNLP/ultimate-sitemap-parser

1.6.0

Contributors

Uh oh!

1.5.0

Uh oh!

1.4.0

Uh oh!

1.3.1

Uh oh!

1.3.0

Uh oh!

1.2.0

Uh oh!

1.1.1

Contributors

Uh oh!

1.1.0

Uh oh!

1.0.0

Breaking Changes

New Features

Performance

Bug Fixes

Uh oh!

1.0.0rc1

Uh oh!