-
Notifications
You must be signed in to change notification settings - Fork 6
Closed
Description
Per commoncrawl/cc-pyspark#56 the Common Crawl web graphs preserve a www. prefix in the host name. A tool to convert the graph including the www. prefix to one without it, would be useful to compare 1:1 how this changed the graph structure and derived properties, such as the centrality rankings.
The class HostToDomainGraph already supports two aggregation levels:
- (by default) registered domain - the domain name below the ICANN registry suffix defined by the public suffix list (PSL)
- "private" domain (command-line flag
--private-domains) - the domain name below any public suffix including those in the private section of the PSL.
Adding a third aggregation level "host without www.", is simple:
- Implement the stripping of the
www.prefix following how it was done in cc-pyspark (see Host-level link extraction: preserve thewww.prefix in host names cc-pyspark#57). - Expose this aggregation level per command-line options.
- Because two mutually exclusive boolean flags are cumbersome, we might refactor the code to use the option
--aggregation-levelwith three supported values:registered-domain(default),private-domain,host-without-www. - But ensure backward-compatibility of the current options, to avoid that scripts and documentation need to be adapted immediately.
- Because two mutually exclusive boolean flags are cumbersome, we might refactor the code to use the option
- Update the Javadoc and command-line help accordingly. Add note that the new aggregation level "stretches" the definition of a "domain".
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels