Skip to content

Extend HostToDomainGraph to fold host-level graphs stripping the www. prefix #29

@sebastian-nagel

Description

@sebastian-nagel

Per commoncrawl/cc-pyspark#56 the Common Crawl web graphs preserve a www. prefix in the host name. A tool to convert the graph including the www. prefix to one without it, would be useful to compare 1:1 how this changed the graph structure and derived properties, such as the centrality rankings.

The class HostToDomainGraph already supports two aggregation levels:

  • (by default) registered domain - the domain name below the ICANN registry suffix defined by the public suffix list (PSL)
  • "private" domain (command-line flag --private-domains) - the domain name below any public suffix including those in the private section of the PSL.

Adding a third aggregation level "host without www.", is simple:

  1. Implement the stripping of the www. prefix following how it was done in cc-pyspark (see Host-level link extraction: preserve the www. prefix in host names cc-pyspark#57).
  2. Expose this aggregation level per command-line options.
    • Because two mutually exclusive boolean flags are cumbersome, we might refactor the code to use the option --aggregation-level with three supported values: registered-domain (default), private-domain, host-without-www.
    • But ensure backward-compatibility of the current options, to avoid that scripts and documentation need to be adapted immediately.
  3. Update the Javadoc and command-line help accordingly. Add note that the new aggregation level "stretches" the definition of a "domain".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions