Skip to content

Extend HostToDomainGraph to fold host-level graphs stripping the www. prefix#30

Merged
lfoppiano merged 16 commits intomainfrom
feature/add-strip-www-host-folding
Mar 20, 2026
Merged

Extend HostToDomainGraph to fold host-level graphs stripping the www. prefix#30
lfoppiano merged 16 commits intomainfrom
feature/add-strip-www-host-folding

Conversation

@lfoppiano
Copy link
Copy Markdown
Contributor

As specified in #29

@lfoppiano lfoppiano marked this pull request as ready for review March 19, 2026 10:53
@lfoppiano
Copy link
Copy Markdown
Contributor Author

@sebastian-nagel thanks for the comments, just to be sure, in both the pointed parts, the variable host should hold the URL in the standard way (www.google.com), because it was reverted from the input which is in the SURT? approach (com.google.com), or at least this is what I have understood from the tests

@sebastian-nagel
Copy link
Copy Markdown
Contributor

host should hold the URL in the standard way (www.google.com)

Sorry, yes, of course. Also EffectiveTldFinder needs the unreversed form. That makes the code even simpler.

@lfoppiano
Copy link
Copy Markdown
Contributor Author

lfoppiano commented Mar 19, 2026

Yes. AFAIK the variable host carries the unreversed form. So, anything that come after would work left to right if I haven't overlooked anything.

Copy link
Copy Markdown
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @lfoppiano!

Sorry, there was an conceptual misunderstanding: host-without-www is on the same level as are registered and private domain.

The three nodes

0       com.example
1       com.example.www
2       com.example.xyz

are folded to (without counts of "subdomains"):

0       com.example       2
1       com.example.xyz       1

I'll update the script src/script/host2domaingraph.sh to work with the option --aggregation-level requiring an option value.

System.err.println(" -c\tcount hosts per domain (additional column in <nodes_out>");
System.err.println(" --private-domains\tconvert to private domains (include suffixes from the");
System.err.println(" --private-domains\t(deprecated - use --aggregation-level)");
System.err.println(" \tconvert to private domains (include suffixes from the");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tabs in the whitespace. If the terminal shows tabs with length 8, then the help might look skewed:

 --private-domains      (deprecated - use --aggregation-level)
                                        convert to private domains (include suffixes from the
                        PRIVATE domains subdivision of the public suffix list,
                        see https://github.com/publicsuffix/list/wiki/Format#divisions)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing, could run as:

./src/script/host2domaingraph.sh -h 2>&1 | expand -8

@lfoppiano
Copy link
Copy Markdown
Contributor Author

If I have understood correctly I've made all the required changes.

…ateDomain=true AND stripwww=true, the CLI ensure that this condition never happens
Copy link
Copy Markdown
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good to me. Tested with a small host-level web graph and all three aggregation levels.

Please squash the commits to a smaller meaningful number.

@lfoppiano lfoppiano merged commit e867fe2 into main Mar 20, 2026
3 checks passed
@lfoppiano lfoppiano deleted the feature/add-strip-www-host-folding branch March 20, 2026 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants