Skip to content

Conversation

@AyanSinhaMahapatra
Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra commented Nov 17, 2025

This PR improve package scan performance by....

References:

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁
  • Updated documentation pages (if applicable)
  • Updated CHANGELOG.rst (if applicable)

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Use multiregex to use a cached regex path patterns and
datafile handlers mapping to detect package datafiles faster.

Reference: #4064
Reference: #4061
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
@AyanSinhaMahapatra AyanSinhaMahapatra marked this pull request as draft November 17, 2025 09:49
@AyanSinhaMahapatra AyanSinhaMahapatra changed the title Fast package scan Improve package scan performance Nov 17, 2025
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
@AyanSinhaMahapatra AyanSinhaMahapatra marked this pull request as ready for review November 19, 2025 09:47
Introduce a new option --binary-packages which looks for
package/dependency data in binaries.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
We do not need the license index in a --package-only scan
as this is designed to do a fast package detection only scan
which skips the license detection. As license index loading
takes a couple seconds in each case, this makes the
package only scan much faster.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
--system-package Scan ``<input>`` for installed system package
databases.

-b, --binary-package Scan <input> for package and dependency related
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this:

--package-in-exec Scan compiled executable binaries such as ELF, WinpE and Mach-O files, looking for structured package and dependency metadata as found for example in Go and Rust binaries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or

--package-in-compiled Scan compiled executable binaries such as ELF, WinpE and Mach-O files, looking for structured package and dependency metadata as found for example in Go and Rust compiled binaries.


./configure --dev
venv/bin/scancode-reindex-licenses
venv/bin/scancode-cache-package-patterns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about naming this venv/bin/scancode-reindex-package-patterns to be consistent?



# These handlers are special as they use filetype to
# detect these binaries instead of datafile path patterns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# detect these binaries instead of datafile path patterns
# detect these compiled executable binaries instead of datafile path patterns

# This is the Pickle protocol we use, which was added in Python 3.4.
PICKLE_PROTOCOL = 4

PACKAGE_INDEX_LOCK_TIMEOUT = 60 * 6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

360 seconds is a lot. 60 secs should be enough.

PACKAGE_INDEX_DIR = 'package_patterns_index'
PACKAGE_INDEX_FILENAME = 'index_cache'
PACKAGE_LOCKFILE_NAME = 'scancode_package_index_lockfile'
PACKAGE_CHECKSUM_FILE = 'scancode_package_index_tree_checksums'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used anymore (also should be dropped from licensing)

Suggested change
PACKAGE_CHECKSUM_FILE = 'scancode_package_index_tree_checksums'

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some nits for your consideration!

pickle.dump(self, fn, protocol=PICKLE_PROTOCOL)


def get_prematchers_from_glob_pattern(pattern):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring and a few doctests ?

Return a mapping of regex patterns to datafile handler IDs and
multiregex patterns consisting of regex patterns and prematchers.
"""
handler_by_regex = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a defaultdict(list) maybe?

multiregex patterns consisting of regex patterns and prematchers.
"""
handler_by_regex = {}
multiregex_patterns = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
multiregex_patterns = []
# stores tuples of ....????
multiregex_patterns = []

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about instead having a small attrs or dataclass object to use until a last minute conversion?

class AcceleratedPattern:
    regex :str # regular expression string
    prematchers :list[str] # list of prematcher strinsg for this regex
    handler_datasource_id" :str #handler

something more or less like that would help avoid creating parallel list until the end.

handler_by_regex[regex_pattern]= [handler.datasource_id]

for regex in handler_by_regex.keys():
regex_and_prematcher = (regex, prematchers_by_regex.get(regex, []))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in keeping with the idea of using an object above, this could be a namedtuple

from typing import NamedTuple
class RegexPrematchers(NamedTuple):
    regex: str
    prematchers: list[str]

location=location,
application=application,
system=system,
binary=binary,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be compiled?

--system-package Scan ``<input>`` for installed system package
databases.

-b, --binary-package Scan <input> for package and dependency related
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or

--package-in-compiled Scan compiled executable binaries such as ELF, WinpE and Mach-O files, looking for structured package and dependency metadata as found for example in Go and Rust compiled binaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants