-
-
Notifications
You must be signed in to change notification settings - Fork 636
Improve package scan performance #4606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Reference: https://github.com/Quantco/multiregex Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Use multiregex to use a cached regex path patterns and datafile handlers mapping to detect package datafiles faster. Reference: #4064 Reference: #4061 Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Introduce a new option --binary-packages which looks for package/dependency data in binaries. Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
a99c4f8 to
dde6bc9
Compare
We do not need the license index in a --package-only scan as this is designed to do a fast package detection only scan which skips the license detection. As license index loading takes a couple seconds in each case, this makes the package only scan much faster. Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
| --system-package Scan ``<input>`` for installed system package | ||
| databases. | ||
|
|
||
| -b, --binary-package Scan <input> for package and dependency related |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this:
--package-in-exec Scan compiled executable binaries such as ELF, WinpE and Mach-O files, looking for structured package and dependency metadata as found for example in Go and Rust binaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or
--package-in-compiled Scan compiled executable binaries such as ELF, WinpE and Mach-O files, looking for structured package and dependency metadata as found for example in Go and Rust compiled binaries.
|
|
||
| ./configure --dev | ||
| venv/bin/scancode-reindex-licenses | ||
| venv/bin/scancode-cache-package-patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about naming this venv/bin/scancode-reindex-package-patterns to be consistent?
|
|
||
|
|
||
| # These handlers are special as they use filetype to | ||
| # detect these binaries instead of datafile path patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # detect these binaries instead of datafile path patterns | |
| # detect these compiled executable binaries instead of datafile path patterns |
| # This is the Pickle protocol we use, which was added in Python 3.4. | ||
| PICKLE_PROTOCOL = 4 | ||
|
|
||
| PACKAGE_INDEX_LOCK_TIMEOUT = 60 * 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
360 seconds is a lot. 60 secs should be enough.
| PACKAGE_INDEX_DIR = 'package_patterns_index' | ||
| PACKAGE_INDEX_FILENAME = 'index_cache' | ||
| PACKAGE_LOCKFILE_NAME = 'scancode_package_index_lockfile' | ||
| PACKAGE_CHECKSUM_FILE = 'scancode_package_index_tree_checksums' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used anymore (also should be dropped from licensing)
| PACKAGE_CHECKSUM_FILE = 'scancode_package_index_tree_checksums' |
pombredanne
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some nits for your consideration!
| pickle.dump(self, fn, protocol=PICKLE_PROTOCOL) | ||
|
|
||
|
|
||
| def get_prematchers_from_glob_pattern(pattern): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a docstring and a few doctests ?
| Return a mapping of regex patterns to datafile handler IDs and | ||
| multiregex patterns consisting of regex patterns and prematchers. | ||
| """ | ||
| handler_by_regex = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a defaultdict(list) maybe?
| multiregex patterns consisting of regex patterns and prematchers. | ||
| """ | ||
| handler_by_regex = {} | ||
| multiregex_patterns = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| multiregex_patterns = [] | |
| # stores tuples of ....???? | |
| multiregex_patterns = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about instead having a small attrs or dataclass object to use until a last minute conversion?
class AcceleratedPattern:
regex :str # regular expression string
prematchers :list[str] # list of prematcher strinsg for this regex
handler_datasource_id" :str #handlersomething more or less like that would help avoid creating parallel list until the end.
| handler_by_regex[regex_pattern]= [handler.datasource_id] | ||
|
|
||
| for regex in handler_by_regex.keys(): | ||
| regex_and_prematcher = (regex, prematchers_by_regex.get(regex, [])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in keeping with the idea of using an object above, this could be a namedtuple
from typing import NamedTuple
class RegexPrematchers(NamedTuple):
regex: str
prematchers: list[str]| location=location, | ||
| application=application, | ||
| system=system, | ||
| binary=binary, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be compiled?
| --system-package Scan ``<input>`` for installed system package | ||
| databases. | ||
|
|
||
| -b, --binary-package Scan <input> for package and dependency related |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or
--package-in-compiled Scan compiled executable binaries such as ELF, WinpE and Mach-O files, looking for structured package and dependency metadata as found for example in Go and Rust compiled binaries.
This PR improve package scan performance by....
References:
Tasks
Run tests locally to check for errors.