Improve pattern matching #133

edoardottt · 2023-10-16T09:35:19Z

edoardottt
Oct 16, 2023
Maintainer

I came across huge matches like:

[
  {
    "name":"PHP error",
    "match":"PHP error"
  },
  {
    "name":"MySQL error",
    "match":"warning_forbid_default_priv"<MORE THAN 20000 LINES HERE>"
  }
]

which completely destroy my terminal 😄

So we might think about either:

truncate the output a bit when matching a regex and maybe add a CLI flag / env variable to control the truncate character limit
improving regexes such that it doesn't match too much but only up to a few lines before / after (maybe up to the next newline but not sure how it would work for e.g Python tracebacks) -> i'm sure we can find a way to do better ;) Probably adding regex matching tests would help
adding which regex matched to the output - for instance MySQL error is comprised of multiple regexes and it would help to know which one of them matched

We could end up with a JSON format like:

[
  {
     "name": "MySQL Error",
     "results": [
        {
           "type": "Regex",
           "details": {"match": "Warning: ...<truncated_output>mysqli error: need new cache refresh... <truncated_output>", "regex": "(?i)Warning.*?mysqli?", "location": "line 42", "source": "body"}
        }
     ]
   }
]

Additionally, regexes have their limits - ideally we want to see one step further and create some kind of pattern-recognition algorithms, or using even using ML for this kind of tasks. It could be a good evolution for cariddi ;) The type key would be useful in that case to differenciate the matches from regex matches:

[
  {
    "type": "PatternFinder",
    "details": {"match": "Warning: ...<truncated_output>mysqli error: need new cache refresh... <truncated_output>", "matcher": "error-finder", "version": "2.0.1"}
  },
  {
    "type": "ML",
    "details": {"model_name": "my-awesome-ml-model", "version": "0.0.1"}
  }
]

There is also room to improve the findings by filtering which ones are found important or not, for instance:

an HTML comment containing "TODO / DO THIS LATER / PASSWORD / etc..." is important
an HTML comment containing a software version is important
an HTML comment like "" is not important
an email starting with licensing@<domain> or sales@<domain> is very common and not very sensitive
an error / exception with an actual traceback is very sensitive
etc...

Those "rules" could be first hardcoded by us on a case-by-case and then learned by ML as well at some point, and a severity field could be set for each finding.

There might be a need to create separate issues for some of those points since it's not directly linked to the JSON lines aggregation. Feel free to copy-paste some of my comments there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve pattern matching #133

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Improve pattern matching #133

Uh oh!

edoardottt Oct 16, 2023 Maintainer

Replies: 0 comments

edoardottt
Oct 16, 2023
Maintainer