Cognee HotPotQA evaluation #1733

LStromann · 2025-11-04T15:00:51Z

LStromann
Nov 4, 2025
Collaborator

=Hi all, I'm trying to replicate the results in the Cognee paper (https://arxiv.org/pdf/2505.24478) on HotPotQA there is an evaluation script in the repo (https://github.com/topoteretes/cognee/tree/main/evals), but when I run it, I get significantly worse results. I'm guessing I need to apply the proper hyperparameters, but they are not stated anywhere. Does someone know if there is an issue with the evaluation script in the repo, or where are the hyperparameters? Thanks for the help.

This discussion was automatically pulled from Discord.

LStromann · 2025-11-04T15:01:03Z

LStromann
Nov 4, 2025
Collaborator Author

Hi Aleksis, one of the co-authors of the paper here. Thanks for taking an interest in our work! Really cool to see someone taking a deep, hands-on dive into the paper.

0 replies

LStromann · 2025-11-04T15:01:19Z

LStromann
Nov 4, 2025
Collaborator Author

Since you showed interest (and I love talking about it 🤓 ), here’s a bit of background: the experiments in the paper used an internal tool called Dreamify. As mentioned in the paper, Dreamify treated Cognee and its evaluation framework as a black box and ran hyperparameter optimization using TPE. The setup had a custom wrapper around an older version of the evaluation scripts you’re referring to.

The scripts that are now in the evals folder come from some later work. We made those public mainly to support repeated runs and to analyze distributions of evaluation results. We haven’t published the newer evaluation setup yet.

So you’re right: to reproduce the exact results from the paper, you’d need:

The older version of the evaluation script and how it was used in Dreamify
The hyperparameter values found through optimization
The version of Cognee we were using at the time

The first two were never released publicly and remain proprietary, though they influenced a lot of what we’ve built since then.

0 replies

LStromann · 2025-11-04T15:01:42Z

LStromann
Nov 4, 2025
Collaborator Author

Can you clarify what exactly you’re trying to reproduce?
Keep in mind that Cognee has evolved quite a bit since then! To be honest, I think reproducing the exact numbers from the paper without the proper hyperparameter optimization setup has limited value at this point. As we and others have discussed, multihop benchmarks like HotPotQA have well-known limitations. On top of that, they are not a great proxy for what our users’ AI memory needs are, so we don’t fine-tune Cognee’s defaults to perform best on any single benchmark.
If your goal is to experiment with the evaluation framework, I’d suggest trying a simple random search over hyperparameters. That’s something you can easily parallelize, and it might give you a good sense of how the current version of Cognee performs on HotPotQA or any of the other benchmarks, for that matter.

0 replies

LStromann · 2025-11-04T15:02:11Z

LStromann
Nov 4, 2025
Collaborator Author

Okay, that was a lot of info for you 😅 . Hope it helps. Let me know what you think about it, and please keep us updated on your Cognee eval research!

0 replies

LStromann · 2025-11-04T15:02:46Z

LStromann
Nov 4, 2025
Collaborator Author

Yes, thanks a lot. I'm trying to add Ontotext GraphDB as an option for a graph database. I want to reproduce the benchmark results with the standard implementation. After that, I want to reproduce them with Ontotext GraphDB to see that my implementation matches the result. And finally, I want to try to add some of the features of Ontotext GraphDB like ontology support, reasoning and sparql to see if I can even improve the results

0 replies

LStromann · 2025-11-04T15:03:26Z

LStromann
Nov 4, 2025
Collaborator Author

Could you maybe proved me the hyperparameters with which you achieved the best result also in the paper it's stated that you did 50 trials for each experiment is the final result the best or the average

0 replies

LStromann · 2025-11-04T15:04:13Z

LStromann
Nov 4, 2025
Collaborator Author

Thanks for the clarification! Good luck with the Ontotext GraphDB adapter development. We know how interesting (and challenging!) developing those can be. Please let us know how it performs once you’ve got it running, if you can! We have a community repo where you can share it if you’d like. And if you ever need deeper help building it, feel free to reach out so we can discuss what a funded collaboration might look like.

0 replies

LStromann · 2025-11-04T15:05:05Z

LStromann
Nov 4, 2025
Collaborator Author

Keep an eye on our updates , we’ll be sharing some of what’s currently proprietary in the future. For now, we can’t release additional details, but we’ll make sure the community hears about it once we can!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topoteretes

Cognee HotPotQA evaluation #1733

Uh oh!

{{title}}

Uh oh!

Replies: 8 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Topoteretes

Cognee HotPotQA evaluation #1733

Uh oh!

LStromann Nov 4, 2025 Collaborator

Replies: 8 comments

Uh oh!

LStromann Nov 4, 2025 Collaborator Author

Uh oh!

LStromann Nov 4, 2025 Collaborator Author

Uh oh!

LStromann Nov 4, 2025 Collaborator Author

Uh oh!

LStromann Nov 4, 2025 Collaborator Author

Uh oh!

LStromann Nov 4, 2025 Collaborator Author

Uh oh!

LStromann Nov 4, 2025 Collaborator Author

Uh oh!

LStromann Nov 4, 2025 Collaborator Author

Uh oh!

LStromann Nov 4, 2025 Collaborator Author

LStromann
Nov 4, 2025
Collaborator

LStromann
Nov 4, 2025
Collaborator Author

LStromann
Nov 4, 2025
Collaborator Author

LStromann
Nov 4, 2025
Collaborator Author

LStromann
Nov 4, 2025
Collaborator Author

LStromann
Nov 4, 2025
Collaborator Author

LStromann
Nov 4, 2025
Collaborator Author

LStromann
Nov 4, 2025
Collaborator Author

LStromann
Nov 4, 2025
Collaborator Author