jfarmer
jfarmer2mo ago

Is it possible to score a full dataset run?

I'm wondering if there is a best practice for evaluations that require multiple traces. Primary use case would be running a prompt against a full dataset and wanting to evaluate the total precision/recall/f1/etc. Right now I can score each dataset item but I haven't figured out a great way to surface metrics to the UI that would encompass the full run. The alternative I've tested is encompassing the full run in a trace and scoring that but it seems a bit hacky.
Solution:
Can you add your +1 to this idea post that tracks this feature? https://github.com/orgs/langfuse/discussions/2511 Currently only averages are nicely supported but we plan to look into run level scores...
GitHub
Scoring dataset runs, e.g. precision, recall, f-value · langfuse · ...
Describe the feature or potential improvement In LangSmith, for example, there is a feature to get precision, recall, f-value in experiment feature on a dataset. https://docs.smith.langchain.com/ho...
Jump to solution
3 Replies
Solution
Marc
Marc2mo ago
Can you add your +1 to this idea post that tracks this feature? https://github.com/orgs/langfuse/discussions/2511 Currently only averages are nicely supported but we plan to look into run level scores
GitHub
Scoring dataset runs, e.g. precision, recall, f-value · langfuse · ...
Describe the feature or potential improvement In LangSmith, for example, there is a feature to get precision, recall, f-value in experiment feature on a dataset. https://docs.smith.langchain.com/ho...
jfarmer
jfarmer2mo ago
Done! Thanks for the quick response
Marc
Marc2mo ago
thank you! sure, happy to help