PP•3w ago

PP - One suggestion for enhancing the Dataset R...

One suggestion for enhancing the Dataset Run feature involves improving prompt performance analysis on test cases within datasets. - It would be beneficial to have a way to easily filter and identify mistakes from previous dataset runs. For instance, if we're using an LLM for text classification, it would be valuable to know the specific types of errors the model is making. Currently, there is no filtering UI available for this purpose in the Dataset Run feature. - Additionally, while scoring is currently possible at the tracing level, it would be highly advantageous to enable scoring at the dataset run level. For example, in a text classification task with multiple classes, having precision and recall metrics for each class would help pinpoint where errors are occurring. Achieving this level of insight is challenging when scoring is only supported at the tracing level.

1 Reply

Marc•3w ago

Hi @PP, thanks for sharing!

It would be beneficial to have a way to easily filter and identify mistakes from previous dataset runs. For instance, if we're using an LLM for text classification, it would be valuable to know the specific types of errors the model is making. Currently, there is no filtering UI available for this purpose in the Dataset Run feature.

which kinds of filters would be useful here?

Additionally, while scoring is currently possible at the tracing level, it would be highly advantageous to enable scoring at the dataset run level. For example, in a text classification task with multiple classes, having precision and recall metrics for each class would help pinpoint where errors are occurring. Achieving this level of insight is challenging when scoring is only supported at the tracing level.

Agree! please add your upvote and any thoughts you might have here: https://github.com/orgs/langfuse/discussions/2511

GitHub

Scoring dataset runs, e.g. precision, recall, f-value · langfuse · ...

Describe the feature or potential improvement In LangSmith, for example, there is a feature to get precision, recall, f-value in experiment feature on a dataset. https://docs.smith.langchain.com/ho...