Skip to main content
This tutorial shows how to classify documents as relevant or irrelevant to queries using benchmark datasets with ground-truth labels. Key Points:
  • Download and prepare benchmark datasets for relevance classification
  • Compare different LLM models (GPT-4, GPT-3.5, GPT-4 Turbo) for classification accuracy
  • Analyze results with confusion matrices and detailed reports
  • Get explanations for LLM classifications to understand decision-making
  • Measure retrieval quality using ranking metrics like precision@k

Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the full notebook.
https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix-docs-images/gc.ico

Google Colab

colab.research.google.com

Download Benchmark Dataset

df = download_benchmark_dataset(
    task="binary-relevance-classification",
    dataset_name="wiki_qa-train"
)

Configure Evaluation

N_EVAL_SAMPLE_SIZE = 100
df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)
df_sample = df_sample.rename(columns={
    "query_text": "input",
    "document_text": "reference",
})

Run Relevance Classification

from phoenix.evals import LLM, evaluate_dataframe
from phoenix.evals.metrics import DocumentRelevanceEvaluator

llm = LLM(provider="openai", model="gpt-4")
relevance_evaluator = DocumentRelevanceEvaluator(llm=llm)

evals_df = evaluate_dataframe(dataframe=df_sample, evaluators=[relevance_evaluator])
relevance_classifications = evals_df["document_relevance_score"].str["label"].tolist()
choices = relevance_evaluator.CHOICES

Evaluate Results

true_labels = df_sample["relevant"].map({True: "relevant", False: "unrelated"}).tolist()

print(classification_report(true_labels, relevance_classifications, labels=choices))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=relevance_classifications, classes=choices
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

Get Explanations

relevance_classifications_df = evaluate_dataframe(
    dataframe=df_sample.sample(n=5),
    evaluators=[relevance_evaluator],
)
relevance_classifications_df["label"] = relevance_classifications_df["document_relevance_score"].str[
    "label"
]
relevance_classifications_df["explanation"] = relevance_classifications_df[
    "document_relevance_score"
].str["explanation"]

Compare Models

Run the same evaluation with different models:
# GPT-3.5
llm_gpt35 = LLM(provider="openai", model="gpt-3.5-turbo")

# GPT-4 Turbo
llm_gpt4turbo = LLM(provider="openai", model="gpt-4-turbo-preview")