Model Evaluation Metrics in AI

Study Notes

Generative models such as GPT and LLaMA need to comprehend and align with human intent.

Human Evaluation is a direct method of assessing model performance by humans.
Automated Evaluation utilizes algorithms to quantify model output quality.

BLEU is a metric for evaluating the quality of text generated by comparing it to one or more reference texts.
It calculates the n-gram overlap between the generated text and the reference text.
BLEU is commonly used in machine translation and text generation tasks.
BLEU's limitations include potentially overlooking semantic meaning and fluency due to its focus on exact matches.

ROUGE measures the overlap of n-grams, word sequences, and word pairs between the generated text and reference text.
ROUGE is primarily used for summarization tasks.

Distinct-n measures the ratio of unique n-grams to the total number of n-grams in the generated outputs.
Distinct-n is typically used for text generation tasks.

Quality metrics like BLEU and ROUGE evaluate the accuracy, fluency, and appropriateness of generated text.
Quantity metrics like Distinct-n assess the volume of distinct and diverse outputs.

Strategies for effectively addressing user queries include iterative prompt refinement and utilizing few-shot or zero-shot learning.
Effective query addressing enhances response relevance and accuracy.

Prompts are instructions and context passed to a language model to achieve a desired task.
Prompt engineering is the practice of developing and optimizing prompts to efficiently use language models for a variety of applications.
Prompt engineering is a valuable skill for AI engineers and researchers to improve and efficiently use language models.

Standard prompts: Direct instructions or questions.
Chain of thought prompts: Using intermediate reasoning steps.
In context learning: Demonstrating tasks through examples within the prompt.
In context learning: Models learn from examples provided in the prompt.

Enhances AI model performance: Improves accuracy and relevance of outputs.
Enables more precise and nuanced interactions with AI: Crucial for specialized applications and tasks.

Defining what the LLM model should do (classify, sort, filter, write, summarize, translate, etc.).

Text Summarization: Generating a concise summary of a given text.
Question Answering: Providing answers to questions based on given information.
Text Classification: Assigning a category or label to a piece of text.
Role Playing: Engaging in a dialogue by assuming a specific role.
Code Generation: Generating code in a specific programming language.
Reasoning: Performing logical reasoning and inference.
Text Generation: Generating creative or informative text.