AI Glossary
139 terms covering core AI concepts, techniques, and architectures
A/B Testing
An experimental methodology for comparing two or more model versions in production by randomly routing traffic to each variant and measuring performance differences. A/B testing enables data-driven decisions about model updates and improvements.
Activation Function
A mathematical function applied to a neuron's output in a neural network that introduces non-linearity, enabling the network to learn complex patterns. Common activation functions include ReLU (Rectified Linear Unit), sigmoid, and tanh. Without activation functions, neural networks would only be able to learn linear relationships. The choice of activation function affects training dynamics, gradient flow, and model expressiveness.
Adversarial Examples
Carefully crafted inputs designed to fool machine learning models into making mistakes. Often these are imperceptible modifications to normal inputs (like adding noise to an image) that cause confident misclassifications. Adversarial examples expose vulnerabilities in AI systems and are studied to improve robustness. They are particularly concerning for security-critical applications like autonomous vehicles or content moderation.
Agentic AI
A paradigm where AI systems act as autonomous agents that can plan, reason, use tools, and take actions to accomplish complex goals. Unlike traditional chatbots that respond to single queries, agentic AI systems can break down tasks, execute multi-step workflows, interact with external services, and iterate until objectives are met.
AGI (Artificial General Intelligence)
A hypothetical form of artificial intelligence that possesses the ability to understand, learn, and apply knowledge across any intellectual task that a human can perform. Unlike narrow AI which excels at specific tasks, AGI would have flexible, general-purpose intelligence. AGI remains a theoretical goal and active area of research, with significant debate about when or if it will be achieved and what safety measures are needed.
AI Agent
An autonomous software entity that uses AI (typically an LLM) to perceive its environment, make decisions, and take actions to achieve specific goals. Agents can use tools, maintain memory, and operate in multi-step workflows.
AI Alignment
The research field focused on ensuring that AI systems behave in accordance with human intentions, values, and goals. Alignment aims to solve the problem of specifying what we actually want AI to do and ensuring it faithfully follows those specifications, especially as systems become more powerful and autonomous.
AI Safety
A broad research field dedicated to ensuring AI systems are beneficial and do not cause unintended harm. AI safety encompasses technical challenges like robustness and reliability, as well as societal concerns like misuse prevention, existential risk, and maintaining human oversight of powerful systems.
API (Application Programming Interface)
A set of rules and protocols that allows different software applications to communicate with each other. In AI, APIs enable developers to integrate AI capabilities into their applications without building models from scratch. Popular AI APIs include OpenAI API, Anthropic API, Google AI API, and Hugging Face Inference API. APIs abstract away complexity and provide simple interfaces for powerful AI functionality.
Attention Mechanism
A component in neural networks that allows models to focus on different parts of the input when producing output. Rather than treating all input equally, attention assigns different weights to different positions, enabling the model to "pay attention" to the most relevant information for each step of processing.
Automation
Using AI and software to perform tasks with minimal human intervention. In the AI context, this includes workflow automation, data processing, and decision-making pipelines.
AutoML
Automated Machine Learning: the process of automating the end-to-end pipeline of applying machine learning, including data preprocessing, feature engineering, model selection, hyperparameter tuning, and evaluation. AutoML makes machine learning accessible to non-experts and saves time for experts. Tools like Google AutoML, H2O.ai, and Auto-sklearn enable practitioners to build high-quality models with minimal manual intervention.
Autonomous AI
AI systems capable of operating independently to accomplish goals with minimal human oversight. Autonomous AI agents can plan, execute multi-step tasks, use tools, recover from errors, and adapt their approach based on results, acting as self-directed problem solvers rather than simple question-answer systems.
Backpropagation
The core algorithm for training neural networks. It works by computing the error between the model's prediction and the correct answer, then propagating that error backwards through the network layer by layer, calculating how much each weight contributed to the error. These error signals are used to update weights via gradient descent.
Batch Normalization
A technique that normalizes the inputs of each layer across a mini-batch during training. Batch normalization stabilizes learning, allows higher learning rates, reduces sensitivity to initialization, and acts as a regularizer to reduce overfitting.
Batch Size
The number of training examples processed together in one forward and backward pass during model training. Larger batches provide more stable gradient estimates and better utilize parallel hardware but require more memory. Smaller batches introduce more noise but can lead to better generalization. Batch size affects training speed, memory usage, and final model quality. Common values range from 8 to 512 depending on the task and hardware.
Beam Search
A decoding algorithm that maintains multiple hypothesis sequences (beams) at each step, keeping only the top-k most probable candidates. Beam search balances exploration and exploitation to find high-quality outputs, commonly used in machine translation and text generation.
Benchmark
A standardized test or evaluation suite used to measure and compare AI model performance. Benchmarks provide consistent metrics across tasks like reasoning, coding, math, and language understanding, enabling objective comparison between different models and approaches.
BERT (Bidirectional Encoder Representations from Transformers)
A transformer-based language model introduced by Google in 2018 that reads text bidirectionally (considering context from both left and right). Unlike GPT which is autoregressive (predicts next token), BERT is trained to predict masked tokens in sentences. BERT excels at understanding tasks like question answering, sentiment analysis, and text classification. It revolutionized NLP by providing powerful pre-trained representations.
Bias in AI
Systematic errors in AI outputs that unfairly favor or disadvantage particular groups of people. AI bias typically originates from biased training data, flawed model design, or biased evaluation metrics. It can manifest as racial, gender, age, or socioeconomic discrimination in AI decisions and outputs.
BLEU
Bilingual Evaluation Understudy - a metric that measures the quality of machine-translated text by comparing n-gram overlap with reference translations. BLEU scores range from 0 to 1 (or 0-100), with higher scores indicating better translation quality.
Catastrophic Forgetting
The tendency of neural networks to abruptly forget previously learned information when trained on new tasks. This challenge is central to continual learning research, which aims to enable models to learn sequentially without forgetting.
Chain-of-Thought Prompting
A prompting technique that encourages language models to break down complex reasoning into intermediate steps before arriving at a final answer. By explicitly asking the model to "think step by step," it produces more accurate results on logic, math, and multi-step reasoning tasks.
Chunking
The process of breaking large documents or text into smaller, manageable pieces for processing by AI systems. Chunking is essential for RAG systems where documents must fit within context windows and for creating meaningful semantic units for embedding and retrieval. Effective chunking strategies balance between preserving semantic coherence and maintaining appropriate sizes. Chunking methods include fixed-size, sentence-based, paragraph-based, or semantic splitting.
CLIP
Contrastive Language-Image Pre-training - a multimodal model developed by OpenAI that learns visual concepts from natural language descriptions. CLIP can classify images, search for images using text, and perform zero-shot learning across vision tasks without task-specific training.
CNN (Convolutional Neural Network)
A type of neural network specifically designed for processing grid-like data such as images. CNNs use special layers called convolutional layers that apply filters to detect features like edges, textures, and shapes. They have been the backbone of computer vision for over a decade.
Compute
The computational resources (processing power, memory, time) required to train or run AI models. Training large models requires enormous compute, measured in petaflop-days or GPU-hours. The amount of compute used in training has grown exponentially, doubling roughly every 3-4 months for frontier models. Compute is a key bottleneck in AI progress and a major cost factor. Efficient use of compute through techniques like mixed-precision training and model parallelism is crucial.
Computer Vision
A field of AI that trains computers to interpret and understand visual information from the world, including images and videos. Computer vision systems can identify objects, faces, text, and scenes, and are used in applications from self-driving cars to medical imaging to augmented reality.
Constitutional AI
An AI safety approach developed by Anthropic that trains AI systems using explicit principles or a "constitution" that guides behavior. Rather than relying solely on human feedback for every decision, the AI learns to evaluate its own outputs against written principles covering helpfulness, harmlessness, and honesty. This approach aims to make AI behavior more transparent, controllable, and aligned with human values.
Context Window
The maximum amount of text (measured in tokens) that a language model can process at once, including both the input prompt and the generated output. The context window determines how much information the model can "see" and consider when generating responses. Larger context windows allow for longer documents and conversations.
Contrastive Learning
A self-supervised learning technique that learns representations by contrasting similar (positive) and dissimilar (negative) pairs. The model learns to pull together representations of similar samples while pushing apart dissimilar ones in the embedding space.
Cosine Similarity
A metric that measures the similarity between two vectors by computing the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical). In AI, cosine similarity is widely used to compare embeddings and determine semantic similarity between texts, images, or other data. It is scale-invariant, meaning it measures orientation rather than magnitude, making it ideal for high-dimensional vector comparisons in semantic search and recommendation systems.
Cross-Validation
A technique for robustly evaluating model performance by dividing data into multiple subsets (folds), training on some folds while validating on others, then rotating which folds are used for each purpose. K-fold cross-validation uses k different train-validation splits, providing k performance estimates that can be averaged. This gives a more reliable performance estimate than a single train-test split, especially with limited data.
Data Augmentation
Techniques for artificially expanding training datasets by creating modified versions of existing data. In computer vision, this includes rotations, flips, crops, and color adjustments. For text, it includes paraphrasing, back-translation, or synonym replacement. Data augmentation improves model robustness, reduces overfitting, and can help address data scarcity. It teaches models to be invariant to irrelevant variations while maintaining performance on the core task.
Data Drift
The phenomenon where the statistical properties of input data change over time in production, causing model performance to degrade. Monitoring for data drift is crucial for maintaining model reliability and knowing when retraining is needed.
Data Poisoning
An adversarial attack where malicious examples are injected into training data to manipulate model behavior. Data poisoning can cause models to make specific errors, introduce backdoors, or degrade general performance.
Data Science
An interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data. Data science combines statistics, mathematics, programming, and domain expertise to analyze data, build predictive models, and inform decision-making. It encompasses data collection, cleaning, exploration, visualization, and modeling.
Deep Learning
A subset of machine learning that uses neural networks with many layers (hence "deep") to learn complex patterns from large amounts of data. Deep learning powers most modern AI breakthroughs, from image recognition to language generation, and is the foundation behind technologies like GPT and DALL-E.
Diffusion Model
A type of generative AI model that creates data (typically images) by learning to reverse a gradual noising process. During training, noise is progressively added to data until it becomes random static, and the model learns to reverse each step. During generation, it starts with random noise and iteratively removes it to produce coherent outputs.
Dropout
A regularization technique that randomly deactivates a fraction of neurons during each training step, preventing the network from becoming overly dependent on specific neurons. Dropout forces the network to learn robust, redundant representations that work even when some neurons are missing. During inference, all neurons are active but their outputs are scaled accordingly. Dropout is one of the most effective methods for preventing overfitting in deep learning.
Edge AI
Artificial intelligence processing that occurs locally on end-user devices (edge devices) rather than in the cloud. Edge AI enables real-time inference with low latency, improved privacy (data stays on device), reduced bandwidth costs, and operation without internet connectivity. Applications include smartphone AI features, smart home devices, autonomous vehicles, and IoT sensors. Requires model optimization techniques like quantization.
Embedding Space
The mathematical space where embeddings live, representing concepts as points in high-dimensional space. Similar concepts cluster together, and relationships between concepts are preserved as directions and distances. This geometric representation of meaning enables machines to understand semantic relationships.
Embeddings
Numerical vector representations of data (text, images, etc.) that capture semantic meaning. Similar items have similar embeddings, enabling semantic search and comparison.
Emergent Abilities
Capabilities that suddenly appear in AI models at sufficient scale but are absent in smaller models. Examples include complex reasoning, arithmetic, and in-context learning that emerge when models reach certain parameter counts or training compute thresholds. Emergent abilities are unpredictable and not explicitly trained for, arising from the interaction of scale and training data. They represent both exciting potential and safety concerns as models grow larger.
Encoder-Decoder
A neural network architecture consisting of two main components: an encoder that processes input into a compressed representation, and a decoder that generates output from that representation. Originally popularized for sequence-to-sequence tasks like machine translation, this architecture is used in many modern AI systems. Transformers can use encoder-only (BERT), decoder-only (GPT), or full encoder-decoder (T5) configurations depending on the task.
Epoch
One complete pass through the entire training dataset during the model training process. Training typically involves multiple epochs, where the model sees the same data repeatedly but learns incrementally with each pass. The number of epochs is a key hyperparameter: too few and the model may underfit, too many and it may overfit. Training is often monitored across epochs to find the optimal stopping point.
Explainable AI (XAI)
A field focused on making AI decisions interpretable and understandable to humans. XAI techniques include attention visualization, feature importance scores, saliency maps, and natural language explanations. As AI systems are deployed in high-stakes domains like healthcare and finance, explainability becomes critical for trust, debugging, regulatory compliance, and identifying biases. The challenge is balancing model performance with interpretability.
F1 Score
The harmonic mean of precision and recall, providing a balanced measure of classification performance. F1 score is particularly useful when dealing with imbalanced datasets where accuracy alone can be misleading.
Fairness
The principle that AI systems should treat all individuals and groups equitably, without discriminating based on protected characteristics like race, gender, or age. Fairness in AI involves multiple concepts: demographic parity, equal opportunity, and individual fairness. Achieving fairness requires careful attention to training data, evaluation metrics, and deployment contexts. Fairness is a key ethical concern and increasingly a regulatory requirement for AI systems.
Federated Learning
A distributed machine learning approach where models are trained across multiple decentralized devices or servers holding local data, without exchanging the raw data itself. This enables collaborative learning while preserving privacy and reducing data transfer costs.
Few-Shot Learning
A technique where a model learns to perform a task from just a few examples provided in the prompt. Instead of fine-tuning on thousands of examples, you include 2-5 demonstrations of the desired input-output pattern directly in the prompt, and the model generalizes from these examples to handle new inputs.
Fine-Tuning
The process of further training a pre-trained AI model on a specific dataset to adapt it for a particular task or domain. This allows customization without training from scratch.
Fine-Tuning Dataset
A curated collection of examples used to adapt a pre-trained model to a specific task or domain. Fine-tuning datasets are typically much smaller than pre-training datasets (thousands to millions of examples vs. billions) but are carefully constructed to represent the target task. Quality matters more than quantity: clean, diverse, and representative fine-tuning data leads to better task performance and reduces the risk of overfitting or learning spurious patterns.
Foundation Model
A large AI model trained on broad, diverse data at scale that can be adapted to a wide range of downstream tasks. Foundation models serve as a base upon which many applications are built, similar to how a building's foundation supports multiple structures above it. Examples include GPT-4, Claude, Llama, and Gemini.
Function Calling
A capability that allows large language models to invoke external functions or APIs in a structured way. Rather than just generating text, the model can output formatted requests to execute actions like searching databases, calling APIs, performing calculations, or retrieving information. Function calling enables LLMs to interact with the real world and access capabilities beyond text generation, making them more useful as agents.
GAN (Generative Adversarial Network)
A generative model architecture consisting of two neural networks competing against each other: a generator that creates fake data and a discriminator that tries to distinguish fake from real. Through this adversarial training, the generator learns to produce increasingly realistic outputs.
Generative AI
A category of artificial intelligence systems that can create new content such as text, images, audio, video, and code. Unlike traditional AI that classifies or predicts, generative AI produces original outputs by learning patterns from training data. Major examples include ChatGPT for text, Midjourney for images, and Suno for music.
GPT (Generative Pre-trained Transformer)
A family of large language models developed by OpenAI based on the transformer architecture. GPT models are "pre-trained" on vast amounts of text data to predict the next word in a sequence, then fine-tuned for specific tasks. The series includes GPT-2, GPT-3, GPT-4, and beyond, each more capable than the last.
Gradient Clipping
A technique that prevents exploding gradients by limiting the magnitude of gradients during backpropagation. When gradients exceed a threshold, they are scaled down proportionally, stabilizing training especially in recurrent networks.
Gradient Descent
The optimization algorithm used to train most machine learning models. It works by iteratively adjusting model parameters in the direction that reduces the error (loss). Think of it as rolling a ball downhill on an error landscape, always moving toward the lowest point.
Greedy Decoding
The simplest decoding strategy that always selects the single most probable token at each generation step. While fast and deterministic, greedy decoding can produce repetitive or suboptimal outputs and is prone to local optima.
Grounding
The practice of connecting AI-generated content to verifiable, factual sources or real-world data. Grounding reduces hallucinations by anchoring outputs to retrieved documents, databases, or APIs rather than relying solely on the model's parametric knowledge. It is a key component of RAG systems and is essential for building trustworthy AI applications in domains requiring factual accuracy. Grounding provides transparency and enables fact-checking of AI outputs.
Hallucination
When an AI model generates information that sounds plausible and confident but is factually incorrect or entirely fabricated. Hallucinations are a fundamental challenge with large language models because they are designed to produce fluent, probable-sounding text rather than verified facts.
Hugging Face
A leading AI platform and community that hosts thousands of open-source machine learning models, datasets, and tools. Hugging Face provides the Transformers library (the most popular framework for working with transformer models), a model hub for sharing and discovering models, datasets for training, and Spaces for hosting ML demos. It has become the central hub for open-source AI collaboration.
Hyperparameters
Configuration settings that control the learning process of a machine learning model, set before training begins. Unlike model parameters that are learned from data, hyperparameters are chosen by the practitioner. Examples include learning rate, batch size, number of layers, number of neurons per layer, and regularization strength. Tuning hyperparameters is crucial for optimal model performance.
In-Context Learning
The ability of large language models to learn and adapt to new tasks purely from examples provided in the prompt, without any parameter updates or fine-tuning. By seeing a few demonstrations of a task in context, the model can generalize to new instances. This emergent capability is a key advantage of large foundation models and enables few-shot and zero-shot learning. It demonstrates that LLMs can learn at inference time, not just training time.
Inference
The process of using a trained AI model to make predictions or generate outputs on new data. While training teaches the model, inference is when the model applies what it learned. In the context of LLMs, inference refers to generating text responses. Inference costs, speed, and efficiency are critical considerations for deploying AI at scale.
Inference Time
The time it takes for a trained model to process an input and generate an output. Inference time is critical for user experience and cost in production systems. It depends on model size, hardware, optimizations (quantization, pruning), and input length. Reducing inference time through techniques like caching, batching, and model compression enables real-time applications and reduces infrastructure costs. For LLMs, inference time is often measured in tokens per second.
Instruction Tuning
A fine-tuning approach that trains language models to follow natural language instructions by training on diverse instruction-following tasks. Models are shown many examples of instructions paired with appropriate responses, teaching them to generalize to new instructions. Instruction tuning transforms base language models (which only predict next tokens) into helpful assistants that can perform tasks on demand. This is a key step in creating models like ChatGPT and Claude.
Jailbreak
Techniques used to bypass the safety guardrails and restrictions built into AI systems, causing them to generate content they were designed to refuse. Jailbreaks often use clever prompting, roleplay scenarios, or encoding tricks to circumvent safety measures. While some jailbreaks are used for security research, they highlight ongoing challenges in AI alignment and the difficulty of making safety measures robust against adversarial inputs.
Knowledge Distillation
A model compression technique where a smaller "student" model is trained to replicate the behavior of a larger "teacher" model. The student learns not just from the training data, but from the teacher's output probabilities, capturing subtle patterns that the teacher learned. This produces smaller, faster models that retain much of the teacher's capability.
KV Cache
Key-Value caching - an optimization for transformer inference that stores previously computed key and value vectors during autoregressive generation. This avoids redundant computation and significantly speeds up token generation for long sequences.
LangChain
An open-source framework for building applications powered by large language models. LangChain provides modular components for common LLM patterns like prompt templates, chains (sequences of operations), agents (autonomous decision-makers), memory management, and integrations with vector databases and tools. It simplifies developing complex LLM applications by providing high-level abstractions and reusable components.
Large Language Model (LLM)
A type of artificial intelligence model trained on vast amounts of text data to understand and generate human-like language. LLMs use transformer architecture and can perform tasks like text generation, translation, summarization, and question answering.
Latency
The delay between sending a request to an AI system and receiving the response. Low latency is critical for interactive applications like chatbots, voice assistants, and real-time recommendations. Latency includes inference time, network delays, and processing overhead. Achieving low latency requires optimized models, efficient serving infrastructure, edge deployment, and sometimes trade-offs with model quality. First-token latency (time to first word) is especially important for user experience.
Layer Normalization
A normalization technique that standardizes inputs across features within a single sample, unlike batch normalization which operates across the batch dimension. Layer norm is particularly effective in recurrent networks and transformers where batch statistics are less stable.
Learning Rate
A hyperparameter that controls how much to adjust model weights in response to the estimated error during training. A smaller learning rate makes training more stable but slower, while a larger learning rate speeds up training but risks overshooting optimal values or becoming unstable. Finding the right learning rate is critical for efficient and successful model training. Modern techniques like learning rate schedules and adaptive optimizers help manage this.
LoRA (Low-Rank Adaptation)
An efficient fine-tuning technique that trains only a small number of additional parameters instead of updating the entire model. LoRA adds small trainable matrices to the model's layers, reducing memory requirements and training time by 10-100x while achieving results comparable to full fine-tuning.
Loss Function
A mathematical function that measures how far a model's predictions are from the actual correct values. During training, the goal is to minimize this loss. Different tasks use different loss functions: mean squared error for regression, cross-entropy for classification, and specialized losses for tasks like object detection or language modeling. The loss function guides how the model updates its weights.
LSTM (Long Short-Term Memory)
A type of recurrent neural network architecture designed to learn long-term dependencies in sequential data. LSTMs use a gating mechanism to control information flow, allowing them to remember important information over long sequences while forgetting irrelevant details.
Meta-Learning
Learning to learn - training models to adapt quickly to new tasks with minimal data by learning from a distribution of related tasks. Meta-learning enables few-shot learning where models can generalize to novel tasks after seeing only a few examples.
Mixture of Experts
A model architecture where multiple specialized sub-networks (experts) each handle different types of inputs, with a gating network that routes each input to the most relevant experts. This allows models to have enormous total parameter counts while only activating a fraction per input, improving efficiency.
MLOps
Machine Learning Operations (MLOps) is the practice of applying DevOps principles to machine learning systems. It encompasses the deployment, monitoring, maintenance, and continuous improvement of ML models in production. MLOps includes version control for models and data, automated testing, CI/CD pipelines for model deployment, performance monitoring, retraining workflows, and infrastructure management.
Model Card
A standardized documentation format that provides transparency about a machine learning model. Model cards describe the model's intended use, training data, performance metrics, limitations, biases, ethical considerations, and appropriate use cases. They help users understand what a model can and cannot do, promoting responsible AI deployment and preventing misuse. Model cards are increasingly required for AI transparency.
Model Collapse
A degradation phenomenon where generative models trained on synthetic data (including their own outputs) progressively lose diversity and quality. Model collapse is an emerging concern as AI-generated content becomes more prevalent on the internet.
Model Compression
Techniques to reduce model size and computational requirements while maintaining performance, including quantization, pruning, distillation, and low-rank factorization. Compressed models enable deployment on resource-constrained devices and reduce inference costs.
Model Deployment
The process of taking a trained machine learning model and making it available for use in production applications. Deployment involves packaging the model, setting up infrastructure for serving predictions, implementing APIs for access, monitoring performance, handling versioning, and ensuring reliability and scalability. Successful deployment bridges the gap between research and real-world impact.
Model Monitoring
The practice of tracking model performance, data quality, and system health in production. Monitoring includes tracking accuracy metrics, detecting data drift, identifying anomalies, and alerting when intervention is needed.
Model Serving
The infrastructure and techniques for deploying machine learning models in production environments to handle real-time predictions. Model serving includes load balancing, batching, caching, monitoring, and version management.
Multi-Agent Systems
Architectures where multiple AI agents collaborate or compete to solve complex problems. Each agent may have specialized roles, and they communicate to achieve a shared objective.
Multimodal AI
AI systems that can process and generate multiple types of data, such as text, images, audio, and video simultaneously. Multimodal models understand relationships across modalities, like describing what's in an image or generating images from text descriptions.
Natural Language Processing
A branch of artificial intelligence focused on enabling computers to understand, interpret, and generate human language. NLP combines computational linguistics, machine learning, and deep learning to process text and speech, powering applications like translation, sentiment analysis, chatbots, and search engines.
Neural Architecture Search (NAS)
An automated process for discovering optimal neural network architectures for specific tasks, rather than manually designing them. NAS algorithms explore the space of possible architectures, evaluating candidates and searching for designs that maximize performance while satisfying constraints like size or latency. While computationally expensive, NAS has discovered novel architectures that outperform human-designed networks. It represents a step toward automating ML pipeline design.
Neural Network
A computing system inspired by biological neural networks in the human brain. It consists of layers of interconnected nodes (neurons) that process information by passing signals through weighted connections. Neural networks learn by adjusting these weights based on training data, enabling them to recognize patterns, classify data, and make predictions.
Open-Source AI
AI models and tools whose code, weights, and training methodology are publicly available for anyone to use, modify, and distribute. Open-source AI democratizes access to powerful models, enabling innovation, transparency, and customization. Notable examples include Llama, Mistral, and Stable Diffusion.
Optimizer
An algorithm that adjusts neural network parameters during training to minimize the loss function. The optimizer determines how weights are updated based on computed gradients. Popular optimizers include SGD (Stochastic Gradient Descent), Adam, AdamW, and RMSprop. Each optimizer has different strategies for learning rate adaptation, momentum, and handling sparse gradients. The choice of optimizer significantly impacts training speed and final model quality.
Overfitting
A modeling error that occurs when a machine learning model learns the training data too well, including its noise and outliers, rather than the underlying patterns. An overfit model performs excellently on training data but poorly on new, unseen data because it has essentially memorized rather than generalized. Techniques to prevent overfitting include regularization, dropout, early stopping, and using more training data.
Parameters
The learned weights and biases inside a neural network that are adjusted during training to minimize loss. Parameters encode the knowledge the model has learned from training data. Model size is typically measured in parameter count: GPT-3 has 175 billion parameters, while smaller models may have millions or billions. More parameters generally enable greater capacity to learn complex patterns, but also require more compute and memory.
PEFT (Parameter-Efficient Fine-Tuning)
A category of techniques for adapting large pre-trained models to specific tasks while updating only a small fraction of parameters. PEFT methods like LoRA, prefix tuning, and adapter layers achieve competitive results with full fine-tuning while using 100-1000x less compute and memory. PEFT makes it practical to customize large models on consumer hardware and enables serving multiple task-specific models efficiently by sharing the base model.
Perplexity
A measurement of how well a language model predicts text. Lower perplexity means the model is less "surprised" by the text and better at predicting what comes next. It is calculated as the exponential of the average negative log-likelihood per token and serves as a key evaluation metric.
Precision and Recall
Precision measures the proportion of positive predictions that are correct, while recall measures the proportion of actual positives that were correctly identified. These complementary metrics are fundamental to evaluating classification and information retrieval systems.
Prompt Caching
A technique that saves and reuses the computed representations of common prompt prefixes across multiple requests. This reduces computation costs and latency when the same system prompt or context is used repeatedly.
Prompt Engineering
The practice of crafting effective instructions (prompts) for AI models to elicit desired outputs. Includes techniques like few-shot prompting, chain-of-thought, and system prompts.
Prompt Injection
A security vulnerability where malicious input is crafted to manipulate an AI system into ignoring its instructions or performing unintended actions. Similar to SQL injection in databases, prompt injection exploits how LLMs process instructions embedded in user input. Attackers can potentially bypass safety filters, extract system prompts, or cause harmful outputs. Defending against prompt injection is an active area of AI security research.
Quantization
A technique for reducing the memory and computational requirements of AI models by representing their numerical parameters with lower precision. For example, converting 32-bit floating-point numbers to 8-bit integers. Quantization enables large models to run on consumer hardware with minimal quality loss.
RAG (Retrieval-Augmented Generation)
A technique that enhances LLM responses by first retrieving relevant information from external knowledge sources, then using that context to generate more accurate and up-to-date answers.
ReAct (Reasoning + Acting)
A prompting and agent framework that interleaves reasoning traces with actions, enabling AI systems to dynamically reason about tasks while taking actions and observing results. ReAct alternates between thinking steps (planning what to do next) and acting steps (using tools or taking actions), creating a synergistic loop. This approach improves task completion in complex, multi-step scenarios by combining chain-of-thought reasoning with interactive tool use.
Red Teaming
A security practice where experts deliberately try to break, manipulate, or find vulnerabilities in AI systems. Red teamers attempt to generate harmful outputs, bypass safety measures, extract training data, or cause other failures. This adversarial testing helps identify weaknesses before deployment, improve robustness, and inform safety measures. Red teaming is crucial for ensuring AI systems are safe and secure.
Regularization
Techniques used to prevent machine learning models from overfitting by adding constraints or penalties during training. Regularization encourages simpler models that generalize better to new data. Common methods include L1 and L2 weight penalties, dropout, early stopping, and data augmentation. Regularization is the balance between fitting training data well and maintaining the ability to perform on unseen data.
Reinforcement Learning
A type of machine learning where an agent learns to make decisions by taking actions in an environment and receiving rewards or penalties. The agent learns a strategy (policy) that maximizes cumulative reward over time through trial and error, without being explicitly told what the correct actions are.
RLHF (Reinforcement Learning from Human Feedback)
A training technique that uses human preferences to fine-tune language models. Human evaluators rank different model outputs for quality, helpfulness, and safety. These rankings train a reward model, which is then used with reinforcement learning to align the language model with human values and expectations.
RNN (Recurrent Neural Network)
A type of neural network designed for processing sequential data like text or time series. RNNs maintain a hidden state that acts as memory, allowing information from earlier in a sequence to influence processing of later elements. Largely superseded by transformers for NLP tasks.
ROUGE
Recall-Oriented Understudy for Gisting Evaluation - a set of metrics for evaluating automatic summarization and machine translation by measuring overlap of n-grams, word sequences, and word pairs between generated and reference texts.
Scaling Laws
Mathematical relationships that describe how AI model performance improves with increases in model size, dataset size, and compute budget. Scaling laws have shown remarkably predictable patterns: doubling compute tends to produce consistent performance gains. These laws guide decisions about resource allocation and help predict the capabilities of future models. Understanding scaling laws is crucial for efficient AI development and forecasting progress.
Self-Attention
A mechanism that allows each position in a sequence to attend to all positions in the same sequence when computing representations. Self-attention is the core innovation in transformers, enabling the model to weigh the relevance of different parts of the input when processing each element. It captures long-range dependencies and relationships within sequences more effectively than recurrent architectures. Multi-head self-attention applies this mechanism multiple times in parallel.
Self-Supervised Learning
A learning approach where models learn from unlabeled data by solving pretext tasks that automatically generate labels from the data itself. Examples include predicting masked words (BERT) or next tokens (GPT), enabling models to learn rich representations without human annotation.
Semantic Search
A search approach that understands the meaning and intent behind a query rather than just matching keywords. Semantic search uses embeddings to convert text into numerical vectors and finds results that are conceptually similar, even if they use different words. This enables much more relevant search results than traditional keyword matching.
Speculative Decoding
An inference optimization where a small, fast model generates multiple candidate tokens that are verified in parallel by the larger target model. This can speed up generation by 2-3x without changing the output distribution.
Stable Diffusion
An open-source text-to-image diffusion model that generates high-quality images from text descriptions. It uses a latent diffusion process that operates in a compressed latent space, making it more efficient than previous diffusion models while producing detailed, creative outputs.
Supervised Learning
A machine learning approach where models learn from labeled training data, with each example consisting of an input paired with the correct output. The model learns to map inputs to outputs by minimizing prediction errors. Common supervised learning tasks include classification (predicting categories) and regression (predicting numbers). Examples include spam detection, image recognition, and price prediction.
Synthetic Data
Artificially generated data that mimics real-world data patterns without containing actual real-world information. Synthetic data is used to train AI models when real data is scarce, expensive, private, or biased. It can be created through simulations, generative models, or rule-based systems.
System Prompt
A set of instructions given to a language model at the beginning of a conversation that defines its behavior, personality, capabilities, and constraints. System prompts are typically invisible to end users but guide how the AI responds throughout the interaction. They can specify the AI's role (e.g., helpful assistant, expert programmer), tone, safety guidelines, and what the AI should or should not do.
Temperature
A parameter that controls the randomness and creativity of an AI model's outputs. Low temperature (0.0-0.3) makes the model more deterministic and focused, always choosing the most probable next token. High temperature (0.7-1.5) increases randomness and variety, allowing less probable tokens to be selected.
Throughput
The number of requests or data samples an AI system can process per unit of time. High throughput is important for serving many users simultaneously or processing large batches of data. Throughput can be increased through batching multiple requests together, parallelization across multiple GPUs, and efficient model architectures. There is often a trade-off between latency and throughput: optimizing for one may reduce the other.
Token
The fundamental unit of text that language models process. A token can be a whole word, part of a word, a number, or a punctuation mark. Models generate text token by token, and their capacity is measured in tokens. Understanding tokens is essential for managing costs and context window limits.
Tokenization
The process of breaking text into smaller units called tokens that a language model can process. Tokens can be words, parts of words, or individual characters depending on the tokenizer. Tokenization is a critical first step in NLP, as it determines how text is represented and processed by the model.
Tool Use
The capability of AI systems to interact with external tools, APIs, and services to accomplish tasks beyond text generation. Tool use enables LLMs to perform calculations, search databases, query APIs, execute code, and interact with the digital world. This transforms language models from pure text generators into autonomous agents that can take actions. Tool use is implemented through function calling, plugins, or agent frameworks.
Top-k Sampling
A generation technique that randomly samples from only the k most likely next tokens at each step. This constrains the vocabulary to high-probability options while maintaining diversity, producing more focused and coherent text than pure random sampling.
Top-p Sampling (Nucleus Sampling)
A dynamic sampling method that selects from the smallest set of tokens whose cumulative probability exceeds p. Unlike top-k, nucleus sampling adapts the vocabulary size based on the probability distribution, allowing more diversity when the model is uncertain.
Training Data
The dataset used to teach a machine learning model by showing it examples of inputs and (in supervised learning) their corresponding correct outputs. The quality, quantity, diversity, and representativeness of training data fundamentally determine what a model learns and how well it performs. Biases in training data lead to biased models. For large language models, training data consists of billions of text examples from the internet and other sources.
Transfer Learning
A machine learning technique where a model trained on one task is repurposed for a different but related task. Rather than training from scratch, transfer learning leverages knowledge already learned, dramatically reducing the amount of data and compute needed. This is the foundation of how modern LLMs work: pre-train broadly, then fine-tune for specifics.
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel. The foundation of modern LLMs and many other AI systems.
Unsupervised Learning
A machine learning approach where models find patterns and structure in unlabeled data without explicit guidance about what to look for. The model discovers hidden patterns, groupings, or representations on its own. Common unsupervised learning tasks include clustering (grouping similar items), dimensionality reduction (simplifying data), and anomaly detection. Unlike supervised learning, there are no correct answers to learn from.
VAE (Variational Autoencoder)
A generative model that learns to encode data into a probabilistic latent space and decode it back. VAEs combine neural networks with variational inference to learn meaningful representations and generate new samples similar to the training data.
Validation Set
A portion of data held out from training and used to tune model hyperparameters and make decisions during development. Unlike the training set (used to learn) and test set (used for final evaluation), the validation set provides feedback during the model development process. It helps detect overfitting and guides choices like when to stop training, which model architecture to use, and optimal hyperparameter values.
Vector Database
A specialized database designed to store and efficiently query high-dimensional vector embeddings. Essential for semantic search, recommendation systems, and RAG pipelines.
ViT (Vision Transformer)
Vision Transformer - an architecture that applies the transformer model directly to image patches, treating them as sequences. ViT has achieved state-of-the-art results on image classification tasks and demonstrates that pure transformer architectures can compete with CNNs in computer vision.
Whisper
An automatic speech recognition (ASR) model developed by OpenAI, trained on 680,000 hours of multilingual data. Whisper demonstrates robust performance across languages, accents, and acoustic conditions, and can also perform speech translation and language identification.
Zero-Shot Learning
A technique where a model performs a task without any task-specific examples in the prompt. The model relies entirely on its pre-trained knowledge and the natural language description of the task. Zero-shot capabilities are a hallmark of powerful foundation models.