- In & Out AI
- Posts
- How AI Learn
How AI Learn
An intuitive, math-free overview on how AI models are trained.
Welcome back to In & Out AI, where we break down complex AI concepts using intuition instead of equations, giving you the technical insights usually reserved for degree holders.
Last week, we learned that AI models are made up of interconnected numbers, and the numbers are set by the training process. This week, we’ll take a look at how model training works. As usual, there won’t be any math involved, everything will be explained intuitively using examples and analogies. Let’s dive in!
Want to learn about AI but frustrated by materials that are either too generic or buried in math? Subscribe now for our always-free newsletter, designed specifically for business and product professionals like you to truly understand how AI works.
Model Training Overview
Now, discovering the AI girlfriend that you've poured your heart out to is essentially just a boatload of math... can sting. But after the tears subside, a more practical question comes up: How on earth does someone find that perfect set of numbers? Since a model has billions of parameters, manually tuning each one is just absurd amount of work.

It’s infeasible for humans to dial in every number in a model.
Well, there actually isn't some hyper-caffeinated scientist manually dialing in billions of numbers. Instead, these numbers are iteratively updated by an algorithm, aptly named the optimizer, during the training process. And it is this training process that gives an AI model most of its perceived "intelligence".
Think of the optimizer's job like tuning an old analog radio, where the model's parameters are like the radio's tuning knobs. Your goal when tuning a radio is to get a clear sounding output. But initially, all the knobs are set randomly, giving mostly static noise, and you would continuously adjust the knobs to eliminate this noise. This static noise, or more precisely, the difference between this noise and the ideal clear output, is analogous to the errors the AI makes. And it is the optimizer’s job to minimize the errors by adjusting the parameters, just like how you adjust the radio knobs. In AI jargon, the measurement of this error - how far off the model's current output is from the desired output - is called the loss.

But how does the optimizer know how much loss there is, and which way to adjust the parameters? For that, it needs one crucial ingredient...
Data - The Most Crucial Thing
The optimizer's goal is clear: to minimize loss. But how does it quantify the loss? By doing one simple yet crucial comparison: The optimizer will be given a specific input, calculates the output based on the existing parameter values, and compares it to the known correct output for the given input. The discrepancy between the two outputs is the loss. Since all inputs and outputs are numbers, the numerical difference can always be easily calculated.
This reveals the essential prerequisite: training data - a dataset containing numerous example inputs, each paired with its verified correct output. In AI terminology, these correct outputs are known as labels (sometimes referred to as "ground truth").
An Example
Consider training a model to identify cats:
Input: 🐱
Label: “That's a cat!".
Initially, the model processes the input image and perhaps outputs "That's a toaster!". Then, the optimizer compares this to the label "That's a cat!", calculates the resulting (substantial) loss, and registers that the model's parameters need adjustment. Without the correct label provided by the training data, the optimizer simply cannot improve the model's performance.
This reliance underscores why data quality is supremely important. If the training dataset mistakenly includes images of toasters labeled as "That's a cat!", the optimizer will diligently adjust the model's parameters to minimize loss based on this false information. The result will be the model learns to always identify toasters as cats. This is the operational definition of "garbage in, garbage out" (GIGO) - a tenet in AI research and development. Beyond such glaring errors, subtle biases or skewed representation within the data can also systematically mislead the training process. Thus, ensuring dataset quality at scale is one of the less glamorous, yet utterly vital tasks in AI development. After all, a model’s output is always representative of its training data.

Garbage in, garbage out.
While using labeled data (formally called supervised learning) is a cornerstone of AI training and is what we’re focusing on here, be aware that other training paradigms also exist, utilizing different types of data and/or feedback mechanisms.
Now that we understand how loss is calculated using data, we can tackle the next critical step: how the optimizer leverages this loss value to guide the adjustment of billions of parameters.
The Learning Mechanism - Mathematically Going Downhill
Going back to our previous example. Our initial set of parameter values takes a cat picture as input and generates an output claiming it's actually a toaster, producing a very large loss for the optimizer. The loss tells us how wrong the model currently is, but it doesn't directly say, "Hey, tweak parameter #4,712,983 up by 0.05 and parameter #12,001,004 down by 0.02." So what does the optimizer do to the loss score to make it useful?
Ideally, there'd be a magic instruction manual telling the optimizer precisely how much to adjust each of the billions of parameters to instantly reach minimum loss. That manual, unfortunately, does not exist. However, what we do have is calculus (don't panic, we'll only introduce the intuition here, this is still an equation-free zone).
Calculus provides the optimizer with something almost as good as the exact answer: it reveals the direction to move the parameters that will most effectively decrease the loss from the current position. In other words, imagine we're trying to find our way to a destination, calculus doesn't give us the exact route to get there, but it does give us a compass pointing towards the next intersection en route. That direction can change as you move, so it's necessary to drive slowly and constantly keep an eye on where the compass is pointing (please don’t actually be distracted like this when driving irl).

The ideal information to get from the loss vs. the actual information we get.
Another common mental model to visualize this process is to use an extremely simplified case when there are only two parameters. Imagine you’re trying to go down a hill in the dark. We can think of the loss value as the altitude. And the two parameter values are the latitude and longitude that define your current location. Calculus tells you which direction is the steepest downhill from the current spot. Taking small steps each time following the direction, which is constantly updating, you will (ideally) eventually reach the bottom. The latitude and longitude (parameters) at the bottom gives the coordinates for lowest altitude (loss).

Gradient descent pointing to the steepest direction downhill at each move.
This leads to the core algorithm of most AI training: gradient descent:
Calculate the loss at the current parameter settings.
Use calculus to find the gradient (the direction of steepest loss decrease).
Take a small step (adjust the parameters slightly) in that downhill direction.
Repeat step 1.
Now, remember our model has billions of parameters. So while it seems very simple when there are only two parameters and the optimizer only needs to decide between how much to move left vs. right, it’s mind-bogglingly complex in billions of dimensions. It is impossible to visualize it, but the principle remains the same. Calculus still finds the gradient - the direction of steepest descent across the billions of dimensions simultaneously - and the optimizer takes a small step that way. Billions of parameters, one collective nudge downhill (There are actually multiple separate nudges - one for each layer of the neural network, but it's ok to imagine them as one collective nudge for now).
Potential Problems When Implementing Gradient Descent
While I said you can get to the bottom of the hill using gradient descent in the previous example, there’s actually no guarantee you’ll actually arrive at the lowest spot (the minimum). So to increase the likelihood of actually reaching the lowest point, a few practicalities would need to be addressed when implementing gradient descent:
Step Size (a.k.a. Learning Rate): How big a step should you take when moving down the hill if you are capable of taking 1000x bigger steps than normal? There’s no physical restriction on how much you can change a parameter when training a model, after all. If the steps are too small, progress will be incredibly slow, which is of course bad. Since we want to achieve our goal fast, wouldn’t that mean bigger steps are always better? However, if the steps are too large, you risk missing the minimum like Tony Hawk leaping over a ramp. Repeat this, and the optimizer might just bounce back and forth across the bottom, while never reaching the actual bottom. So finding a good learning rate – often referred to as "tuning" this hyperparameter – is key. It usually involves some experimentation to find that sweet spot.

Moving too fast risks missing the bottom, even if you’re moving in the right direction
Knowing When to Stop: How long does the optimizer keep adjusting the parameters? Common sense applies: It might stop when the loss isn't decreasing much anymore. Or sometimes, you just stop after a predetermined number of steps or when the budget runs out. A more sophisticated way is to monitor progress on a separate validation dataset (similar to training data, but wasn't actually used to train the model). If performance on that set stops improving or gets worse, it's time to stop, even if the training loss is still dropping (this prevents the model from just memorizing the training examples).
The Local Minima Problem: Gradient descent can also get stuck in a “local minimum”. In our “going downhill” example, this means we end up in a small dip and think we’ve reached the bottom of the valley. Modern techniques and optimizers often have clever ways to mitigate this, but it's an important limitation to be aware of.
Real-life Implications
Having an understanding of how models are trained not only gives you a talking point to impress your family during the next thanksgiving dinner, but it also directly explains some real-world characteristics (and headaches) of modern AI.
Frontier Models are VERY Expensive to Train
Remember the more than 1025 calculations needed to train models like GPT-4o and Grok 3, as mentioned in the last post? Now we can see what drives that number after learning how training works: billions of parameters requiring adjustment; training datasets approaching the scale of the public internet; and the inherently iterative nature of gradient descent. There is also the non-trivial cost of failed training runs. Thus, training frontier AI models is an extremely expensive endeavor reserved only for companies with very very deep pockets.
Model biases
As we've discussed, an optimizer's sole mandate is to minimize loss on the specific training data. But here’s the crucial catch: what guarantees the quality or objectivity of that data? Is there a committee or some sort of public oversight program ensuring the training data is truthful and unbiased? Unfortunately, the answer is no.
As we saw earlier, the model learns exactly what the data teaches it. This extends beyond simple mislabeling. If the training data - often scraped from the internet and refined by human labelers - reflects human biases, factual inaccuracies, or just general internet weirdness, the model will learn to reproduce these patterns.
Neither the internet nor human labelers are infallible sources of objective truth (surprise surprise). Consequently, it's mistaken to treat AI outputs as unbiased facts. A healthy dose of skepticism towards AI-generated content isn't just recommended; it should be a requirement for sensible usage.
This reliance on data becomes particularly salient when you consider the potential for intentional societal influence. Whoever controls the data pipeline – selecting datasets, filtering content, directing human labelers, etc. – gains significant leverage. Consequently, for corporations seeking profit or governments seeking power, this presents an incredibly powerful tool. As AI becomes more and more ubiquitous, the ability to deliberately steer the training data effectively enables them to control the 'facts' perceived by the users.

What's Next
We've journeyed from the basic components of AI models through the core learning process. With this foundation laid, our next step is to zoom in on the type of model that has truly captured the world's attention and driven the recent AI explosion: Large Language Models (LLMs). We'll explore what makes these models unique, how they handle text, and the architectures that enable the abilities seen in systems like ChatGPT.
Stay tuned as we shift our focus from the general principles of AI to the specific mechanics behind AI that understands and generates human language.
This is just the beginning of our journey into the heart of AI. To keep unlocking these complex ideas without the headache of equations, subscribe for free and get future posts delivered directly.
If you've found this to be a valuable educational resource, please consider sharing it with friends and colleagues who might also benefit - it’s the best way to support this newsletter!
Reply