<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Descending Notebooks]]></title><description><![CDATA[Learn machine learning concepts with code examples]]></description><link>https://descendingnotebooks.com</link><generator>RSS for Node</generator><lastBuildDate>Wed, 15 Apr 2026 10:59:14 GMT</lastBuildDate><atom:link href="https://descendingnotebooks.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Adam Optimizer]]></title><description><![CDATA[You’ve probably used Adam as your go to optimizer. But do you know why it works? In this article, we’ll unpack the Adam optimizer introduced in this paper. This post is for anyone who wants to deeply understand what is happening when they use Adam.
A...]]></description><link>https://descendingnotebooks.com/adam-optimizer</link><guid isPermaLink="true">https://descendingnotebooks.com/adam-optimizer</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[AI]]></category><category><![CDATA[adam optimisation]]></category><category><![CDATA[Gradient-Descent ]]></category><category><![CDATA[pytorch]]></category><dc:creator><![CDATA[Jessen]]></dc:creator><pubDate>Mon, 14 Jul 2025 19:53:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1752522763545/59c110fe-c8fe-423a-bb77-002fcf36cf7e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You’ve probably used Adam as your go to optimizer. But do you know why it works? In this article, we’ll unpack the Adam optimizer introduced in this <a target="_blank" href="https://arxiv.org/abs/1412.6980">paper</a>. This post is for anyone who wants to deeply understand what is happening when they use Adam.</p>
<p>Adam is an optimizer that helps us efficiently converge on a set of parameters for a stochastic function that we want to minimize. In machine learning, we often use Adam to find the parameters of a model represented by a stochastic function that minimizes a loss function.</p>
<h2 id="heading-the-motivation">The motivation</h2>
<p>We already have many variations of optimizers, therefore it is good to question why we need another. Here we will outline the core problems Adam addresses and the consequences of not addressing them.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">❗</div>
<div data-node-type="callout-text"><strong>Problem 1: </strong>One Global Learning Rate (Across Parameters)</div>
</div>

<p>Classic optimizers such as Stochastic Gradient Descent (SGD) only have one learning rate. Therefore when we have several parameters, the gradient of a parameter at a particular point in time can be vastly different from other parameters. In this situation when using one global learning rate, we take the same step size when updating a parameter, no matter what the gradient is. This means when some parameters could receive a more confident, larger update, we might be held back by needing to have smaller updates on more sensitive parameters. Of course, this is scaled by the gradient itself as it is multiplied by the learning rate, but this places an upper limit on what our global learning rate can be, as it needs to account for sensitive parameters that should receive smaller updates.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">❗</div>
<div data-node-type="callout-text"><strong>Problem 2: </strong>Fixed Learning Rate Within a Parameter Over Time</div>
</div>

<p>Whilst problem 1 highlights the desire to have multiple learning rates to allow parameters to individually update more confidently or more sensitively depending on their gradient at a point in time, there’s also a related issue over time for a single parameter. The gradient for each parameter will change over our training run, as expected, and we want to take large steps when the gradient is large and smaller steps when approaching a minimum. As previously discussed, having multiple learning rates allows this to vary across the model for each parameter, but it does not vary over time within one parameter, and we still have an upper bound on individual learning rates.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">❗</div>
<div data-node-type="callout-text"><strong>Problem 3: </strong>Manual Learning Rate Tuning</div>
</div>

<p>As previously noted with some optimizers like SGD we have to choose a learning rate, and its upper bound is dictated by small gradient step sizes. We can also select learning rates using trial and error, grid search and using learning rate schedules which are a predetermined schedule for adjusting the learning rate over time. All of these require some thought into selecting the learning rate.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">❗</div>
<div data-node-type="callout-text"><strong>Problem 4: </strong>Small batches can be noisy</div>
</div>

<p>Due to memory constraints, we might not load entire datasets and therefore use batches of data as seen in SGD. However, each batch allows the gradient to diverge and could be a noisy outlier rather than representative of the overall gradient of the training batch.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">❗</div>
<div data-node-type="callout-text"><strong>Problem 5: </strong>Sparse gradients</div>
</div>

<p>Optimizers like Adagrad help with sparse gradients by retaining a high learning rate for infrequently updated parameters. We might come across sparse gradients in NLP tasks where words are infrequently used and ideally should receive meaningful updates.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">❗</div>
<div data-node-type="callout-text"><strong>Problem 6: </strong>Lack of momentum in flat regions</div>
</div>

<p>SGD suffers from slowing down during flat regions, meaning updates are small when we might want to speed up and get through the region. Other optimizers have adjusted for this, such as SGD with momentum, where gradients are accumulated. This means once we are in a flat region, we are carrying on the accumulation of past gradients allowing us to push through flat regions rather than computing our update on the current gradient.</p>
<h2 id="heading-step-by-step-inside-the-adam-optimizer">Step by step: inside the Adam optimizer</h2>
<p>Now that we've seen the key challenges, let's walk through how Adam actually works step by step using the original algorithm from the paper.</p>
<p>$$\begin{alignedat}{2} (1) \quad &amp; \textbf{Require: } \alpha \text{ (Stepsize)} \\ (2) \quad &amp; \textbf{Require: } \beta_1, \beta_2 \in [0, 1) \text{ (Exponential decay rates)} \\ (3) \quad &amp; \textbf{Require: } f(\theta) \text{ (Stochastic objective function)} \\ (4) \quad &amp; \textbf{Require: } \theta_0 \text{ (Initial parameter vector)} \\ (5) \quad &amp; m_0 \leftarrow 0 \quad \text{(Initialize 1st moment vector)} \\ (6) \quad &amp; v_0 \leftarrow 0 \quad \text{(Initialize 2nd moment vector)} \\ (7) \quad &amp; t \leftarrow 0 \quad \text{(Initialize timestep)} \\ (8) \quad &amp; \textbf{while } \theta_t \text{ not converged do} \\ (9) \quad &amp; \quad t \leftarrow t + 1 \\ (10) \quad &amp; \quad g_t \leftarrow \nabla_\theta f_t(\theta_{t-1}) \quad \text{(Compute gradients)} \\ (11) \quad &amp; \quad m_t \leftarrow \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t \quad \text{(Update biased 1st moment)} \\ (12) \quad &amp; \quad v_t \leftarrow \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2 \quad \text{(Update biased 2nd moment)} \\ (13) \quad &amp; \quad \hat{m}_t \leftarrow m_t / (1 - \beta_1^t) \quad \text{(Bias-corrected 1st moment)} \\ (14) \quad &amp; \quad \hat{v}_t \leftarrow v_t / (1 - \beta_2^t) \quad \text{(Bias-corrected 2nd moment)} \\ (15) \quad &amp; \quad \theta_t \leftarrow \theta{t-1} - \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) \quad \text{(Update parameters)} \\ (16) \quad &amp; \textbf{end while} \\ (17) \quad &amp; \textbf{return } \theta_t \quad \text{(Resulting parameters)} \end{alignedat}$$</p><h3 id="heading-initialization">Initialization</h3>
<p><strong>Lines 1-2</strong> are simply setting the hyperparameters of the algorithm. α is a normal learning rate as seen in SGD, and β1/β2 are values both between 0 and 1. β1 controls how much of a past gradient is added to each update step, creating a weighted average. e.g. If β1 is set to 0.9, we retain 90% of the previous moment estimate and blend in 10% of the current gradient. β2 is a similar parameter but covers an accumulated penalty value that we will see later, along with a deeper explanation of the weighted average of the gradient.</p>
<p><strong>Line 3</strong> is setting up our stochastic function to optimize e.g. a loss like mean squared error</p>
<p><strong>Line 4</strong> initializes the parameter vector, just like other optimizers.</p>
<p><strong>Line 5</strong> creates a vector to store the first moment (also known as the mean) of every parameter’s gradient. Therefore the vector should have the same length as the number of parameters we need to optimize. Each value is initialized to zero. When these values are updated β1 is used to scale how much we update the running mean with the previously seen gradient.</p>
<p><strong>Line 6</strong> creates a vector similar to what is seen in line 5 and is also initialized to zero. However we will store the second moment for each parameter in this vector as the algorithm progresses. The second moment is the running average of the gradient squared, which reflects the average magnitude (squared) of the gradient over time. It helps to assess the stability or noisiness of the gradient signal. As β1 is used to scale the first moment, β2 is used with this vector to scale the second moment at each update.</p>
<h3 id="heading-training-loop">Training loop</h3>
<p><strong>Line 7-9</strong> sets up and starts our training loop, with an initial time step <strong><em>t</em></strong>.</p>
<p><strong>Line 10</strong> computes gradients for the loss function with respect to model parameters θ in the previous time step <strong>t-1.</strong> These gradients are stored in the variable g.</p>
<h3 id="heading-moment-updates">Moment updates</h3>
<p><strong>Line 11</strong> updates our vector of first moments. We initially set this to zero, therefore for the first update, we will be adding in 1-β1 of each parameters current gradient.</p>
<p>e.g. Given t=1, β1=0.9 and the gradient of the current parameter is 1.5.</p>
<p>We will keep 90% (when β1=0.9) of the previous gradient, and blend in 10% (1-β1=0.1) of our current gradient (1.5) resulting in 0.15 being stored in our vector for this parameter.</p>
<p>On the next loop given t=2 and the gradient of current parameter is now 1.4. We will update our first moment for the current parameter with 90% of its current value (0.15 from the last update), which results in 0.135 and add this to 0.1 of the current gradient (1.4). The result is 0.135+0.14=0.275</p>
<p>This process, blending in previous gradients rather than taking only the current gradient allows us to build an average gradient value and provides a signal if we are on a consistent gradient rather than a fluctuating one which might require us to back off from large updates.</p>
<p><strong>Line 12</strong> uses a similar mechanism to line 11 for updating a value for each parameter. However it stores the second moment, which is the exponential moving average of the squared gradients. Squaring the gradients ensures the second moment remains positive, acting as a penalty term. Since the gradients are squared, the values are always non-negative, therefore the second moment keeps accumulating regardless of the gradient’s direction. The calculation of the first moment does not detect oscillating gradients well as they flip around a local minima as the negative and positive gradients will cancel each other out. However the calculation of the second moment will keep increasing as it oscillates around a local minima.</p>
<p><strong>Line 13-14</strong> are bias correction steps for the first and second moments. Since we initialize both moment vectors to zero, early values are biased toward zero, even if the true gradients are not.</p>
<p>To correct for this, we divide by a factor that compensates for how little history we've accumulated so far. In the first few steps, this correction has a large effect, later on it fades away as the moment estimates become more accurate on their own.</p>
<p>For example, at time step t = 1, the correction for the first moment is:</p>
<p>$$\frac{m_t}{1-\beta_1^1}$$</p><p>And at t = 10:</p>
<p>$$\frac{m_t}{1-\beta_1^{10}}$$</p><p>Since β1​ is a number less than 1 (e.g, 0.9), raising it to higher powers brings it closer to 0. Therefore the denominator​ grows closer to 1 over time. That means in early steps, we divide by a small number (amplifying the estimate), and later we divide by something close to 1 (leaving it mostly unchanged).</p>
<p><strong>Line 15</strong> is where we update our parameters, where, similar to SGD, we adjust by a learning rate or step size as it is known here, however with Adam we can scale our step size based on the first and second moment. We can think of the first moment as our signal and the second moment as our noise. Therefore when we have a high signal to noise ratio, we are confident we can take a large step and allow us to take a large step size.</p>
<p>e.g. If our first moment = 5, second moment = 25</p>
<p>$$\frac{5}{\sqrt{25}}=1$$</p><p>This results in 1, which, when multiplied by our step size returns the full step size, therefore we take a large parameter update.</p>
<p>If our second moment is high compared to the first moment, this is a signal that we are not confident in the average gradient being reported by the first moment. This could be for a few reasons, such as being on an oscillating surface. If we frequently flip between negative and positive gradients, the second moment will capture all of these as positive values, therefore it keeps building up rather than negative values canceling out previous positive values as in our the first moment.</p>
<p>e.g. If our first moment = 5, second moment = 100</p>
<p>$$\frac{5}{\sqrt{100}}=0.5$$</p><p>This will result in reducing our step size by half, signalling a lack of confidence, therefore take smaller steps.</p>
<p>ε is a small constant added to prevent division by zero.</p>
<h3 id="heading-algorithm-overview">Algorithm Overview</h3>
<p>Zooming out from the pseudocode, we can see how the various steps help to address the problems listed earlier. First we track the first and second moment for every parameter we want to train. Whilst this does use more memory than one global learning rate, it should result in better use of our resources as we converge faster.</p>
<p>Taking the first moment which is an average of gradients we become resistant to sudden gradient spikes allowing us to smooth out the convergence. If we take a stable downward path toward a local minimum, at first the gradients will be large, with the first moment retaining a high value. If we ignore the second moment, this large first moment will not be scaling the step size back much, signalling we are confident and can take these large steps. As we get closer to the local minima, our first moment should start converging towards zero, which starts to scale back our step size. This prevents overshooting the local minima and oscillating around it.</p>
<p>However, not all regions of the loss surface are smooth. In more chaotic areas, the second moment plays a larger role in stabilizing updates. The second moment will come more into play, where we are not on a consistent, smooth part to a local minima. If the gradient repeatedly enters small troughs, the gradient will flip between negative and positive values. With only the first moment, we have smoothed out gradients however these could be big shifts leading us to scale the step size erratically. With the second moment, regardless of a positive or negative gradient we are increasing the value due to the squaring of the gradient which always yields a non-negative result for any real number. As the second moment grows, the larger value to be divided by results in a smaller scaling factor. The algorithm becomes more conservative due to low confidence.</p>
<p>Ultimately, Adam determines how much to update a parameter by looking at the ratio between the first moment (our directional signal) and the square root of the second moment (our measure of noise or instability). Adam adapts its learning rate dynamically, growing cautious in noisy or uncertain regions and moving decisively when gradients are stable.</p>
<p>Here’s how Adam addresses the challenges introduced earlier:</p>
<ul>
<li><p><strong>One global learning rate</strong><br />  Adam maintains per-parameter learning rates using first and second moment estimates, allowing different update magnitudes for each parameter.</p>
</li>
<li><p><strong>Fixed learning rate over time</strong><br />  Moment estimates adapt over time, allowing the step sizes to shrink or grow as gradients evolve.</p>
</li>
<li><p><strong>Manual learning rate tuning</strong><br />  Step sizes are adjusted automatically, often reducing the need for manual learning rate schedules or tuning.</p>
</li>
<li><p><strong>Small batches are noisy</strong><br />  Exponential moving averages smooth gradient estimates, helping Adam stay stable even with noisy mini-batch gradients.</p>
</li>
<li><p><strong>Sparse gradients</strong><br />  Like Adagrad, Adam reduces updates for frequently active parameters, while allowing relatively larger updates for rarely used ones, making it well-suited for sparse gradients.</p>
</li>
<li><p><strong>Flat regions and momentum</strong><br />  The first moment acts like momentum, helping push through flat or ambiguous regions of the loss surface.</p>
</li>
</ul>
<h3 id="heading-summary">Summary</h3>
<p>Optimizers preceding Adam used parts of the concepts it brings together. For example, SGD was extended with momentum, which averages gradients over time. This helps reduce oscillation and allows the optimizer to follow a more stable path, which is conceptually similar to Adam’s first moment estimate.</p>
<p>Adagrad introduced per-parameter learning rates by accumulating the square of past gradients. This allows large updates for infrequently updated parameters and smaller updates for frequently updated ones. However, because the accumulation grows without decay, the learning rate can become excessively small later in training.</p>
<p>RMSProp improved on this by using an exponential moving average of squared gradients instead of a cumulative sum. This enabled the learning rate to adapt more flexibly to recent gradient behavior rather than shrinking predictably over time.</p>
<p>Adam combines these ideas. It uses momentum like first moment estimates and RMSProp style second moment estimates to scale the step size based on both direction and the reliability of the gradients. Adam also introduces bias correction, which improves early training by compensating for the initial zero values in the moving averages.</p>
<p>Adam is widely used because of its robustness and adaptability. It often converges faster than optimizers like SGD. However, it does not always generalize as well. Because Adam closely follows the gradient signal for each parameter, especially in models with many parameters, it can overfit or settle into sharp minima. In contrast, SGD with momentum tends to average out gradient noise more effectively, which helps it find flatter minima that often lead to better generalization.</p>
<p>Below is a PyTorch implementation of Adam's core logic.</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="9344a3c96a22b7722485cde4441eecf1"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/jessenr/9344a3c96a22b7722485cde4441eecf1" class="embed-card">https://gist.github.com/jessenr/9344a3c96a22b7722485cde4441eecf1</a></div>]]></content:encoded></item><item><title><![CDATA[Positional Encoding from Sinusoidal to RoPE]]></title><description><![CDATA[Transformers process the tokens of a text input in parallel, but unlike sequential models they do not understand position and see the input as a set of tokens. However when we calculate attention for a sentence, words that are the same but in differe...]]></description><link>https://descendingnotebooks.com/positional-encoding-from-sinusoidal-to-rope</link><guid isPermaLink="true">https://descendingnotebooks.com/positional-encoding-from-sinusoidal-to-rope</guid><category><![CDATA[sinusoidal]]></category><category><![CDATA[positional encoding]]></category><category><![CDATA[rope]]></category><category><![CDATA[transformers]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Jessen]]></dc:creator><pubDate>Tue, 01 Apr 2025 20:06:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743363511472/43c47d68-cc50-4047-bfd9-31e5255d8f33.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Transformers process the tokens of a text input in parallel, but unlike sequential models they do not understand position and see the input as a set of tokens. However when we calculate attention for a sentence, words that are the same but in different positions do receive different attention scores. If attention is a calculation between two embeddings, how can the same word i.e. same embedding, receive different scores when it is in a different position. It comes down to positional encoding, but before we get into how positional encoding works, lets run a test.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn
<span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoTokenizer, AutoModel
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns

model_name = <span class="hljs-string">"bert-base-uncased"</span>

<span class="hljs-comment"># define sentence</span>
sentence = <span class="hljs-string">"The brown dog chased the black dog"</span>

<span class="hljs-comment">#define tokenizer</span>
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokens = tokenizer(sentence, return_tensors=<span class="hljs-string">"pt"</span>)[<span class="hljs-string">"input_ids"</span>]

<span class="hljs-comment"># define token embeddings</span>
model = AutoModel.from_pretrained(model_name)
token_embeddings = model.get_input_embeddings()(tokens)
embed_dim = token_embeddings.shape[<span class="hljs-number">-1</span>]

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SimpleAttention</span>(<span class="hljs-params">nn.Module</span>):</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, embed_dim</span>):</span>
        super().__init__()
        self.w_query = nn.Linear(embed_dim, embed_dim)
        self.w_key = nn.Linear(embed_dim, embed_dim)
        self.w_value = nn.Linear(embed_dim, embed_dim)

        <span class="hljs-comment"># Initialize weights</span>
        nn.init.normal_(self.w_query.weight, mean=<span class="hljs-number">0.0</span>, std=<span class="hljs-number">0.8</span>)
        nn.init.normal_(self.w_key.weight, mean=<span class="hljs-number">0.0</span>, std=<span class="hljs-number">0.8</span>)
        nn.init.normal_(self.w_value.weight, mean=<span class="hljs-number">0.0</span>, std=<span class="hljs-number">0.8</span>)

        self.attention = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=<span class="hljs-number">1</span>, batch_first=<span class="hljs-literal">True</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span>
        output, attention_weights = self.attention(self.w_query(x), self.w_key(x), self.w_value(x))
        <span class="hljs-keyword">return</span> output, attention_weights

<span class="hljs-comment"># Compute attention        </span>
attention_layer = SimpleAttention(embed_dim=embed_dim)
attention_output, attention_weights = attention_layer(token_embeddings)

<span class="hljs-comment"># Convert attention weights to numpy array and remove extra dimensions</span>
attention_matrix = attention_weights.squeeze().detach().numpy()

<span class="hljs-comment"># Get token labels for the axes</span>
tokens_text = tokenizer.convert_ids_to_tokens(tokens[<span class="hljs-number">0</span>])

<span class="hljs-comment"># Create heatmap</span>
plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">8</span>))
sns.heatmap(attention_matrix, 
            xticklabels=tokens_text,
            yticklabels=tokens_text,
            cmap=<span class="hljs-string">'YlOrRd'</span>,
            annot=<span class="hljs-literal">True</span>,
            fmt=<span class="hljs-string">'.2f'</span>)

plt.title(<span class="hljs-string">'Attention Weights Heatmap'</span>)
plt.xlabel(<span class="hljs-string">'Key Tokens'</span>)
plt.ylabel(<span class="hljs-string">'Query Tokens'</span>)
plt.tight_layout()
plt.show()
</code></pre>
<p>The code above will create a simple attention layer, without positional encoding. This enables our test to demonstrate how attention behaves when all tokens are treated purely based on their semantic embedding, without any positional differentiation. It has been initialised with random weights and has not undergone training, however training shouldn’t matter as tokens that are the same will have the same weights applied to them and result in the same output.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1741723814330/d5e555ff-dcf9-4a2a-9305-1aa11be95867.png" alt class="image--center mx-auto" /></p>
<p>The attention heat map above is the output of our code. I’ve highlighted the use of two dog tokens to demonstrate how they receive equivalent attention weights against all other keys. For attention heads to be able to specialize e.g notice verb-object pairs, they must be able to differentiate between the same words in different positions otherwise they will receive the same result from the same weights being applied to token embedding.</p>
<h2 id="heading-what-do-positional-encodings-enable">What do positional encodings enable</h2>
<p>In transformers we differentiate words at different positions with positional embeddings. This transformation perturbs the embedding based on the position of the word. For our example sentence in the code above, dog at position 3 would have a slightly different embedding to dog at position 7 after positional embeddings have been applied, resulting in a different attention result for each token. This will allow attention heads to see different positions and allow them to specialize e.g. an attention head that has specialized in recognising verb object pairs would have high attention with chases and the second dog, but not the first dog. However if each dog had not had their embedding slightly altered the specialized attention head could never notice that difference between both dogs.</p>
<h2 id="heading-positional-encodings-step-by-step">Positional encodings step by step</h2>
<p>Before we start to think about how positional encodings might be implements lets come up with some requirements</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>Requirement 1 : </strong>Positional Encodings should alter the existing embeddings.</div>
</div>

<p>This is instead of passing in an addition feature meaning our neural network would have to interpret extra information and spend extra computation. By combining with existing embedding, we need a way for the network to see a different embedding when words are the same at different positions.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>Requirement 2 : </strong>Positional Encodings must uniquely identify every word in a sequence</div>
</div>

<p>As shown early, each word needs to be uniquely identified even if the same word appears in a text sequence.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>Requirement 3 : </strong>We must be able to generalise over positional encoding patterns</div>
</div>

<p>We want to create models that can generalise, therefore our positional encoding should be a pattern that can be recognised and accounted for in our attention matrices. We could assign random IDs to each position, and in theory this is learnable, however we would be using more learnable parameters than needed to account for the need to memorise individual IDs and the semantic information they contain for the various encounters. We are essentially forcing the transformer to memorise combinations of position and semantic information rather than allowing it to learn reusable patterns.</p>
<h3 id="heading-add-the-position-as-an-integer">Add the position as an integer</h3>
<p>An obvious thing to try to simply add the position of the token to the embedding. e.g. If dog was my first word and the embedding was [0.10, 0.80, 0.45], adding 1 to this would result in [1.10, 1.80, 1.45]. For the second word we would add 2 and so on.</p>
<p>However there are a few problems with this, first I could have many words in my input and say I reach the 100th word, adding 100 to each dimension of that token’s embedding will create very large numbers. In neural network we want to keep inputs on a similar scale to allow for faster convergence. However with our implementation embedding values will grow as we add more words to our input.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>Requirement 4: </strong>Positional embeddings should be bounded</div>
</div>

<p>This requirement will ensure that embeddings stay within reasonable limits and do not grow too large. Another problem presented by our larger integers is the network will only be trained on the largest length it has seen, and therefore cannot generalise to any length. Ideally we want our network to work at lengths it has not seen.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>Requirement 5: </strong>Positional embeddings should work for any length, even for lengths not seen in training.</div>
</div>

<p>Lastly the positional part of our embedding, seems to greatly outweigh the original embedding. It might be hard to see what the original input was, in fact it seems to completely change the embedding and more strong represent position than semantic information.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742326183185/f55a5425-f976-45e7-a620-bc0d6521f420.png" alt class="image--center mx-auto" /></p>
<p>The graph above shows the embeddings of random list of words without positional information in reduced dimensional space. We can see clear semantic meaning, with words around the same concept clustering together.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742326327994/de21c116-2d4d-4bbc-995a-763c760dd649.png" alt class="image--center mx-auto" /></p>
<p>The graph above represents the same embeddings with integer positions added based on the position they were in the text sequence. We can see they no longer cluster as before and they seem to have a different semantic meaning. They also seem to be dominated by the some increasing order. Therefore adding a position with an integer is probably not going to work for us and has led us to a new requirement.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>Requirement 6 : </strong>Embeddings must retain their semantic meaning</div>
</div>

<h3 id="heading-sinusoidal-encoding">Sinusoidal Encoding</h3>
<p>Based on our requirements so far, we’ve now come to the solution presented in the <a target="_blank" href="https://arxiv.org/abs/1706.03762">Attention Is All You Need</a> paper. However let’s build up our understanding of how this works.</p>
<p>To help meet our requirements, let’s use a sine function, where the positional coding is sine(x), with x being our token position and the result added to our embedding element wise.</p>
<p>For example if our embedding was [0.1, 0.6, 0.8] our resulting embedding with positional information would be [0.1 + sin(0), 0.6 + sin(1), 0.2 + sin(2)].</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742240999829/37d7ba45-165d-4a95-9895-2ae0417a4d66.png" alt class="image--center mx-auto" /></p>
<p>If we take the x axis to be our token position we can see that the numbers produced are small, helping to retain our embedding semantic information if added to each embedding element wise. Requirement wise it almost meets what we want, it can handle any sequence length as sine can be calculated for any x value, it is bounded between 1 and -1, it isn’t large enough to alter semantic information, however it most likely does not meet requirement 2: Positional Encodings must uniquely identify every word in a sequence.</p>
<p>As we can see sin(x) is periodic and it will eventually repeat itself for a tokens position. e.g sin(2) and sin(353) when rounded up will return the same value. This means for positions that return the same value, our attention mechanism will see them as the same position, and if they are the same word, we will have the same problem as not having positional encoding.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742328342555/eb7c4018-8960-4bf2-8c96-e71e8ea041ee.png" alt class="image--center mx-auto" /></p>
<p>To help mitigate this we introduce cosine i.e. cos(x) as shown above and for each even index of our embedding we use sine and for odd indices we use cosine.</p>
<p>For example if our embedding was [0.1, 0.6, 0.8] our resulting embedding with positional information would be [0.1 + sin(0), 0.6 + cos(1), 0.2 + sin(2)].</p>
<p>Using cosine introduces a phase shift and greatly reduces the likelihood of receiving the same value for a position later in the sequence and the phase shift greatly reduces the chance of values being near each other. However there is still a chance of repeating at longer sequences lengths, plus with the regular periodicity it is hard to build relative patterns across long distances. This means we might not be meeting requirement 3: We must be able to generalise over positional encoding patterns. With this shortwave frequency it will be hard to generalise long distance patterns.</p>
<p>We can improve on both generalising over long range patterns reducing the chance of results collapsing into the same position by introducing lower frequencies. ie. we can create a general function which is <strong>sin(x/i)</strong> for even positions and <strong>cos(x/i)</strong> for odd positions where <strong>i</strong> is the the index of our embedding.</p>
<p>For example if our embedding was [0.1, 0.6, 0.8, 0.9] our resulting embedding with positional information would be [0.1 + sin(1), 0.6 + cos(1/2), 0.2 + sin(2/3), 0.8 + cos(2/4)].</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742329614922/afe61e71-be4a-4d48-a19e-94fc1f4bfa0a.png" alt class="image--center mx-auto" /></p>
<p>The graph has been expanded on the x axis compared to previous graphs, to show how dividing by the i-th position decreases the frequency. We would do this for all the dimensions of the embedding, therefore if our embeddings size was 768, we would have 768 individual values, where the frequency decreases as we move closer to the end of the embedding.</p>
<p>The decreased frequency can be used to generalise over long distances and the extra dimensions highlight short, medium and long distances allows our attention matrices to learn parameters that discard certain ranges and focus on what it needs for a particular attention head.</p>
<p>With our current solution, as we are only scaling up each embedding index linearly by increasing i, we could have many dimensions that don’t convey enough different information and therefore have redundant positional information.</p>
<p>Our current formulas are:</p>
<p>$$PE(pos, 2i) = \sin\left(\frac{pos}{2i}\right)$$</p><p>$$PE(pos, 2i+1) = \cos\left(\frac{pos}{2i}\right)$$</p><p>Above PE represents our positional encoding function, pos the token we are taking in, and i our embedding index.</p>
<p>In the Attention Is All You Need paper, the formulas used are</p>
<p>$$PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right)$$</p><p>$$PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$</p><p>Here we can see rather than dividing by i, we have i as an exponent of 10000 and i is divided by d which represents the embedding size. This allows the the different dimensions to scale up logarithmically covering a wider range of frequencies rather than the previous linear scaling and also reduces redundant positional information. Using 10,000 as the base with an exponent is what allows the frequencies to scale up logarithmically, and d is used to control the scaling. Without d, it would scale far too quickly and might mean we miss important positional information. 10,000 was used as the base after experimenting in the Attention paper and provided a balance between scaling up and capturing enough information at different ranges. This great range of frequencies allows the model to generalise better, and gives more options for attention heads to focus on ranges for their specialisation.</p>
<h2 id="heading-absolute-vs-relative-positional-encoding">Absolute vs Relative Positional Encoding</h2>
<p>So we’ve built out positional encoding in the same way that was designed in the Attention Is All You Need paper and this method has been used in many models. However it has been improved upon and subsequently used in newer transformer models.</p>
<p>The problem with sinusoidal encoding is that is falls into a category known as absolute positional encoding. Absolute encoding adds a particular value or ID to each position, so position 1 receives a certain value, position 2 another etc. If I have the sentences “A man threw a ball” and another sentence “In the garden a man threw a ball”. In each sentence I have “man threw” in position 2, 3 and in position 5, 6 in the first and second sentence respectively. With sinusoidal encoding, man and threw will receive different positional encodings in each sentence. However in language, what creates meaning is less about absolute positions of related words but the relative position between them. If an attention head specialized in detecting subject and verb, it would have learn the value for man and threw differently for the two different sentences, by providing absolute values for each position, our learnt parameters have less chance to generalise, even though we do have some implicit relative positioning with sinusoidal encoding due to the periodicity introduced with sine and cosine waves. However the relative implicit relative positioning is hard to extract and models are not forced to look at this when absolute positioning is available.</p>
<p>With relative positioning, if we took the same sentence our encoding scheme would receive values that highlight the relative position between words, and not have absolute values. This forces the model to use relative values and also helps to generalise more, a property we want for our NLP models. For example in the case with man and threw in two sentences, our model could extract the fact the words are next to each, without having to memorise exact embedding plus positional values for different situations.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>Requirement 7 : </strong>Positional encoding should be model relative distances</div>
</div>

<h2 id="heading-rope-rotary-positional-embedding">RoPE (Rotary Positional Embedding)</h2>
<p>One way that has been used in many models to provide relative positional embeddings is RoPE.</p>
<p>Let’s imagine we have sentence 1 “The mouse ate some cheese” and sentence 2 “In the house a cat ate some cheese”</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743278855230/a05e6275-bc36-4c20-a5c1-ae33db0daee7.png" alt class="image--center mx-auto" /></p>
<p>The graphs on the top represent simple 2D embedding for each word in our sentences. The graph below represents a rotation of each vector based on the their position using the transformation pθ where p is the word position and θ is 25° e.g. some in the first sentence at position 4 would be rotated by 4θ which is 100°. As we can see in the rotated embeddings, ate and cheese which have the same relative position in each sentence end up in different positions, due to their position, but the angle between them is the same as the relative distance is the same in each sentence. This is what RoPE does it takes an embedding, and rotates it by some θ based on it’s position.</p>
<h3 id="heading-rope-and-attention-math">RoPE and Attention Math</h3>
<p>You can also see that every embedding has it’s magnitude maintained. No matter where we rotate the magnitude will be the same. We can see the importance of this by looking at how we calculate attention scores.</p>
<p>$$q⋅k=∥q∥∥k∥cos(θ)$$</p><p>We can see our attention scores rely on the norms of our vectors and angle between them. Since we have only rotated our vectors their norms stay the same, therefore any influence on attention scores due to RoPE must be on angle changes. Including the RoPE angle difference our attention score now becomes the following, where d the the angle introduced by RoPE.</p>
<p>$$∥q∥∥k∥cos(θ+d)$$</p><p>This now means, if our attention head wanted q and k to be highly aligned but they are far apart, our model must learn to increase the magnitude of the original embedding when projected into q and k or project them into a space where they are much closer, despite RoPE moving them far apart.</p>
<p>We’ve seen how rotating embeddings, helps transformers to identify tokens at different positions by providing another lever to modulate, the angle between tokens. To see how we relative position can be picked up by the model, we need to break down our attention score using transposition.</p>
<p>$$RoPE(q,i)=R(i)q$$</p><p>The above formula define a RoPE transformation, where the first parameter is our vector, the second the position in a text sequence and R is a rotation matrix. Therefore our attention score would be.</p>
<p>$$RoPE(q,i).RoPE(k,j) = R(i)q.R(j)k$$</p><p>Changing this to matrices transposition, and using product transposition rules.</p>
<p>$$(R(i)q)^\top(R(j)k) = q^\top R(i)^\top R(j)k$$</p><p>Since R is a rotation matrix, and the transpose of a rotation matrix can be represented as R(-θ) we can simplify to</p>
<p>$$q^\top R(j-i)k$$</p><p>Where j and i are simply positions, we can see how the attention calculation has cleanly extracted the difference in positions, allowing it to use relative distance rather than being intertwined with the actual semantic content of the embeddings. This allows the model to modulate what it wants such as direction or angle to achieve the result we want. Before with sinusoidal we have to learn to extract position from the semantic information.</p>
<h3 id="heading-rope-frequencies">RoPE frequencies</h3>
<p>As in the sinusoidal method previously talked about you might be wondering, if the rotation eventually repeats and therefore the same token at different positions will appear at the same angel relative to another token. Like the sinusoidal method, RoPE also introduces different frequencies across the embedding. In our example so far we have only considered 2D embeddings. In RoPE, within one embedding we take indices of the embedding pairwise e.g. position 1 &amp; 2, position 3 &amp; 4 etc and rotate each pair by some θ depending on the token position and the index of the pair we are looking at. As we move further along the embedding, the rotation becomes smaller and smaller, resulting in a lower frequency towards the end of the embedding as in the sinusoidal method. This allows us to create unique signatures for likewise tokens in long text sequences and provides the model with data to help it focus on what it wants such as long range or short range dependencies.</p>
<p>To rotate each pair we can use a rotation matrix</p>
<p>$$\begin{bmatrix} \cos(m\theta_i) &amp; -\sin(m\theta_i) \\ \sin(m\theta_i) &amp; \cos(m\theta_i) \end{bmatrix}$$</p><p>m is our token position and i-th θ is defined as</p>
<p>$$\theta_i = \frac{1}{10000^{\frac{2i}{d}}}$$</p><p>This is similar to our sinusoidal formula and produces similar sinusoidal frequencies, i being the pairwise position, and i-th θ is a rotation angle defined for each pair in an embedding based on it’s index.</p>
<p>e.g. If our embedding was [0.1, 04, 0.5, 0.2], after splitting into pairs we have [[0.1, 0.4], [0.5, 0.2]], our 1st pair is [0.1, 0.4] and 2nd pair [0.5, 0.2]. We then calculate theta for the 1st and 2nd pair use 1 and 2 as i.</p>
<p>$$\begin{bmatrix} \cos(m\theta_0) &amp; -\sin(m\theta_0) &amp; 0 &amp; 0 &amp; \cdots &amp; 0 &amp; 0 \\ \sin(m\theta_0) &amp; \cos(m\theta_0) &amp; 0 &amp; 0 &amp; \cdots &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; \cos(m\theta_1) &amp; -\sin(m\theta_1) &amp; \cdots &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; \sin(m\theta_1) &amp; \cos(m\theta_1) &amp; \cdots &amp; 0 &amp; 0 \\ \vdots &amp; \vdots &amp; \vdots &amp; \vdots &amp; \ddots &amp; \vdots &amp; \vdots \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; \cdots &amp; \cos(m\theta_{d/2}) &amp; -\sin(m\theta_{d/2}) \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; \cdots &amp; \sin(m\theta_{d/2}) &amp; \cos(m\theta_{d/2}) \end{bmatrix}$$</p><p>Using the above sparse matrix we can rotate each pair of embedding. We calculate θ up to d/2 due using pairs and only to compute angles for half the size of the embedding.</p>
<h3 id="heading-efficient-implementation-of-rope">Efficient Implementation of RoPE</h3>
<p>The sparse matrix can get quite large, and holding that in memory along with accompanying matrix multiply can be made more efficient by splitting up our rotation matrices into two vectors.</p>
<p>If we have a 2 dimensional embedding at position 1 rotating this would involve the calculation</p>
<p>$$\begin{bmatrix}x_1\\ x_2 \end{bmatrix} . \begin{bmatrix} cos(\theta) &amp; -sin(\theta)\\ sin(\theta) &amp; cos(\theta)\\ \end{bmatrix} = \begin{bmatrix} x_1cos(\theta)-x_2sin(\theta)\\ x_1sin(\theta) + x_2cos(\theta)\\ \end{bmatrix}$$</p><p>We can break this down into the following components</p>
<p>$$\begin{bmatrix}x_1\\ x_2 \end{bmatrix} \otimes \begin{bmatrix} cos(\theta)\\ cos(\theta) \end{bmatrix} = \begin{bmatrix} x_1cos(\theta)\\ x_2cos(\theta) \end{bmatrix}$$</p><p>First we use element wise multiplication to cover the cos part of our rotation matrix. Notice how we have left -x2 *sin(θ) on the top row and, x1 *sin(θ) on the bottom. As negative x2 is now on the top, for every pair we flip them and make x2 negative.</p>
<p>$$\begin{bmatrix}-x_2\\ x_1 \end{bmatrix} \otimes \begin{bmatrix} sin(\theta)\\ sin(\theta) \end{bmatrix} = \begin{bmatrix} -x_2sin(\theta)\\ x_1sin(\theta) \end{bmatrix}$$</p><p>We then simply add the two results together element wise to achieve the same as sparse matrix rotation, but with a method doesn’t require construction of large matrices and is more easily vectorized.</p>
<p>$$\begin{bmatrix} x_1cos(\theta)\\ x_2cos(\theta) \end{bmatrix} + \begin{bmatrix} -x_2sin(\theta)\\ x_1sin(\theta) \end{bmatrix}$$</p><h2 id="heading-sources-amp-further-reading">Sources &amp; Further Reading</h2>
<p>- [You could have designed state of the art Positional Encoding] (<a target="_blank" href="https://fleetwood.dev/posts/you-could-have-designed-SOTA-positional-encoding">https://fleetwood.dev/posts/you-could-have-designed-SOTA-positional-encoding</a>) - This post was heavily inspired by this original blog post.</p>
<p>- [RoFormer: Enhanced Transformer with Rotary Position Embedding](<a target="_blank" href="https://arxiv.org/abs/2104.09864">https://arxiv.org/abs/2104.09864</a>) – Introduces RoPE (Rotary Positional Embedding), a technique for modeling relative positions through rotation.</p>
<p>- [Attention Is All You Need](<a target="_blank" href="https://arxiv.org/abs/1706.03762">https://arxiv.org/abs/1706.03762</a>) – The original Transformer paper that introduced sinusoidal positional encodings and the self-attention mechanism.</p>
]]></content:encoded></item><item><title><![CDATA[Road to ChatGPT - Part 1: Understanding the Basics of Linear Regression]]></title><description><![CDATA[Yo, AI’s a mind that don’t ever sleep, stackin’ patterns and data, runnin’ deep.It don’t hustle like us, but it’s sharp on the grind, tech so smooth, it’ll blow your mind.

Have you ever wondered how ChatGPT can talk about AI in a Snoop Dogg like sty...]]></description><link>https://descendingnotebooks.com/road-to-chatgpt-part-1-understanding-the-basics-of-linear-regression</link><guid isPermaLink="true">https://descendingnotebooks.com/road-to-chatgpt-part-1-understanding-the-basics-of-linear-regression</guid><category><![CDATA[Linear Regression]]></category><category><![CDATA[chatgpt]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Primer]]></category><category><![CDATA[mean squared error]]></category><category><![CDATA[Machine Learning algorithm]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Jessen]]></dc:creator><pubDate>Sun, 05 Jan 2025 19:59:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1734984788432/63978ec2-379a-45e3-b7a6-32055e09af43.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>Yo, AI’s a mind that don’t ever sleep, stackin’ patterns and data, runnin’ deep.<br />It don’t hustle like us, but it’s sharp on the grind, tech so smooth, it’ll blow your mind.</p>
</blockquote>
<p>Have you ever wondered how ChatGPT can talk about AI in a Snoop Dogg like style, as seen in the example above? In this series, we’ll go from basic machine learning to help build our intuition, before implementing the models used to power ChatGPT. Along the way, we’ll learn the basics of PyTorch, a popular machine learning library to simplify our code and enable easier implementations of more complex models.</p>
<h3 id="heading-why-use-machine-learning">Why Use Machine Learning?</h3>
<p>When a computer is required to perform a task, we might write a function mapping an input to an output to enable the task. For instance, to predict house value based on the number of bedrooms, we might write a function such as</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_house_value</span>(<span class="hljs-params">bedrooms</span>):</span>
    <span class="hljs-keyword">return</span> bedrooms*<span class="hljs-number">100000</span>
</code></pre>
<p>This function takes in the number of bedrooms, multiplied by a constant, and outputs a house value. As a predictor of house value, it’s pretty limited, we could take in more input parameters to calculate a more accurate house value.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_house_value_v2</span>(<span class="hljs-params">bedrooms, city, garden</span>):</span>
    <span class="hljs-comment"># Adjust house value based on whether there is a garden</span>
    garden_increase = <span class="hljs-number">2000</span> <span class="hljs-keyword">if</span> garden <span class="hljs-keyword">else</span> <span class="hljs-number">0</span>

    <span class="hljs-comment"># Adjust house value based on city</span>
    city_multipliers = {
        <span class="hljs-string">"London"</span>: <span class="hljs-number">1.2</span>,
        <span class="hljs-string">"New York"</span>: <span class="hljs-number">1.5</span>,
        <span class="hljs-string">"Tokyo"</span>: <span class="hljs-number">1.8</span>
    }
    city_multiple = city_multipliers.get(city, <span class="hljs-number">1.0</span>)

    <span class="hljs-keyword">return</span> garden_increase + city_multiple * <span class="hljs-number">100000</span> * bedrooms
</code></pre>
<p>The function is gradually refining its output, returning house values that better reflect the diverse range of values we encounter. However, it remains highly inaccurate and would necessitate additional parameters to function effectively or restrict its application to specific regions and require extensive domain knowledge to comprehend how each input influences the output. The algorithm also potentially misses interplay between the 3 input parameters. For example, gardens could have more value in different cities. You can imagine this as if statements capturing the complex relationships by ANDing input parameters, however try to imagine this now with 50 cities and/or more input parameters.</p>
<p>Machine Learning allows us to take preexisting data and create the mappings represented by our functions for us. For example in the house value example, we can use data on various houses with features such as the number of bedrooms, proximity to schools, location and a house value associated with it.</p>
<p>We then select an appropriate machine learning model to train. During the training phase, we take each house one at a time, load the features, and the dependent variable which in this case is the house value. While training, the model will learn to predict house values based on features it has seen, by adjusting internal parameters of the model. After training, we can infer house values of houses not seen in the training data by supplying different combinations of features e.g 5 bedrooms, not near a school, New York, etc and the model will predict the house as it has learned to map input features to predicted house values.</p>
<p>To help us understand how models learn and why their internal adjustments create a desired output we'll start with linear regression. Later we will explore how we can extend concepts in linear regression to more complex architectures such as neural networks, which can approximate a variety of computational tasks and finally transformers which will allow you to create Snoop-Dogg style text like at the top of the article.</p>
<h2 id="heading-linear-regression">Linear Regression</h2>
<p>Linear Regression is a method used to model the relationship between a dependent variable and one or more features. The dependent variable should be a continuous value when using linear regression. For instance, if we want to predict the value of a house based on its size, we can create a linear model where house value is the dependent variable and house size is the feature. This model would help us to understand and predict how changes in the size of a house might influence its value, allowing us to make predictions about future house values based on size.</p>
<p>We can use a 2D plot with house size on the x-axis and house value on the y-axis, to help us visualize the model.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735851196905/a770401a-d223-4cf3-95dd-faad42cc539f.png" alt="Scatter plot showing a positive correlation between house size and house value. The data points are upward sloping, indicating larger houses tend to have higher values." class="image--center mx-auto" /></p>
<p>Plotting the data shows a clear correlation between house value and house size. Drawing a line through the data will help us to approximate other house values based on their size. With this line, we can pick a house size on the x-axis, and see what y value intersects with x on the line. This line would be our model.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735850966150/35e86963-a5f3-4353-b0cf-e096f5c712c1.png" alt="Scatter plot showing house size vs. house value with a red trend line indicating positive correlation." class="image--center mx-auto" /></p>
<p>The line drawn above is one we could draw by hand and intuitively know it is a near optimal line to predict house values based on the data. This is based on it going through the centre of the plotted data.</p>
<p>Training a linear regression models allows us to find that line programatically. However if we can easily draw the line, why bother. Well, we can have many input features, and each feature is another axis on our plot. That however becomes more difficult to visualize in 3D and not possible above 3 dimensions. In real projects we will most likely be dealing with multiples features and therefore need a way to determine what this line is. For the time being though we’ll stick with the 2 dimensions, house value and house size while learning.</p>
<h3 id="heading-model-line">Model line</h3>
<p>The line drawn through our data represent our model. It allows us to infer values for inputs not yet seen in a dataset. Each line can be represented with the following equation</p>
<p>$$y=w_1 x_1+b$$</p><ul>
<li><p><strong>y</strong> is our dependent variable, house value, the value we are trying to predict</p>
</li>
<li><p><strong>x₁</strong> is an input feature, house size. We use subscript 1, because we can have many input features which would be labeled x₂, x₃… etc. These other features could represent other house properties such as number of floors, or proximity to schools. However for our simple case we have one input feature.</p>
</li>
<li><p><strong>w₁</strong> is a weight. This is a constant value that is determined during training. Again this has a subscript 1 because there will be as many weights as there are features i.e. there is a weight for each feature category.</p>
</li>
<li><p><strong>b</strong> represents a bias value. This is another constant value determined through training. It is not multiplied by any feature but simply added onto the final result.</p>
</li>
</ul>
<p>w₁ and b are constants that are learnt through training and these are the values that decide the gradient of our model. x₁ would be a variable that is inputted for each calculation, in this case the house size.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735937355296/2dcf87e1-ac2c-44b7-8070-1db9de19c3a5.png" alt="Three scatter plots showing data points with different linear regression lines: the first with slope 1 and intercept 0, the second with slope 1 and intercept 1, and the third with slope 2 and intercept 0." class="image--center mx-auto" /></p>
<p>From the above plots we can see how different values for w₁ and b produce different models, some clearly better than others.</p>
<p>We can see w₁ affects the gradient, this is because it is a weight being applied against an input value such as house size. The greater the weight the more influence it has on the dependent variable y such as house value. When the weight is greater, the slope is steeper as the feature influences the dependent variable more. When the weight is negative there is a downward slope as the feature has a negative affect on the dependent variable.</p>
<p>When we change b, we see the line intersects the y-axis at a different point. For example at zero, it intersects where y=0, when b is 1, it intersects where y=1. Without this bias value, the model would have less flexibility as y would always have to be zero when x is zero.</p>
<h3 id="heading-what-makes-a-good-model">What makes a good model</h3>
<p>Now we know what parts make up a linear regression model, how do we know we have good model. Visually we can see there are better models than others. The best model is the one that minimizes the difference between predicted values and actual values.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735938562421/1f5ff6d6-7247-47ce-a2bf-bce072c50863.png" alt class="image--center mx-auto" /></p>
<p>From the above image, we can see there is a difference between what a model predicts for a house size and what the actual house value is for a given house size. If our model was the perfect predictor and predictions were equal to the actual value, the sum of all the differences would be zero. As predictions start to deviate from actual values the sum of all absolute differences increases and our model becomes potentially less accurate.</p>
<h3 id="heading-the-loss-function">The loss function</h3>
<p>We can use the <strong>MSE</strong> (Mean Square Error) formula to represent those differences. This is called a loss function, which aggregates the the difference between all actual and predicted values. Each individual difference is an error value.</p>
<p>$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$</p><p>The MSE will give us an overall loss value, where a lower loss value represents a better model. i.e. a loss of zero is a perfect predictor, meaning all predictions perfectly match the actual values.</p>
<ul>
<li><p><strong>n</strong> is the number of instances we have e.g if we have data for 100 house, n is 100</p>
</li>
<li><p><strong>i</strong> is a specific instance e.g. in our house data, if we took the 2nd example i would be 2</p>
</li>
<li><p><strong>ŷ</strong> (pronounced y hat) is our predicted value e.g. the first house in our dataset might have a predicted value of $100k in our model</p>
</li>
<li><p><strong>y</strong> is our actual value e.g. the first house in our dataset might have an actual value of $120k</p>
</li>
</ul>
<p>The MSE calculates the difference between all predicted and actual values for all n instances and squares each one. Squaring the result adds a greater penalty to predictions further away from actual values, whilst also making each value positive.</p>
<p>Making each value positive ensures minus values can’t cancel out other values when summing the values.</p>
<h3 id="heading-how-do-we-train-a-good-model">How do we train a good model</h3>
<p>To select good values for our weights and bias which minimizes the loss from the MSE we must train our model using a dataset. From now on we’ll refer to weights and biases collectively as parameters.</p>
<p>We train with the following steps</p>
<ul>
<li><p>initialize our parameters (weights and bias) to random values</p>
</li>
<li><p>calculate predictions for every sample we have based on the parameter values we chosen for our set of samples, calculate the mse</p>
</li>
<li><p>based on the mse value, adjust our parameters, in a way which favors a lower loss value (we'll cover this in more depth later)</p>
</li>
<li><p>with our new parameter values, we calculate predictions again, calculate the mse, and adjust parameters.</p>
</li>
<li><p>continue the above steps, of adjusting parameters, calculating mse until our loss is acceptably small or the changes in the mse start to slow down with each iteration or for a set number of iterations</p>
</li>
</ul>
<p>We'll go through these steps in code below with real data using the California Housing dataset. In this dataset, each row represents a district in California, and the target variable (the output) is the median house value of owner occupied homes, expressed in hundreds of thousands of dollars. As there is more than one feature, we’ll have more than than 2 dimensions, and therefore won’t be able to easily visualize with graphs as before.</p>
<h2 id="heading-putting-it-all-together">Putting it all together</h2>
<h3 id="heading-data-importing-and-exploration">Data importing and exploration</h3>
<p>Below we'll use linear regression on a real dataset. The California Housing dataset collects data on housing per district in California. Each district contains features such as population, average number of bedrooms etc. Using linear regression, we'll predict the Medium House Value per district based on the the other fields in the dataset.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Import libraries that allow us to download the California housing dataset</span>
<span class="hljs-comment"># and view the data</span>
<span class="hljs-keyword">from</span> sklearn.datasets <span class="hljs-keyword">import</span> fetch_california_housing
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Load the dataset</span>
california_housing = fetch_california_housing()

<span class="hljs-comment"># Convert to a DataFrame, this will allow us to view and access data easily</span>
data = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)

<span class="hljs-comment"># Create MedHouseVal column which represents the Median House Value per a district</span>
<span class="hljs-comment"># This is the y value we will try to predict</span>
data[<span class="hljs-string">'MedHouseVal'</span>] = california_housing.target

<span class="hljs-comment"># Display the first few rows of the dataset</span>
print(data.head())
</code></pre>
<p>The output from data.head() allows us to quickly inspect the dataset so we can get a feel of what the data looks like.</p>
<pre><code class="lang-bash">MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  MedHouseVal  
0    -122.23        4.526  
1    -122.22        3.585  
2    -122.24        3.521  
3    -122.25        3.413  
4    -122.25        3.422
</code></pre>
<h3 id="heading-scaling-features">Scaling features</h3>
<p>Next, we need to preprocess our data by scaling the features. Scaling adjusts the range of each feature to be similar, which prevents any single feature from dominating the training process due to its large scale. This also aids in faster convergence of the model during training.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># We'll create our input feature dataset X and a corresponding Y dataset </span>
<span class="hljs-comment"># for the data we want to predict</span>
<span class="hljs-comment"># We'll also convert the datasets to numpy which is better for processing, </span>
<span class="hljs-comment"># where Pandas dataframes are good for data exploration</span>
x_unprocessed = data.drop(<span class="hljs-string">"MedHouseVal"</span>, axis=<span class="hljs-number">1</span>).to_numpy()
y = data[<span class="hljs-string">"MedHouseVal"</span>].to_numpy()

<span class="hljs-comment"># Scale features in X using Z-score normalization</span>
<span class="hljs-comment"># This preserves the distribution of the data and ensures all</span>
<span class="hljs-comment"># features have a mean of 0 and a standard deviation of 1</span>
x_mean = np.mean(x_unprocessed, axis=<span class="hljs-number">0</span>)
x_std = np.std(x_unprocessed, axis=<span class="hljs-number">0</span>)
x = (x_unprocessed - x_mean)/x_std
</code></pre>
<p>The above scaling, known as standardization or Z-score normalization, ensures that features with large ranges do not overshadow others in influencing the model's predictions. Without scaling, features that have large ranges might result in smaller weights, as their input values are already greatly influencing the dependent value due to their size relative to other features. Over time, training will reach the correct weights, however, unscaled features can cause inefficient optimization paths, leading to slower convergence.</p>
<p>If all of our features are on similar scales (e.g., 1-100), then we can leave scaling out, as no feature will dominate due to scale.</p>
<p>Note we did not scale y, as y is not included in the learning process and a weight is not assigned to it, as it is the value we are trying to predict.</p>
<h3 id="heading-test-and-validation-datasets">Test and validation datasets</h3>
<p>After exploring the data and pre-processing it, we need to split the dataset into a train and validation set. A typical split would include 80% of the data in the training set and the remaining 20% in the validation set. This split is commonly used because it provides enough data to train the model effectively whilst also keeping a sufficient portion to evaluate its performance. However, this ratio might change depending on the size of the dataset or the requirements of the specific task. Larger datasets might allow for smaller validation sets, such as 90/10 splits.</p>
<p>The training set is responsible for training the model. When we calculate the MSE at the end of a training iteration, ideally, we want as low a number as possible. However, if we achieve zero MSE and therefore perfect accuracy, there is a high chance our model has learned to predict the data in the training set but has not generalized to unseen data. This is also known as the model overfitting the data.</p>
<p>To test that our model generalizes to data outside of the training set, we hold back some data from the initial dataset to create a validation set. With the validation set, after we have trained the model, we calculate the MSE on this set. If the loss is low, indicating predictions are close to the observed data in the validation set, it is likely we have a generalized model, meaning it performs well on unseen data and not just the validation set. However, a low validation error alone does not guarantee generalization and might require additional testing on unseen data to confirm the model's robustness.</p>
<pre><code class="lang-python"><span class="hljs-comment"># optional seed, setting this makes our random calculations reproducible</span>
np.random.seed(<span class="hljs-number">42</span>)
<span class="hljs-comment"># get the number of samples, shape returns the dimensions of the matrix</span>
<span class="hljs-comment"># specifying 0, fetches only the number of rows</span>
<span class="hljs-comment"># a random order is generated, in case the order of our data </span>
<span class="hljs-comment"># skews the distrubtion in some way</span>
n_samples = x.shape[<span class="hljs-number">0</span>]

<span class="hljs-comment"># this will generate an array of numbers from 0 to the number of rows minus 1, in a random order</span>
<span class="hljs-comment"># we'll use this to select rows from our datasets with a random order that is set once</span>
indices = np.random.permutation(n_samples)

<span class="hljs-comment"># define the percentage of samples that should be used in a training set</span>
train_ratio = <span class="hljs-number">0.8</span>
n_train = int(train_ratio * n_samples)

<span class="hljs-comment"># create the training and validation datasets</span>
train_indices = indices[:n_train]
val_indices = indices[n_train:]
x_train, y_train = x[train_indices], y[train_indices]
x_val, y_val = x[val_indices], y[val_indices]

print(<span class="hljs-string">f"Training set size: <span class="hljs-subst">{x_train.shape}</span>"</span>)
print(<span class="hljs-string">f"Validation set size: <span class="hljs-subst">{x_val.shape}</span>"</span>)
</code></pre>
<pre><code class="lang-bash">Training <span class="hljs-built_in">set</span> size: (16512, 8)
Validation <span class="hljs-built_in">set</span> size: (4128, 8)
</code></pre>
<h3 id="heading-weight-initialisation">Weight initialisation</h3>
<p>Next we create a weight vector to represent the weights applied to each feature. We need as many weights as there are features and an additional bias value, which allows the model to adjust the output independently of the feature values. Random small values are used for weight initialization.</p>
<p>We also choose small initialization as large numbers can cause a feature to contribute too much to predictions, causing large changes to the weights across iterations. The large changes could become unstable, oscillating around the optimum value. Instead of gently moving towards an optimum point, the weight values might swing wildly around it, slowing convergence.</p>
<pre><code class="lang-python"><span class="hljs-comment"># initialise an array of random weights for each feature</span>
<span class="hljs-comment"># x.shape(1), will give us the number of columns which represent features</span>
<span class="hljs-comment"># we add 1 to the number of features to account for the bias term that will be trained</span>
n_weight = x_train.shape[<span class="hljs-number">1</span>]+<span class="hljs-number">1</span>
weights = np.random.randn(n_weight) * <span class="hljs-number">0.01</span>
print(<span class="hljs-string">"Initial weights:"</span>, weights)

<span class="hljs-comment"># we will also add a feature column of ones, which will be used for the bias value later</span>
<span class="hljs-comment"># np.c_ will horizontally concatenate 2 matrices e.g. x_train and a generated 1 column matrix of 1s</span>
x_train = np.c_[x_train, np.ones(x_train.shape[<span class="hljs-number">0</span>])]
x_val = np.c_[x_val, np.ones(x_val.shape[<span class="hljs-number">0</span>])]
</code></pre>
<p>Example output, yours will be different due to random initialization.</p>
<pre><code class="lang-bash">Initial weights: [-0.00454508  0.00186784 -0.01296609  0.00670622 -0.00369122  0.00874647
 -0.00619931 -0.00674114 -0.00945463]
</code></pre>
<h3 id="heading-the-training-loop">The training loop</h3>
<p>Now we begin the training loop, an iterative process aimed at finding the optimal weights. In the training loop we will calculate predictions for each sample using our random weights, then calculate the loss for all predictions.</p>
<p>To visualize the loss, we can plot a graph where y is the loss value, and each other axis corresponds to a weight and its current value. This graph provides insights into how different weight values influence the loss, helping to identify trends such as whether certain weights significantly reduce or increase the loss, and guiding adjustments toward optimal values. For all weight values, there is an intersection point where y reaches its minimum. The minima is where the weight values at this point produce predictions for all sample that are the closest possible value to the actual value. This is the value we aim to find.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735939962388/491e64aa-441d-4f99-a50c-582eadd37620.png" alt class="image--center mx-auto" /></p>
<p>The above shows what that graph might look like with one weight, as two dimensions are easier to reason with. Visualizing beyond three dimensions is not possible because our spatial intuition is limited to three dimensions, making it challenging to interpret higher dimensional spaces. As shown in the graph, there's an optimal value for w₁ where y is at its lowest. The lower y is, the smaller the difference between the predicted values and actual values for each sample. Here, y refers to the loss value, which quantifies these differences and serves as the key metric in the optimization process. By minimizing the loss value, we iteratively adjust the weights to improve the model's accuracy in predicting outcomes.</p>
<h3 id="heading-gradient-descent">Gradient Descent</h3>
<pre><code class="lang-python"><span class="hljs-comment"># initialise an array of random weights for each feature</span>
<span class="hljs-comment"># x.shape(1), will give us the number of columns which represent features</span>
<span class="hljs-comment"># we add 1 to the number of features to account for a bias value</span>
n_weight = x_train.shape[<span class="hljs-number">1</span>] + <span class="hljs-number">1</span>
weights = np.random.randn(n_weight) * <span class="hljs-number">0.01</span>
print(<span class="hljs-string">"Initial weights:"</span>, weights)

<span class="hljs-comment"># we will also add a feature column of ones, which will be used for the bias value later</span>
<span class="hljs-comment"># np.c_ will horizontally concatenate 2 matrices e.g. x_train and a generated 1 column matrix of 1s</span>
x_train = np.c_[x_train, np.ones(x_train.shape[<span class="hljs-number">0</span>])]
x_val = np.c_[x_val, np.ones(x_val.shape[<span class="hljs-number">0</span>])]
</code></pre>
<p>Above we've initialised our parameters with random values to provide a starting point.</p>
<p>To move these random initializations to the optimal value, we need to move the weights towards the point where the loss is at it's minimum. To compute this, we can calculate the partial derivative for w₁ with respect to the MSE function. The partial derivative for our weight will tell us the gradient of our weight for a particular loss value i.e it will tell us how much the loss is affected as we adjust our weight.</p>
<details><summary>What’s a partial derivative</summary><div data-type="detailsContent">If you're unfamiliar with derivatives, you can explore them at <a target="_blank" href="https://www.mathsisfun.com/calculus/derivatives-introduction.html"><strong>Math is Fun</strong></a>. However, it's perfectly fine if you don't know them. Essentially, derivatives help us determine the steepness and direction of a graph at a specific point. A larger derivative value indicates a steeper gradient, with negative values indicating a downward slope and positive values an upward slope. In the context of machine learning, we use derivatives to calculate how much the Mean Square Error (MSE) changes concerning one of its parameters, like a weight. This is done by deriving a formula for the partial derivative, which helps us adjust weights to minimize the MSE. In this article, we'll provide the formulas for calculating partial derivatives for each weight, but in future articles, machine learning frameworks will handle these calculations automatically. So, it's okay to skip over the details of derivatives for now.</div></details>

<p>In the code below we update the weight by subtracting the gradient for the current value of of w₁ and all other parameters.</p>
<pre><code class="lang-python">w1 -= w1_gradiant
</code></pre>
<p>This will allow it to go towards the minima regardless of being a downward or upward gradient. This process of updating iteratively updating the weights by calculating gradients is called <strong>Gradient Descent</strong>. Below shows our initial point and several iterative updates. You'll see the jumps are quite large and can overshoot the optimum value, causing the value to oscillate around the optimum and slowing convergence, or it might not converge at all.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736025494865/6510c7f8-3291-4a49-b3eb-cbcc8c4ae9d3.png" alt class="image--center mx-auto" /></p>
<p>To smooth out these updates, we use a hyperparameter called the learning rate represented by <strong>α</strong> (alpha).</p>
<details><summary>What’s a hyperparameter</summary><div data-type="detailsContent">A hyperparameter is a meta-parameter that controls how the learning algorithm learns. For example, setting the learning rate (α) to a higher value might result in faster initial convergence to weight values, but as the algorithm approaches the optimum, it can become unstable. On the other hand, setting it to a lower value can make the model learn too slowly, potentially wasting computational resources.</div></details>

<p>Let’s replace our previous code line with the one below, taking into account the learning rate.</p>
<pre><code class="lang-python">learning_rate = <span class="hljs-number">0.01</span>
w₁ -= w₁_gradient * learning_rate
</code></pre>
<p>Above we simply multiply the gradient by a learning rate to decrease the step size and smooth out the update, as seen on the graph below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736086241913/b2f3e1d1-1472-45be-a43b-9a7a2ed3f249.png" alt class="image--center mx-auto" /></p>
<p>However setting α too low will result in more updates to achieve the optimum weight, resulting in a slower convergence and extra compute. Setting this too large results in the problems we see without using a learning rate to scale the gradient update.</p>
<p>A few sections back we scaled our features ensuring they were within similar ranges with a mean of 0 and a standard deviation of 1. Without scaling, weight updates for features with large ranges such as house values would have far greater jumps compared to a feature with a smaller range such as number of rooms. As the learning rate is the same for all features, scaling our features to be within similar ranges allows them to take similarly scaled weight updates resulting in faster convergence.</p>
<p>Features with a large scale will also dominate the learning process, as the feature would contribute significantly more than other smaller scale features, due to the larger numbers. You might think it's fine if larger features contribute more to the weight updates early on, since they’ll eventually take smaller steps as the gradient shrinks. However, without feature scaling, the optimization process becomes uneven. Large features dominate the updates initially, but smaller features may take much longer to make meaningful progress. This imbalance can lead to slower convergence overall, as the model takes a less efficient path towards the optimal solution. By scaling all features to similar ranges, we ensure that all features contribute to the weight updates more evenly, leading to a faster and smoother convergence.</p>
<p>Once we've updated our weights, we repeat the process, of calculating predictions, calculating the loss then updating weights using partial derivatives. This iterative process is known as gradient descent. We can carry on for a set number of iterations known as epochs or until the loss starts to start to slow down and stabilize across epochs, signaling that our model has stopped learning.</p>
<p>This process of iteratively updating the parameters proportional to the gradient of a loss function is called Gradient Descent. It is a foundational optimization method in machine learning, enabling efficient minimization of loss functions, and it will play a crucial role in more advanced algorithms we explore later.</p>
<pre><code class="lang-python"><span class="hljs-comment"># set the number of epochs to train for, 100 is good enough for this simple</span>
<span class="hljs-comment"># example, however a better approach would be stop learning once we see</span>
<span class="hljs-comment"># little change in the error metrics between epochs</span>
n_epochs = <span class="hljs-number">100</span>
<span class="hljs-comment"># set the learning rate, 0.1 is generally a good starting place, however</span>
<span class="hljs-comment"># later examples will adjust this during the learning phase to achieve</span>
<span class="hljs-comment"># faster convergence </span>
lr = <span class="hljs-number">0.01</span>

<span class="hljs-comment"># the main training loop</span>
<span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(n_epochs):
    <span class="hljs-comment"># calculate predictions using the dot product of two matrices, </span>
    <span class="hljs-comment"># the feature matrix and the weight vector</span>
    predictions_train = np.dot(x_train, weights)
    error = predictions_train - y_train
    gradients = <span class="hljs-number">2</span> * (x_train * error[:, np.newaxis]).mean(axis=<span class="hljs-number">0</span>)
    weights -= lr * gradients

    <span class="hljs-comment"># calculate the mean squared error</span>
    mse_train = np.mean(error ** <span class="hljs-number">2</span>)

    <span class="hljs-comment"># calculate predictions on validation set</span>
    predictions_val = np.dot(x_val, weights)
    mse_val = np.mean((predictions_val - y_val) ** <span class="hljs-number">2</span>)

    print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch}</span>: Training Error = <span class="hljs-subst">{mse_train:<span class="hljs-number">.6</span>f}</span>, Validation Error = <span class="hljs-subst">{mse_val:<span class="hljs-number">.6</span>f}</span> "</span>)
</code></pre>
<pre><code class="lang-bash">Epoch 0: Training Error = 5.637618, Validation Error = 5.468886 
Epoch 1: Training Error = 5.438583, Validation Error = 5.277734 
Epoch 2: Training Error = 5.247518, Validation Error = 5.094238 
Epoch 3: Training Error = 5.064100, Validation Error = 4.918087 
Epoch 4: Training Error = 4.888018, Validation Error = 4.748984 
Epoch 5: Training Error = 4.718977, Validation Error = 4.586642 
........................
........................
Epoch 95: Training Error = 0.726476, Validation Error = 0.753565 
Epoch 96: Training Error = 0.721906, Validation Error = 0.749195 
Epoch 97: Training Error = 0.717503, Validation Error = 0.744986 
Epoch 98: Training Error = 0.713263, Validation Error = 0.740932 
Epoch 99: Training Error = 0.709177, Validation Error = 0.737027
</code></pre>
<p>The above output is an example we might see from the training loop. We can see at the start we have an a high error value, but training over 100 epochs brings the error rate down considerably. While it does start to slow down towards the end of the training phase, we are still seeing improvements epoch to epoch, and we could eke out better accuracy by continuing to train. However, there is a trade-off to consider, additional training may lead to marginal accuracy improvements, but it also increases the risk of overfitting, where the model performs well on training data but poorly on unseen data.</p>
<h3 id="heading-vectorization">Vectorization</h3>
<p>The training loops also make use of vectorized methods and NumPy, leveraging broadcasting when arrays have compatible but differing shapes, and element-wise operations when arrays have the same shape, to create short lines which are fast to run. Vectorized methods allow for SIMD (single instruction multiple data), which allows entire arrays to be processed in parallel by a CPU instruction, significantly reducing control overhead compared to traditional loops. Instead of iterating through each element individually, SIMD enables a single instruction to operate on multiple data points simultaneously, maximizing parallelism and improving performance.</p>
<p>Normally we iterate through an array which requires control instructions to loop through each element, then an instruction that must be applied to each element individually. Vectorized methods remove the need for many control instructions. Memory fetches are also minimized as instructions are performed on chunks of data in registers rather than single elements of an array.</p>
<p>Vectorized functions, especially in NumPy, are written in optimized compiled C code rather than Python, further enhancing performance. In our case rather, than iterating through our array vector and multiplying by each feature for each sample we can calculate them all at once. For this we use the dot product.</p>
<details><summary>What is the dot product</summary><div data-type="detailsContent">The dot product is an operation which takes two vectors, and multiplies them element wise, then sums them to create a single scalar. In linear regression this is used to quickly take input features, multiply by the trained weights then sum them to create the output value.</div></details>

<p>Below compares the use of iterative methods vs vectorized methods.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">iterative_predictions</span>(<span class="hljs-params">x, weights</span>):</span>
    predictions = np.zeros(x.shape[<span class="hljs-number">0</span>])
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(x.shape[<span class="hljs-number">0</span>]):
        prediction = <span class="hljs-number">0.0</span>
        <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> range(x.shape[<span class="hljs-number">1</span>]):
            prediction += x[i][j] * weights[j]
        predictions[i] = prediction
    <span class="hljs-keyword">return</span> predictions

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">vectorized_predictions</span>(<span class="hljs-params">x, weights</span>):</span>
    <span class="hljs-keyword">return</span> np.dot(x, weights)

<span class="hljs-comment"># Time the iterative method</span>
start_time = time.time()
predictions_iterative = iterative_predictions(x_train, weights)
iterative_time = time.time() - start_time
print(<span class="hljs-string">f"Iterative method took <span class="hljs-subst">{iterative_time:<span class="hljs-number">.6</span>f}</span> seconds"</span>)

<span class="hljs-comment"># Time the vectorized method</span>
start_time = time.time()
predictions_vectorized = vectorized_predictions(x_train, weights)
vectorized_time = time.time() - start_time
print(<span class="hljs-string">f"Vectorized method took <span class="hljs-subst">{vectorized_time:<span class="hljs-number">.6</span>f}</span> seconds"</span>)
</code></pre>
<pre><code class="lang-bash">Iterative method took 0.031959 seconds
Vectorized method took 0.000339 seconds
</code></pre>
<p>As we can see above the vectorized method is orders of magnitude faster using the vectorized dot product method.</p>
<p>As well as highly optimized methods such as the dot product. NumPy applies element wise operations on arrays using SIMD.</p>
<pre><code class="lang-python">error = predictions_train - y_train
</code></pre>
<p>When performing a calculation such as the one above between two arrays of the same size, NumPy automatically performs the operation element-wise between the two arrays. This means an element of an array is matched with it's counterpart in a corresponding matrix and a calculation is completed as if they were single scalar values. For example, with predictions_train and y_train, which are both the same size, NumPy will subtract the corresponding elements (first element of predictions_train minus the first element of y_train, and so on), ultimately returning a new array of results.</p>
<p>NumPy can also handle arrays or matrices of differing sizes using a method called Broadcasting if they meet certain rules to be "broadcast-compatible."</p>
<p>These rules are as follows:</p>
<ul>
<li><p><strong>Rank Compatibility:</strong> Each matrix must have the same number of dimensions.</p>
</li>
<li><p><strong>Dimension Compatibility:</strong> Dimensions are compared from right to left. If the size in each dimension are the same or if one of them is 1, then the dimension is considered compatible, and the operation can proceed. If one dimension is 1, that value is broadcasted across the larger dimension.</p>
</li>
</ul>
<p>For example, if we have an array with shape (3, 1) and another array with shape (3, 4), NumPy will broadcast the single value across the dimension where 1 occurs to perform the element-wise calculation, effectively treating the first array as if it had the shape (3, 4).</p>
<p>Later in this series, we'll see how we can push these calculations to a GPU, which is designed specifically for massive parallel computations, allowing us to achieve even faster processing.</p>
<h2 id="heading-limitations-of-linear-regression">Limitations of Linear Regression</h2>
<p>As the name suggests, linear regression is linear, which means it struggles when we need to model non-linear relationships. We only increase or decrease a feature by a magnitude, therefore it’s ability to model complex relationship is limited. e.g if we could apply exponents or polynomials to an input feature we could start to see non linear relationships.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736105240774/63c94b09-0931-4234-abb3-be2647d588f3.png" alt="A scatter plot with points distributed in an overall U shape, with a dense cluster in the top left and another in the lower right, forming a curve that dips in the middle." class="image--center mx-auto" /></p>
<p>For example, in the dataset shown above, a straight line from our linear model wouldn’t be able to capture the complex shape we see. While factors like exponents would allow for more complex lines, methods to learn those and combine several to create models would be limited. To model non linear relationships there are more appropriate models which do not involve learning these extra components.</p>
<p>Linear models are also limited in their ability to model relationships between input features. For instance, in our California Housing Dataset, if 1 bedroom houses when they are very old significantly increase the median income value, a linear model would struggle to capture this relationship, as it treats each feature independently and multiplies it by a weight.</p>
<h2 id="heading-summary">Summary</h2>
<p>Linear regression might be less popular than Transformers, Neural Networks, and other advanced models, but if a simple linear model is sufficient, it’s quick and easy to train.</p>
<p>You might then be wondering why we covered linear regression in the first place. The reason is that linear regression provides a solid foundation for understanding key concepts such as training with gradient descent, loss functions, and how weights influence a model's output. In linear regression, we use a mathematical function that takes some input features and has adjustable parameters known as weights. We apply a loss function to the model's predictions and then use gradient descent to adjust the weights until the loss function is appropriately minimized.</p>
<p>This approach to training can be extended to more complex models, however the core idea remains the same: we adjust parameters to minimize a loss function. The difference lies in the mathematical function we use, linear regression is simple, but we can replace it with more flexible models that can handle complex relationships, and help predict different types of data.</p>
<p>The next model we’ll intro are Neural Networks as our new mathematical function. Neural Networks allow us to model non-linear relationships and dependencies between features, enabling us to tackle much more complex datasets.</p>
<h2 id="heading-complete-code">Complete Code</h2>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="6667aa0b41a5c03e91c96866f4c884fb"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/jessenr/6667aa0b41a5c03e91c96866f4c884fb" class="embed-card">https://gist.github.com/jessenr/6667aa0b41a5c03e91c96866f4c884fb</a></div>]]></content:encoded></item></channel></rss>