Demystifying RNNs: A Deep Dive into Dimensions and Parameters

Understanding what really happens inside Recurrent Neural Networks

When learning about Recurrent Neural Networks (RNNs), many tutorials focus on the high-level concept of “memory” but gloss over the practical details of how they actually work. As someone who struggled with these details, I want to share the insights that finally made RNNs click for me.

The Core RNN Equations

Let’s start with the fundamental RNN equations that everyone shows:

$$ h_t = tanh(W_{hh} · h_{t-1} + W_{hx} \cdot x_t + b_h) \\ y_t = W_{ho} \cdot h_t + b_v $$

These equations look simple enough, but the devil is in the dimensions. Let’s break them down with a concrete example.

A Concrete Example

Let’s define our dimensions:

Input dimension ( $d_{in}$ ): 4 (each input is a 4D vector)
Hidden dimension ( $d_h$ ): 3 (the size of our RNN’s “memory”)
Output dimension ( $d_{out}$ ): 2 (e.g., binary classification)

Now let’s look at what each component actually contains:

The Vectors (Changing States)

Component	Shape	Description
$x_t$	`(4,)`	Input at time t (e.g., a word embedding)
$h_{t-1}$	`(3,)`	Previous hidden state (the “memory” so far)
$h_t$	`(3,)`	New hidden state (updated memory)
$y_t$	`(2,)`	Output at time t

Key Insight: $h_t$ and $x_t$ do NOT have the same dimension! The hidden dimension is a design choice, while input dimension is determined by your data.

The Parameters (Learned Weights)

Component	Shape	Purpose
$W_{hx}$	`(3, 4)`	Transforms input to hidden space
$W_{hh}$	`(3, 3)`	Transforms previous hidden state
$b_h$	`(3,)`	Hidden layer bias
$W_{ho}$	`(2, 3)`	Transforms hidden state to output
$b_v$	`(2,)`	Output bias

Key Insight: The weight matrices are the “bridges” that make different dimensions compatible. They’re the actual parameters learned during training.

Dimensionality Check: Why It All Works

Let’s verify the math works dimensionally:

# All operations are dimensionally compatible:
W_hh DOT h_{t-1}    # (3,3) DOT (3,)  → (3,)
W_hx DOT x_t         # (3,4) DOT (4,)  → (3,) 
b_h                # (3,)
# Sum: (3,) + (3,) + (3,) → (3,)
tanh(...)          # (3,) → (3,)  # h_t is born!

W_ho DOT h_t         # (2,3) DOT (3,) → (2,)
b_v                # (2,)
# Sum: (2,) + (2,) → (2,)  # y_t is ready!

Common Questions Answered

1. Is $h_t$ a vector or matrix?

In the fundamental formulation, $h_t$ is a vector. However, during batch processing (which we almost always do), it becomes a matrix where each row is the $h_t$ for one sequence in the batch.

2. How is $h_0$ initialized?

Typically with zeros: $h_0 = [0, 0, 0, …, 0]$. This provides a neutral starting point.

3. What’s actually being “learned”?

The weight matrices and biases $(W_{hh}, W_{hx}, W_{ho}, b_h, b_v)$ are the learned parameters. The hidden state $h_t$ is the result of computation, not a parameter.

4. Why can RNNs handle variable-length sequences?

Because the same parameters (weights) are reused at each time step, and the hidden state dimension remains constant regardless of sequence length.

Parameter Counting

In our example:

$W_{hx}: 3 \times 4 = 12$ parameters

$W_{hh}: 3 \times 3 = 9$ parameters

$b_h: 3$ parameters

$W_{ho}: 2 \times 3 = 6$ parameters

$b_v: 2$ parameters

Total: 32 parameters (regardless of sequence length!)

The Achilles’ Heel: Vanishing and Exploding Gradients

Despite their elegant design, RNNs suffer from a fundamental limitation: they struggle to learn long-term dependencies. This occurs due to the vanishing and exploding gradient problem.

During training, gradients are calculated and propagated backward through time. At each step, the gradient gets multiplied by the same weight matrix $W_{hh}$. The behavior of this repeated multiplication depends on the eigenvalues of $W_{hh}$.

What are eigenvalues? Think of them as the matrix’s “scaling factors” – they tell you how much a vector gets stretched or compressed when multiplied by the matrix.

If the largest eigenvalue is less than 1: Gradients shrink exponentially as they backpropagate through time, eventually vanishing to near-zero. The network loses its ability to learn from distant time steps.
If the largest eigenvalue is greater than 1: Gradients grow exponentially, exploding to enormous values and making training unstable.

This fragility stems from RNNs’ sequential structure: each hidden state depends solely on its immediate predecessor. The result is a brittle information chain where long-range dependencies vanish, limiting the model’s ability to capture context across extended sequences.

The Big Picture

Think of an RNN as a function: $h_t = f(x_t, h_{t-1})$

Parameters = The fixed “knobs” of the function (weights and biases)

Hidden state = The changing “memory” that gets passed between function calls

The magic = The same function f is called repeatedly, each time updating the memory based on new input

Why This Matters

Understanding these dimensional relationships is crucial because:

It helps debug shape errors when implementing RNNs
It clarifies what the model is actually learning
It provides intuition for more advanced architectures (LSTMs, GRUs, Transformers)
It explains why RNNs can handle sequences of any length

The next time you see RNN equations, remember: the dimensions tell the real story! The matrices are the bridges that make everything connect, and the hidden state is the messenger carrying information through time.