Transforms for Video Compression, Part 1: Vectors, the Dot Product and Orthonormal Bases

The use of transforms in data compression algorithms is at least 40 years old. The goal of this three-part series of posts is to provide the mathematical background necessary for understanding transforms and to explain why they are a valuable part of many compression algorithms.

I’ll focus on video since that’s my particular interest. Part 1 (this post) will review vectors, the dot product and orthonormal bases. Part 2 introduces the use of matrices for describing both one and two-dimensional transforms. Finally, Part 3 gives an example and explains intuitively why transforms are a valuable part of video compression algorithms.

Remember back in high school when you learned the “dot product” between two vectors? It turns out that this operation is at the heart of a powerful way of thinking about transforms such as the DFT, the DCT, and the integer transform used in H.264. In this post, part 1 of a three-part series, I review some key concepts we’ll need, including vectors, the dot product and orthonormal bases. So let’s get started!

An n dimensional vector is just a collection of n numbers. By convention the numbers are usually written in a column, like this:

$\underline{x}=\begin{pmatrix} x_{0} \\ x_{1} \\ x_{2} \\ \vdots \\ x_{n-1} \end{pmatrix}$

Vector concepts are valuable in signal processing because a discrete, or digital, signal is nothing more than a collection of numbers. In other words, a discrete signal (of finite length) is a vector. The numbers in a vector can be samples of a music waveform, pixels in an image or samples from any signal that tickles your fancy. We call the numbers that make up a vector the components of the vector. Any operation that can be performed on a vector can be thought of as an operation performed on a discrete signal.

Now, an n-dimensional vector can be viewed in two ways:

as a point in n-dimensional space (i.e., the components of the vector are the coordinates of a point); or
as a directed line segment with its tail at the origin and its head at the point specified by its coordinates (this is the way we normally think of a vector — as an “arrow” with a definite direction).

The Pythagorean Theorem tells us how long a vector is as a function of its components:

$\| \underline{x} \| = \sqrt{\displaystyle\sum_{i=0}^{n-1}x_{i}^{2}}$

If you have two vectors, then the dot product between them is computed by multiplying their corresponding components together and summing up the products. Mathematically we express this as

$\underline{x} \cdot \underline{y} = \displaystyle\sum_{i=0}^{n-1}x_{i}y_{i}$

We see immediately that

$\| \underline{x} \| = \sqrt{ \underline{x} \cdot \underline{x}}$

It can also be shown that

$\underline{a} \cdot \underline{b} = \| \underline{a} \|\| \underline{b} \| \cos \theta$

This is just the vector formulation of the famous law of cosines you learned in trigonometry class, where θ is the angle between the vectors a and b. Now, recall that the cosine of a 90 degree angle is zero — so from this equation, it is clear that a and b are orthogonal, i.e., the angle between them is 90 degrees, if their dot product is zero.

If a vector has length one, we call it a unit vector. We can always form a unit vector that points in the same direction as an arbitrary vector b by applying the formula

$\underline{u} = \dfrac{\underline{b}}{\| \underline{b} \|}$

(Recall that multiplying or dividing a vector by a scalar means multiplying or dividing each component of the vector by the scalar.)

Finally, we can find the projection of a vector a in the direction of a unit vector u by computing the dot product between them:

$\underline{a} \cdot \underline{u} = \| \underline{a} \| \cos \theta$

Let’s consider an example for the case of three-dimensional vectors. We note that (1, 0, 0), (0, 1, 0), and (0, 0, 1) are all unit vectors; furthermore, they are all orthogonal to each other because the dot product between any pair of them is zero. Of course (1, 0, 0) points in the direction of the x‑axis, (0, 1, 0) points in the direction of the y‑axis, and (0, 0, 1) points in the direction of the z‑axis. Because these three unit vectors are all orthogonal to each other we say these vectors are orthonormal to each other (here “normal” refers to the fact that the vectors have a length of 1—they’ve been normalized).

Now, let x^T = (x₀, x₁, x₂) be an arbitrary vector in 3D space. (The superscript “T” means transpose; when we write a vector horizontally we refer to it as the transpose of the vertical vector. But I may be sloppy at times and let the context make clear whether a vector is horizontally or vertically oriented.) The dot product of x with (1, 0, 0) is x₀, the dot product of x with (0, 1, 0) is x₁, and the dot product of x with (0, 0, 1) is x₂. In other words, x₀, x₁, and x₂ are the projections of the vector x in the directions of the x, y, and z axes.

Furthermore, the unit vectors (1, 0, 0), (0, 1, 0), and (0, 0, 1) form an orthonormal basis for the set of all 3D vectors because any 3D vector can be written as a scaled sum of these three vectors. In particular:

$(x_{0},x_{1},x_{2})=x_{0}(1,0,0)+x_{1}(0,1,0)+x_{2}(0,0,1)$

Now we’re ready for a key insight into transforms. The vectors (1, 0, 0), (0, 1, 0), and (0, 0, 1) are not the only orthonormal basis vectors. For example, one easy way to get another set of orthonormal vectors is to rotate the unit vectors that point in the x-axis and y‑axis directions by 45 degrees, while leaving the (0, 0, 1) vector unchanged. Doing this we get the three unit vectors (0.707, 0.707, 0), (-0.707, 0.707, 0), and (0, 0, 1). It is easy to compute the dot product of every pair of these vectors and show that they are mutually orthogonal; it is also easy to see that all the vectors have a length of 1. What if we take the projection of (x₀, x₁, x₂) onto each of these new vectors? We get

$y_{0} = (0.707,0.707,0) \cdot (x_{0},x_{1},x_{2})=0.707x_{0}+0.707x_{1} \newline y_{1} = (-0.707,0.707,0) \cdot (x_{0},x_{1},x_{2})=-0.707x_{0}+0.707x_{1} \newline y_{2} = (0,0,1) \cdot (x_{0},x_{1},x_{2})=x_{2}$

Now we can express the vector x^T as a sum of these three new unit vectors as follows:

$(x_{0},x_{1},x_{2})= y_{0}(0.707,0.707,0)+y_{1}(-0.707,0.707,0)+y_{2}(0,0,1)$

In other words, given the three projections y₀, y₁ and y₂, we can reconstruct the numbers x₀, x₁ and x₂ from them by using them to scale their corresponding unit vectors and then adding up the scaled vectors.

In fact, for any set of three orthonormal vectors, we can compute the projection of x onto each of the vectors, and given these projections, we can reconstruct x exactly as demonstrated above. Conceptually, we “transform” x into y by computing the projections onto the unit vectors of our orthonormal basis (eq. 1); given y, we “inverse transform” back to x by using the components of y to scale the basis vectors and then add them up (eq. 2). Every set of orthonormal basis vectors defines a transform.

When the goal is to compress the vector (i.e., the discrete signal) x, then it may be easier to first transform x into a new vector y and compress y instead. Of course, given y, the receiver can apply the inverse transform to recover the original vector x. If the right orthonormal vectors are chosen — in other words, if the right transform is used — the benefit can be dramatic! — but more on that in Part 3. First, we need to develop a more efficient formulation of the above concepts using matrix notation. That will be the topic of Part 2.