Recently I've been working a lot on coding models and functions 'from scratch'. The point is to force myself to understand not only how each model/method/algorithm works, but also really understand the implementations and parts of the models.
One of the problems I've hit with a few different statistical models is singular matrices. In this post I'm going to talk about:
A singular (or degenerate) matrix is a matrix that can not be inverted. A lot of this post will discuss the bridge between theory and practice, so I'll further specify that a singular amtrix is a matrix that can't theoretically be inverted.
Of course, practically we can do all sorts of silly things. There are many functions that one can run on a computer that will try and 'succeed' at inverting a sigular matrix. I put succeed in quotes as any function that returns a result when trying to invert a singular matrix is returning nonsense. This is why it is important not to blindly trust functions for random packages you find on the internet! On the other hand, there may be matrices that are theoretically invertable but impossible to practically invert for a variety of reasons that I'll discuss in a bit.
First, let's review the definition of 'invertable'. A matrix is invertable if there exists a square ($nxn$) matrix B such that $AB = BA = I$ where $I$ is the identity matrix. Matrix inversion is the process of finding matrix $B$ for matrix $A$ that satisfies the equation above.
Technical Asside: I'd like to burrow a little deeper into what a singular matrix is, but it's a bit mathy. Feel free to skip this if you aren't a hard core math nerd. One can show that non-singular coefficient matrices lead to unique solutions for every vector of constants one could choose. Singular matrices, on the other hand, have non-trivial nullspaces (see proof NMTNS at bottom). For vector contraints b, the system $\mathcal{LS}(A,b)$ could be inconsistant (i.e. no solution). However, if $\mathcal{LS}(A,b)$ has at elast one solution $(w)$, then it will have infinitely many solutions (see proof PSPHS)! A system of equations with a singular coefficient matrix will never have a unique solution.
We'll also note that for singular matrices that there will often be a way to write one row of the matrix as a linear combination of the other rows (may also be true for the columns).
Why are singular matrices a problem? Well, as it turns out, we often need to invert matrices. For example, what if we want to evaluate the probability density function of a multivariate gaussian distribution?
$$ p(x;\mu,\Sigma)= \frac{1}{(2\pi)^{\frac{n}{2}} \left\lvert \Sigma \right\rvert ^\frac{1}{2}} \exp\biggl( \frac{-1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu) \biggr) $$
We would need to find $\Sigma^{-1}$, the inverse of the caovariance matrix. Or what if we wanted to evaulate the posterior of a Gaussian Process Model?
$$ \overline{\mathcal{f}}_* = k^T_*(K+\sigma^2_nI)^-1y $$
$$ \mathcal{V}[\mathcal{f}_*] = k(x+*,x_*)-k^T_*(K+\sigma^2_nI)^-1k_* $$
I borrowed the notation above from Gaussian Processes for MAchine Learning Eq. 2.25-26. I could go on listing examples of important equations that require matrix inversions but I think you get the point.
The problem is, if we ever need to invert a singular matrix we are in big trouble!
In many classrooms we teach that the simplest way to find out if a matrix is singular is to find the determinant. If the determinant is zero, then the matrix is singular. This would work fine if theory and practice always went hand in hand, but in the real world things can go terribly wrong with using the determinant to find out if a matrix is singular.
Here is a good example (courtesy of anonymous user 85109 on stack exchange). Let's take the determinant of the matrix below.
import numpy as np
arr = np.array([
[16, 2, 3, 13],
[5, 11, 10, 8],
[9, 7, 6, 12],
[4, 14, 15, 1]
])
np.linalg.det(arr)
Well that's not zero! Awesome, so the matrix must be non-singular, right? Nope. We can see that there is a way to write a row on this matrix as a linear combination of the other raws (and the same for the columns). This implies that the matrix is singular!
Let's check the symbolic determinant to get a second opinion.
import sympy as sym
M = sym.Matrix(arr)
M.det()
Wow! The symbolic determinant is exactly what we expect for a singular matrix (zero). So why did numpy give us a different answer?
Well, calcualting the determinant of large matrices is very inefficient. A nice approximation that is commonly leveraged by apckages like numpy is to use the product of the diagonal elements of a specific matrix factorization of the array (LU factorization as of version 1.15). Let's look at this factorization below.
import scipy.linalg as la
P, L, U = la.lu(arr)
print(L)
print(U)
print(P)
The diagonal of the lower triable (L) are all ones and the diagnonal of the upper triangle (U) are all on-zero! This makes for nice easy math then writing statistical/scientific computing packages. We can take the product of the diagonal of the upper triagle to approximate the determinant of the original matrix.
np.prod(np.diag(U))
We got the same answer as when we called the determinant function from numpy! Neat. Now this LU decomposition technique is super fast, but it relies on floating point arithmatic. The product of the daigonal of the upper triangle is not quite zero as we would expect. This is why using standard functions that calcualte determinants to identify singualr matrices is a bad idea.
Here are a few other weird examples where using the determinant misleads us! Now, the identity matrix is NOT singular,
np.linalg.det(np.eye(100))
But by multiplying our matrix by a very samll number, we suddenly see a determinant value that is WAY closer to zero than the determinant value for the singular matrix above!
np.linalg.det(0.1*np.eye(100))
Now this matrix is NOT singular (for any constant $c$ with identiy matrix $I$, $c*I=D$, matrix D is non-singular jsut like $I$), but with a determinant of $1e^{-100}$ we might easily be fooled into thinking that it is....just wait, it gets worse. Look at the example below.
np.linalg.det(.0001*np.eye(100))
The determinant should jsut be the determinant of $I$ scaled by $.0001^{-100}$...but numpy can't represent that number! Instead the number underflows and becomes zero, thus tricking us into thinking that this matrix is singular. We could easily invert this matrix and get the correct inversion. We can try the same trick with a large constant to get overflow issues (at least this time numpy warns us!).
np.linalg.det(10000*np.eye(100))
What other tests might we try for identifying if a matrix is singular? One common tool is using the matrix rank. If the rank of an NxM matrix is less than the minimum of N and M, then we call the matrix singular.
The rank of a matrix is defined as either 1) the maximum number of linearly independent column vectors in the matrix or 2) the maximum number of linearly independent row vectors in the matrix. Both definitions are equivalent.
A = .0001*np.eye(100)
rank = np.linalg.matrix_rank(A)
size_M = A.shape[0]
det = np.linalg.det(A)
print("rank {} = dimension {}".format(rank, size_M))
print("determinant {}".format(det))
The scaled identity matrix from above still fails to pass the determinant test (due to underflow issues), passes the rank test. We can try this for our original array as well!
rank = np.linalg.matrix_rank(arr)
size_M = arr.shape[0]
det = np.linalg.det(arr)
print("rank {} != dimension {}".format(rank, size_M))
print("determinant {}".format(det))
This array passes the determinant test (even though it is singular), but fails to pass the rank test.
Another test that we can try is the condition test. The condition of a matrix can be thought of as a measure of how easy the matrix is to invert. The best condition is one. The higher the condition number the harder a matrix is to invert and the more errors may propagate through to the inverted matrix. This is nice, because it not only gives us a clue as to whether a matrix is singular, but also whether the matrix is close enough to singular that we can expect errors when computing the inversion on a computer (due to floating point errors and what not).
The condition is technically the norm of a matrix times the norm of it's 'inverse' (or the matrix the computer gets when it tries to invert the matrix). If these two norms are very dissimilar (meaning the norm changed a lot when the matrix was inverted) then we say that the matrix is poorly (or ill) conditioned. The condition number will be high in this case.
Now the computer may still invert ill conditioned matrices. In fact, it takes the same amount of steps to invert a matrix using Gaussian elimination no matter the condition. However, ill conditioned matrices will have many errors in their inverted counter parts (even to the point of being completely useless). The condition becomes a kind of error multiplier.
When solving the linear system $Ax = b$, you might expect that a small error in $b$ would result in a small error in $x$. That’s true if $A$ is well-conditioned. But small changes in $b$ could result in large changes in x if $A$ is ill-conditioned. Any error (like measurement error from real world observations) will be multiplied by poor conditioning (not just floating point errors).
As a rule of thumb in double precision, a condition greater than 1e15 is really bad.
np.linalg.cond(arr)
Our original matrix (the tricky singular one) has a HUGE condition and is probably even singular based only on looking at the condition. Obviously it will be bad to try to invert this matrix without taking the proper precautions.
One good check is to see if the reciprocal of the condition is larger than than float epsilon. If it is clsot to epsilon then you are bound to run into some issues.
import sys
1.0 / np.linalg.cond(arr) >= sys.float_info.epsilon
Finally there is svd (singular value decomposition). This is what rank and condition are based on! When an of the singular values of a matrix are small compared to the largest singular value...beware!
np.set_printoptions(suppress=True)
singular_values = np.linalg.svd(arr)[1]
max_sv = np.max(singular_values)
min_sv = np.min(singular_values)
min_sv / max_sv
We notice that the ratio of the largest and smallest value is REALLY small...that's a bad sign. The svd can tell us if a amtrix is close to singularity. If multiple singualr values are really small it can tell us about the matrix rank.
All of the tools above are easy to use and pretty efficient. A careful scientific coder should always check if his/her matrices are invertable.
So what do we do if we find a singular matrix?!
Singular marticies are, as it turns out, a very small subset of the space of all possible square matricies. In fact, if you were to fill matricies with random uniform samples, you would almost NEVER get a singular matrix.
So the easiest trick to work with a singular amtrix is to add a very small value to the diagonal of the matrix to 'nudge' it out of the singular subset.
arr_nudged = arr + np.eye(arr.shape[0])*1e-10
print("Original Matrix Condition: {}".format(np.linalg.cond(arr)))
print("Nudged Matrix Condition: {}".format(np.linalg.cond(arr_nudged)))
The condition of our nudged matrix is still really big...but not NEARLY as bad as it's original condition! Adding a tiny value like 1e-10 to the diagonal of a covariance matrix (for example) might not change the matrix in any meaningful way from a scientific standpoint, but it can be many fewer errors when calculating the matrix's inverse.
Another good piece of advice to to looka t different methods of inverting matrices. Instead of using Cramer's formula or standard funtions like np.linal.inv
, try using SVD decomposition or LU decomposition. You can even find some very nice numerically stable methods leveraging cholesky decomposition (a favorite for Gaussian Process models).
Author's Note: The observant reader may note that earlier I said that singular matrices are very rare...so why worry about them? Well, they are rare in the sense that you are unlikely to stumble arocss one when randomly sampling many common distributions from the exponential family. However, there are good reasons why we may run acorss them commonly in the real world. Covariances matrices, for example, are often build around multiple samples from a test set. Many data points/samples may be identical or very close resulting in rows/columns in the matrix that are identical/close to identical. This is why we regularize the matrix we want to invert by adding a very small number to the 'ridge' or 'principal diagonal' of the matrix (just like in ridge regression) in the same way that we might add a noise term in the noisy case of Gaussian Process regression! In layman's terms: this is why we add a small number to the matrix diagonal. If you'd like to read more about this in the case of Gaussian Processes, you can check out equation 3.26 on page 45 of Gaussian Processes for Machine Learning.
Well now you know how to find hot singular matrices in your area and even how to work around them! My advice is to always check your matrices before you try to invert them and have a plan for how to treat the matrix if it is poorly conditioned.
I credit these to 'A First Course in Linear Algrebra' by Robert Breezer from which I took these proofs. I thank Robert for releasing this great reference for free under the GNU open source liscence!
Theorem NMTNS: Nonsingular Matrices have Trivial Null Spaces.
Suppose that $A$ is a square matrix. Then $A$ is nonsingular if and only if the null space of $A$ is the set containing only the zero vector, i.e. $\mathcal{N}(A)=\{0\}$.
Proof: The null space of a square matrix, $A$, is equal to the set of solutions to the homogeneous system, $\mathcal{LS}(A,0)$. A matrix is nonsingular if and only if the set of solutions to the homogeneous system, $\mathcal{LS}(A,0)$, has only a trivial solution. These two observations may be chained together to construct the two proofs necessary for each half of this theorem.
Theorem PSPHS: Particular Solution Plus Homogeneous Solutions.
Suppose that $w$ s one solution to the linear system of equations $\mathcal{LS}(A,b)$. Then $y$ is a solution to $\mathcal{LS}(A,b)$ if and only if $y=w+z$ for some vector $z \in \mathcal{N}(A)$.
Proof: PSPHS Proof