Discussion Notes

Feb 12, 2001

(courtesy Fernando Adrian Das Neves)

Non-Negative Matrix Factorization (NMF) is a technique to find an approximate matrix decomposition of the form V » HW, with the constraint that all coefficients must be positive. The r columns of W form the basis. Each column is called an encoding, and it is in one-to-one correspondence with a cell in V. r is usually chosen to that (n+m)r < nm, creating a reduced representation of V in feature space.

Artificial Neural Networks (NN)

A neural network is made of a number of nodes (neurons) and connections. The nodes can be subdivided in input nodes, output nodes, and hidden nodes. Nodes are grouped in layers, and have weighted interconnections that relate the output of a node with the input of another. A function called ‘activation rule’ combines the input of various nodes to produce and ouput signal, called the activation. A neural network can be ‘trained’ to ‘learn’ an association between input and output sets, by adjusting the weights using a ‘learning rule’. The learning rule tries to adjust the weights to minimize an error function.

The activation function is rarely linear, since a combination of linear functions is also linear, which severely limits the class of functions (or patterns) that can be learned. The sigmoid ( ) or ln(x) functions are commonly used as activation rules.

Given enough nodes, connections and time, a neural network can learn any function. For example, a neural network can learn to do the SVD computation by defining an appropriate error function to minimize (remember the SVD or rank k of a matrix A is the k-rank minimum-square approximation of A).

Problem: The general problem of training a NN (in the optimal case) is NP-Complete (proof by Blum and Rivest), so we settle for an approximation.

Normally, greedy algorithms such as backpropagation are used. Such algorithms typically come with almost no guarantees. Alternatively, for specific formulations of the error criterion, approximation methods such as the Expectation-Maximization algorithm can be used.

NN are useful to devise an approximation for a function whose closed form we don’t know, or a way to find the global maximum or minimum – usually we only know pairs (v, f(v)). Given that the learning rule is an approximation, there is no guarantee that it won’t stop in a local maximum (or minimum, depending on that we look for).

NN and NMF

NMF can be represented as a NN of 3 layers; one for every cell of H, one for every cell of W and another for every cell of W. Since in NMF for each cell , we connect V_i,j with all W_i,a, for a=1..r, and each W_i,a with all H_a,j . In a NN, given an unknown layer, it is possible to approximate the values by training. The problem here is that there are two unknown layers (H and W). The way the authors find an approximation is to devise a two-stage method (E and M). Specifically, they maximize the objective function (go here for a detailed discussion by the authors of why they chose this objective function).

The authors device an iterative method to update the values of H and W similar to EM (Expectation Maximization). For an explanation by the authors of how the update function was derived, go here.

The authors claim forcing all the coefficients to be positive creates bases that represent features, or parts, of the elements in V. An example of an application of NMF to IR claims that once NMF is calculated on a reduced space, it can not only capture the existence of synonyms, but also different meanings of the same word in different contexts.