Non-Negative Matrix Factorization (NMF) is a technique to find an approximate matrix decomposition of the form V » HW, with the constraint that all coefficients must be positive. The r columns of W form the basis. Each column is called an encoding, and it is in one-to-one correspondence with a cell in V. r is usually chosen to that (n+m)r < nm, creating a reduced representation of V in feature space.
A neural network is made of a number of nodes (neurons) and connections. The nodes can be subdivided in input nodes, output nodes, and hidden nodes. Nodes are grouped in layers, and have weighted interconnections that relate the output of a node with the input of another. A function called activation rule combines the input of various nodes to produce and ouput signal, called the activation. A neural network can be trained to learn an association between input and output sets, by adjusting the weights using a learning rule. The learning rule tries to adjust the weights to minimize an error function.
The activation function is rarely linear, since a combination of linear functions is also linear, which severely limits the class of functions (or patterns) that can be learned. The sigmoid ( ) or ln(x) functions are commonly used as activation rules.
Given enough nodes, connections and time, a neural network can learn any function. For example, a neural network can learn to do the SVD computation by defining an appropriate error function to minimize (remember the SVD or rank k of a matrix A is the k-rank minimum-square approximation of A).
Problem: The general problem of training a NN (in the optimal case) is NP-Complete (proof by Blum and Rivest), so we settle for an approximation.
Normally, greedy algorithms such as backpropagation are used. Such algorithms typically come with almost no guarantees. Alternatively, for specific formulations of the error criterion, approximation methods such as the Expectation-Maximization algorithm can be used.
NN are useful to devise an approximation for a function whose closed form we dont know, or a way to find the global maximum or minimum usually we only know pairs (v, f(v)). Given that the learning rule is an approximation, there is no guarantee that it wont stop in a local maximum (or minimum, depending on that we look for).
NMF can be represented as a NN of 3 layers; one for every cell of H, one for every cell of W and another for every cell of W. Since in NMF for each cell , we connect Vi,j with all Wi,a, for a=1..r, and each Wi,a with all Ha,j . In a NN, given an unknown layer, it is possible to approximate the values by training. The problem here is that there are two unknown layers (H and W). The way the authors find an approximation is to devise a two-stage method (E and M). Specifically, they maximize the objective function (go here for a detailed discussion by the authors of why they chose this objective function).
The authors device an iterative method to update the values of H and W similar to EM (Expectation Maximization). For an explanation by the authors of how the update function was derived, go here.
The authors claim forcing all the coefficients to be positive creates bases
that represent features, or parts, of the elements in V. An example of an
application of NMF to IR claims that once NMF is calculated on a reduced
space, it can not only capture the existence of synonyms, but also different
meanings of the same word in different contexts.