I'm participating in a summer school at Oxford University learning about neural networks and connectionist modelling. This is an archive/scratchpad as I go through some of the connectionist materials (specifically "Exercises in Rethinking Inateness" from MIT Press by Plunkett and Elman. The book is likely available at
if you have an account.
Some general information on neural networks
The 1998 version of the material I'm working with is at:
My collection of tlearn project files
Click below to download tlearn!
Powerpoint Lecture Slides for summer 2003
(short, but helps)
Some links on artificial intelligence
The weight-limit paramter is +/- 1/2 of that limit. (i.e. 4 means +/- 2) set as the random weights for the network.
The range of random weight limits is important to optimize the training time of the network. We want random weights as close as possible to the expected final outcome (yet randomized). Thus, if we expect a network to have very high variation in the final trained state, a high random-weight range is desired. +/-0.5 seems to generally work well.
The weight-change constant (eta) can generally be set high unless dealing with a very context-sensitive network in which there are small 'slivers' of correct behavior that would be likely to be jumped over using a high weight-change constant.
The 'cluster analysis' tool groups things in a heirachy by how close the vectors are to each other in N-dimensional space where N = # of input nodes. Quite often these relations are not useful.
The Bias node has an output of 1. This can be interpreted as not being sigmoided, or an infinite activation. Inputs are not sigmoided. Outputs are sigmoided.
An 'epoch' is defined as a # of sweeps as the amount of training patterns. If trying these training patterns in random order (with replacement), an epoch can be a deceptive measurement. 1 epoch in random order is unlikely to have tried all of the test patterns. Unchecking the 'with replacement' option in tlearn make sure each input pattern is tested once per epoch.
# hidden units >= (log #output states)/(log 2) (if dealing with only binary activations of the hidden/output units.
The 'bias node' is to allow for a change in the threshold function. One can do this instead of having additional modifications to each neuron's activation function. Although it is basically just a connectionist-model hack to make things simpler. Definitely not biologically plausible.
Momentum is useful in which sequential test patterns need to be treated similarly. (especially useful for sequential order of test patterns) Or on patterns in which you expect the problem to be solved in a strongly piecemeal fashion. It prevents the network from stopping when only a single piece has been solved. (a local minima)
Heuristics for Modelling Methods
# of hidden notes is constrained by two things. They cannot be to few or the network will not have enough cognitive ability to devise an inner representation (extracting the common essence) of the input data. If you have to many, it will be completely within the network's power to simply 'memorize' the proper responses for the input data and not generalize. Thus we want to use the smallest number of hidden nodes as possible in order to get the greatest generalization.
The subset of all input patterns must properly sample the entire space of future input patterns otherwise the network won't have any chance of generalizing properly.
Always try the 'perceptron' learning algorithm first. Perceptron guarentees it will find the optimal solution within a relatively short amount of time '''if the output space is linearly seperable```. If the perceptron method is unsuccessful we can always try other algorithms which can solve problems that are non-linearly-seperable.
Every added hidden node partitions the input space. If dealing with purely binary output, one can derive the needed number of input units by the dimensions of the space required to confine/define the answer.
Localist networks generally has only binary outputs of a few nodes which allow the input pattern to be uniquely determined. Distributed networks often have much more output nodes activated in for any output pattern and have partial activations.
Values closet to a binary relationship, 2^#hidden nodes = #output nodes will be the fastest to solve. This is a product that the number of distinct outputs (and thus inputs) must somehow be compressed into a smaller 'waist' of hidden units. These hidden units must find a way to encode all of that information within a small number of nodes. Binary is a very efficient manner for this compression.
Pattern redunacy is a property of a set of patterns when the activity of a single bit can accurately predict the property of a bit in another pattern.
Backpropogation and related algorithms guarentee that a perfect solution perfectly mapping the input data to the desired output exists for an infinite number of nodes or if we have infinite precision on a single node. However, we will never have either of these, and even then, there's no guarentee we'll ever find the perfect solution on the error surface.
Superposition of networks
Meta-learning algorithm to resolve problems in getting a good simulation. (an algorithm that will try other values as parameters for you intelligently)
Network 1 teaching network 2 to add
Network 1 making network 2 a copy of itself (Perhaps with localist connections between these)
A chaotic network that provides a non-repeating sequence for the output
A crypto-system implemented via a neural network
Do a recursive fibonacci sequence in a neural network
A circular archeitecture for both networks being the student/teacher. Perhaps able to simulate non-linear dynamic systems.
bi-neural networks connecting in middle to perform a transitive property of themselves.
Constructing a network to look for _relative_ patterns in the outputs, not absolute. This would make it far more sensitive to patterns akin to human perception. One would, in principle, make dimensions interchangeable.
Why I make all of these damned homonym spelling errors.
Introduction to the Theory of Neural Computation (0201515601)
Modeling Brain Function : The World of Attractor Neural Networks
Self-Organizing Maps (3540679219)
Neural and Adaptive Systems: Fundamentals through Simulations
-p93 why network solves XOR problem ~2000sweeps
-whatis the 'with replacement' option with the input patterns presented in random order?
-when to add more layers
You don't know. Generally don't go above 1-2 layers of hidden units.
-Besides translation invariance, when do we decrease the number of connections?
-Isn't there anything better we can go with translation invariance?
-Init bias offset?
-'giving away the secret'
-what to do with the network's ability to find solutions varying so much on the encoding of the input/outputs.
Consult current psychological theory of how these objects are represented within the brain. If no strong/accepted theory of these processes exists, make it harder on yourself by making all different input patterns orthogonal to each other. (i.e. 1 dimension for each input, or equally spacing along a lower number of dimensions) to not give the network any notion of pre-defined relationships.