Neural Network Playground

Build a neural network from scratch, choose a dataset pattern, configure hidden layers and activations, then train it. Watch the decision boundary form in real time on the canvas. No data leaves your browser.

Network Configuration

Dataset

Activation

Hidden Layer 1

Hidden Layer 2

Hidden Layer 3

Learning Rate: 0.030

Epoch

Loss

Accuracy

Parameters

How Neural Networks Work

A neural network is a function approximator composed of layers of interconnected neurons. Each neuron computes a weighted sum of its inputs, adds a bias term, and passes the result through a nonlinear activation function. By stacking multiple layers, neural networks can learn to represent extremely complex mappings from inputs to outputs. This property is formalized by the universal approximation theorem, which states that a network with a single sufficiently wide hidden layer can approximate any continuous function to arbitrary precision.

In the playground above, the network takes two input features (X1 and X2) representing coordinates on a 2D plane, processes them through one to three hidden layers, and outputs a single value that determines the classification. The decision boundary is the curve in the input space where the network's output transitions from one class to another.

Layers and Neurons

A neural network is organized into three types of layers. The input layer receives the raw features. In our playground, this is always two neurons (X1 and X2). Hidden layers perform the computation that transforms inputs into useful representations. Each hidden layer contains a configurable number of neurons, and each neuron is connected to every neuron in the previous layer (fully connected or dense layer). The output layer produces the final prediction, a single neuron with sigmoid activation for binary classification.

The number of hidden layers and neurons per layer defines the network's capacity. A network with zero hidden layers is equivalent to logistic regression and can only learn linear decision boundaries. Adding one hidden layer allows the network to learn any convex decision region. Two or more hidden layers enable the network to learn arbitrary, non-convex decision boundaries including disconnected regions. However, more capacity also means more parameters to train and higher risk of overfitting.

Activation Functions

Activation functions introduce nonlinearity into the network. Without them, stacking multiple layers would be equivalent to a single linear transformation, regardless of depth. Three common activations are available in this playground:

ReLU (Rectified Linear Unit): Outputs max(0, x). It is computationally efficient, does not saturate for positive values, and produces sparse activations. ReLU is the default choice for most hidden layers. Its main drawback is the "dying ReLU" problem where neurons can permanently output zero if they enter a regime where the input is always negative.
Sigmoid: Outputs 1 / (1 + exp(-x)), squashing values to the range (0, 1). Useful for output layers in binary classification, but problematic in hidden layers because gradients vanish for very large or very small inputs (saturation). Training deep sigmoid networks is slow.
Tanh: Outputs (exp(x) - exp(-x)) / (exp(x) + exp(-x)), squashing to (-1, 1). Zero-centered output can improve convergence compared to sigmoid, but it still suffers from vanishing gradients at saturation. A reasonable choice when you need bounded activations in hidden layers.

Backpropagation

Backpropagation is the algorithm that makes neural network training possible. It efficiently computes the gradient of the loss function with respect to every weight in the network by applying the chain rule of calculus backwards through the computation graph. The process has two phases:

Forward pass: Input data flows through the network layer by layer, producing activations at each layer and a final prediction at the output.
Backward pass: The error at the output is computed (using a loss function like binary cross-entropy). This error signal is then propagated backwards, computing the gradient contribution of each weight. Weights are updated proportionally to their contribution to the error, scaled by the learning rate.

Backpropagation has computational complexity linear in the number of parameters, making it practical even for networks with millions of weights. It requires storing the activations from the forward pass, which creates a memory-computation tradeoff.

Overfitting vs. Underfitting in Neural Networks

Overfitting occurs when the network memorizes the training data instead of learning the underlying pattern. Signs include: training loss continues to drop while validation loss increases, near-perfect training accuracy but poor test accuracy, and highly irregular decision boundaries that trace around individual data points. Causes include too many parameters relative to the amount of training data, training for too many epochs, and insufficient regularization.

Underfitting occurs when the network lacks the capacity to capture the underlying pattern. Signs include: both training and validation loss remain high, the decision boundary is too simple to separate the classes, and increasing training epochs does not improve performance. Causes include too few neurons or layers, excessive regularization, and learning rate too low.

Common strategies to prevent overfitting include: dropout (randomly zeroing neurons during training), weight decay (L2 regularization), early stopping (halt training when validation loss increases), data augmentation (generating synthetic training examples), and reducing network size. Try these approaches in the playground above: compare a network with 8-8-8 neurons to one with 4-0-0 on the circle dataset. The larger network may achieve slightly better training accuracy but shows more irregular decision boundaries.

Dataset Patterns Explained

The four datasets available in this playground illustrate different classification challenges:

Circle: Points inside a circle are one class, points outside are another. Requires at least one hidden layer to learn the radial decision boundary. A good starting test for neural networks.
XOR: The classic non-linearly-separable problem. Points in opposite quadrants share a class. A single perceptron cannot solve XOR, but one hidden layer with two or more neurons can. This is historically significant as it demonstrated the need for multi-layer networks.
Spiral: Two interleaved spirals, one per class. This is the hardest pattern, requiring multiple hidden layers or many neurons to learn the complex, winding decision boundary. A good stress test for network capacity.
Linear: A linearly separable dataset. Even a network with zero hidden layers (logistic regression) can solve this. Useful as a sanity check and baseline comparison.

Practical Tips for Neural Networks

Always start with a simple architecture and increase complexity only if needed.
Use ReLU for hidden layers and sigmoid/softmax for output layers.
Initialize weights with He initialization (for ReLU) or Xavier initialization (for tanh/sigmoid).
Monitor both training and validation loss. The gap between them indicates overfitting.
Use batch normalization to stabilize training in deeper networks.
Learning rate is the most important hyperparameter. Use a scheduler or finder to tune it.
When stuck, try changing the architecture before extensively tuning hyperparameters.

Frequently Asked Questions

What is a neural network?

A neural network is a computational model inspired by biological neurons. It consists of layers of interconnected nodes that transform input data through weighted connections and nonlinear activation functions. By adjusting weights during training via backpropagation, neural networks learn to approximate complex functions for classification, regression, and other tasks.

What do hidden layers do in a neural network?

Hidden layers enable the network to learn hierarchical, nonlinear representations of the input data. The first hidden layer learns simple features while deeper layers combine these into increasingly abstract representations. More hidden layers allow the network to model more complex functions, but also increase the risk of overfitting.

What is the difference between ReLU, sigmoid, and tanh activations?

ReLU outputs max(0, x) and is the most widely used because it trains faster and avoids vanishing gradients for positive values. Sigmoid squashes outputs to (0, 1) and is used for binary output layers. Tanh squashes to (-1, 1) with zero-centered output. Sigmoid and tanh both suffer from vanishing gradients for extreme values.

What is backpropagation?

Backpropagation is the algorithm used to compute gradients of the loss function with respect to each weight in the network. It applies the chain rule of calculus backwards through the network layers, computing how much each weight contributed to the error. These gradients are then used to update weights via gradient descent.

How many hidden layers and neurons should I use?

Start simple: one hidden layer with a number of neurons between the input and output sizes. For most tabular data, 1-2 hidden layers with 32-128 neurons works well. Use validation performance to guide decisions. Too many neurons leads to overfitting; too few leads to underfitting.

Related Tools

About the Author

Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.

Last updated: May 25, 2026