A Minimal ISP Walkthrough: From Bayer RAW to Display Image

If you strip a camera pipeline to the essentials, a basic Image Signal Processor (ISP) can be viewed as a sequence of simple transformations:

Sample light with a Bayer color filter array (CFA)
Convert analog sensor signal with an ADC
Apply black level correction
Process in the linear RAW domain
Apply white-balance gains
Demosaic to full RGB
Apply a basic tone mapper
Gamma-encode for display

Minimal ISP pipeline from sensor to display

Bayer data: what the sensor actually captures

Most sensors do not measure full RGB at each pixel. Instead, each photosite sits under one color filter (typically RGGB):

red sample
green sample
green sample
blue sample

This means each pixel location contains only one color channel at first.

RGGB Bayer pattern example

Why two greens? Human vision is more sensitive to luminance detail, and luminance is strongly represented by green.

ADC: analog to digital conversion

Sensor output starts as an analog charge/current and is converted by an ADC (Analog-to-Digital Converter) to integer code values.

For a 12-bit ADC, values are usually in $[0, 4095]$ before calibration. Higher bit depth allows finer quantization.

At this stage, the signal is still typically considered RAW and linear with respect to scene radiance.

Black level correction

Many pipelines apply black level correction immediately after ADC (or as part of RAW normalization):

$$ x_{\text{blc}} = \max(0, x_{\text{adc}} - b) $$

where $b$ is the black-level offset estimated from optical-black pixels or calibration tables.
This removes the sensor/electronics offset so that “no light” maps close to zero.

Linear domain data

In the linear domain, doubling incoming light roughly doubles signal value.

That linearity is important for physically meaningful operations:

scaling channels for white balance,
some denoising/statistics steps,
HDR merges and exposure fusion,
camera calibration math.

A common beginner mistake is to do physically based operations after gamma; for foundational ISP steps, linear is usually the safest place.

White balancing (basic gain model)

A simple white balance model is per-channel gain:

$$ R’ = g_R R,\quad G’ = g_G G,\quad B’ = g_B B $$

You can interpret this as compensating the scene illuminant (for example, warm indoor light that pushes red/yellow).

In a minimal ISP, gains are often computed from:

metadata from auto white balance (AWB), or
a simple gray-world assumption.

Gray-world in a little more detail

The gray-world idea assumes that, averaged over a sufficiently diverse scene, the mean color should be achromatic (roughly equal channel means).

Let channel means in linear RAW be $\mu_R, \mu_G, \mu_B$. A common target is to match all channels to the green mean:

$$ g_R = \frac{\mu_G}{\mu_R},\quad g_G = 1,\quad g_B = \frac{\mu_G}{\mu_B} $$

Then apply gains to each corresponding Bayer sample (or post-demosaic channel, depending on pipeline design).

To make gray-world more robust, many implementations ignore saturated pixels (and often near-black noisy pixels) when computing $\mu_R, \mu_G, \mu_B$.
For example, skip samples above a threshold like:

$$ x \ge \tau_{\text{sat}} \approx (2^N - 1) - m $$

for an $N$-bit sensor and margin $m$. Saturated values are clipped and no longer represent true scene chromaticity, so including them biases gain estimates.

This step is usually done before demosaicing or early in the pipeline depending on implementation details.

Demosaicing

Because Bayer gives one channel per pixel location, we must reconstruct the missing two channels to get full RGB at every output pixel. That process is demosaicing.

Basic options:

nearest neighbor (very fast, lower quality),
bilinear interpolation (classic baseline),
edge-aware methods (better quality, more compute).

For a “most basic ISP,” bilinear interpolation is a good conceptual and practical baseline.

Basic tone mapping

Linear camera data has wide dynamic range relative to display range. Tone mapping compresses highlights and shapes contrast to look natural on SDR displays.

Simple tone mappers often used for teaching:

Reinhard global: $f(x) = \frac{x}{1+x}$
filmic-like curves (piecewise or polynomial)
simple shoulder roll-off with optional exposure scaling

Even a simple global operator helps preserve highlight detail better than hard clipping.

Gamma correction for display

Displays and human perception are non-linear. To store/render efficiently for standard displays, we apply a gamma-like transfer (or sRGB OETF approximation).

Conceptually, a simplified encoding form is:

$$ \text{out} = \text{in}^{1/\gamma},\quad \gamma \approx 2.2 $$

Linear vs gamma-encoded relationship

After gamma encoding, mid-tones receive more code precision where human vision is more sensitive.

Putting it all together

A minimal end-to-end chain looks like:

Bayer capture -> ADC -> linear RAW -> white balance -> demosaic -> tone map -> gamma encode

That is enough to understand the core ISP ideas before adding advanced blocks like denoise, sharpening, color correction matrices, local tone mapping, temporal processing, and HDR-specific logic.