If you strip a camera pipeline to the essentials, a basic Image Signal Processor (ISP) can be viewed as a sequence of simple transformations:
- Sample light with a Bayer color filter array (CFA)
- Convert analog sensor signal with an ADC
- Apply black level correction
- Process in the linear RAW domain
- Apply white-balance gains
- Demosaic to full RGB
- Apply a basic tone mapper
- Gamma-encode for display
Bayer data: what the sensor actually captures
Most sensors do not measure full RGB at each pixel. Instead, each photosite sits under one color filter (typically RGGB):
- red sample
- green sample
- green sample
- blue sample
This means each pixel location contains only one color channel at first.
Why two greens? Human vision is more sensitive to luminance detail, and luminance is strongly represented by green.
ADC: analog to digital conversion
Sensor output starts as an analog charge/current and is converted by an ADC (Analog-to-Digital Converter) to integer code values.
For a 12-bit ADC, values are usually in $[0, 4095]$ before calibration. Higher bit depth allows finer quantization.
At this stage, the signal is still typically considered RAW and linear with respect to scene radiance.
Black level correction
Many pipelines apply black level correction immediately after ADC (or as part of RAW normalization):
$$ x_{\text{blc}} = \max(0, x_{\text{adc}} - b) $$
where $b$ is the black-level offset estimated from optical-black pixels or calibration tables.
This removes the sensor/electronics offset so that “no light” maps close to zero.
Linear domain data
In the linear domain, doubling incoming light roughly doubles signal value.
That linearity is important for physically meaningful operations:
- scaling channels for white balance,
- some denoising/statistics steps,
- HDR merges and exposure fusion,
- camera calibration math.
A common beginner mistake is to do physically based operations after gamma; for foundational ISP steps, linear is usually the safest place.
White balancing (basic gain model)
A simple white balance model is per-channel gain:
$$ R’ = g_R R,\quad G’ = g_G G,\quad B’ = g_B B $$
You can interpret this as compensating the scene illuminant (for example, warm indoor light that pushes red/yellow).
In a minimal ISP, gains are often computed from:
- metadata from auto white balance (AWB), or
- a simple gray-world assumption.
Gray-world in a little more detail
The gray-world idea assumes that, averaged over a sufficiently diverse scene, the mean color should be achromatic (roughly equal channel means).
Let channel means in linear RAW be $\mu_R, \mu_G, \mu_B$. A common target is to match all channels to the green mean:
$$ g_R = \frac{\mu_G}{\mu_R},\quad g_G = 1,\quad g_B = \frac{\mu_G}{\mu_B} $$
Then apply gains to each corresponding Bayer sample (or post-demosaic channel, depending on pipeline design).
To make gray-world more robust, many implementations ignore saturated pixels (and often near-black noisy pixels) when computing $\mu_R, \mu_G, \mu_B$.
For example, skip samples above a threshold like:
$$ x \ge \tau_{\text{sat}} \approx (2^N - 1) - m $$
for an $N$-bit sensor and margin $m$. Saturated values are clipped and no longer represent true scene chromaticity, so including them biases gain estimates.
This step is usually done before demosaicing or early in the pipeline depending on implementation details.
Demosaicing
Because Bayer gives one channel per pixel location, we must reconstruct the missing two channels to get full RGB at every output pixel. That process is demosaicing.
Basic options:
- nearest neighbor (very fast, lower quality),
- bilinear interpolation (classic baseline),
- edge-aware methods (better quality, more compute).
For a “most basic ISP,” bilinear interpolation is a good conceptual and practical baseline.
Basic tone mapping
Linear camera data has wide dynamic range relative to display range. Tone mapping compresses highlights and shapes contrast to look natural on SDR displays.
Simple tone mappers often used for teaching:
- Reinhard global: $f(x) = \frac{x}{1+x}$
- filmic-like curves (piecewise or polynomial)
- simple shoulder roll-off with optional exposure scaling
Even a simple global operator helps preserve highlight detail better than hard clipping.
Gamma correction for display
Displays and human perception are non-linear. To store/render efficiently for standard displays, we apply a gamma-like transfer (or sRGB OETF approximation).
Conceptually, a simplified encoding form is:
$$ \text{out} = \text{in}^{1/\gamma},\quad \gamma \approx 2.2 $$
After gamma encoding, mid-tones receive more code precision where human vision is more sensitive.
Putting it all together
A minimal end-to-end chain looks like:
Bayer capture -> ADC -> linear RAW -> white balance -> demosaic -> tone map -> gamma encode
That is enough to understand the core ISP ideas before adding advanced blocks like denoise, sharpening, color correction matrices, local tone mapping, temporal processing, and HDR-specific logic.