Density Plot from a Single Random Variable

Kernel Density Estimation (KDE) is a powerful way for estimating the probability density function (PDF) of a random variable. As a non-parametric approach, KDE allows us to visualize the distribution of a dataset without making assumptions about its underlying structure. This provides a clear picture of how data points are spread out across a range.

For example, consider a set of data points from a random variable spread across a continuous range. To gain a higher-level understanding of how these data points are distributed, we can create a density plot using KDE. This technique generates a smooth curve that represents the likelihood of data points occurring at any given point within the range, offering a more refined view than a simple histogram.


In simpler terms, KDE smooths out the distribution of data points over a continuous range. Instead of just showing where data points are (like in a histogram), providing a continuous curve that visually captures where data points are most likely to appear.

Density plots are versatile tools in data analysis, providing valuable insights into data distribution, identifying outliers, and revealing patterns, and underlying structures. For instance, they can be used for quantizing a variable in a heatmap, helping to create color quantiles that visually represent data concentration.

Let’s first look at the final code and then we will go through the details:

Data

  • You can download the data here
<!-- DensityPlot.svelte -->
<script>
  import { scaleLinear } from 'd3-scale';
  import { line, curveBasis } from 'd3-shape';
  import { csv } from 'd3-fetch';
  import { mean, max } from 'd3-array';
  import AxisLeft from './AxisLeftV5.svelte';
  import AxisBottom from './AxisBottomV5.svelte';
  import { onMount } from 'svelte';

  // State variables
  let data = $state([]);
  let width = $state(0);
  const height = 400;
  const margin = { top: 30, right: 30, bottom: 30, left: 80 };

  //  Load data
  onMount(() => {
    csv('/data/price.csv').then((csvData) => {
      data = csvData;
    });
  });
  // Helper functions for kernel density estimation
  function kernelDensityEstimator(kernel, X) {
    return function (V) {
      return X.map(function (x) {
        return [
          x,
          mean(V, function (v) {
            return kernel(x - v);
          })
        ];
      });
    };
  }

  function Epanechnikov(bandwidth) {
    return function (v) {
      return Math.abs((v /= bandwidth)) <= 1
        ? (0.75 * (1 - v * v)) / bandwidth
        : 0;
    };
  }

  // Computed values
  let x = $derived(
    width
      ? scaleLinear()
          .domain([0, 1000])
          .range([margin.left, width - margin.right])
      : null
  );

  let density = $derived(
    data.length && x
      ? kernelDensityEstimator(
          Epanechnikov(7),
          x.ticks(40)
        )(data.map((d) => +d.price))
      : []
  );

  let y = $derived(
    density.length
      ? scaleLinear()
          .range([height - margin.bottom, margin.top])
          .domain([0, max(density, (d) => d[1])])
      : null
  );

  let lineGenerator = $derived(
    x && y
      ? line()
          .curve(curveBasis)
          .x((d) => x(d[0]))
          .y((d) => y(d[1]))
      : null
  );
</script>

<div class="wrapper" bind:clientWidth={width}>
  {#if data && width && x && y && density.length > 0 && lineGenerator}
    <svg {width} {height}>
      <!-- x and y axes -->
      <AxisLeft {height} {margin} yScale={y} />
      <AxisBottom {width} {height} {margin} xScale={x} ticksNumber={10} />

      <!-- Density Plot -->
      <path
        d={lineGenerator(density)}
        fill="#fcd34d"
        opacity="0.8"
        stroke="#000"
        stroke-width="1"
        stroke-linejoin="round" />
    </svg>
  {/if}
</div>

Let’s look at the most important functions that we are using here kernelDensityEstimator and Epanechnikov. These functions are crucial for converting a single variable, such as price data, into a density estimation, which is then used to create a density plot.

kernelDensityEstimator function

This function generates the density estimates by applying a kernel function over the data points. It returns a function that, when given a dataset (V), returns an array of [x, y] pairs, where x is a point in the domain (e.g., price values) and y is the estimated density at that point.

function kernelDensityEstimator(kernel, X) {
  return function (V) {
    return X.map(function (x) {
      return [
        x,
        mean(V, function (v) {
          return kernel(x - v);
        })
      ];
    });
  };
}

Epanechnikov function

Epanechnikov function defines a specific type of kernel called theEpanechnikov kernel. The kernel is a function that determines the weight given to each data point when estimating the density at a particular point x. A bandwidth parameter that controls the smoothness of the kernel. Smaller values of bandwidth results in a sharper curve, while larger values produce a smoother curve.

function Epanechnikov(bandwidth) {
  return function (v) {
    return Math.abs((v /= bandwidth)) <= 1
      ? (0.75 * (1 - v * v)) / bandwidth
      : 0;
  };
}

How we are using this function in our code:

// Computed values
let x = $derived(
  width
    ? scaleLinear()
        .domain([0, 1000])
        .range([margin.left, width - margin.right])
    : null
);

let density = $derived(
  data.length && x
    ? kernelDensityEstimator(
        Epanechnikov(7),
        x.ticks(40)
      )(data.map((d) => +d.price))
    : []
);

let y = $derived(
  density.length
    ? scaleLinear()
        .range([height - margin.bottom, margin.top])
        .domain([0, max(density, (d) => d[1])])
    : null
);

let lineGenerator = $derived(
  x && y
    ? line()
        .curve(curveBasis)
        .x((d) => x(d[0]))
        .y((d) => y(d[1]))
    : null
);

The x.ticks(40) part is a method call on a D3 scale (x) that returns an array of “ticks” or evenly spaced values across the domain of the scale.

This kde function, when supplied with an array of data, returns an array of [x, density] pairs, where each x is a price point from the specified ticks, and density is the estimated density at that point.

If you were to console.log the density array created by the kernelDensityEstimator function, it would indeed be an array of 2-value arrays. In this array:

The first number is a point in the range of the price data (these points are generated by x.ticks(40)). The second number is the estimated density (not the exact probability) at that specific price point. This density value represents how densely packed the data points are around that specific value in the price range.

// $effect( () => console.log(density))
[
  [0, 0.00004286142900004286],
  [20, 0.0003520760239289236],
  [40, 0.003966650309345785],
  [60, 0.00884563675026403],
  [80, 0.00855085508550852],
  [100, 0.0068547671093640095],
  //
  [880, 0.000014432930173483822],
  [900, 0.000005248338244903207],
  [920, 0.00017122703523996712],
  [940, 0.000009840634209193514],
  [960, 0.000013776887892870924],
  [980, 0.00012049309887256947],
  [1000, 0]
];

Join our Discord to share your charts, ask questions, or collaborate with fellow developers!