Density Plot from a Single Random Variable

Kernel Density Estimation (KDE) is a powerful way for estimating the probability density function (PDF) of a random variable. As a non-parametric approach, KDE allows us to visualize the distribution of a dataset without making assumptions about its underlying structure. This provides a clear picture of how data points are spread out across a range.

For example, consider a set of data points from a random variable spread across a continuous range. To gain a higher-level understanding of how these data points are distributed, we can create a density plot using KDE. This technique generates a smooth curve that represents the likelihood of data points occurring at any given point within the range, offering a more refined view than a simple histogram.


In simpler terms, KDE smooths out the distribution of data points over a continuous range. Instead of just showing where data points are (like in a histogram), providing a continuous curve that visually captures where data points are most likely to appear.

Density plots are versatile tools in data analysis, providing valuable insights into data distribution, identifying outliers, and revealing patterns, and underlying structures. For instance, they can be used for quantizing a variable in a heatmap, helping to create color quantiles that visually represent data concentration.

Let’s first look at the final code and then we will go through the details:

<!-- DensityPlot.svelte -->
<script>
  import { onMount } from 'svelte';
  import { scaleLinear } from 'd3-scale';
  import { line, curveBasis } from 'd3-shape';
  import { csv } from 'd3-fetch';
  import { mean } from 'd3-array';
  import Axis from './Axis.svelte';

  // Setting up margins and dimensions for the SVG container
  let margin = { top: 30, right: 30, bottom: 30, left: 80 };
  let width;
  let height = 400;
  let data, x, y, kde;
  let density = [];

  // We will use csv from d3 to fetch the data and we'll sort it by descending gdp
  // download data on: https://datavisualizationwithsvelte.com/price.csv
  onMount(async () => {
    data = await csv('/data/price.csv');
  });

  // Reactive block that runs whenever the 'data' variable is updated

  $: if (data) {
    x = scaleLinear()
      .domain([0, 1000])
      .range([margin.left, width - margin.right]);
    y = scaleLinear()
      .range([height - margin.bottom, margin.top])
      .domain([0, 0.01]);

    kde = kernelDensityEstimator(Epanechnikov(7), x.ticks(40));
    density = kde(data.map((d) => +d.price));
  }

  function kernelDensityEstimator(kernel, X) {
    return function (V) {
      return X.map(function (x) {
        return [
          x,
          mean(V, function (v) {
            return kernel(x - v);
          })
        ];
      });
    };
  }

  function Epanechnikov(bandwidth) {
    return function (v) {
      return Math.abs((v /= bandwidth)) <= 1
        ? (0.75 * (1 - v * v)) / bandwidth
        : 0;
    };
  }

  $: lineGenerator = line()
    .curve(curveBasis)
    .x((d) => x(d[0]))
    .y((d) => y(d[1]));
</script>

{#if data}
  <div bind:clientWidth={width}>
    <svg {width} {height}>
      <!-- x and y axes -->
      <Axis {width} {height} {margin} xScale={x} ticks={10} />
      <Axis {width} {margin} yScale={y} />

      <!-- Density Plot -->
      {#if density.length > 0}
        <path
          d={lineGenerator(density)}
          fill="#fcd34d"
          opacity="0.8"
          stroke="#000"
          stroke-width="1"
          stroke-linejoin="round" />
      {/if}
    </svg>
  </div>
{/if}

Let’s look at the most important functions that we are using here kernelDensityEstimator and Epanechnikov. These functions are crucial for converting a single variable, such as price data, into a density estimation, which is then used to create a density plot.

kernelDensityEstimator function

This function generates the density estimates by applying a kernel function over the data points. It returns a function that, when given a dataset (V), returns an array of [x, y] pairs, where x is a point in the domain (e.g., price values) and y is the estimated density at that point.

function kernelDensityEstimator(kernel, X) {
  return function (V) {
    return X.map(function (x) {
      return [
        x,
        mean(V, function (v) {
          return kernel(x - v);
        })
      ];
    });
  };
}

Epanechnikov function

Epanechnikov function defines a specific type of kernel called theEpanechnikov kernel. The kernel is a function that determines the weight given to each data point when estimating the density at a particular point x. A bandwidth parameter that controls the smoothness of the kernel. Smaller values of bandwidth results in a sharper curve, while larger values produce a smoother curve.

function Epanechnikov(bandwidth) {
  return function (v) {
    return Math.abs((v /= bandwidth)) <= 1
      ? (0.75 * (1 - v * v)) / bandwidth
      : 0;
  };
}

How we are using this function in our code:

$: if (data) {
  x = scaleLinear()
    .domain([0, 1000])
    .range([margin.left, width - margin.right]);
  y = scaleLinear()
    .range([height - margin.bottom, margin.top])
    .domain([0, 0.01]);

  kde = kernelDensityEstimator(Epanechnikov(7), x.ticks(40));
  density = kde(data.map((d) => +d.price));
}

The x.ticks(40) part is a method call on a D3 scale (x) that returns an array of “ticks” or evenly spaced values across the domain of the scale.

This kde function, when supplied with an array of data, returns an array of [x, density] pairs, where each x is a price point from the specified ticks, and density is the estimated density at that point.

If you were to console.log the density array created by the kernelDensityEstimator function, it would indeed be an array of 2-value arrays. In this array:

The first number is a point in the range of the price data (these points are generated by x.ticks(40)). The second number is the estimated density (not the exact probability) at that specific price point. This density value represents how densely packed the data points are around that specific value in the price range.

// $: console.log(density)
[
  [0, 0.00004286142900004286],
  [20, 0.0003520760239289236],
  [40, 0.003966650309345785],
  [60, 0.00884563675026403],
  [80, 0.00855085508550852],
  [100, 0.0068547671093640095],
  //
  [880, 0.000014432930173483822],
  [900, 0.000005248338244903207],
  [920, 0.00017122703523996712],
  [940, 0.000009840634209193514],
  [960, 0.000013776887892870924],
  [980, 0.00012049309887256947],
  [1000, 0]
];

Thanks for reading! In case you have a question or comment please join us on Discord!