$\def\nR{\mathbb{R}}\def\rmC{{\mathrm C}}\def\rmu{{\rm u}}\def\sS{{\Sigma}}\def\sF{{\mathcal{F}}}\def\l{\left}\def\r{\right}\def\di{d}\def\sinc{\operatorname{sinc}}\def\ltag#1{\tag{#1}\label{#1}}$
We speak of signals as time functions $x:\nR\rightarrow\nR$ (defined on the whole time axis).
Actually, they are not just any real functions but must have some nice properties (such as square integrability, else we had difficulties with convolution).
We do not go deeper into that here. We just collect all admissible signals into a signal set $\sS$.
A system $S$ maps input signals $x$ uniquely to output signals $y$, we write $y=S(x)$.
We concentrate here on the special class of time-invariant linear systems.
Linear means not only that for two signals $x_1,x_2\in \sS$ and a scalar $\lambda\in\nR$ there holds $S(x_1+\lambda x_2)=S(x_1)+\lambda S(x_2)$ but also if we have a parameterized set of signals $\l(x_\tau\r)_{\tau\in\nR}\in\sS$ and a coefficient function $\lambda(\tau)$ then
$$
S\l(\int_\nR \lambda_\tau x_\tau \di\tau\r) = \int_\nR \lambda_\tau S(x_\tau)\di\tau
$$
For time invariance we need the notion of shifted time signals. For a given time-shift $\tau\in\nR$ and a signal $x\in\sS$ we denote with $x(\bullet-\tau)$ the shifted signal defined by $x(\bullet-\tau)(t)=x(t-\tau)$ for all $t\in\nR$.
Time invariant systems respond to a shifted input signal with a shifted output signal, i.e., $S(x(\bullet-\tau)) = S(x)(\bullet-\tau)$.
Beside linearity this time invariance is the special property of
models for physical systems that makes convolution such a useful tool.
And the set of linear time-invariant systems is large. All systems
that can be described through linear ordinary and partial differential
equations with constant coefficients belong to this set.
Now, suppose we have a family of base functions $\l(\delta(\bullet-\tau)\r)_{\tau\in\nR}$ for our signal space $\sS$
which just differ by a shift.
Just to recap: Base functions of $\sS$ means that we can represent every signal $x\in\sS$ as linear combination
\begin{align}
\ltag{base}
x(t) &= \int_{-\infty}^\infty \lambda_\tau \delta(t-\tau) d\tau
\end{align}
of the base functions $\delta(\bullet-\tau)$ with a corresponding family of coefficients $\lambda_\tau$ which represent the signal $x$ uniquely.
The nice thing with base functions which just differ by shift is that
because of the time-invariance of the considered physical systems $S$
the system responses to these base functions also just differ by a
shift:
$$
S(\delta(\bullet-\tau))= S(\delta)(\bullet-\tau)
$$
Let us denote the system repsonse to our unshifted base function $\delta$ as $g:=S(\delta)$.
To calculate the answer of the linear operator we just need the mapped
base functions (just like in linear algebra).
\begin{align}
y=S(x) &= S\l(\int_\nR \lambda_\tau \delta(\bullet-\tau) d\tau \r)\\
&= \int_\nR\lambda_\tau S(\delta(\bullet-\tau)) d\tau&&\text{ just using linearity}\\
&= \int_\nR\lambda_\tau S(\delta)(\bullet-\tau) d\tau&&\text{ using time invariance}\\
&= \int_\nR\lambda_\tau g(\bullet-\tau) d\tau
\end{align}
The system response at a certain time $t$ can be computed as
\begin{align}
\ltag{impulseResponse}
y(t) &= \int_\nR \lambda_\tau g(t-\tau)d\tau.
\end{align}
Now, the only problem remains to find the appropriate base
functions. Hm..., we would have to extend our signal space
$\sS$ to a space of distributions for that purpose and we would have to define the linear system
$S$ on this extended signal space.
We do not want to go into these details here.
Instead we switch to Rieman-Stieltjes integrals in \eqref{base}. That
means we use a family of base functions $\bar h(\bullet-\tau)$ (left-continuous, with
bounded variation) that only differ in a shift such that we can
represent our signal as
\begin{align}
\ltag{baseStieltjes}
x(t) &= \int_{\tau=-\infty}^\infty \lambda_\tau d \bar h(t-\tau).
\end{align}
with a family of coefficients $\lambda_\tau$ continuously depending on $\tau\in\nR$.
Riemann Stieltjes integrals are known from distribution functions in
statistics. Especially, discontinuities in the integrator function $h$
result to sum terms
\begin{align}
x(t) &= \sum_{\scriptsize\begin{matrix}\tau\in\nR\\h(t-(\tau-))\\\neq h(t-(\tau+))\end{matrix}} \lambda_\tau\cdot (\bar h(t-(\tau+))-\bar h(t-(\tau-)) +\\
\ltag{sumStieltjes}
&\phantom{= \sum_{\scriptsize\begin{matrix}\tau\in\nR\\h(t-(\tau-))\\\neq h(t-(\tau+))\end{matrix}}\lambda_\tau\cdot \bar h(t-(\tau+))
}+ \int_{\tau=-\infty}^\infty \lambda_\tau (-\bar h'_\rmC(t-\tau))d\tau.
\end{align}
where $h_\rmC$ is the continuous part of $h$, i.e.,
\begin{align}
\bar h_\rmC(t)&=\bar h(t)-\sum_{\scriptsize\begin{matrix}\tau\leq t\\h(\tau-)\\\neq h(\tau+)\end{matrix}}\bar h(\tau+)-\bar h(\tau-).
\end{align}
Note, that the differentiation of $\bar h_\rmC$ in \label{sumStieltjes} is
carried out w.r.t. $\tau$. This explains the negative sign.
With the help of \label{sumStieltjes} it is easy to construct nice
base functions. We just choose $\bar h'_\rmC=0$, i.e. a piecewise constant
function and let $h$ have one jump of height -1 from 0 to -1 for $\tau=t$.
Such a signal would be the negative Heaviside function
\begin{align}
\bar h(t) &=\l\{\begin{matrix}
0&\text{ for }t<0\\
-1&\text{ for }t\geq 0
\end{matrix}
\r\} = -h(t)
\end{align}
with the Heaviside function $h$.
This way we get just $\lambda_\tau = x(\tau)$ and the
representation \label{baseStieltjes} reads as
\begin{align}
x(t) =\int_{\tau=-\infty}^\infty x(\tau)\cdot d\bar h(t-\tau).
\end{align}
Applying the linear time invariant system to this signal means
\begin{align}
y(t) &= S\l(\int_{\tau=-\infty}^{\infty} x(\tau)d\bar h(\bullet-\tau)\r)(t)\\
&=\int_{\tau=-\infty}^{\infty} x(\tau)d S\l(\bar h(\bullet-\tau)\r)(t)&&\text{ using linearity}\\
&=\int_{\tau=-\infty}^{\infty} x(\tau)d S\l(\bar h\r)(t-\tau)&&\text{ using time invariance}
\end{align}
We have again one signal $\bar G:= S(\bar h)$ that
characterizses the system completely.
Often the system has smoothing properties such that $\bar G$ is
continuous even with the discontinuous input $\bar h$.
In that case one can derive the integrator function in the
Riemann-Stieltjes integral w.r.t. $\tau$ and obtains
\begin{align}
y(t) &= \int_{\tau=-\infty}^\infty x(\tau) (-\bar G'(t-\tau))d\tau.
\end{align}
There is the connection to \label{impulseResponse}.
We can choose
\begin{align}
\lambda_\tau &= x(\tau)\\
g(t) &=-\bar G'(t)\\
&= - S(-h)'(t) = S(h)'(t)
\end{align}
and write
\begin{align}
y(t) &= \int_{\tau=-\infty}^\infty x(\tau)g(t-\tau)d\tau.
\end{align}
Mark Leeds asked some good questions in the comments to this Answer which I want to address here:
- At the point where you introduce the stieltjes integral, what was the integral called before that point? I know reimann and lebesgue but that's it.
- If you had to give intuition for why it's the second term is g(t−τ) rather than just g(t), what would it be?
Answer to 1.: One could at first try Lebesgue integrals. But, pityingly there is
no family of Lebesgue integrable base functions $\delta(\bullet-\tau)$
$(\tau\in\nR)$ which only differ by shifts. The problem is that the
convolution integral
$y(t)=\int_{-\infty}^{+\infty}x(\tau)\delta(t-\tau)d\tau$ "smears" the
support of the integrands. This leads relatively fast to the
conclusion that the base functions $\delta(\bullet-\tau)$ should have
a single point as support to satisfy $y(t)=x(t)$. This in turn would
mean that $\delta$ would be the zero function in the Lebesgue
sense. Therefore, there are no such base functions in $L^2$. I have
shown one relatively simple way around this difficulty with
Stieltjes-Riemann integrals.
Another general way would be to start with the space of piecewise
smooth L2-functions which form a ring with the convolution product as
multiplication. The algebraic structure of this set is like that one
of the ring of
integers. This ring
can be extended to a field. This method is known from the extension of
the ring of integers to the field rational numbers). The unit element
in that field is called Dirac-delta distribution. You find this
approach in the book Differential Algebraic Equations of Kunkel and
Mehrmann.
A further alternative would be to reduce the signal space to the space of
frequency-limited signals. This way even works with classical
integrals. I think you know the Fourier transformation $X(\omega) =
\sF(x)(\omega) = \int_{-\infty}^\infty x(t) e^{-i\omega t} dt$ from
statistics or at least from the
video.
Furthermore, you know that convolution in the time-domain
$(x*y)(t)=\int_{\tau=-\infty}^\infty x(\tau)y(t-\tau)d\tau$ translates
to multiplication $\sF(x*y)(\omega) = X(\omega)\cdot Y(\omega)$ in the
frequency domain.
We are looking for the decomposition of signals $x$ in linear combinations $x(t) = \int_{-\infty}^\infty x(\tau) \delta(t-\tau) d\tau = (x*\delta)(t)$.
That means that $\delta$ should retain the signal under
convolution. In the frequency domain the spectrum
$\Delta:=\sF(\delta)$ of $\delta$ should retain the spectrum $X$ of
the signal $x$ under multiplication. If our signal space $\sS$
contains signals $x$ whose spectra $X$ have unbounded support this implies that the spectrum of $\Delta$ must be the unit function, i.e., $\Delta(\omega)=1$ for all $\omega\in\nR$. But, if there exists an upper limit frequency $\omega_\rmu>0$ such that the spectra $X$ for all signals $x\in\sS$ are zero outside of $[-\omega_\rmu,\omega_\rmu]$ then $\Delta(\omega)$ must only be 1 for $\omega\in[-\omega_\rmu,\omega_\rmu]$. We can construct an appropriate generator for our base as inverse Fourier transformed of
\begin{align}
\Delta(\omega) &=
\begin{cases}
1 &\text{ for } \omega \in [-\omega_\rmu,\omega_\rmu]\\
0 &\text{ else }
\end{cases}
\end{align}
\begin{align}
\delta(t) &:= \frac1{2\pi}\int_{-\infty}^\infty \Delta(\omega)e^{i\omega t} d\omega\\
&= \frac1{2\pi}\int_{-\omega_\rmu}^{\omega_\rmu} e^{i\omega t} d\omega\\
&= \begin{cases}
\frac{\omega_\rmu}{\pi} &\text{ for } t=0\\
\frac1{t\pi} \sin(\omega_\rmu t)&\text{ for } t\neq 0
\end{cases}\\
&= \frac{\omega_\rmu}{\pi} \sinc(\omega_\rmu t)
\end{align}
with the sinc function.
So, for frequency-bounded signals you can use normal Lebesgue
integration and you have an explicit representation of the base functions $\delta(\bullet-\tau) = \frac{\omega_\rmu}{\pi} \sinc(\omega_\rmu (\bullet-\tau))$.
Pityingly, this is a rather restricted signal space only suited for educational purposes. You do not have a step signal since this is not frequency limited and not $L^2$. Furthermore, all those signals are acausal since their support is unlimited in both directions.
Note, that for increasing upper frequency bound $\omega_\rmu$ the
sequence $\delta_{\omega_\rmu}$ $(\omega_\rmu\rightarrow\infty)$
represents in the limit the Dirac delta distribution. This limit is
one way to understand the Dirac delta distribution.
Answer to 2.: The intuitive picture is already contained in the text. The LTI
system responses to a time-shifted input signal by a time-shifted
output signal. That is the reason why we represent our signal $x$ as
superposition of time-shifted basis signals
$x(t)=\int_{\tau=-\infty}^\infty x(\tau)\delta(t-\tau) d\tau.$ Under
the integral only $\delta$ is dependent on the real time $t$ and $x$
only appears as a weighting function which ist "indexed" through
$\tau$. Thus, the system $S$ only acts on the shifted time signals
$\delta(t-\tau)$ with the shifted system response $g(t-\tau)$. One
obtains the system response on the input signal $x$ as superposition
$y(t) = S(x)(t) = \int_{-\infty}^\infty x(\tau)g(t-\tau) d\tau$ of the
system responses $g(t-\tau)$ on the shifted $\delta(t-\tau)$. Notably,
the impulse response starting at time $\tau$ is weighted with the
value $x(\tau)$ of the input signal at time $\tau$. It contributes to
the output $y(t)$ at time $t$ with the $x(\tau)$-weighted impulse response of
the system after the elapsed time $t-\tau$ of the time interval from $\tau$ to $t$.
You compose your input signal as a sequence of shifted "special signals" and look what the system delivers for the shifted special signals after the elapsed time from the start time of the special signal to the current time.