1

I have set of (x,y) points which can be connected to form a graph, my goal is to detect dynamic parts of this graph. by dynamic I mean ranges where the values are not stable but they are changing by going up/down and forming different graphs than a straigh y = C line.

I tried to look at the slopes graph and look at the parts which has slope > EPSILON. But I noticed that this approach depends on the density of the points, if 2 sets of points depict the same graph, one is more dense then its slopes values will be lower ( the change between 2 consecutive points isn't noticeable now ).

How can I detect such areas from the points without depending on the points number used to build the graph ??

Here is an example of the data I am processing :

enter image description here

I want to be able do detect the dynamic ranges in this graph without being dependent on the density of the points given to describe the same graph ( the more points --> difference between 2 consecutive points "y" values become lower ... )

In this graph we can see that a static part prevails in the beginning and in the end, and in the middle there is a good dynamic range...

  • 1
    Hi, welcome to Math.SE! Thanks for your question, it looks interesting but I could not really comprehend what you are asking. Could you consider giving an example? – gt6989b Jan 02 '14 at 13:44

2 Answers2

1

To be sure I'm answering what you're asking:

Your $x_i$ and $y_i$ values might be, say, dates and temperatures on those days, but you only get to read the thermometer now and then, so that (calling Jan 1 = 1, Jan 2 = 2, Feb 1 = 32, etc.), you have data like

(1, 12)

(2, 11)

(15, 13)

(22, 14)

(29, 13)

(33, 18)

(39, 25)

and so on. You'd like to identify most of January ($x = 1$ through $x = 29$) as "constantly low temp" but february as a warming trend.

Assuming that you've ordered the points so that you have $(x_1, y_1), (x_2, y_2), \ldots$, where the $x_i$ are increasing (as I did above) and then drawn the graph, then you're looking at some point $$ (x_i, y_i) $$ and asking "is the graph increasing faster than $\epsilon$ here? One decent way to remove the dependence on the spacing of $x$ is this. Let's make it concrete and say you're looking at $(x_4, y_4)$. You compute $y_5 - y_4$ and find it's larger than $\epsilon$, but then realize that this is true becuase $x_5$ is much larger than $x_4$. The usual solution is to instead compute $$ d_4 = \frac{y_5 - y_4}{x_5 - x_4}, $$ which could be called the "forward difference estimate of the derivative." The "backward difference estimate" would look at the prior rather than next point: $$ b_4 = \frac{y_4 - y_3}{x_4 - x_3}, $$ And the average of the two also makes some sense, as does a "symmetric" version, where you ignore $x_4$ and $y_4$, but instead look at $$ s_4 = \frac{y_5 - y_3}{x_5 - x_3}. $$

I'd suggest looking at each of these in the context you're examining and see how things look.

John Hughes
  • 93,729
  • Thanks for the reply, you got the general idea. If I understand, you propose to use a larger window when trying to calculate the slope. 2 points about this approach :
    1. Doesn't it still dependent on the density of the points (the more dense the larger your window is ) ?
    2. if we look at the first value and the last value in a "window" ( in your example pt_5, pt_3 ), you might have that :

    pt_3 : (x,y) , pt_4 (x+1,y+2) , pt_5(x+2,y) so in this case pt_4 causes a bit dynamic behavior, but you missed it. It is not necessary to use slopes approach, maybe standard deviation/average are good too

    – Foad Rezek Jan 02 '14 at 14:16
  • An example: (1, 3), (2, 4), (5, 3), (6, 2), (9, 3), (11, 5), (13, 7). The slopes (by forward differences) at $x = 1$ and $x = 11$ are $(4-3)/(2-1) = 1$ and $(7-5)/(13-11) = 1$. Even though the density of points is half as great, the slopes come out the same. If we'd had $(13, 6)$ instead of $(13, 7)$, the slope would have been $1/2$ instead (i.e., the same jump of 1 in y-values would have produced different slopes). Division by $x_{i+1} - x_i$ exactly handles the differing density of points. For item "2", forward differences would still find the jump at $pt_4$. By the $s_4$ method would not. – John Hughes Jan 02 '14 at 14:28
  • maybe my problem is similar and I cant do the mapping from your solution, here are some details: I have a graph of time (x axis) and power (y axis), assume that the graph is of total length 500 time_units. someone chooses a frequency of sampling and send me the sampled points. ex: freq of 0.2pt/time_unit --> I will receive 100 points, freq of 4pts/time_unit --> I will have 2000 points. It important that the points has the same difference in x-axis. but in each case the difference stands for different time units (freq=0.2 --> difference is 5 time_unit, freq=4 --> difference is 0.25 time_unit) – Foad Rezek Jan 02 '14 at 15:00
0

Given Foad's response to my earlier answer, I'm going to try to re-state and answer his question here in a second response, rather than extended comments.

Problem: There's an unknown function $F : [0, 500] \to \mathbb R$; we are given samples $y_i = F(t_i)$ of $F$ at times $t_0 = 0, t_1 = b, t_2 = 2b, \ldots, t_i = i\cdot b, \ldots, t_n$, where $n$ is approximately $500/b$. We may assume that non-constancies of $F$ occur at a scale substantially larger than $b$. We'd like to identify points $t_i$ at which $F$ is near-constant, independent of the value $b$.


From the number of samples, you can get a decent estimate of $b$ (namely, $b \approx 500/n$. (It's possible, too, that in your context you're actually given the value of $b$.)

Then compute, for instance, $$ d_i = \frac{y_{i+1} - y_{i}}{b} $$ When $d_i$ is small (less than some constant $\epsilon$ that you choose), we can say $F$ is nearly constant; when $d_i$ is large, $F$ is varying. This is pretty crude, but it is, at least, more or less independent of the spacing, $b$.

Another possible choice is "Compute the variance $v_i$ of the numbers $y_{i-k}, \ldots, y_{i+k}$ for some small $k$" (the "window size" is then $2k+1$), but this has the disadvantage that if you double the number of samples (i.e., cut the value $b$ in half), you end up examining a different period of time.

Better by far is to pick a time-span $\Delta t$, and compute $k = \dfrac{\Delta t}{2b}$, and then look at the variance of the samples $y_{i-k}, \ldots, y_{i+k}$. If $b$ doubles, $k$ will be half as large, and you'll end up looking at fewer samples, but they'll correspond to (approximately) the same time interval as your previous ones. Even if you take this latter approach, the results for different $b$ values will not be identical: the sample variance for $20$ samples will be difference than the sample variance for $10$...but my guess is that in your particular problem, this will not be significant.

To summarize:

First, and once and for all, pick a time-interval over which you expect variation to be significant. Let's say that's $\Delta t = 5 ms$.

Estimate $b \approx 500 / n$ (or use $b$ if it's given to you).

Compute $k = \dfrac{b}{2 \Delta t}$.

For each $i > k$,

(i) Let $m_i = \dfrac{ \sum_{j = i-k}^{j = i+k} y_j} {2k+1}$.

(ii) Let $v_i = \frac{1}{2k+1} \sqrt{ \sum_{j = i-k}^{j = i+k} (y_j - m_i)^2}$.

(iii) if $v_i$ is larger than some threshold, report that $F$ is varying at $x_i$; else report that it's approximately constant.

John Hughes
  • 93,729