2

Its my first time on here and my maths is poor so please be kind. I am working on a Masters dissertation focused on document clustering methods in which I would like to apply a weight based on the time interval between two documents.

I am looking for some help coming up with a function to express a time interval with results between 0 and 1. The reason I want to map the results to a maximum value of zero is that this is being applied as a weight to a cosine similarity metric where identical articles would receive a cosine measurement of 1 etc.

Example 1, the date difference between 31/05/2015 and 20/06/2015 is 9 days. Example 2, the date difference between 31/05/2015 and 20/01/2015 is 129 days.

I would like to apply a function whereby example 1 has a higher value (towards the 1 end of the scale) and example 2 has a lower value (towards the 0 end of the scale). If the date difference was only 1, the value of 1 should apply.

I hope this makes sense. Any help anyone can offer me would be greatly appreciated.

Thank You

Claire

  • Welcome to Math.SE! Right now there are a lot of functions that satisfy your requirements. Do you happen to want the value of example $1$ to be $129/9$ times as high as that of example $2$? That would nail it down to just $1$ function. – Hrodelbert Jun 05 '15 at 11:59
  • 1
    Thank you so much for your reply, the examples given are just arbitrary really, I am looking for some sort of weight that is higher if the documents are close together in terms of publish date, example two should ahve a lower score but not necessarily tied to 129 times more relevant, hope that makes sense. Thanks again! – Claire McMahon Jun 05 '15 at 13:05

3 Answers3

1

Do you have a maximum and minimum date? If so, I suggest you simply divide the number by the number of days between the maximum and the minimum.

5xum
  • 123,496
  • 6
  • 128
  • 204
1

The simplest approach in my opinion is to consider the function $$ w(d) =\frac{1}{d}, $$ where $d$ is the number of days difference. It is obvious that $w(1) = 1$ and that $0<w(d)<1$ for all $d\geq1$, as you wanted, and by coincidence, this function also has the property that $w(9)=\frac{129}{9}w(129)$, or more generally $$ w(d_1) = \frac{d_2}{d_1} w(d_2). $$ You can generalize this function if you want: $$ w_A(d) = \frac{A+1}{d+A} $$ also obeys all your properties for a positive constant $A$. Note that this is not the most general function satisfying your constraints. Even after replacing $d$ with some monotonically ascending function of $d$, $w$ will still not be the most general function.

Hrodelbert
  • 1,029
  • Thank you kindly for the time taken to respond to my question, I have opted to go with a variant of your simplest approach w(d)=1/d. Rather than days, I have opted to use the difference in weeks. Any idea how I could possibly express the following in a formula? – Claire McMahon Jun 09 '15 at 12:42
  • cosine similarity (new value) = cosine similarity score (original) + ( 20% of cosine similarity score (original) * 1/date_difference(months))) – Claire McMahon Jun 09 '15 at 12:43
  • @ClaireMcMahon If you want to have again that for $w(d)$ where now $d$ is the difference in weeks that $w(1) = 1$, you can use the same formula: this only works if the smallest possible difference is $1$. Also, in your second comment you seem to be using months rather than weeks. Which is it? – Hrodelbert Jun 09 '15 at 12:46
  • @ClaireMcMahon Might I also attract your attention to the fact that you can upvote answers that you found useful. This is the way the site filters good answers. If there is an answer that completely solved your question, you can accept it by clicking the check next to the answers. – Hrodelbert Jun 09 '15 at 12:48
  • Ooops sorry, I have been playing around with two different figures, it should be weeks rather than months. – Claire McMahon Jun 09 '15 at 13:15
  • Then, if the smallest possible difference in terms of weeks is $1$, nothing at all changes compared to the case when we treated days. – Hrodelbert Jun 09 '15 at 13:20
0

One standard approach in this kind of situation is to consider some discount factor $d<1$ such that the value of an example decreases by this factor for each time unit which has passed, i.e. for an example with initial value $v_0$ and age $t$, you would use the value $v = v_0 \cdot d^t$.