0

I have a problem that I've formulated as follows. Given a finite target set $T$, and a set-generating function $F(x_i) = C_i$ that also produces finite sets, I'd like to find the set $C_i$ that has maximum overlap with $T$. As far as I've been able to figure out, I'm basically trying to minimize the Jaccard distance between the sets, i.e.

$J(A,B) = 1-\frac{|A\cap B|}{|A\cup B|}$

This seems vaguely similar to set covering problems, but I haven't stumbled across a formulation quite like it. Could anyone direct me to any references or algorithms geared towards solving this sort of optimization problem?

Note: I've tried to frame it in the most general way possible, but for my particular application, these sets are composed of non-negative integer triples of the form $(x,y,z)$. The integers in the triples are also restricted to fall within a certain interval.

EDIT 1:

Okay, let's get more specific. The domain for the members of the sets is essentially a fixed-size voxel grid. Let's even make it a cube, so $n_v$ is the number of voxels along a given direction.

I'd like to examine different types of target sets and generating functions, but the most basic useful function, $F_{rect}$ will take two triples as parameters, representing opposite corners of a rectangular prism. So the members of the set returned by $F_{rect}(x_0,y_0,z_0,x_1,y_1,z_1)$ will be all triples $(x,y,z)$ that satisfy the inequalities:
$x_0\leq x \leq x_1$,
$y_0\leq y\leq y_1$,
$z_0\leq z \leq z_1$.

So this function has some definite structure, and it's quite simple to enumerate all of the possible sets given the bounds of the domain under consideration. All of my potential generating functions will share the properties of being dense and easily enumerated.

Although I can cycle through all the sets and check them in linear time, that will take a very long time. I'd like to use a branch-and-bound scheme to prune the possibilities, but I'm a total novice in optimization techniques. Has this specific problem been worked on before to anyone's knowledge?

  • It will be hard to give an algorithm answer unless either your sets are all given as the input, or you can specify your function $F$ that generates the $C_i$ and the set of all possible inputs $x_i$ for it. If the sets $C_i$ are simply listed in the input, then there is an optimal linear time algorithm: Simply compute Jaccard for every one of your input sets $C_i$. If there is no known structure in $F$ that can be exploited, then you have to use all inputs $x_i$ and compute Jaccard for all possible $C_i$. – user2566092 Aug 27 '15 at 17:39
  • @user2566092 This is a good point. I've got no experience in discrete optimization, so I didn't really know how to frame the question properly. I've added a specific example of a generating function to provide more details. There will always be a known structure to $F$ that can be exploited, and I would think that somebody has already worked on a clever scheme for this type of problem. – voxelite Aug 27 '15 at 19:17

1 Answers1

0

One thing that might help, at least compared to brute force, is to build a so-called "range tree" data structure (you can look up e.g. on Wikipedia), which would very quickly (in about $(\log n)^2$ time) allow you to compute the number of points in your original set $T$ (where $|T| = n$) that are contained in a given query box (one of your $C_i$ sets with sides in between any 6 end-points you want). The range tree can be built in about $n (\log n )^2$ time and takes the same amount of space, which is not bad compared to just storing and going through the points in $T$ (which would take $n$ time and space)

Note that for a given query box $C_i$, if $|T| = n$ is large then it would take much longer, linear $n$ time, to compute the size of the intersection. If you sort 3 copies of $T$, according to $x,y,z$ coordinates separately, this would triple your memory usage and require about $n \log n$ preprocessing time, but at least then each time you change a dimension endpoint of $C_i$ up or down by 1, you can get the running time down to update the size of the intersection but it can still be pretty bad, e.g. around $n^{2/3}$ if your set $T$ is pretty dense.

So, as you can see, building a range tree that quickly returns counts of points in $T$ that are in any given query box, you can substantially reduce your running time if $n$ is large, at least if you are performing brute force search over all possible $C_i$.

Also, if you are looking for alternatives to brute force search, you can look up simulated annealing which tries to repeatedly randomly change the ranges for $C_i$ by small amounts to find nearby solutions, trying to find better and better solutions, and occasionally moving to a worse solution in order to avoid getting stuck in local maxima. (The probability of moving to a worse solution depends on how much worse it is.) The downside to simulated annealing is that there's no way to know whether the best solution you find is the true global optimal solution. Range trees again will speed up simulated annealing a lot if $n$ Is large.

Anyway, given the structure of your problem (finding the best range query, i.e. box) you might have better luck getting an answer that finds the optimal solution while avoiding brute force search if you post on CS.SE (the computer science sibling of this site). As I mentioned, a range tree can help and maybe someone can figure out how to use it even better to quickly find the optimal solution.

user2566092
  • 26,142