2

Background: There are n entities. Each entity has the same m attributes. Each attribute can vary between several values.

Problem: Select the smallest subset of entities such that you have at least one entity for each value of each attribute.

Example: Say this is your data, where each row is an entity, and each column is an attribute:

Attribute A Attribute B Attribute C
Entity A 1 1 2
Entity B 1 3 1
Entity C 1 2 1
Entity D 1 2 1
Entity E 2 1 2
Entity F 2 1 1

Steps you could take to arrive at the solution:

  • Entity B must be selected; it is the only instance of attribute B being 3
  • One of Entities C or D must be selected; they are the only instances of attribute B being 2
  • So far, we only have instances of attributes A and C being 1. If we add entity E, we now have an instance where both are 2.

So in this example, the optimal solution is either B, C and E or B, D, and E.

Question: Is this a known optimization problem? If so, is it solvable? I would also be interested in heurestics that get close to the optimal solution.

1 Answers1

3

This is related to several NP-complete problems.

3-DIMENSIONAL MATCHING is the problem we'd get if there are $3$ attributes (as in the example), each attribute has $q$ values, and we want to know if we can get the exact minimum number of entities: $q$ entities with no values of any attribute in common.

Since this is a special case of the problem in the question, that problem is also NP-complete.

SET COVER is a more general problem: given a family of sets whose union is $U$, we want find the smallest subfamily whose union is still $U$. We can reduce the problem in the question to a set cover problem if we think of each entity as the set of attribute values it has: for example, entity A would be the set $\{a_1, b_1, c_2\}$ representing a value of 1 for attribute A, 1 for attribute B, 2 for attribute C. Entity B would be the set $\{a_1, b_3, c_1\}$.

In the set cover problem, there's no requirement that all sets have the same size, or that the elements can be classified by attribute, so it's more general. However, the ideas used to solve the set cover problem might be useful to approximate the problem in the question.

For example, the greedy algorithm that repeatedly chooses the entity with the most uncovered elements (attribute values) achieves an approximation ratio of $H(s) = 1 + \frac12 + \frac13 + \cdots + \frac1s$, where $s$ is the number of elements to cover. (In the example, $s= 2 + 3 + 2 = 7$, since attributes A and C have $2$ possible values, and attribute B has $3$ possible values.) Asymptotically, $H(s) \sim \ln s$.

The Wikipedia article also mentions that if each element (that is, each value of each attribute) appears on at most $f$ entities, then taking an LP relaxation of this problem (and rounding up) can be used to find a solution within a factor of $f$ of the optimal solution.

Misha Lavrov
  • 142,276