Background: There are n entities. Each entity has the same m attributes. Each attribute can vary between several values.
Problem: Select the smallest subset of entities such that you have at least one entity for each value of each attribute.
Example: Say this is your data, where each row is an entity, and each column is an attribute:
| Attribute A | Attribute B | Attribute C | |
|---|---|---|---|
| Entity A | 1 | 1 | 2 |
| Entity B | 1 | 3 | 1 |
| Entity C | 1 | 2 | 1 |
| Entity D | 1 | 2 | 1 |
| Entity E | 2 | 1 | 2 |
| Entity F | 2 | 1 | 1 |
Steps you could take to arrive at the solution:
- Entity B must be selected; it is the only instance of attribute B being 3
- One of Entities C or D must be selected; they are the only instances of attribute B being 2
- So far, we only have instances of attributes A and C being 1. If we add entity E, we now have an instance where both are 2.
So in this example, the optimal solution is either B, C and E or B, D, and E.
Question: Is this a known optimization problem? If so, is it solvable? I would also be interested in heurestics that get close to the optimal solution.