0

I have a problem I'd like to code. I have a process which generates numbers 0 through n-1 and I want to stop it when it generates the first repeated element.* I'm looking for a data structure that makes this fast. In particular, adding a new element and testing if an element is in the structure need to be fast. The expected number of insertions is around $\sqrt n$ or actually a bit worse (say $\sqrt{2n}$) because the process slightly favors unique values.

Some kind of self-balancing binary tree seems like the right approach, but maybe there's a better way? For small $n$ I think a bit array would be superior but I'm looking at $n\approx10^9$ which is too large for that to be practical I think.

* Actually, it doesn't need to stop right away -- if it's more efficient you can generate elements in blocks and check every now and then.


This is a CS question but too basic for cstheory.stackexchange and probably insufficiently applied for stackoverflow.

Charles
  • 32,122
  • 1
    The simplest solution would be to use your language's version of hash tables, or barring that, a set class. –  Nov 05 '13 at 06:59
  • @Hurkyl: Probably. What if I'm interested in maximizing speed? Testing to $10^7$ took 13 hours with a naive implementation and so $10^9$ would take in the ballpark of 25,000 hours with same, more than I care to spend. – Charles Nov 05 '13 at 07:10
  • Why is a bit set not practical? It will only take 125 MB of memory to store 10^9 bits. Looks very practical to me. – Dan Shved Nov 05 '13 at 07:20
  • It won't get swapped or anything. Should work extremely fast, if you implement it carefully in a language like C++. If your language is more high-level (think Ruby), then it depends. – Dan Shved Nov 05 '13 at 07:25
  • @Dan: Random access to main memory is a lot slower than random access to cache memory. –  Nov 05 '13 at 07:57
  • @Charles: are you sure bottleneck is your table lookups, rather than your compute work? –  Nov 05 '13 at 07:58
  • @Hurkyl: Yes, the data structure is the bottleneck. (And you're right, main memory latency is what makes a bitset impractical here.) – Charles Nov 05 '13 at 08:38
  • @Hurkyl I know, but it should still be fast enough to do $\sqrt{10^9}$ operations extremely fast. Even $10^9$ should be tolerable. – Dan Shved Nov 05 '13 at 08:38
  • @DanShved: I'm considering languages as low as assembly and as high as C++ though I'll probably end up using the happy medium of C. I wouldn't consider anything like Ruby for a compute-intensive task like this. – Charles Nov 05 '13 at 08:40
  • Memory latency is a thing to consider, but when you replace $O(1)$ with something asymptotically worse, it can very much outweigh the memory latency thing. I wonder if you've really tried it, or just assume it will be slow. – Dan Shved Nov 05 '13 at 08:40
  • @DanShved: I'll have to do about $10^9$ of the $\sqrt{10^9}$ operations, which makes me care quite a bit about performance. Scaling up earlier efforts would take 3-10 core years which is unworkable. – Charles Nov 05 '13 at 08:41
  • I'm sorry, the question says something about $\sqrt{n}$ and doesn't mention $n\sqrt{n}$. Where does that come from? Also, when you got 13 hours of work for $n=10^7$, what was the implementation? Was it the bit set? – Dan Shved Nov 05 '13 at 08:44
  • @DanShved: I'm doing the main calculation ~ n times. I left that out since it's irrelevant to the optimization itself -- no work can be shared between invocations. The implementation was very naive, generating results in batches and sorting. Anything would be better than that. – Charles Nov 05 '13 at 08:47
  • OK. But I think the bit set might still do the job, although I cannot be sure without experimenting. $3 \cdot 10^{13}$ random memory operations might still do it in several hours/days. I doubt it will be years, although I'm not sure. And I don't think you'll find anything faster anyway. Why don't you try to do $10^9$ random memory accesses, measure the time, and then decide whether or not it is worth it? – Dan Shved Nov 05 '13 at 08:50
  • I mean, any computer nowadays should at least do $10^8$ memory operations per second. That makes it a total of $3\cdot10^5$ seconds, which is less than $100$ hours. Give or take. – Dan Shved Nov 05 '13 at 08:53
  • 1
    A hash table might also do it. But I can't say which will turn out to be faster in practice, you'll have to experiment. – Dan Shved Nov 05 '13 at 08:57
  • @DanShved: You're talking about random access to main memory, so more like 100 ns -> 1000 hours or about a month. But yes, I certainly hope to improve greatly on the original. With a hash set I hope to work in L1 mostly to get that down further. – Charles Nov 05 '13 at 09:08
  • @Charles Well yes, 1000 hours is much worse than 100. That's why I'm not sure, I don't really remember what that time constant is. If it really is that slow - go with the hash table ) – Dan Shved Nov 05 '13 at 09:23
  • 1
    I think that you ought to try stackoverflow. The advice given here seems sound, but I'd expect more/different ideas/contributors over there. – bubba Nov 05 '13 at 12:24
  • Have you considered a B-tree? Also cs.stackexchange.com is still in beta, but would probably be a good place to ask. – MasterOfBinary Nov 05 '13 at 18:49

0 Answers0