Optimization problem

Discussion:

(too old to reply)

Robert AH Prins

2010-05-16 18:28:26 UTC

Hi all,

Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;

Problem description:

The program processes an array of N cells, with N up to about 4096. The
array is filled from the start, it may not be completely full but there
are no empty cells between full cells. A cell contains a value of
1..255, so logically every cell would be a byte, but if processing
efficiency requires each cell to be a word or even a double-word, that
would be no problem. The values are normalized, they are effectively
indexes into a table, so if the table contains just T entries, the
values would go from 1 to T.

Given the above setup, the problem is to find *all* sub-arrays that
contain 3..T distinct elements and are as short as possible, i.e. if
there are 42 sub-arrays of three elements containing three distinct
values, there is no need to find sub-arrays of four elements containing
three distinct values. Also, a shorter sub-array may *never* be a part
of a longer sub-array containing *only* distinct values.

An example, suppose only the array contains 56 elements, in this case
with, for clarity, values a..j:

1 2 3 4 5 5
....5....0....5....0....5....0....5....0....5....0....56
aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja

in this case the program should find the following sub-arrays:

len aperture from to value
3 3 40 42 bef
3 3 44 46 fgh
7 7 50 56 gihbfja
8 16 41 56 efffghggggihbfja, containing efghibja
9 23 34 56 cbbbbbbefffghggggihbfja, containing cbefghija
10 25 32 56 dccbbbbbbefffghggggihbfja, containing dcbefghija

It is possible to find sub-arrays with 4, 5, or 6 distinct values, but
they are either longer than the series of 7-in-7 (12-25 contain abcd),
or are part of the 7-in-7 series and as such they should not be included!

My AD 1996 program used to slide a window that started with three
elements (plus a sentinel on either side) over the big array and spit
out the position of the first element if it found three distinct
elements in the window *and* the two sentinels also contained any of the
values in the window. The program included some fairly minor
optimizations, such as immediately sliding the window to the last of a
series of equal values, in the example above, the three position wide
window would start by covering pos 6..8, and next it would move to
11..13, etc. If at any moment the program would find that all elements
in the window *and the sentinel on the right* would be distinct, the
window size would be increased by one and the scan would continue.

Taking the above example, a window with an aperture of four elements
would start its slide at position 6..9. Once it reached position 49,
with "gihb" visible and "g" and "f" in the two sentinels, the window
would be widened to five characters as the "f" is distinct from the four
"visible" characters, with a new sentinel of "j". Given that this is
again distinct from the now five characters in the window, the process
would repeat itself, eventually resulting in the string of seven
distinct characters starting at position 50.

Once the string of seven distinct characters has been found, the window
is widened to *nine* (if there had been a string of 8-in-8 it would have
been found by the previous slide!) characters and the process restarts
at position 6..14, but in the end it fails to find a series of eight
distinct characters and so the process is repeated with windows of 10,
11, 12, 13, 14, 15 and finally 16 characters, when a string of eight
distinct characters is finally found.

The problem is, for relatively low values of N and C, the process may
not be overly efficient, but it works. However, once N and C increase,
(N is theoretically unbounded, C has an upper limit of around 210), the
process becomes horribly inefficient: using the old IBM V2.3.0 OS PL/I
compiler, which had a statement count facility, this method required the
execution of 556,379,518 PL/I statements (and a number of actual machine
instruction that is at least one order of magnitude greater) for the
current values of N(2329) and C(66).

As for some figures on required restarts, the final series of 66 is
found in a sub-array with a length of 1950 elements and with a series of
65 in a sub-array of only 1889 elements, this means that the slide had
to be restarted 61 times! Even more restarts, 312(!), are needed when
going from 61 to 62 elements.

In 1998 I posted the problem to comp.lang.pascal.borland, the post
should be on Google, but Google refuses to show more than the first
9,960 (out of 28,517) posts in that group.

Three people claimed that there was a much faster way, and two of them
backed this up with programs, Brent Beach and Paul Green, with Paul's
solution being the fastest by a fair margin. The results of his code
matched my output, I replaced my code with his code, and that seemed to
be the end of the story...

Until a few months back...

... when I started converting the original Pascal code in assembler.
While doing so I decided that it would be interesting to use this
procedure in another part of the code. However, the results were not
what I expected: result rows were missing. I managed to track down a
paper copy of the original PG program, typed that in, had it checked for
typos by a friend, and it turned out to give the same erroneous results.
PG's email address is no longer valid, the floppy that contained a copy
of my email conversation with him and Brent, including a basic
explanation of the algorithm he used, is, after having been stored on a
cold loft for the best part of the last decade, unreadable. The only
thing I remember about his explanation were the words "clear as mud..."
near the end.

I have spent the last two weeks trying to figure out how it works and
staring at the Virtual Pascal IDE "Watch" window, I've finally decided
to ask for help.

So, if anyone can give me any clues as to how I could perform the above
described process in an efficient way, I'd be very grateful. *I'm quite
capable of writing the code myself*, I would just like to get any clues
towards a more efficient algorithm. If that means you want to see the
current Pascal (or PL/I) code and the input data that leads to the
erroneous results, drop me a line, "robert(a)prino(d)org".

Thanks,

Robert

--
Robert AH Prins
spamtrap(a)prino(d)org

glen herrmannsfeldt

2010-05-17 00:29:19 UTC