Discussion:
Optimization problem
(too old to reply)
Robert AH Prins
2010-05-16 18:28:26 UTC
Permalink
Hi all,

Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;

Problem description:

The program processes an array of N cells, with N up to about 4096. The
array is filled from the start, it may not be completely full but there
are no empty cells between full cells. A cell contains a value of
1..255, so logically every cell would be a byte, but if processing
efficiency requires each cell to be a word or even a double-word, that
would be no problem. The values are normalized, they are effectively
indexes into a table, so if the table contains just T entries, the
values would go from 1 to T.

Given the above setup, the problem is to find *all* sub-arrays that
contain 3..T distinct elements and are as short as possible, i.e. if
there are 42 sub-arrays of three elements containing three distinct
values, there is no need to find sub-arrays of four elements containing
three distinct values. Also, a shorter sub-array may *never* be a part
of a longer sub-array containing *only* distinct values.

An example, suppose only the array contains 56 elements, in this case
with, for clarity, values a..j:

1 2 3 4 5 5
....5....0....5....0....5....0....5....0....5....0....56
aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja

in this case the program should find the following sub-arrays:

len aperture from to value
3 3 40 42 bef
3 3 44 46 fgh
7 7 50 56 gihbfja
8 16 41 56 efffghggggihbfja, containing efghibja
9 23 34 56 cbbbbbbefffghggggihbfja, containing cbefghija
10 25 32 56 dccbbbbbbefffghggggihbfja, containing dcbefghija

It is possible to find sub-arrays with 4, 5, or 6 distinct values, but
they are either longer than the series of 7-in-7 (12-25 contain abcd),
or are part of the 7-in-7 series and as such they should not be included!

My AD 1996 program used to slide a window that started with three
elements (plus a sentinel on either side) over the big array and spit
out the position of the first element if it found three distinct
elements in the window *and* the two sentinels also contained any of the
values in the window. The program included some fairly minor
optimizations, such as immediately sliding the window to the last of a
series of equal values, in the example above, the three position wide
window would start by covering pos 6..8, and next it would move to
11..13, etc. If at any moment the program would find that all elements
in the window *and the sentinel on the right* would be distinct, the
window size would be increased by one and the scan would continue.

Taking the above example, a window with an aperture of four elements
would start its slide at position 6..9. Once it reached position 49,
with "gihb" visible and "g" and "f" in the two sentinels, the window
would be widened to five characters as the "f" is distinct from the four
"visible" characters, with a new sentinel of "j". Given that this is
again distinct from the now five characters in the window, the process
would repeat itself, eventually resulting in the string of seven
distinct characters starting at position 50.

Once the string of seven distinct characters has been found, the window
is widened to *nine* (if there had been a string of 8-in-8 it would have
been found by the previous slide!) characters and the process restarts
at position 6..14, but in the end it fails to find a series of eight
distinct characters and so the process is repeated with windows of 10,
11, 12, 13, 14, 15 and finally 16 characters, when a string of eight
distinct characters is finally found.

The problem is, for relatively low values of N and C, the process may
not be overly efficient, but it works. However, once N and C increase,
(N is theoretically unbounded, C has an upper limit of around 210), the
process becomes horribly inefficient: using the old IBM V2.3.0 OS PL/I
compiler, which had a statement count facility, this method required the
execution of 556,379,518 PL/I statements (and a number of actual machine
instruction that is at least one order of magnitude greater) for the
current values of N(2329) and C(66).

As for some figures on required restarts, the final series of 66 is
found in a sub-array with a length of 1950 elements and with a series of
65 in a sub-array of only 1889 elements, this means that the slide had
to be restarted 61 times! Even more restarts, 312(!), are needed when
going from 61 to 62 elements.

In 1998 I posted the problem to comp.lang.pascal.borland, the post
should be on Google, but Google refuses to show more than the first
9,960 (out of 28,517) posts in that group.

Three people claimed that there was a much faster way, and two of them
backed this up with programs, Brent Beach and Paul Green, with Paul's
solution being the fastest by a fair margin. The results of his code
matched my output, I replaced my code with his code, and that seemed to
be the end of the story...

Until a few months back...

... when I started converting the original Pascal code in assembler.
While doing so I decided that it would be interesting to use this
procedure in another part of the code. However, the results were not
what I expected: result rows were missing. I managed to track down a
paper copy of the original PG program, typed that in, had it checked for
typos by a friend, and it turned out to give the same erroneous results.
PG's email address is no longer valid, the floppy that contained a copy
of my email conversation with him and Brent, including a basic
explanation of the algorithm he used, is, after having been stored on a
cold loft for the best part of the last decade, unreadable. The only
thing I remember about his explanation were the words "clear as mud..."
near the end.

I have spent the last two weeks trying to figure out how it works and
staring at the Virtual Pascal IDE "Watch" window, I've finally decided
to ask for help.

So, if anyone can give me any clues as to how I could perform the above
described process in an efficient way, I'd be very grateful. *I'm quite
capable of writing the code myself*, I would just like to get any clues
towards a more efficient algorithm. If that means you want to see the
current Pascal (or PL/I) code and the input data that leads to the
erroneous results, drop me a line, "robert(a)prino(d)org".

Thanks,

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
glen herrmannsfeldt
2010-05-17 00:29:19 UTC
Permalink
Post by Robert AH Prins
Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;
The program processes an array of N cells, with N up to about 4096. The
array is filled from the start, it may not be completely full but there
are no empty cells between full cells. A cell contains a value of
1..255, so logically every cell would be a byte, but if processing
efficiency requires each cell to be a word or even a double-word, that
would be no problem. The values are normalized, they are effectively
indexes into a table, so if the table contains just T entries, the
values would go from 1 to T.
For alphabetsize of 32 or less, that is, values 0..31, and if I
did it in C, I would probably use the one bit per 32 bit word
storage, which makes testing fairly easy using & and |.
As far as I know, PL/I BIT(32) doesn't have the optimization
that uses a 32 bit word, and UNSPEC is likely even worse.
Post by Robert AH Prins
Given the above setup, the problem is to find *all* sub-arrays that
contain 3..T distinct elements and are as short as possible, i.e. if
there are 42 sub-arrays of three elements containing three distinct
values, there is no need to find sub-arrays of four elements containing
three distinct values. Also, a shorter sub-array may *never* be a part
of a longer sub-array containing *only* distinct values.
I wonder if this can be done using dynamic programming.
DP can speed up many such problems, but it isn't always
so obvious.

For the longer subarrays, it might be that hash tables could
speed up the comparisons. That would, for example, get you
an O(N) test for a value being in a subarray being considered.
Post by Robert AH Prins
An example, suppose only the array contains 56 elements, in this case
1 2 3 4 5 5
....5....0....5....0....5....0....5....0....5....0....56
aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja
len aperture from to value
3 3 40 42 bef
3 3 44 46 fgh
7 7 50 56 gihbfja
8 16 41 56 efffghggggihbfja, containing efghibja
9 23 34 56 cbbbbbbefffghggggihbfja, containing cbefghija
10 25 32 56 dccbbbbbbefffghggggihbfja, containing dcbefghija
It is possible to find sub-arrays with 4, 5, or 6 distinct values, but
they are either longer than the series of 7-in-7 (12-25 contain abcd),
or are part of the 7-in-7 series and as such they should not be included!
My AD 1996 program used to slide a window that started with three
elements (plus a sentinel on either side) over the big array and spit
out the position of the first element if it found three distinct
elements in the window *and* the two sentinels also contained any of the
values in the window. The program included some fairly minor
optimizations, such as immediately sliding the window to the last of a
series of equal values, in the example above, the three position wide
window would start by covering pos 6..8, and next it would move to
11..13, etc. If at any moment the program would find that all elements
in the window *and the sentinel on the right* would be distinct, the
window size would be increased by one and the scan would continue.
The optimization works if it is faster than a direct test.
For short subarrays, an unrolled compare loop should be pretty fast.
For longer ones, as I said above, a hash table might be fast.
Post by Robert AH Prins
Taking the above example, a window with an aperture of four elements
would start its slide at position 6..9. Once it reached position 49,
with "gihb" visible and "g" and "f" in the two sentinels, the window
would be widened to five characters as the "f" is distinct from the four
"visible" characters, with a new sentinel of "j". Given that this is
again distinct from the now five characters in the window, the process
would repeat itself, eventually resulting in the string of seven
distinct characters starting at position 50.
Once the string of seven distinct characters has been found, the window
is widened to *nine* (if there had been a string of 8-in-8 it would have
been found by the previous slide!) characters and the process restarts
at position 6..14, but in the end it fails to find a series of eight
distinct characters and so the process is repeated with windows of 10,
11, 12, 13, 14, 15 and finally 16 characters, when a string of eight
distinct characters is finally found.
The problem is, for relatively low values of N and C, the process may
not be overly efficient, but it works. However, once N and C increase,
(N is theoretically unbounded, C has an upper limit of around 210), the
process becomes horribly inefficient: using the old IBM V2.3.0 OS PL/I
compiler, which had a statement count facility, this method required the
execution of 556,379,518 PL/I statements (and a number of actual machine
instruction that is at least one order of magnitude greater) for the
current values of N(2329) and C(66).
As for some figures on required restarts, the final series of 66 is
found in a sub-array with a length of 1950 elements and with a series of
65 in a sub-array of only 1889 elements, this means that the slide had
to be restarted 61 times! Even more restarts, 312(!), are needed when
going from 61 to 62 elements.
In 1998 I posted the problem to comp.lang.pascal.borland, the post
should be on Google, but Google refuses to show more than the first
9,960 (out of 28,517) posts in that group.
Three people claimed that there was a much faster way, and two of them
backed this up with programs, Brent Beach and Paul Green, with Paul's
solution being the fastest by a fair margin. The results of his code
matched my output, I replaced my code with his code, and that seemed to
be the end of the story...
Until a few months back...
... when I started converting the original Pascal code in assembler.
Can you post some code so we can see it? It might give some good,
or not so good, ideas on how to do it better.
Post by Robert AH Prins
While doing so I decided that it would be interesting to use this
procedure in another part of the code. However, the results were not
what I expected: result rows were missing.
-- glen
Robert AH Prins
2010-05-17 09:29:55 UTC
Permalink
Post by glen herrmannsfeldt
Post by Robert AH Prins
Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;
The program processes an array of N cells, with N up to about 4096. The
array is filled from the start, it may not be completely full but there
are no empty cells between full cells. A cell contains a value of
1..255, so logically every cell would be a byte, but if processing
efficiency requires each cell to be a word or even a double-word, that
would be no problem. The values are normalized, they are effectively
indexes into a table, so if the table contains just T entries, the
values would go from 1 to T.
For alphabetsize of 32 or less, that is, values 0..31, and if I
did it in C, I would probably use the one bit per 32 bit word
storage, which makes testing fairly easy using& and |.
As far as I know, PL/I BIT(32) doesn't have the optimization
that uses a 32 bit word, and UNSPEC is likely even worse.
Bit operations, especially on aligned bitstrings, were never very bad,
even in the old V2.3.0 compiler, and BIT(1) ALIGNED already uses one bit
per 1 byte.
Post by glen herrmannsfeldt
Post by Robert AH Prins
Given the above setup, the problem is to find *all* sub-arrays that
contain 3..T distinct elements and are as short as possible, i.e. if
there are 42 sub-arrays of three elements containing three distinct
values, there is no need to find sub-arrays of four elements containing
three distinct values. Also, a shorter sub-array may *never* be a part
of a longer sub-array containing *only* distinct values.
I wonder if this can be done using dynamic programming.
DP can speed up many such problems, but it isn't always
so obvious.
For the longer subarrays, it might be that hash tables could
speed up the comparisons. That would, for example, get you
an O(N) test for a value being in a subarray being considered.
Never thought of using a hash table, always focused on the sliding
window and tried to do that as optimal as possible.

<snip>
Post by glen herrmannsfeldt
Post by Robert AH Prins
Until a few months back...
... when I started converting the original Pascal code in assembler.
Can you post some code so we can see it? It might give some good,
or not so good, ideas on how to do it better.
I'll pull the code out of the program, probably needs a bit of massaging
to make it runnable. I probably still have the (or at least an
iteration) of sliding window somewhere, but it may not be able to post
it this week as I will be away for a a short break.
Post by glen herrmannsfeldt
Post by Robert AH Prins
While doing so I decided that it would be interesting to use this
procedure in another part of the code. However, the results were not
what I expected: result rows were missing.
Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
Terje Mathisen
2010-05-17 09:03:15 UTC
Permalink
Post by Robert AH Prins
Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;
The program processes an array of N cells, with N up to about 4096. The
array is filled from the start, it may not be completely full but there
are no empty cells between full cells. A cell contains a value of
1..255, so logically every cell would be a byte, but if processing
efficiency requires each cell to be a word or even a double-word, that
would be no problem. The values are normalized, they are effectively
indexes into a table, so if the table contains just T entries, the
values would go from 1 to T.
The maximum array size of 4096 is interesting, it means that we can
afford to use quite a bit of lookup table space per entry.

Otoh is also means that we cannot afford to use a lot of time to
initialize those tables, since that would take just as long or longer
than the obvious window scanning algorithm.

I would still start by determining the first and last position of each
unique byte, then use that to pick the determine the shortest substring
which contains all the values: This becomes the limit for a window scan
that starts with aperture 3 and goes up.

byte not_found[256]; // Contains 1 for byte values not seen so far

unsigned current_limit = 3; // Minimum len!
unsigned next_aperture = 0; // Wait until we find the first substring

for (unsigned aperture = 3; aperture < limit; aperture++) {
if (aperture == next_aperture)
current_limit++;

for (unsigned start = 0; start <= length-aperture; start++) {
/* This memset might turn out to be the single most expensive
part of the algorithm, at least for short lengths/apertures.
Even using 16 SSE writes, it will take at least 16 cycles
just to get to L1 cache.

One possible optimization is to note that at least half
of all bus write cycles will be idle during the main loop
below, so we could use two independent not_found buffers
and alternate between them: Use one and zero the other
at the same time!
*/
memset(not_found, 1, sizeof(found));

unsigned count = 0;

/* The core of the scanning algorithm:
Count the number of unique byte values by adding together
all those values not seen previously, while making each
entry in the not_found[] array:
*/
for (i = 0; i < aperture; i++) {
byte c = data[start+i];
count += not_found[c];
not_found[c] = 0;
}
/* The running time for this loop should be ~3 cycles/iteration
as long as the loop branch is correctly predicted, and at
least for the shorter apertures the modern multi-level
predictors should do quite well here.
*/

/* This next part will be very rarely executed except for
the shortest apertures/lengths, so we can assume it to
be skipped:
*/
if (count >= current_limit) {
if (count > current_limit)
current_limit = count;
save(start, aperture, count);
next_aperture = aperture+1;
}
}
}

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Robert AH Prins
2010-05-17 15:24:18 UTC
Permalink
Post by Terje Mathisen
Post by Robert AH Prins
Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;
The program processes an array of N cells, with N up to about 4096. The
array is filled from the start, it may not be completely full but there
are no empty cells between full cells. A cell contains a value of
1..255, so logically every cell would be a byte, but if processing
efficiency requires each cell to be a word or even a double-word, that
would be no problem. The values are normalized, they are effectively
indexes into a table, so if the table contains just T entries, the
values would go from 1 to T.
The maximum array size of 4096 is interesting, it means that we can
afford to use quite a bit of lookup table space per entry.
Memory is cheap, but cache is of course limited.
Post by Terje Mathisen
Otoh is also means that we cannot afford to use a lot of time to
initialize those tables, since that would take just as long or longer
than the obvious window scanning algorithm.
The problem is, the sliding window algorithm is fundamentally wrong. I
spend a bit of time converting the current Pl/I version of the program
back to something that compiles with the old (AD late 1980's, early
1990's compiler) and ran the program with the count option. The old
sliding window method resulted, as mentioned before, in the execution of
*556,379,518* PL/I statements. The Paul Green method gets the same
result by executing a mere *79,386* PL/I statements, a rather staggering
difference...

I'm now going to dig through about a dozen boxes with papers that were
never unpacked after we moved. I have a faint hope that there may be
printouts of my email exchange with him (PG) in one of them. Hope I have
the time to do it today or tomorrow, as I'm off to the other side of
Europe later this week, provided the Eyjafjallajökull allows me...
Post by Terje Mathisen
I would still start by determining the first and last position of each
unique byte, then use that to pick the determine the shortest substring
which contains all the values: This becomes the limit for a window scan
that starts with aperture 3 and goes up.
That string may contain all values, but strings to either side of it may
contain shorter substrings. In the example I gave, all values occur in
the substring starting at position 32, but if there had been additional
values b and c at positions "0" and "-1", your window scan would, if I
read the above correctly, not pick them up as part of a 3-string.

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
Terje Mathisen
2010-05-17 14:50:20 UTC
Permalink
Post by Robert AH Prins
Post by Terje Mathisen
Post by Robert AH Prins
Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;
The program processes an array of N cells, with N up to about 4096. The
array is filled from the start, it may not be completely full but there
are no empty cells between full cells. A cell contains a value of
1..255, so logically every cell would be a byte, but if processing
efficiency requires each cell to be a word or even a double-word, that
would be no problem. The values are normalized, they are effectively
indexes into a table, so if the table contains just T entries, the
values would go from 1 to T.
The maximum array size of 4096 is interesting, it means that we can
afford to use quite a bit of lookup table space per entry.
Memory is cheap, but cache is of course limited.
Post by Terje Mathisen
Otoh is also means that we cannot afford to use a lot of time to
initialize those tables, since that would take just as long or longer
than the obvious window scanning algorithm.
The problem is, the sliding window algorithm is fundamentally wrong. I
spend a bit of time converting the current Pl/I version of the program
back to something that compiles with the old (AD late 1980's, early
1990's compiler) and ran the program with the count option. The old
sliding window method resulted, as mentioned before, in the execution of
*556,379,518* PL/I statements. The Paul Green method gets the same
result by executing a mere *79,386* PL/I statements, a rather staggering
difference...
That sounds like the difference between an O(n*n) and O(n*log(n))
algorithm. :-)
Post by Robert AH Prins
I'm now going to dig through about a dozen boxes with papers that were
never unpacked after we moved. I have a faint hope that there may be
printouts of my email exchange with him (PG) in one of them. Hope I have
the time to do it today or tomorrow, as I'm off to the other side of
Europe later this week, provided the Eyjafjallajökull allows me...
Britain and several other western parts of Europe seems quite iffy
tomorrow morning, according to
http://weatheronline.co.uk/cgi-app/volcanic?LANG=en&ART=3
Post by Robert AH Prins
Post by Terje Mathisen
I would still start by determining the first and last position of each
unique byte, then use that to pick the determine the shortest substring
which contains all the values: This becomes the limit for a window scan
that starts with aperture 3 and goes up.
That string may contain all values, but strings to either side of it may
contain shorter substrings. In the example I gave, all values occur in
the substring starting at position 32, but if there had been additional
values b and c at positions "0" and "-1", your window scan would, if I
read the above correctly, not pick them up as part of a 3-string.
Right, I did realize that a little bit later.

This is an intriguing problem, I'll have think some more...
:-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
robin
2010-05-17 10:01:26 UTC
Permalink
"glen herrmannsfeldt" <***@ugcs.caltech.edu> wrote in message news:hsq2kv$4gt$***@speranza.aioe.org...

| For alphabetsize of 32 or less, that is, values 0..31, and if I
| did it in C, I would probably use the one bit per 32 bit word
| storage, which makes testing fairly easy using & and |.
| As far as I know, PL/I BIT(32) doesn't have the optimization
| that uses a 32 bit word,

It does if it's unioned with a 31-bit FIXED BINARY.

| and UNSPEC is likely even worse.

Did you try it?
Dick Wesseling
2010-05-17 04:57:52 UTC
Permalink
Post by Robert AH Prins
Given the above setup, the problem is to find *all* sub-arrays that
contain 3..T distinct elements and are as short as possible, i.e. if
there are 42 sub-arrays of three elements containing three distinct
values, there is no need to find sub-arrays of four elements containing
three distinct values. Also, a shorter sub-array may *never* be a part
of a longer sub-array containing *only* distinct values.
An example, suppose only the array contains 56 elements, in this case
1 2 3 4 5 5
....5....0....5....0....5....0....5....0....5....0....56
aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja
len aperture from to value
3 3 40 42 bef
3 3 44 46 fgh
7 7 50 56 gihbfja
8 16 41 56 efffghggggihbfja, containing efghibja
9 23 34 56 cbbbbbbefffghggggihbfja, containing cbefghija
10 25 32 56 dccbbbbbbefffghggggihbfja, containing dcbefghija
It is possible to find sub-arrays with 4, 5, or 6 distinct values, but
they are either longer than the series of 7-in-7 (12-25 contain abcd),
Are you saying that if we tabulate the results, then both the "len" and
"aperture" columns should have increasing values?

If my understanding of the requirements is correct then it may be better
to start with a large window and _decrease_ it in each pass. (As finding
the longest length and corresponding aperture is trivial).
For example, after finding len=10 aperture=25 one should only consider
substrings of 25 or less when searching for sequences of less than 10
distinct values.
Robert AH Prins
2010-05-17 09:12:12 UTC
Permalink
Post by Dick Wesseling
Post by Robert AH Prins
Given the above setup, the problem is to find *all* sub-arrays that
contain 3..T distinct elements and are as short as possible, i.e. if
there are 42 sub-arrays of three elements containing three distinct
values, there is no need to find sub-arrays of four elements containing
three distinct values. Also, a shorter sub-array may *never* be a part
of a longer sub-array containing *only* distinct values.
An example, suppose only the array contains 56 elements, in this case
1 2 3 4 5 5
....5....0....5....0....5....0....5....0....5....0....56
aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja
len aperture from to value
3 3 40 42 bef
3 3 44 46 fgh
7 7 50 56 gihbfja
8 16 41 56 efffghggggihbfja, containing efghibja
9 23 34 56 cbbbbbbefffghggggihbfja, containing cbefghija
0 25 32 56 dccbbbbbbefffghggggihbfja, containing dcbefghija
It is possible to find sub-arrays with 4, 5, or 6 distinct values, but
they are either longer than the series of 7-in-7 (12-25 contain abcd),
Are you saying that if we tabulate the results, then both the "len" and
"aperture" columns should have increasing values?
If they are temporarily stored in an array or list, then it's trivial to
sort them in any required order before further processing.
Post by Dick Wesseling
If my understanding of the requirements is correct then it may be better
to start with a large window and _decrease_ it in each pass. (As finding
the longest length and corresponding aperture is trivial).
Even if there may be more than one longest interval? This *is* pretty
unlikely. However, the current set of data contains two sub-arrays with
a length of 1879 that contain 64 distinct values.
Post by Dick Wesseling
For example, after finding len=10 aperture=25 one should only consider
substrings of 25 or less when searching for sequences of less than 10
distinct values.
That is pretty obvious, but why would reducing the aperture from 24 down
to potentially 9 to find substrings with 9 distinct values be any more
efficient than starting with an aperture of 9 and increasing it to 24?
If the substring of 9 elements were to be found with an aperture of 16
or 17, i.e. dead in the middle, starting with the smaller aperture would
involve less testing...

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
Terje Mathisen
2010-05-17 08:09:13 UTC
Permalink
Post by Dick Wesseling
Post by Robert AH Prins
Given the above setup, the problem is to find *all* sub-arrays that
contain 3..T distinct elements and are as short as possible, i.e. if
there are 42 sub-arrays of three elements containing three distinct
values, there is no need to find sub-arrays of four elements containing
three distinct values. Also, a shorter sub-array may *never* be a part
of a longer sub-array containing *only* distinct values.
An example, suppose only the array contains 56 elements, in this case
1 2 3 4 5 5
....5....0....5....0....5....0....5....0....5....0....56
aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja
len aperture from to value
3 3 40 42 bef
3 3 44 46 fgh
7 7 50 56 gihbfja
8 16 41 56 efffghggggihbfja, containing efghibja
9 23 34 56 cbbbbbbefffghggggihbfja, containing cbefghija
10 25 32 56 dccbbbbbbefffghggggihbfja, containing dcbefghija
I haven't thought this all the way through, but my immediate idea was to
start by making a full inverted index, i.e. sort all possible substrings
into alphabetical order.

However, on second consideration this probably doesn't really help. :-(
Post by Dick Wesseling
Post by Robert AH Prins
It is possible to find sub-arrays with 4, 5, or 6 distinct values, but
they are either longer than the series of 7-in-7 (12-25 contain abcd),
Are you saying that if we tabulate the results, then both the "len" and
"aperture" columns should have increasing values?
If my understanding of the requirements is correct then it may be better
to start with a large window and _decrease_ it in each pass. (As finding
the longest length and corresponding aperture is trivial).
For example, after finding len=10 aperture=25 one should only consider
substrings of 25 or less when searching for sequences of less than 10
distinct values.
This sounds like a good idea!

If we start with a full scan that indexes the locations of each possible
byte value, then we know that the longest sequence will contain each of
the unique byte values found, right?

We can then start with the full input array, then from each end remove
bytes until we get to the only remaining entry for a byte value: This
will be the shortest substring which contains all the unique values.

As you note, this is also the longest substring we need to consider...

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
robin
2010-05-17 09:59:14 UTC
Permalink
"Robert AH Prins" <***@prino.org> wrote in message news:***@mid.individual.net...
| Hi all,
|
| Can anyone give me some hints as to solve the following problem,
| preferably in a way that is faster than the way I used to do it, and
| without the bug in the current version;
|
| Problem description:
|
| The program processes an array of N cells, with N up to about 4096. The
| array is filled from the start, it may not be completely full but there
| are no empty cells between full cells. A cell contains a value of
| 1..255, so logically every cell would be a byte, but if processing
| efficiency requires each cell to be a word or even a double-word, that
| would be no problem.

In general, byte values are quicker to manipulate than word,
halfword, and doubleword values.

| The values are normalized, they are effectively
| indexes into a table, so if the table contains just T entries, the
| values would go from 1 to T.
|
| Given the above setup, the problem is to find *all* sub-arrays that
| contain 3..T distinct elements and are as short as possible, i.e. if
| there are 42 sub-arrays of three elements containing three distinct
| values, there is no need to find sub-arrays of four elements containing
| three distinct values. Also, a shorter sub-array may *never* be a part
| of a longer sub-array containing *only* distinct values.
|
| An example, suppose only the array contains 56 elements, in this case
| with, for clarity, values a..j:
|
| 1 2 3 4 5 5
| ....5....0....5....0....5....0....5....0....5....0....56
| aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja
|
| in this case the program should find the following sub-arrays:
|
| len aperture from to value
| 3 3 40 42 bef
| 3 3 44 46 fgh
| 7 7 50 56 gihbfja
| 8 16 41 56 efffghggggihbfja, containing efghibja
| 9 23 34 56 cbbbbbbefffghggggihbfja, containing cbefghija
| 10 25 32 56 dccbbbbbbefffghggggihbfja, containing dcbefghija
|
| It is possible to find sub-arrays with 4, 5, or 6 distinct values, but
| they are either longer than the series of 7-in-7 (12-25 contain abcd),
| or are part of the 7-in-7 series and as such they should not be included!
|
| My AD 1996 program used to slide a window that started with three
| elements (plus a sentinel on either side) over the big array and spit
| out the position of the first element if it found three distinct
| elements in the window *and* the two sentinels also contained any of the
| values in the window.

This sounds like something adaptable to using INDEX.
especially as your data are byte entries. (or maybe I'm
thinking of the wrong part of the search...).
Dick Wesseling
2010-05-17 23:06:23 UTC
Permalink
Post by Robert AH Prins
Post by Dick Wesseling
Post by Robert AH Prins
Given the above setup, the problem is to find *all* sub-arrays that
contain 3..T distinct elements and are as short as possible, i.e. if
there are 42 sub-arrays of three elements containing three distinct
values, there is no need to find sub-arrays of four elements containing
three distinct values. Also, a shorter sub-array may *never* be a part
of a longer sub-array containing *only* distinct values.
An example, suppose only the array contains 56 elements, in this case
1 2 3 4 5 5
....5....0....5....0....5....0....5....0....5....0....56
aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja
len aperture from to value
3 3 40 42 bef
3 3 44 46 fgh
7 7 50 56 gihbfja
8 16 41 56 efffghggggihbfja, containing efghibja
9 23 34 56 cbbbbbbefffghggggihbfja, containing cbefghija
0 25 32 56 dccbbbbbbefffghggggihbfja, containing dcbefghija
It is possible to find sub-arrays with 4, 5, or 6 distinct values, but
they are either longer than the series of 7-in-7 (12-25 contain abcd),
Are you saying that if we tabulate the results, then both the "len" and
"aperture" columns should have increasing values?
If they are temporarily stored in an array or list, then it's trivial to
sort them in any required order before further processing.
That's not what I meant. I wanted to make sure that I understood the
problem statement before attempting a solution.
Post by Robert AH Prins
Post by Dick Wesseling
If my understanding of the requirements is correct then it may be better
to start with a large window and _decrease_ it in each pass. (As finding
the longest length and corresponding aperture is trivial).
Even if there may be more than one longest interval? This *is* pretty
unlikely. However, the current set of data contains two sub-arrays with
a length of 1879 that contain 64 distinct values.
Post by Dick Wesseling
For example, after finding len=10 aperture=25 one should only consider
substrings of 25 or less when searching for sequences of less than 10
distinct values.
That is pretty obvious, but why would reducing the aperture from 24 down
to potentially 9 to find substrings with 9 distinct values be any more
efficient than starting with an aperture of 9 and increasing it to 24?
If the substring of 9 elements were to be found with an aperture of 16
or 17, i.e. dead in the middle, starting with the smaller aperture would
involve less testing...
The solution that I have in mind is O(N*C) where N is the length of
the input and C is either the size of the alphabet (current version) or
- with a few improvements that left as an exercise - the number of
different lengths (which is 5 for {3,7,8,9,10} in your example).

My solution involves an exclusion list that is searched linearly,
this may ruin the performance, but it should be possible to use a
binary search tree instead. Also left as an exercise.

Anyway, here is what I had in mind. Compiles with gcc.


#include <stdio.h>
#include <string.h>


int verbose=0;

#if 1
// original example

#define TESTSTR "aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja"

#else

/*
That string may contain all values, but strings to either side of it may
contain shorter substrings. In the example I gave, all values occur in
the substring starting at position 32, but if there had been additional
values b and c at positions "0" and "-1", your window scan would, if I
read the above correctly, not pick them up as part of a 3-string.
*/

#define TESTSTR "bcaaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja"
#endif
unsigned char input [] = TESTSTR ;
unsigned N = sizeof(TESTSTR)-1;



#define C 256 // Size of alphabet
#define ONE 1 /* Pascal array bias convention */


void findseqs(unsigned char *data, unsigned len) {
unsigned i;
unsigned uniq = 0; // Nr unique symbols in window
unsigned apub = ~0; // Aperture upper bound

unsigned nrfound = /* to keep gcc happy */ 0;
unsigned found[len];

unsigned freq[C]; // Frequency count
bzero(freq, sizeof(freq));

#define inc(c) do { if(!freq[c]) uniq++; freq[c]++; } while(0)
#define dec(c) do { freq[c]--; if(!freq[c]) uniq--; } while(0)


#define WINSIZE ((r-l)+1)

// THIS NEEDS TO BE IMPROVED:
struct {
unsigned start;
unsigned end; // Exclusive end
} excluded [len];
int nrexcluded=0;


// Start with largest possible sequence: the entire input string

unsigned l = 0; // Start of window
unsigned r = len-1; // Inclusive end of window

for (i=0; i<len; i++) {
inc(data[i]);
}

// Trim left & right edges

while (freq[data[r]]>1) { dec(data[r]); r--; }
while (freq[data[l]]>1) { dec(data[l]); l++; }

// From here on we want to find only sequences that are shorter or
// equal to what we have found above:

unsigned goal = uniq;

haveseq:
if (WINSIZE <= apub ) { // Short enough?

// A candidate. May be premature if we have shorter sequences
// with the same nr of unique symbols.

if (verbose) {
printf("%s\n", data);
for (i=0; i<l; i++) printf(" ");
for ( ; i<=r; i++) printf("-");
printf("\n");
printf("Possible sequence: len %d aper %d from %d to %d\n",
uniq, WINSIZE, l, r);
}

// Part of exclusion?
// For now brute force. todo: some sort of binary search

for (i=0; i<nrexcluded; i++) {
if ( l >= excluded[i].start &&
r < excluded[i].end) {
if (verbose) printf("Skip excluded\n");
goto skip;
}
}
if (WINSIZE < apub) {
nrfound = 0; // Dump old candidates
apub = WINSIZE; // Next matches must be <= this one
}
found[nrfound] = l; // Add to list of tentative results
nrfound++;
skip: ;
}

// Slide window

dec(data[l]); l++;
while (freq[data[l]]>1) { dec(data[l]); l++; }
while (uniq < goal && r<len-1) { r++; inc(data[r]); }

// Hit right edge?
if (uniq == goal) goto haveseq;


// No more matches of this length. Print results

for (i=0; i<nrfound; i++) {

unsigned j;
printf("Found len %2d aperture %2d pos %d\t", goal, apub, ONE+found[i]);
for (j=found[i]; j< apub+found[i]; j++) printf("%c", data[j]);
printf("\n");

if (apub == goal) { // *only* distinct values?
if (verbose) printf("Add exclude\n");
excluded[nrexcluded].start = found[i];
excluded[nrexcluded].end = found[i]+goal;
nrexcluded++;
}
}

// Now the next shorter sequence.
//
// If may understanding of the problem is correct it must have
// a smaller aperture than the one found above. So in order to
// find the length and a first estimate of the aperture we slide
// a fixed size window over the input data

#define RESET \
bzero(freq, sizeof(freq)); \
uniq = l = r = 0; /* reset everything */

RESET

apub--; // Must be shorter the previous

for (r=0; r<apub-1; r++) {
inc(data[r]);
} inc(data[r]);

unsigned int start = 0; // Where to start next iteration
unsigned int end = 0; // of main loop

goal=0;

// match fixed window at all positions
while (r < (len-1)) {
if (uniq > goal ) {

// Q: does it pay to test exclusion list at this point?
// Yes! That saves quite a few iterations of the main
// loop. E.g. sample data is now O(10*N) whereas
// O(5*N) can be achieved. This is left as an exercise
// for the reader.

goal = uniq;
start = l;
end = r;
}
dec(data[l]); l++;
r++; inc(data[r]);
}

if (goal < 3) return;

RESET
l=start;
for (r=l; r<end; r++) {
inc(data[r]);
} inc(data[r]);

while (freq[data[l]]>1) { dec(data[l]); l++; }
apub=WINSIZE;
nrfound=0;

if (verbose) {
printf("Next goal %d start %d end %d uniq %d \n",
goal, start, end, uniq );
}
goto haveseq;
}


int main(int argc, char **argv) {
if (argc >1) {
findseqs(argv[1], strlen(argv[1]));
} else {
findseqs(input, N);
}
return 0;
}
James J. Weinkam
2010-05-18 08:12:04 UTC
Permalink
Post by Robert AH Prins
Hi all,
Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;
...
I believe the program below does what you want and should be a couple of orders of magnitude faster
than the method you outlined (based on your statement about how many "PL/I statements" are executed
on an input of length 2329 with 66 distinct values). For illustrative purposes, the program uses the
character representation you used in your example for input and output, but the algorithm itself
will work for values of c up to 32765 and v up to 32764 if you have enough memory and patience.

The program uses four arrays: y(c), a character array into which the input is read; x(c), an integer
array giving the encoded input with a=1, b=2, etc; a(0:c,v+1), described below; and l(v), with l(j)
the length of the shortest subsequence with j distinct values.

The basic idea is to build an array, a, with a row for each cell and a column for each value. The
i,j element of the array gives the relative cell number in the input array starting from cell i of
the first occurrence of value j, if any, otherwise a(i,j) is greater than max(c,v+1). The array is
built starting from the last row, c, and working back to row 1. Row c is first set to max(c+1,v+2).
Then a(c,x(c)) is set to 1. Working backward for i=c-1 to 1, each element of row i is set to the
minimum of 32766 and the corresponding element of row i+1 plus 1 and then element a(i,x(i)) is set to 1.

Once this construction is complete, it is evident that there is one and only one instance of the
integer 1 in each row, and all values in each row less than c+1 are unique.

Next each row is sorted. After sorting, the a(i,j) element contains the length of the shortest
subsequence starting at position i that contains j distinct values unless there is no such
subsequence, in which case a(i,j) is greater than c. It is also evident that if a(i,j)=j then
a(i,k)=k for 1<=k<=j. Now if a(i,j)=j the subsequence of length j starting at position i has j
distinct values, but it should be excluded from the solution set for j distinct values if it is a
proper subsequence of a longer subsequence of distinct values. This will be so iff a(i,j+1)=j+1 or
a(i-1,j+1)=j+1.

Based on these considerations, the solution sets for each j are constructed by scanning each column
starting with j=3 and proceeding to j=v. For this purpose the array a actually has an additional
row, 0, to contain the subscript of the first cell of the first minimal subsequence for each j, and
an extra column, v+1, containing values greater than c to ensure that the test for inclusion can be
carried out for j=v. For each j, l(j) is initialized to c+1 and k (the previous element in the list
of minimal subsequences) to 0. The scan proceeds as follows: if a(i,j) is less than l(j) the current
subsequence is shorter that the shortest found so far so l(j) is updated and k reset to 0. If a(i,j)
is equal to l(j) the inclusion test is applied and if it passes the current subsequence is added to
the list. Once the entire column has been processed the list is terminated by setting a(k,j)=0.

It is now a simple matter to read off the solution subsequences. The program utilizes sorta to sort
the rows. call sorta(addr(first element),n,w,c) sorts an array of n elements of width w in place
according the comparison function c. c(u,v) returns '1'b (bit(1) aligned) if the u-th element may
precede the v-th element, i.e., x(u)<=x(v).


Source:

%process mar(2,100) offset;
subsets: proc options(main) reorder;
dcl
(c,v,i,j,k,m,ar init((rank('a')-1))) bin fixed,
(x(c),a(0:c,v+1),l(v)) bin fixed ctl,
y(c) char(1) ctl,
sorta entry(ptr,bin fixed(31),bin fixed(31),
entry(bin fixed(31),bin fixed(31)) returns(bit(1) aligned)),
vfmt entry(bin float(53),bin fixed(15),bin fixed(15)) returns(char(50) var),
sysin file input,
sysprint file print,
(addr,max,min,rank) builtin;
get file(sysin) list(c,v);
put file(sysprint) edit('c: ',vfmt(c,10,0),', v: ',vfmt(v,10,0))(col(1),4 a);
allocate x,a,l,y;
get file(sysin) edit(y)(col(1),(c)a(1));
put file(sysprint) edit('Input: ',y)(col(1),a,(c)a(1));
do i=1 to c; x(i)=rank(y(i))-ar; end;
a(c,*)=max(c+1,v+2); a(c,x(c))=1;
do i=c-1 to 1 by -1; a(i,*)=min(32766,a(i+1,*)+1); a(i,x(i))=1; end;
do i=1 to c; call sorta(addr(a(i,1)),v,2,comp); end;
do j=3 to v; l(j)=c+1; k=0; m=j+1; a(0,m)=0;
do i=1 to c;
if a(i,j)<l(j) then do; l(j)=a(i,j); k=0; end;
if a(i,j)=l(j) then if a(i-1,m)~=m then if a(i,m)~=m then do; a(k,j),k=i; end;
end;
a(k,j)=0;
if a(0,j)>0 then do;
put file(sysprint) edit('Distinct: ',vfmt(j,10,0),', length: ',vfmt(l(j),10,0))(col(1),4 a);
i=a(0,j); k=0; do while(i~=0); k+=1;
put file(sysprint) edit(k,': (',vfmt(i,10,0),') ',(y(i+m) do m=0 to l(j)-1))
(col(1),f(10),3 a,(l(j))a(1));
i=a(i,j);
end;
end;
end;
comp: proc(u,v) returns(bit(1) aligned) reorder;
dcl (u,v) bin fixed(31);
return(a(i,u)<=a(i,v));
end comp;
end subsets;

Input:

56 10
aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja

Output:

c: 56, v: 10
Input: aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja
Distinct: 3, length: 3
1: (40) bef
2: (44) fgh
Distinct: 7, length: 7
1: (50) gihbfja
Distinct: 8, length: 16
1: (41) efffghggggihbfja
Distinct: 9, length: 23
1: (34) cbbbbbbefffghggggihbfja
Distinct: 10, length: 25
1: (32) dccbbbbbbefffghggggihbfja

Here is the output from another example:

c: 48, v: 10
Input: abcabcabcabcdefgabcabccdabcdeabcaaabbbcccdefghij
Distinct: 3, length: 3
1: (1) abc
2: (2) bca
3: (3) cab
4: (4) abc
5: (5) bca
6: (6) cab
7: (7) abc
8: (8) bca
9: (9) cab
10: (18) bca
11: (19) cab
12: (20) abc
13: (31) bca
Distinct: 4, length: 4
1: (23) cdab
2: (24) dabc
Distinct: 5, length: 5
1: (25) abcde
2: (26) bcdea
3: (27) cdeab
4: (28) deabc
Distinct: 7, length: 7
1: (10) abcdefg
2: (11) bcdefga
3: (12) cdefgab
4: (13) defgabc
Distinct: 8, length: 8
1: (41) cdefghij
Distinct: 9, length: 11
1: (38) bcccdefghij
Distinct: 10, length: 14
1: (35) abbbcccdefghij
robin
2010-05-19 12:54:53 UTC
Permalink
"James J. Weinkam" <***@nospicedham.cs.sfu.ca> wrote in message news:oZrIn.4426$z%***@edtnps83...

| I believe the program below does what you want and should be a couple of orders of magnitude faster
| than the method you outlined (based on your statement about how many "PL/I statements" are executed
| on an input of length 2329 with 66 distinct values). For illustrative purposes, the program uses the
| character representation you used in your example for input and output, but the algorithm itself
| will work for values of c up to 32765 and v up to 32764 if you have enough memory and patience.

Assuming that it works, a very impressive effort!
Terje Mathisen
2010-05-19 18:15:32 UTC
Permalink
Post by robin
| I believe the program below does what you want and should be a couple of orders of magnitude faster
| than the method you outlined (based on your statement about how many "PL/I statements" are executed
| on an input of length 2329 with 66 distinct values). For illustrative purposes, the program uses the
| character representation you used in your example for input and output, but the algorithm itself
| will work for values of c up to 32765 and v up to 32764 if you have enough memory and patience.
Assuming that it works, a very impressive effort!
Indeed.

I have not tried to read the code (or explanation) except to note that
it must be far better than O(n*n). :-)

With that given, how about a simple approach:

Scan through the input string from beginning to end, while maintaining a
set of lookup tables, one for each previous starting position. In each
lookup table we will have two things: The values seen so far, and the
total number of different characters.

For each possible length we remember the shortest string that so far has
resulted in this many unique values, updating it when we find a shorter
string.

This way the processing would consist of loading data[i], then for each
of the previous bytes taken as the starting position check if this value
hasn't been seen before, and if so, update the count and check if this
count/length is equal or better to the best seen so far:

for (i = 2; i < data_len; i++) {
byte b = data[i];
for (j = 0; j < i-2; j++) {
if (!seen[j][b]) {
seen[j][b] = 1;
unique_values = ++count[j];
len = i-j+1;
if (len <= shortest[unique_values]) {
save(unique_values, len, j);
shortest[unique_values] = len;
}
}
}
}

The save function will generate a list of all possible "winning"
candidates, so it must be sorted and pruned before the final output but
that is trivial.

The main (only?) problem with the program above is that it is still
O(n*n), with data_len = 4096, the total number of iterations will be
1+2+3+4+...+4093 which is ~8M.

However, since the maximum possible number of unique values is the
character set size, even using all 256 possible byte values would mean
that for the longer sequences, starting early in the input, 15/16 of all
bytes will have been seen previously so the only code executed will be the

if (!seen[j][b]) {

test & branch.

With 4K 256-byte lookup tables we'll need 1 MB of cache, using bits
instead would compress this to 128K, but that is still too large to fit
in L1 cache, so probably not worth it.

I expect the running time to be ~ 10-20 cycles/inner loop iteration, or
80-160M cycles for the 4K input, letting the program finish in less than
a tenth of a second on any currently existing PC.

Is that fast enough?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Dick Wesseling
2010-05-19 19:34:56 UTC
Permalink
Post by Terje Mathisen
I have not tried to read the code (or explanation) except to note that
it must be far better than O(n*n). :-)
I have studied the code and the explanation. I too am impressed by its
elegance, and now I finally understand the problem statement.
I think that there is a solution with equal or better O() and a smaller
memory requirement.
Post by Terje Mathisen
Given the above setup, the problem is to find *all* sub-arrays that
contain 3..T distinct elements and are as short as possible,
I took me while to understand what is meant by "as short as possible".
Post by Terje Mathisen
It is possible to find sub-arrays with 4, 5, or 6 distinct values, but
they are either longer than the series of 7-in-7 (12-25 contain abcd),
or are part of the 7-in-7 series and as such they should not be included!
1 2 3 4 5 5
....5....0....5....0....5....0....5....0....5....0....56
aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja
-------------- -------
4-in-14 7-in-7


I did not fully understand why 12-15 was to be rejected, after all
"abcd" is not a substring of "gihbfja".
Now I understand. 7-in-7 contains substrings 6-in-6, 5-in-5, 4-in-4
and so on. These are not to be printed because they are proper
substrings of "gihbfja". However, they define "as short as possible".
for lengths <= 7.

In other words, once you know the longest n-in-n sequence all shorter
sequences must consist entirely of unique symbols.

This suggests two different seach strategies, one for long seqences and
one for short sequences.

Plan A:

Starting with the longest possible sequence search for n-in-m
sequences until n=m. This requires multiple passes for the input,
one for each unique value of n with m>=n (Or two passes actually, see
my first attempt).

Plan B:

Once the longests n-in-n sequence is known perform ONE more pass
over the input, searching for remaining maximal sequences with
3 <= length <n that consist of unique symbols only.

Finish touch:
Sort the results if so desired.
Dick Wesseling
2010-05-21 01:46:27 UTC
Permalink
/*
Here is the final version of my solution.

Let: N number of cells, size of the input
V number of values, size of symbol set.
M the number of distinct values in the largest sequence.
T the size of the largest n-in-n sequence, i.e. the
largest sequence consisting of distinct values only.

As explained in my previous message <4bf43d60$0$22939$***@news6.xs4all.nl>
the algorithm uses two different search strategies:

Plan A

Find the largest sequence with size M

for n from M downwards search n-in-m sequences, keeping only sequences
with minimal value of m.
Print sequence if n<m. (If n=m the same sequence will be found again
in the next step).

Now n=m=T.

Plan B
Prints all maximal n-in-n sequences.

Plan A scans the input 1+(M-T) .. 2+(M-T) times, depending on the
input position of the longest sequence.
Plan B scans the input once.

The running time is therefore bounded by O(N * (3+(M-T)) which
is typically better than or equal to O(N*C).
For small values of N the cost of erasing the frequency array should
also be accounted for.

Two arrays are used:
found[N] Intermediate results of plan A.
freq[C] Frequency count for all symbols in the sliding window


Finding sequences is done using a straightforward sliding window
scan of the input. The values in freq[] are updated when the window
edges move and the numer of unique symbols is updated when a
frequency count changes between zero and non-zero.


The output is unsorted.

*/

#include <stdio.h>
#include <string.h>

#define C 256 // Size of alphabet
#define ONE 1 // Pascal array bias convention

void printres (unsigned char *data, unsigned pos, unsigned l, unsigned w)
{
unsigned i;
printf("len %2d aperture %2d pos %2d\t", l, w, ONE+pos);
for (i=pos; i<pos+w; i++) printf("%c", data[i]);
printf("\n");
}

void findseqs (unsigned char *data, unsigned len)
{
// printf("%s\n", data);
unsigned i;
unsigned uniq = 0; // Nr unique symbols in window
unsigned apub = ~0; // Aperture upper bound (aka "m")

unsigned nrfound = /* to keep gcc happy */ 0;
unsigned found[len]; // intermediate results

unsigned freq[C]; // Frequency count
bzero(freq, sizeof(freq));

# define inc(c) do { if(!freq[c]) uniq++; freq[c]++; } while(0)
# define dec(c) do { freq[c]--; if(!freq[c]) uniq--; } while(0)
# define WINSIZE ((r-l)+1)

// Start with largest possible sequence: the entire input string

unsigned l = 0; // Start of window
unsigned r = len-1; // Inclusive end of window
unsigned lim = len-1; // inclusive end of data

for (i=0; i<len; i++) {
inc(data[i]);
}
if (uniq < 3) return;

// Trim left & right edges

while (freq[data[r]]>1) { dec(data[r]); r--; }
while (freq[data[l]]>1) { dec(data[l]); l++; }

// From here on we want to find only sequences that are shorter or
// equal to what we have found above:

unsigned goal = uniq;

while(1) {

// data[l..r] contains "goal" unique symbols
// freq[] number of occurances of each symbol in data[l..r]
// data[l] is a unique symbol
// data[r] is a unique symbol

if (WINSIZE <= apub) { // Short enough?

// A candidate. May be premature if we have shorter sequences
// with the same nr of unique symbols.

if (WINSIZE < apub) {
nrfound = 0; // Dump old candidates
apub = WINSIZE; // Next matches must be <= this one
}
found[nrfound] = l; // Add to list of tentative results
nrfound++;
}

// Slide window to next match.

dec(data[l]); l++;
while (freq[data[l]]>1) { dec(data[l]); l++; }
while (uniq < goal && r<lim) { r++; inc(data[r]); }

if (uniq == goal) continue;

// We've hit the right edge

bzero(freq, sizeof(freq)); // Reset window & statistics
uniq = l = r = 0;

if (apub == goal) break; // If *only* distinct values proceed
// with plan B.

for (i=0; i<nrfound; i++) { // Print the results
printres(data, found[i], goal, apub);
}

// Now the next shorter sequence.

goal--;

inc(data[l]);
do {
r++;
inc(data[r]);
} while (uniq != goal);

while (freq[data[l]] != 1) {
dec(data[l]);
l++;
}
// Now the precondition holds. Also, the window size is smaller
// than the aperture upper bound from the previous iteration.
}

// Plan B
// Now we consider only sequences will all symbols distinct.
// The freq[] array therefore behaves like a bitmap with values 0&1 only.

freq[data[l]] = 1;

while (r<lim) {
unsigned char cl, cr;
cl = data[l];
cr = data[r+1];

if (freq[cr] != 0) { // Maximal sequence?
if (WINSIZE >= 3) {
printres(data, l, WINSIZE, WINSIZE);
}
do { // Move left edge until
cl = data[l]; // we have all symbols unique.
freq[cl] = 0;
l++;
} while (cl != cr);
}
r++;
freq[data[r]]=1;
}

// Final sequence.

if (WINSIZE >= 3) {
printres(data, l, WINSIZE, WINSIZE);
}
return;
}

int main(int argc, char **argv)
{
if (argc > 2) {
fprintf(stderr, "Usage: %s {string}\n", argv[0]);
return 1;
}
if (argc == 1) {
// read from standard input
unsigned char buf[4096];
size_t inlen = fread(buf, sizeof(buf[0]), sizeof(buf), stdin);
findseqs( buf, (unsigned) inlen);
}
if (argc == 2) {
findseqs(argv[1], strlen(argv[1]));
}

return 0;
}
James J. Weinkam
2010-05-27 00:43:25 UTC
Permalink
This message was originally posted on 2010/05/21 at 23:09 PDT.
It has not yet appeared on any of the newsgroups; numerous other messages in this thread have
appeared since then.

When I tried to repost it just now, a message box appeared saying "Sending of message failed. You
may only send to one newsgroup at a time."

Accordingly, I am now posting it to each group separately.
Post by Dick Wesseling
/*
Here is the final version of my solution.
Let: N number of cells, size of the input
V number of values, size of symbol set.
M the number of distinct values in the largest sequence.
T the size of the largest n-in-n sequence, i.e. the
largest sequence consisting of distinct values only.
Plan A
Find the largest sequence with size M
for n from M downwards search n-in-m sequences, keeping only sequences
with minimal value of m.
Print sequence if n<m. (If n=m the same sequence will be found again
in the next step).
Now n=m=T.
Plan B
Prints all maximal n-in-n sequences.
Plan A scans the input 1+(M-T) .. 2+(M-T) times, depending on the
input position of the longest sequence.
Plan B scans the input once.
The running time is therefore bounded by O(N * (3+(M-T)) which
is typically better than or equal to O(N*C).
For small values of N the cost of erasing the frequency array should
also be accounted for.
found[N] Intermediate results of plan A.
freq[C] Frequency count for all symbols in the sliding window
Finding sequences is done using a straightforward sliding window
scan of the input. The values in freq[] are updated when the window
edges move and the numer of unique symbols is updated when a
frequency count changes between zero and non-zero.
The output is unsorted.
...
When I did my original solution a few days ago, I didn't even consider a moving window approach
because the OP had said that it's performance was terrible. He also mentioned complexities such as
having to restart the scan multiple times, etc.

Seeing your sliding window approach inspired me to give that approach a try.

I have come up with a streamlined version of your algorithm that I believe meets the OP's
specifications. All of us who have participated in discussing this problem have been bandying about
terms such as unique, distinct, minimal, shortest, without making the intended interpretation
completely clear. I had to do a bit of reading between the lines to come up with what I believe are
the OP's intended specifications, which is why I only say that I believe my solution meets his
specifications. I also inadvertently added to the confusion: by the time I wrote my first solution,
I did not have the original post in front of me; I remembered that the OP had referred to cells and
values so I used C for the number of cells and V for the number of values. Only later did I discover
that he had used N and T for these quantities. I have renamed some of the variables in my program to
make it consistent with the OP's notation.

Let me suggest the following definitions and nomenclature:

The data consist of a sequence of n cells stored in an array, x, with subscripts running from 1 to n.

Each cell contains one of a set of t values which are integers between 1 and t inclusive.

A subsequence of length l consists of l consecutive cells from the array x.

A k-subsequence is a subsequence of length l>=k which contains exactly k of the t possible values.

A minimal k-subsequence is one for which the removal of the first or last cell would reduce it to a
(k-1)-subsequence.

A shortest k-subsequence is a minimal k-subsequence whose length is less than of equal to that of
any other minimal k-subsequence in the array.

A shortest possible k-subsequence is a k-subsequence of length k.

If a minimal k-subsequence is not shortest possible, it contains at least one repeated value. Since
every minimal 1- or 2-subsequence is shortest possible shortest k-subsequences are only of interest
for k>=3. The array may but need not contain a shortest possible k-subsequence for 3<=k<=t.
If the array contains a shortest possible k-subsequence, it also contains shortest possible
j-subsequences for 3<=j<k.

Assuming that each of the t values appears in the array at least once, the array contains at least
one shortest k-subsequence for 1<=k<=t.

Within this framework, here is my understanding of the OP's requirements: "For 3<=k<=t, list all
shortest k-subsequences except, for k<t, omit any shortest possible k-subsequences which are
contained within a shortest possible (k+1)-subsequence."

A shortest possible k-subsequence, z, is not contained within a shortest possible k+1-subsequence
iff each of the immediately preceding and following cells (if it exists) contains a value already
present in z. I shall refer to this as the inclusion test. Note that the inclusion test always
passes for k=t.

In my new program, I have abandoned the use of characters for the external representation and
switched to integers between 1 and t. I have also changed the data representation from bin fixed(15)
to bin fixed(16) unsigned. This allows values of n and t up to 65534. Finally, I added timing of the
input, processing, and output.

To facilitate application of the inclusion test, the data array, x, has sentinel cells x(0) and
x(n+1) both set equal to the non existent value 0. The frequency table, f, has an extra entry f(0)
set equal to 1.

For each k, 3<=k<=t, a window delimited by the subscripts i on the left and j on the right is moved
across the entire data array. The width of the window is w=j-i+1. Initially, the window is at cell 1
(i=1) and is empty (w=0). The relationship w=j-i+1 gives j=0. Since no subsequence can be longer
than n, l, the length of the shortest minimal k-subsequence found so far is set to n+1; the number
of shortest minimal k-subsequences found so far, s, is set to 0; the frequency table, f, is
initialized to 0 except for f(0) which is set to 1 as previously explained; finally the number of
distinct values within the window, d, is set to 0.

The scan of the data consists of repeatedly finding the next minimal k-subsequence and deciding
whether or not to remember it. This is done as follows: first the window is increased on the right
one cell at a time until the number of distinct values is equal to k or the end of the data is
reached, in which case there are no more minimal k-subsequences so the scan is terminated. At this
point, the number of instances of the value at position j is necessarily 1. Next values are removed
from the window on the left until the leftmost value has only 1 instance in the window. At the point
the window contains the leftmost minimal k-subsequence not already examined. Next the width of the
window, w, is computed and the disposition of the subsequence decided. If w>l, the subsequence is
skipped; if w<l, a shorter minimal k-subsequence has been found, so the previous set of
k-subsequences is thrown away and the subsequence becomes the first element in a new set if it
passes the inclusion test, otherwise the set is left empty; otherwise w=l, so the subsequence is
added to the set of remembered k-subsequences if it passes the inclusion test, which is only applied
if w=k.

Finally the first cell is deleted from the window leaving a not necessarily minimal
(k-1)-subsequence and the loop repeats.

Here is the PL/I source code:

%process mar(2,100) offset;
subsets: proc options(main) reorder;
dcl
(n,t,i,j,k,v,d,w,l,s) bin fixed(16) unsigned,
(x(0:n+1),a(n),f(0:t)) bin fixed(16) unsigned ctl,
(ti,to,tc init(0)) bin float(53),
vfmt entry(bin float(53),bin fixed(15),bin fixed(15)) returns(char(50) var),
uttime0 entry, uttime entry returns(bin float(53)),
sysin file input,
output file output stream;
call uttime0;
get file(sysin) list(n,t);
allocate x,a,f;
get file(sysin) list((x(i) do i=1 to n)); x(0),x(n+1)=0;
ti=uttime;
open file(output) title('/stdout:,type(crlf),recsize(32000)');
put file(output) edit('n: ',vfmt(n,10,0),', t: ',vfmt(t,10,0))(col(1),4 a);
put file(output) edit('Input: ',(vfmt(x(i),10,0) do i=1 to n))(col(1),a,(n)(a,x(1)));
to=uttime;
do k=3 to t; l=n+1; f(0)=1; do i=1 to t; f(i)=0; end; d=0; i=1; j=0; s=0;
scan: do forever;
do j=j+1 to n until(d=k); v=x(j); f(v)+=1; if f(v)=1 then d+=1; end; if d<k then leave scan;
v=x(i); do while(f(v)>1); f(v)-=1; i+=1; v=x(i); end; w=j-i+1;
if w>l then;
else if w<l then do; l=w; s=0;
if w>k then do; s=1; a(1)=i; end;
else if f(x(i-1))>0 then if f(x(j+1))>0 then do; s=1; a(1)=i; end;
end;
else do;
if w>k then do; s+=1; a(s)=i; end;
else if f(x(i-1))>0 then if f(x(j+1))>0 then do; s+=1; a(s)=i; end;
end;
f(x(i))=0; d-=1; i+=1;
end;
tc+=uttime;
if s>0 then do;
put file(output) edit('Distinct: ',vfmt(k,10,0),', length: ',vfmt(l,10,0))(col(1),4 a);
do j=1 to s; i=a(j);
put file(output) edit(j,': (',vfmt(i,10,0),') ',(vfmt(x(i+v),10,0) do v=0 to l-1))
(col(1),f(6),3 a,(l)(a,x(1)));
end;
end;
to+=uttime;
end;
put file(output) skip(2) edit('Elapsed time: input ',vfmt(ti,15,8),', processing ',
vfmt(tc,15,8),', output ',vfmt(to,15,8),'.')(col(1),7 a);
end subsets;

Here is the output from one small example of randomly generated data:

n: 64, t: 12
Input: 11 12 10 9 3 7 10 8 6 7 6 3 5 12 10 10 11 2 12 9 6 4 5 4 5 6 7 3 12 5 4 1 1 8 12 6 7 11 3 8 3
11 11 7 11 6 9 8 9 1 3 11 1 6 9 10 2 7 10 8 4 9 7 5
Distinct: 3, length: 3
1: (40) 8 3 11
Distinct: 4, length: 4
1: (7) 10 8 6 7
Distinct: 5, length: 5
1: (44) 7 11 6 9 8
2: (48) 8 9 1 3 11
Distinct: 6, length: 6
1: (1) 11 12 10 9 3 7
2: (4) 9 3 7 10 8 6
3: (10) 7 6 3 5 12 10
4: (24) 4 5 6 7 3 12
5: (35) 12 6 7 11 3 8
6: (57) 2 7 10 8 4 9
7: (59) 10 8 4 9 7 5
Distinct: 7, length: 7
1: (26) 6 7 3 12 5 4 1
2: (33) 1 8 12 6 7 11 3
Distinct: 8, length: 8
1: (16) 10 11 2 12 9 6 4 5
2: (51) 3 11 1 6 9 10 2 7
Distinct: 9, length: 10
1: (30) 5 4 1 1 8 12 6 7 11 3
2: (51) 3 11 1 6 9 10 2 7 10 8
3: (52) 11 1 6 9 10 2 7 10 8 4
Distinct: 10, length: 11
1: (51) 3 11 1 6 9 10 2 7 10 8 4
Distinct: 11, length: 14
1: (51) 3 11 1 6 9 10 2 7 10 8 4 9 7 5
Distinct: 12, length: 19
1: (16) 10 11 2 12 9 6 4 5 4 5 6 7 3 12 5 4 1 1 8

Elapsed time: input 0, processing 0, output 0.

The table below summarizes some features of the results of some larger examples, a couple of which
go way beyond the limits indicated by the OP.

n t i c o max shortest possible shortest minimal t-subsequence
64 12 0 0 0 8(2) 19(1)
4096 256 .01 .03 .19 54(1) 1181(1)
8192 250 .02 .06 .19 54(1) 994(1)
16384 256 .03 .13 .2 64(2) 981(1)
32000 250 .06 .21 .31 64(1) 931(1)
60000 800 .13 1.18 1.85 117(1) 4043(1)
65534 4096 .15 7.42 42.68 220(1) 29213(1)

All times are in seconds. The numbers in parentheses are the number of instances; the first example
has 2 shortest possible minimal 8-subsequences and the third example has 2 shortest possible minimal
64-subsequences.

The program is vary fast. The complexity is O(n*t). For n=4096 and t=256 which is the maximum size
contemplated by the OP, the computational portion of the program requires only 30ms.
Dick Wesseling
2010-05-27 19:32:51 UTC
Permalink
Post by James J. Weinkam
%process mar(2,100) offset;
subsets: proc options(main) reorder;
This source code really needs some indenting and white space!
As it is it is almost unreadable. :-(
So is your reply, unless you put comp.lang.pl1 back in the newsgroups
list.
Post by James J. Weinkam
The program is vary fast. The complexity is O(n*t). For n=4096 and t=256
Since my approach terminates each scan on the first collision, it is
O(n*sqrt(t)), while keeping the working set very low.
Post by James J. Weinkam
which is the maximum size
contemplated by the OP, the computational portion of the program
requires only 30ms.
You need to run your program on all of the same data sets given by the
OP, verifying both the results and the runtime!
As I wrote yesterday, my current program has done the 2300 char set in
less than a ms total, both scanning and sorting/filtering.
OTOH, all of the competing solutions we've come up with are _far_ faster
than what could be needed. :-)
Terje
Robert AH Prins
2010-05-19 22:19:27 UTC
Permalink
Post by Terje Mathisen
Post by robin
| I believe the program below does what you want and should be a
couple of orders of magnitude faster
| than the method you outlined (based on your statement about how many
"PL/I statements" are executed
| on an input of length 2329 with 66 distinct values). For
illustrative purposes, the program uses the
| character representation you used in your example for input and
output, but the algorithm itself
| will work for values of c up to 32765 and v up to 32764 if you have
enough memory and patience.
Assuming that it works, a very impressive effort!
Indeed.
I have not tried to read the code (or explanation) except to note that
it must be far better than O(n*n). :-)
But is does contain a sort...
Post by Terje Mathisen
Scan through the input string from beginning to end, while maintaining a
set of lookup tables, one for each previous starting position. In each
lookup table we will have two things: The values seen so far, and the
total number of different characters.
For each possible length we remember the shortest string that so far has
resulted in this many unique values, updating it when we find a shorter
string.
This way the processing would consist of loading data[i], then for each
of the previous bytes taken as the starting position check if this value
hasn't been seen before, and if so, update the count and check if this
for (i = 2; i < data_len; i++) {
byte b = data[i];
for (j = 0; j < i-2; j++) {
if (!seen[j][b]) {
seen[j][b] = 1;
unique_values = ++count[j];
len = i-j+1;
if (len <= shortest[unique_values]) {
save(unique_values, len, j);
shortest[unique_values] = len;
}
}
}
}
The save function will generate a list of all possible "winning"
candidates, so it must be sorted and pruned before the final output but
that is trivial.
The main (only?) problem with the program above is that it is still
O(n*n), with data_len = 4096, the total number of iterations will be
1+2+3+4+...+4093 which is ~8M.
However, since the maximum possible number of unique values is the
character set size, even using all 256 possible byte values would mean
that for the longer sequences, starting early in the input, 15/16 of all
bytes will have been seen previously so the only code executed will be the
if (!seen[j][b]) {
test & branch.
With 4K 256-byte lookup tables we'll need 1 MB of cache, using bits
instead would compress this to 128K, but that is still too large to fit
in L1 cache, so probably not worth it.
I expect the running time to be ~ 10-20 cycles/inner loop iteration, or
80-160M cycles for the 4K input, letting the program finish in less than
a tenth of a second on any currently existing PC.
Is that fast enough?
You seem to have come up with a carbon copy of the approach Paul Green
used in 1998 - I did finally find a printout of our email conversation
from 1998 in one of the unopened boxes on the loft. I will scan and
(hopefully) OCR(ish) it when I come back from Vilnius next week time.

His code is in Pascal, and until I started running it for parts of the
input string I had had blind faith in it. Wrong, as it turns out as
there are three strings that it does not process correctly, and all my
debugging hasn't given me any clues as to the why.

If there is interest, I can also post my original sliding window code.

FWIW, the number of statement executed is obviously a rather crude way
of assessing the efficiency of a program. However, I also ran the PL/I
program with Strobe, a mainframe performance analyzing tool. The result
was even more devastating, *one single loop* to see if a next value is
already in the series was responsible for 95% of the CPU time of the
program. Ouch!

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
James J. Weinkam
2010-05-21 01:38:15 UTC
Permalink
Post by Robert AH Prins
His code is in Pascal, and until I started running it for parts of the
input string I had had blind faith in it. Wrong, as it turns out as
there are three strings that it does not process correctly, and all my
debugging hasn't given me any clues as to the why.
Can you send a copy of the data set that fails? Also, if it isn't prying into your business, would
you describe the application this problem is abstracted from? You indicated in your original post
that the values (characters in the posted esample) were actually indices into a table in the real
application.

Finally, the code I posted has two bugs that I know of so far; but they didn't prevent the algorithm
from getting the correct answer. Has anyone spotted them?
Terje Mathisen
2010-05-21 06:56:52 UTC
Permalink
Post by James J. Weinkam
Post by Robert AH Prins
His code is in Pascal, and until I started running it for parts of the
input string I had had blind faith in it. Wrong, as it turns out as
there are three strings that it does not process correctly, and all my
debugging hasn't given me any clues as to the why.
Can you send a copy of the data set that fails? Also, if it isn't prying
into your business, would you describe the application this problem is
abstracted from? You indicated in your original post that the values
(characters in the posted esample) were actually indices into a table in
the real application.
I would also love to see that, I really do believe that my current
approach is sound.
Post by James J. Weinkam
Finally, the code I posted has two bugs that I know of so far; but they
didn't prevent the algorithm from getting the correct answer. Has anyone
spotted them?
No, we have all been to busy debugging our own algorithms! :-)

The raw output (not filtered to only keep the best solution) is as
follows for the sample input:

The rows are in order:

Nr of unique symbols
String length
Starting position
The found string

3 5 11 abbbc
4 14 11 abbbcbcccccccd
3 4 31 dccb
4 10 31 dccbbbbbbe
5 30 11 abbbcbcccccccddddddddccbbbbbbe
3 3 39 bef
4 9 33 cbbbbbbef
5 11 31 dccbbbbbbef
6 31 11 abbbcbcccccccddddddddccbbbbbbef
4 6 39 befffg
6 14 31 dccbbbbbbefffg
7 34 11 abbbcbcccccccddddddddccbbbbbbefffg
3 3 43 fgh
4 6 40 efffgh
5 7 39 befffgh
6 13 33 cbbbbbbefffgh
7 15 31 dccbbbbbbefffgh
8 35 11 abbbcbcccccccddddddddccbbbbbbefffgh
6 12 39 befffghggggi
8 20 31 dccbbbbbbefffghggggi
9 40 11 abbbcbcccccccddddddddccbbbbbbefffghggggi
3 3 49 gih
3 3 50 ihb
4 4 49 gihb
3 3 51 hbf
4 4 50 ihbf
5 5 49 gihbf
3 3 52 bfj
4 4 51 hbfj
5 5 50 ihbfj
6 6 49 gihbfj
7 15 40 efffghggggihbfj
9 24 31 dccbbbbbbefffghggggihbfj
10 44 11 abbbcbcccccccddddddddccbbbbbbefffghggggihbfj
3 3 53 fja
4 4 52 bfja
5 5 51 hbfja
6 6 50 ihbfja
7 7 49 gihbfja
8 16 40 efffghggggihbfja
9 23 33 cbbbbbbefffghggggihbfja
10 25 31 dccbbbbbbefffghggggihbfja

Filtering is trivial:

First remove all non-shortest strings for a given # of unique symbols
Next remove all strings that either start or end at the same position as
another string while being exactly one character shorter. I.e. remove
all pure substrings.

Doing the first stage manually on the set above result in:

3 3 39 bef
3 3 43 fgh
3 3 49 gih
3 3 50 ihb
4 4 49 gihb
3 3 51 hbf
4 4 50 ihbf
5 5 49 gihbf
3 3 52 bfj
4 4 51 hbfj
5 5 50 ihbfj
6 6 49 gihbfj
3 3 53 fja
4 4 52 bfja
5 5 51 hbfja
6 6 50 ihbfja
7 7 49 gihbfja
8 16 40 efffghggggihbfja
9 23 33 cbbbbbbefffghggggihbfja
10 25 31 dccbbbbbbefffghggggihbfja

The second stage removes the proper substrings, starting from the shortest:

3 3 39 bef
3 3 43 fgh
7 7 49 gihbfja
8 16 40 efffghggggihbfja
9 23 33 cbbbbbbefffghggggihbfja
10 25 31 dccbbbbbbefffghggggihbfja

Except for the use of zero-based counting of starting position, this
looks identical to the original request. :-)

Running time is still O(N*sqrt(C)) where N is the input length and C is
the character size, and total memory use is O(N), i.e. the same as the
data itself.

With 4K inputs total data size would be 12KB, but only 4.5 KB would be
within the working set at any given time, so L1 hit rates are very close
to 100%.

int scan(byte *data, int data_len)
{
int c = 0;
for (int i = 2; i < data_len; i++) {
byte b = data[i];
for (int j = i-1; j >= 0; j--) {
// A previously seen character cannot end a new longest sequence!
// It also cannot add to the sequence count for this or any
// longer string, so stop scanning!
if (data[j] == b) break;

int len = i-j+1;
// Increment the count of unique values seen when starting
// from position [j]
int unique_values = ++count[j];
if (len <= shortest[unique_values]) {
if (len >= 3)
save(unique_values, len, j);
shortest[unique_values] = len;
c++;
}
}
}
return c;
}


Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Jerome H. Fine
2010-05-21 13:08:31 UTC
Permalink
[Snip]
With 4K inputs total data size would be 12KB, but only 4.5 KB would be within the working set at
any given time, so L1 hit rates are very close to 100%.
int scan(byte *data, int data_len)
{
int c = 0;
for (int i = 2; i < data_len; i++) {
byte b = data[i];
for (int j = i-1; j >= 0; j--) {
// A previously seen character cannot end a new longest sequence!
// It also cannot add to the sequence count for this or any
// longer string, so stop scanning!
if (data[j] == b) break;
int len = i-j+1;
// Increment the count of unique values seen when starting
// from position [j]
int unique_values = ++count[j];
if (len <= shortest[unique_values]) {
if (len >= 3)
save(unique_values, len, j);
shortest[unique_values] = len;
c++;
}
}
}
return c;
}
As Terje has pointed out a number of times, taking into account
the hit rate for L1 and L2 cache will have a significant impact
on how fast the code runs. While many of the other contributions
to this thread have ignored this aspect, a program which requires
only L1 cache for most of its data can run many times faster,
perhaps up to ten times as fast based on some of the tests I did
about a year ago.

Jerome Fine
Terje Mathisen
2010-05-21 15:21:35 UTC
Permalink
[Snip]
With 4K inputs total data size would be 12KB, but only 4.5 KB would
be within the working set at any given time, so L1 hit rates are very
close to 100%.
int scan(byte *data, int data_len)
{
int c = 0;
for (int i = 2; i < data_len; i++) {
byte b = data[i];
for (int j = i-1; j >= 0; j--) {
// A previously seen character cannot end a new longest sequence!
// It also cannot add to the sequence count for this or any
// longer string, so stop scanning!
if (data[j] == b) break;
int len = i-j+1;
// Increment the count of unique values seen when starting
// from position [j]
int unique_values = ++count[j];
if (len <= shortest[unique_values]) {
if (len >= 3)
save(unique_values, len, j);
shortest[unique_values] = len;
c++;
}
}
}
return c;
}
As Terje has pointed out a number of times, taking into account
the hit rate for L1 and L2 cache will have a significant impact
on how fast the code runs. While many of the other contributions
to this thread have ignored this aspect, a program which requires
only L1 cache for most of its data can run many times faster,
perhaps up to ten times as fast based on some of the tests I did
about a year ago.
I have now timed my code:

The full scan for potential solutions with the given 56-byte input
string took about 8K cycles, while my big-O estimate said O(n*sqrt(c))
which in our case is O(56*sqrt(10)) ~= 180, i.e. each iteration took a
little over 40 cycles.

Since the number of potential solutions was quite high (40), the part of
the inner loop that saved each of them used significant amounts of time,
including a significant number of mis-predicted branches.

Still, running at 130+ cycles per input byte is still fast enough to
handle significant amounts of data.

I would love to have a copy of the "problematic" 4K input which the
original poster have alluded to!

Terje

PS. The final filter operation runs in O(n*log(n)) so it should be much
faster than the scan, at least for large (n). With the 56-byte input
however, the filter took about twice as long as the scan, i.e. the
constant terms were significantly higher!

This meant that the total processing time was actually about 25K cycles,
or 400+ cycles/byte. On my laptop (2.2GHz) this corresponds to "only"
5.5 MB/s. :-(
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
prino
2010-05-21 16:34:24 UTC
Permalink
On 21 Geg, 18:21, Terje Mathisen <"terje.mathisen at
Post by Terje Mathisen
 > [Snip]
 > With 4K inputs total data size would be 12KB, but only 4.5 KB would
be within the working set at any given time, so L1 hit rates are very
close to 100%.
 > int scan(byte *data, int data_len)
 > {
 > int c = 0;
 > for (int i = 2; i < data_len; i++) {
 > byte b = data[i];
 > for (int j = i-1; j >= 0; j--) {
 > // A previously seen character cannot end a new longest sequence!
 > // It also cannot add to the sequence count for this or any
 > // longer string, so stop scanning!
 > if (data[j] == b) break;
 > int len = i-j+1;
 > // Increment the count of unique values seen when starting
 > // from position [j]
 > int unique_values = ++count[j];
 > if (len <= shortest[unique_values]) {
 > if (len >= 3)
 > save(unique_values, len, j);
 > shortest[unique_values] = len;
 > c++;
 > }
 > }
 > }
 > return c;
 > }
As Terje has pointed out a number of times, taking into account
the hit rate for L1 and L2 cache will have a significant impact
on how fast the code runs. While many of the other contributions
to this thread have ignored this aspect, a program which requires
only L1 cache for most of its data can run many times faster,
perhaps up to ten times as fast based on some of the tests I did
about a year ago.
The full scan for potential solutions with the given 56-byte input
string took about 8K cycles, while my big-O estimate said O(n*sqrt(c))
which in our case is O(56*sqrt(10)) ~= 180, i.e. each iteration took a
little over 40 cycles.
Since the number of potential solutions was quite high (40), the part of
the inner loop that saved each of them used significant amounts of time,
including a significant number of mis-predicted branches.
Still, running at 130+ cycles per input byte is still fast enough to
handle significant amounts of data.
I would love to have a copy of the "problematic" 4K input which the
original poster have alluded to!
The current set of input data comprises of "just" 2329 values, and
right now it might potentially increase to around 3K. The current set
is processed without problems, the problem arose when I started using
the code for subsets of this. The output for three of them was plain
wrong, and this could easily be verified manually - am right now in
Lithuania and didn't really want to take it, but the shortest
erroneous input set has, IIRC, *only six or seven elements*.

As for the 4k lmit, there is a chance that some users may pool their
data, and that might increase the "strings" beyond 4k values. The
values themselves will never exceed 255.

I will post the three problematic strings, and the output that results
from them, on Wednesday when I'm back in Belgium - program will be in
Pascal or PL/I, take your pick. :)
Post by Terje Mathisen
PS. The final filter operation runs in O(n*log(n)) so it should be much
faster than the scan, at least for large (n). With the 56-byte input
however, the filter took about twice as long as the scan, i.e. the
constant terms were significantly higher!
This meant that the total processing time was actually about 25K cycles,
or 400+ cycles/byte. On my laptop (2.2GHz) this corresponds to "only"
5.5 MB/s. :-(
Robert
Dick Wesseling
2010-05-22 04:13:17 UTC
Permalink
Post by Terje Mathisen
The full scan for potential solutions with the given 56-byte input
string took about 8K cycles, while my big-O estimate said O(n*sqrt(c))
which in our case is O(56*sqrt(10)) ~= 180, i.e. each iteration took a
little over 40 cycles.
That sounds reasonable. I've timed my solution also, and depending on
the CPU and compiler version and switches I get values from 7200 to
9999 cycles.
Post by Terje Mathisen
Since the number of potential solutions was quite high (40), the part of
the inner loop that saved each of them used significant amounts of time,
including a significant number of mis-predicted branches.
Still, running at 130+ cycles per input byte is still fast enough to
handle significant amounts of data.
Sounds like you've beaten me. My O() depends on the size of the longest
sequence minus the size of the longest unique sequence, which is good
for the reference data. If I time 4K of random data I get 7250 cycles per
input byte. Since sqrt(C) is 16 for that dataset you ought to get about
800 cycles/byte ( http://www.fi.uu.nl/~ftu/random.bin )
Terje Mathisen
2010-05-23 08:37:59 UTC
Permalink
Post by Dick Wesseling
Post by Terje Mathisen
Still, running at 130+ cycles per input byte is still fast enough to
handle significant amounts of data.
Sounds like you've beaten me. My O() depends on the size of the longest
sequence minus the size of the longest unique sequence, which is good
for the reference data. If I time 4K of random data I get 7250 cycles per
input byte. Since sqrt(C) is 16 for that dataset you ought to get about
800 cycles/byte ( http://www.fi.uu.nl/~ftu/random.bin )
It is worse (or better, seen from my point):

With 4K random input the scan took 252K cycles, which means that I
averaged about 83 cycles/byte. (I.e. very fast!)

The problem was that with totally random data, there are a _lot_ of both
potential and real solutions, I found 4099 candidates (i.e. just over
one per starting byte!) and 121 after filtering.

The filter process here took a _lot_ of time though: 8M cycles!

This means that it needed 2000 cycles/input byte. :-(

This simply shows that the problem space (and therefore required
algorithm as well) almost certainly won't contain totally random data,
and this will most probably help my code...

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Dick Wesseling
2010-05-23 18:14:47 UTC
Permalink
Post by Terje Mathisen
Post by Dick Wesseling
Post by Terje Mathisen
Still, running at 130+ cycles per input byte is still fast enough to
handle significant amounts of data.
Sounds like you've beaten me. My O() depends on the size of the longest
sequence minus the size of the longest unique sequence, which is good
for the reference data. If I time 4K of random data I get 7250 cycles per
input byte. Since sqrt(C) is 16 for that dataset you ought to get about
800 cycles/byte ( http://www.fi.uu.nl/~ftu/random.bin )
With 4K random input the scan took 252K cycles, which means that I
averaged about 83 cycles/byte. (I.e. very fast!)
Are you using my data set? I completed and fixed (or broke?) your
code as follows:

#define C 0x100
#define OSIZ 409600
unsigned outpos;

struct {
unsigned len;
unsigned pos;
unsigned w;
} output[OSIZ];

static
void save(unsigned l, unsigned w, unsigned pos)
{
#if 1
output[outpos].len = l ;
output[outpos].pos = pos;
output[outpos].w = w ;
#endif
outpos++;
}


int scan(byte *data, int data_len)
{
unsigned count[data_len];
int shortest[C+1];
int i;
for (i=0; i<C+1; i++) shortest[i] = 0x7FFFFFFF;
for (i=0; i<data_len; i++) count[i] = 1;

int c = 0;
for (i = 1; // not 2: else "abc" won't work


This version took about 15M cycles on my athlon, 10M cycles if I
just count the solutions without actually saving them.
Post by Terje Mathisen
The problem was that with totally random data, there are a _lot_ of both
potential and real solutions, I found 4099 candidates (i.e. just over
one per starting byte!) and 121 after filtering.
Then we're definitely not using the same data set. When I ran your
program I got 78059 potential solutions instead. My program found
831 solutions, all of which were in your unfiltered output.
Post by Terje Mathisen
The filter process here took a _lot_ of time though: 8M cycles!
This means that it needed 2000 cycles/input byte. :-(
This simply shows that the problem space (and therefore required
algorithm as well) almost certainly won't contain totally random data,
and this will most probably help my code...
A slightly optimized version of my program - see below - took
23M cycles on the same data set. Since I only timed the scan phase
of your program and not the filter phase it appears our approaches
are comparable for random data, not as bad as I feared.

Here's my optimized inner loop

#if 0 // 30M cycles for 4K random data
// Note: previous "final" version posted still had a bug here

dec(data[l]); l++;
while (uniq < goal && r<lim) { r++; inc(data[r]); }
while (freq[data[l]]>1) { dec(data[l]); l++; }
#else
// 23M cycles for 4K random data
// The code generated by gcc still hurts my eyes, but this
// simplified loop is a good starting point for asm.


freq[data[l]]=0; l++; uniq--; // data[l] is unique

while (r<lim) { // optimize controlling expression of
unsigned char cr; // loop
r++;
cr = data[r];
unsigned frc = freq[cr];
freq[cr] = frc+1;
if (frc == 0) {
uniq++;
break;
}
}
while (freq[data[l]]>1) { // "uniq" does not change
freq[data[l]]--;
l++;
}
#endif

Once I get rid of the variable "uniq", which is not essential, the
remaining variables data[], freq[], l and r will fit into the x86
register set and the inner loop becomes 14 instructions per input
byte per iteration of the outermost loop.
Terje Mathisen
2010-05-23 20:11:57 UTC
Permalink
Post by Dick Wesseling
Post by Terje Mathisen
With 4K random input the scan took 252K cycles, which means that I
averaged about 83 cycles/byte. (I.e. very fast!)
Are you using my data set? I completed and fixed (or broke?) your
Definitely not the same data set, I am simply using
rand() & 63) + 33;
to generate random (but printable) bytes.

With rand & 255 I get slightly different counts.
Post by Dick Wesseling
int scan(byte *data, int data_len)
{
unsigned count[data_len];
int shortest[C+1];
int i;
for (i=0; i<C+1; i++) shortest[i] = 0x7FFFFFFF;
for (i=0; i<data_len; i++) count[i] = 1;
int c = 0;
for (i = 1; // not 2: else "abc" won't work
You're absolutely right, in order to increment the count to 3 the scan
has to start with the second char. Thanks!
Post by Dick Wesseling
Once I get rid of the variable "uniq", which is not essential, the
remaining variables data[], freq[], l and r will fit into the x86
register set and the inner loop becomes 14 instructions per input
byte per iteration of the outermost loop.
OK, I'll have to compare that to some hand-optimised asm. :-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Jerome H. Fine
2010-05-24 00:07:17 UTC
Permalink
Post by Terje Mathisen
Post by Dick Wesseling
Post by Terje Mathisen
With 4K random input the scan took 252K cycles, which means that I
averaged about 83 cycles/byte. (I.e. very fast!)
Are you using my data set? I completed and fixed (or broke?) your
Definitely not the same data set, I am simply using
rand() & 63) + 33;
to generate random (but printable) bytes.
With rand & 255 I get slightly different counts.
Post by Dick Wesseling
int scan(byte *data, int data_len)
{
unsigned count[data_len];
int shortest[C+1];
int i;
for (i=0; i<C+1; i++) shortest[i] = 0x7FFFFFFF;
for (i=0; i<data_len; i++) count[i] = 1;
int c = 0;
for (i = 1; // not 2: else "abc" won't work
You're absolutely right, in order to increment the count to 3 the scan
has to start with the second char. Thanks!
Post by Dick Wesseling
Once I get rid of the variable "uniq", which is not essential, the
remaining variables data[], freq[], l and r will fit into the x86
register set and the inner loop becomes 14 instructions per input
byte per iteration of the outermost loop.
OK, I'll have to compare that to some hand-optimised as
I am attempting to follow the code for this problem, however, it becomes
a bit
difficult to follow as it changes so often. This is not a complaint as
I very much
appreciate some real examples.

However, I would be extremely helpful if both Terje and Dick could
include all of
the code and some examples, as was done the first few times,
approximately every
fifth time the code changes. This will enable the rest of us who want
to follow the
discussion to put it into context. The problem is very instructive and
I really enjoy
the description of what you fellows are doing.

Jerome Fine
Dick Wesseling
2010-05-24 02:04:07 UTC
Permalink
Post by Jerome H. Fine
I am attempting to follow the code for this problem, however, it becomes
a bit difficult to follow as it changes so often.
I know. Never say "final version"....
Post by Jerome H. Fine
This is not a complaint as I very much appreciate some real examples.
I'd be happy to provide some provide some real examples. Tomorrow is a
holliday here, so I can afford to spend time on this puzzle without
feeling guilty about it. Depending on how things are going - I am
now coding the final asm version - I may even do so later tonight/today.
I don't have to bring my offspring to school. so I can stay up as late
as I want to.
Jerome H. Fine
2010-05-24 03:44:51 UTC
Permalink
Post by Dick Wesseling
Post by Jerome H. Fine
I am attempting to follow the code for this problem, however, it becomes
a bit difficult to follow as it changes so often.
I know. Never say "final version"....
Having started on an IBM 650 in SOAP around 1960, I fully realize that
aspect.
Post by Dick Wesseling
Post by Jerome H. Fine
This is not a complaint as I very much appreciate some real examples.
I'd be happy to provide some provide some real examples. Tomorrow is a
holliday here, so I can afford to spend time on this puzzle without
feeling guilty about it. Depending on how things are going - I am
now coding the final asm version - I may even do so later tonight/today.
I don't have to bring my offspring to school. so I can stay up as late
as I want to.
The C code is even more instructive in its own way as it provides an
overview that assembly code is unable to do. The current C code
along with a few examples as you did with your first post would be
VERY appreciated! It also provides a comparison with the original
and shows the steps taken to make the code faster and more efficient.

As Terje notes, the efficient use of L1 cache is able to improve the
execution speed, but is rarely considered when an algorithm is
evaluated. The C code might include a few comments as to just
how much storage is required at any point in the process. As far
as I can determine, the L1 instruction cache will not be a problem
for this algorithm. Likewise the L1 data cache if care is taken even
though the total storage requirements are much larger.

Perhaps having started with a system that also had an effective L1
cache of about 100 bytes helps to remember how to optimize the
code. I think that the total memory including the disk which also
held the code and the data was about 20 KBytes.

Jerome Fine
Dick Wesseling
2010-05-24 07:26:22 UTC
Permalink
The basic idea is to create a simple algorithm that performs
one sequential scan over the input for each sequence length.
There is no backtracking. Instead intermediate results are
stored in a table "found", and if a shorter string with
the same number of unique symbols is found the table is reset.

If all of the data fits into the L1 cache then performing
multiple passes over the data should perform well. Apart from
the input data the program needs 2 extra arrays:

- freq[C], counts the frequency of each symbol in the window
- found[?], as explained above this array contains tentative results.

freq is heavily used. It adds 1K to the amount of data that
needs to be in the L1 cache.
The cache usage of "found" is more complicated. In theory it
can become quite large, but my inituition is that it should
behave quite modest in practice.

During each pass the following variables keep track of the state
of the sliding window:
- l, the left margin
- r, the right margin (inclusive)
- uniq, the number of unique symbols within the window.

When the right margin moves, we bring a new symbol into the window,
and freq[new_symbol] must be incremented. If the new symbol does
not yet occur within the window then "uniq" must be incremented
as well:

# define inc(c) do { if(!freq[c]) uniq++; freq[c]++; } while(0)

When the left margin moves a symbol falls out of the window and
the opposite happens:

# define dec(c) do { freq[c]--; if(!freq[c]) uniq--; } while(0)

The macros inc() and dec() and the variable uniq are just a convenience
for bootstrapping the algorithm. With careful coding they can be
optimized away from the final version, sacrificing clarity for raw
speed.

Sliding the window needs to maintain a loop invariant. The invariant
is:
- freq[l] must be 1. If the leftmost symbol would occur elsewhere
in the window then this cannot possibly be the shortest string
for the current goal.
- freq[r] must be 1 for the same reason.
- uniq equals the goal of the current pass. Remember that each pass
searches for one particular sequence length only.

That being said, the outline of the program is:

Select a large initial window such that the loop invariant holds.
goal = number of unique symbols in this window.

do
do
slide window from left to right maintaining the invariant
while not at end of input (i.e. invariant still true)
if window size < previous
dump the found array
and update apub: the current best value for the window
if window size <= apub
add to found array
od
print results for sequence length "goal".
while goal not minimal (see below)
goal = goal -1
select an initial window such that the loop invariant holds.
od

The choice for looping from a large goal downwards is arbitrary,
it seemed the easiest way to tackle the problem. If you start
with the window being the entire input stream then you just need
to shrink the margins until freq[l]=freq[r]=1. Setting goal=uniq
completes the loop invariant.

The important step is:
slide window from left to right maintaining the invariant

This is done as follows:

#if 0
1) dec(data[l]); l++;
2) while (uniq < goal && r<lim) { r++; inc(data[r]); }
3) while (freq[data[l]]>1) { dec(data[l]); l++; }
#else
...

1) The window must slide by at least 1 symbol.
We know that the leftmost symbol is unique. If we increment l
the number of unique symbols in the window therefore decreases
by 1.
2) We therefore move the right margin until a new unique symbol
moves into the window. Now equals==goal holds again
3) However, in step 2) we may have skipped over instances of the
leftmost symbol which is now no longer uniq:

before aabcdbx
---- uniq=goal=4 aperture=4
after 1) aabcdbx
--- uniq=3 goal-4 aperture=3
after 2) aabcdbx
----- uniq=goal=4 aperture=5

now "b" is no longer unique. We must therefore move the
left margin even further:

after 3) aabcdbx
---- uniq=goal=4 aperture=4

Now the loop invariant holds again, both margins are
at a unique symbol.

This is the inner loop of the program. Since we know that
data[l] and data[r] are unique we can get rid of the inc/dec
macros and expand and optimize these three statements into:

..
#else
1) freq[data[l]]=0; l++; uniq--; // data[l] is unique

2) while (r<lim) { // optimize controlling expression of
unsigned char cr; // loop
r++;
cr = data[r];
unsigned frc = freq[cr];
freq[cr] = frc+1;
if (frc == 0) {
uniq++;
break; // See below
}
}
3) while (freq[data[l]]>1) { // no reason to update "uniq"
freq[data[l]]--;
l++;
}
#endif

This is better than the first version. Since only one unique
symbol moved out of the window in step 1) we can break out of
step 2) as soon as we find the first new symbol, i.e. a symbols
whose frequence was zero prior to incrementing.
We can still improve upon this version. The variable uniq is
incremented exactly once and decremented excactly once. We therefore
need not change it at all. (If we hit the end of data with r==lim
then "uniq" no longer serves a purpose).

For the purpose of designing an x86 asm program we now have everyting
in place, the inner loop now uses 4 variables l, r, freq[] and data[].
This will fit into the x86 reqister set, leaving 2 or 3 other registers
for holding intermediate results. Now we're all set to try and beat
the C compiler, which is harder than it seems.

Next we need to implement the remainder of the algorithm outline above:

print results for sequence length "goal".
while goal not minimal (see below)
goal = goal -1
select an initial window such that the loop invariant holds.
od

printing the results is straightforward. Initializing the window
for the next iteration of the main loop is similar to sliding the
window: initialize l to 0, move r to the right until we have
enough unique symbols, and move l in case the leftmost character
is no longer uniqe:

// Now the next shorter sequence.

goal--;

1) inc(data[l]);
2) do {
r++;
inc(data[r]);
} while (uniq != goal);

3) while (freq[data[l]] != 1) {
dec(data[l]);
l++;
}

Using the same input data as in the previous example:

after 1) aabcdbx
- uniq1 goal=3 aperture=1
after 2) aabcdbx
---- uniq=3 goal-4 aperture=4
after 3) aabcdbx
--- uniq=goal=4 aperture=5

Again, this loop can be heavily optimized to get rid of inc/dec and
uniq, but I'm not doing that in the C prototype, after this is not
the inner loop.



So far so good, but the reader may ask:

How about the requirement that no sequences that are
substrings of longer unique sequences must be printed?

Good question! This is another reason why looping downwards seemed
the natural thing to do. Let me quote from my previous posting:

<EUREKA>
Post by Robert AH Prins
It is possible to find sub-arrays with 4, 5, or 6 distinct values, but
they are either longer than the series of 7-in-7 (12-25 contain abcd),
or are part of the 7-in-7 series and as such they should not be included!
1 2 3 4 5 5
....5....0....5....0....5....0....5....0....5....0....56
aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja
-------------- -------
4-in-14 7-in-7


I did not fully understand why 12-15 was to be rejected, after all
"abcd" is not a substring of "gihbfja".
Now I understand. 7-in-7 contains substrings 6-in-6, 5-in-5, 4-in-4
and so on. These are not to be printed because they are proper
substrings of "gihbfja". However, they define "as short as possible".
for lengths <= 7.

IN OTHER WORDS, ONCE YOU KNOW THE LONGEST N-IN-N SEQUENCE ALL SHORTER
SEQUENCES MUST CONSIST ENTIRELY OF UNIQUE SYMBOLS.

This suggests two different seach strategies, one for long seqences and
one for short sequences.

</EUREKA>

In other words, as long as you do not have a n-in-n sequence yet you
need not test for substrings, and once you have a n-in-n sequence
you can switch to a different algorithm that needs to handle n-in-n
sequences only, which is much easier than the general n-in-m loop.

Enter plan B. Plan B performs ONE scan of the input to find the
remaining n-in-n sequences. Again a sliding window is used, but
this time the invariant is simply:

- All symbols in the window must be unique.

This is true for l=r. It remains true as long as the symbol just
beyond the right margin does not occur within the window.
Otherwise print the current window if >= 3, and move l until
all symbols are unique again.

1) // l=r=0

while (r<lim) {
unsigned char cl, cr;
cl = data[l];
cr = data[r+1]; // Look ahead

if (freq[cr] != 0) { // Maximal sequence?
if (WINSIZE >= 3) {
printres(data, l, WINSIZE, WINSIZE);
}
do { // Move left edge until
cl = data[l]; // we have all symbols unique.
freq[cl] = 0;
2) l++;
} while (cl != cr);
}
3) r++;
freq[cr]=1;
}

Example:
after 1) abcdabcd
-
after 3) abcdabcd
--
after 3) abcdabcd
--
after 3) abcdabcd
---
after 3) abcdabcd
----
Look ahead, print "abcd" & proceed:

after 2) abcdabcd
----

Another example:
after 1) abcdca
-
after 3) abcdca
--
after 3) abcdca
---
after 3) abcdca
----

Look ahead, print "abcd" & proceed:

after 2) abcdca
--

As for the asm, this C program translates into +- 240 x86 instructions.
The inner loop is 15 instructions. Unfortunately most of these are
tightly interlocked, and there are 6 memory references. The L1
hitrate should be high, but there's not much room for improvement.
Processing the sample went from 9000 to 7000 cycles, but for random
data the gain went down to 2% compare to gcc's code on one CPU and
and -2% one another.
I may play with the asm code some time later, but for now I think
it is not worth bothering. Perhaps Terje is faring better then I.

Here is the next final final version. A TRACE macro is provided.
If you don't have an ansi c++ compiler handy then a trace for the
sample data can be found at http://www.fi.uu.nl/~ftu/trace.txt




-----------------------------------------------------------------------

#include <portos/profile.h>
/*
Here is the final version of my solution.

Let: N number of cells, size of the input
C number of values, size of symbol set.
M the number of distinct values in the largest sequence.
T the size of the largest n-in-n sequence, i.e. the
largest sequence consisting of distinct values only.

The algorithm uses two different search strategies:

Plan A

Find the largest sequence with size M

for n from M downwards search n-in-m sequences, keeping only sequences
with minimal value of m.
Print sequence if n<m. (If n=m the same sequence will be found again
in the next step).

Now n=m=T.

Plan B
Print all maximal n-in-n sequences.

Plan A scans the input 1+(M-T) .. 2+(M-T) times, depending on the
input position of the longest sequence.
Plan B scans the input once.

The running time is therefore bounded by O(N * (3+(M-T)) which
is typically better than or equal to O(N*C).
For small values of N the cost of erasing the frequency array should
also be accounted for.

Two arrays are used:
found[N] Intermediate results of plan A.
freq[C] Frequency count for all symbols in the sliding window


Finding sequences is done using a straightforward sliding window
scan of the input. The values in freq[] are updated when the window
edges move and the number of unique symbols is updated when a
frequency count changes between zero and non-zero.


The output is unsorted.


*/

#include <stdio.h>
#include <string.h>

#define C 256 // Size of alphabet
//#define ONE 1 // Pascal array bias convention
#define ONE 0 // C array bias convention

struct {
unsigned len;
unsigned pos;
unsigned w;
} output[4096];
unsigned outpos;

static
void printres (unsigned char *data, unsigned pos, unsigned l, unsigned w)
{
#if 0
unsigned i;
printf("len %2d aperture %2d pos %2d\t", l, w, ONE+pos);
for (i=pos; i<pos+w; i++) printf("%c", data[i]);
printf("\n");
#else
output[outpos].len = l ;
output[outpos].pos = pos;
output[outpos].w = w ;
outpos++;
#endif
}

#if 1
void trace(char *comment, unsigned char *data,
unsigned l, unsigned r, unsigned uniq, unsigned *freq)
{
unsigned i;
printf("%s: %d-%d aperture %d unique %d\n", comment, l, r, (r-l)+1, uniq);
printf("%s\t", data);
for (i=0; i<C; i++) {
if (freq[i]) {
printf("%c=%d ", i, freq[i]);
}
}
printf("\n");

for (i=0; i<l; i++) printf(" ");
for ( ; i<=r; i++) printf("-");
printf("\n");
}
#define TRACE(x) trace((x), data, l, r, uniq, freq)
#else
#define TRACE(x)
#endif

void findseqs (unsigned char *data, unsigned len)
{
// printf("%s\n", data);
outpos=0;
unsigned i;
unsigned uniq = 0; // Nr unique symbols in window
unsigned apub = ~0; // Aperture upper bound (aka "m")

unsigned nrfound = /* to keep gcc happy */ 0;
unsigned found[len]; // intermediate results

unsigned freq[C]; // Frequency count
bzero(freq, sizeof(freq));

# define inc(c) do { if(!freq[c]) uniq++; freq[c]++; } while(0)
# define dec(c) do { freq[c]--; if(!freq[c]) uniq--; } while(0)
# define WINSIZE ((r-l)+1)

// Start with largest possible sequence: the entire input string

unsigned l = 0; // Start of window
unsigned r = len-1; // Inclusive end of window
unsigned lim = len-1; // Inclusive end of data

for (i=0; i<len; i++) {
inc(data[i]);
}
if (uniq < 3) return;

// Trim left & right edges

TRACE("Entire input");
while (freq[data[r]]>1) { dec(data[r]); r--; }
TRACE("Initial trim r");
while (freq[data[l]]>1) { dec(data[l]); l++; }
TRACE("Initial trim l");

// From here on we want to find only sequences that are shorter or
// equal to what we have found above:

unsigned goal = uniq;

while (1) {

// data[l..r] contains "goal" unique symbols
// freq[] number of occurances of each symbol in data[l..r]
// data[l] is a unique symbol
// data[r] is a unique symbol

TRACE("Main loop");

if (WINSIZE <= apub) { // Short enough?

// A candidate. May be premature if we have shorter sequences
// with the same nr of unique symbols.

if (WINSIZE < apub) {
nrfound = 0; // Dump old candidates
apub = WINSIZE; // Next matches must be <= this one
}
found[nrfound] = l; // Add to list of tentative results
nrfound++;
}

// Slide window to next match.

#if 0 // 30M cycles for 4K random data
// Note: previous "final" version posted still had a bug here

dec(data[l]); l++;
while (uniq < goal && r<lim) { r++; inc(data[r]); }
while (freq[data[l]]>1) { dec(data[l]); l++; }
#else
// 23M cycles for 4K random data
// The code generated by gcc still hurts my eyes, but this
// simplified loop is a good starting point for asm.


freq[data[l]]=0; l++; uniq--; // data[l] is unique
TRACE("After slide step 1");


while (r<lim) { // optimize controlling expression of
unsigned char cr; // loop
r++;
cr = data[r];
unsigned frc = freq[cr];
freq[cr] = frc+1;
if (frc == 0) {
uniq++;
break;
}
}
TRACE("After slide step 3");
while (freq[data[l]]>1) { // no reason to update "uniq"
freq[data[l]]--;
l++;
}
TRACE("After slide step 3");

#endif

if (uniq == goal) continue;

// We've hit the right edge

bzero(freq, sizeof(freq)); // Reset window & statistics
uniq = l = r = 0;

if (apub == goal) break; // If *only* distinct values proceed
// with plan B.

for (i=0; i<nrfound; i++) { // Print the results
printres(data, found[i], goal, apub);
}

// Now the next shorter sequence.

goal--;

inc(data[l]);
do {
r++;
inc(data[r]);
} while (uniq != goal);
TRACE("Find shorter, after slideR");

while (freq[data[l]] != 1) {
dec(data[l]);
l++;
}
TRACE("Find shorter, after slideL");

// Now the precondition holds. Also, the window size is smaller
// than the aperture upper bound from the previous iteration.
}

// Plan B
// Now we consider only sequences with all symbols distinct.
// The freq[] array therefore behaves like a bitmap with values 0&1 only.

freq[data[l]] = 1;

while (r<lim) {
unsigned char cl, cr;
cl = data[l];
cr = data[r+1]; // Look ahead

if (freq[cr] != 0) { // Maximal sequence?
if (WINSIZE >= 3) {
printres(data, l, WINSIZE, WINSIZE);
}
do { // Move left edge until
cl = data[l]; // we have all symbols unique.
freq[cl] = 0;
l++;
} while (cl != cr);
}
r++;
freq[cr]=1;
}

// Final sequence.

if (WINSIZE >= 3) {
printres(data, l, WINSIZE, WINSIZE);
}
return;
}

void printresults() {
// .. qsort "output" if desired
unsigned i;
for (i=0; i<outpos; i++) {
printf("len %3d aperture %3d pos %3d\t",
output[i].len,
output[i].w,
ONE+ output[i].pos);
printf("\n");
}
}

int main(int argc, char **argv)
{
if (argc > 2) {
fprintf(stderr, "Usage: %s {string}\n", argv[0]);
return 1;
}
if (argc == 1) {
// read from standard input
unsigned char buf[4096];
size_t inlen = fread(buf, sizeof(buf[0]), sizeof(buf), stdin);
findseqs( buf, (unsigned) inlen);
}
if (argc == 2) {
size_t l = strlen(argv[1]);
findseqs( (unsigned char *)argv[1], l );
}
//printresults();
return 0;
}
Jerome H. Fine
2010-05-24 10:42:08 UTC
Permalink
Post by Dick Wesseling
The basic idea is to create a simple algorithm that performs
one sequential scan over the input for each sequence length.
There is no backtracking. Instead intermediate results are
stored in a table "found", and if a shorter string with
the same number of unique symbols is found the table is reset.
If all of the data fits into the L1 cache then performing
multiple passes over the data should perform well. Apart from
[Snip]
I just want to say THANK YOU to both Dick and Terje for making the
complete program (the current version that is) available. The discussion
of L1 cache usage also provides valuable insight to what your concepts
are and how they impact the code. While the percentage of individuals
who actually consider cache usage when an algorithm is being designed
is certainly higher than usual with the followers of newsgroups, I doubt
that it is always a high priority. It would be interesting to develop some
tools to measure hit rates. In that regard, could Intel and AMD add
any microcode which would help the user? Does anyone have any idea
how much extra time (if any) it might take to maintain a hit rate register
for cache hits and misses as part of the microcode for the CPU, just as
it is possible to obtain the number of cycles?

Just a thought, but nothing gets done if the questions are not asked.

Jerome Fine
Terje Mathisen
2010-05-24 13:48:05 UTC
Permalink
I just want to say THANK YOU to both Dick and Terje for making the
complete program (the current version that is) available. The discussion
of L1 cache usage also provides valuable insight to what your concepts
are and how they impact the code. While the percentage of individuals
who actually consider cache usage when an algorithm is being designed
is certainly higher than usual with the followers of newsgroups, I doubt
that it is always a high priority. It would be interesting to develop some
tools to measure hit rates. In that regard, could Intel and AMD add
any microcode which would help the user? Does anyone have any idea
how much extra time (if any) it might take to maintain a hit rate register
for cache hits and misses as part of the microcode for the CPU, just as
it is possible to obtain the number of cycles?
This has existed since the original Pentium, in the form of the EMON
counters!

These were "secret" back then, but I reverse engineered them and wrote a
BYTE article which was published in the summer of 1994. On all later x86
cpu models the EMON counters, which are different for pretty much every
cpu generation, have been documented in the cpu manuals.

And Yes, L1 hit/miss rates are one of the most basic counters, included
on every single version since the original one. :-)

Today the easiest way to access them under Win* is to use Intel's VTune,
but there are similar (and free!) tools available under Linux.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
glen herrmannsfeldt
2010-05-24 16:12:30 UTC
Permalink
In comp.lang.pl1 Jerome H. Fine <***@nospicedham.dp3knospamcompsys.to.dotsrc.org> wrote:
(snip)
Post by Jerome H. Fine
I just want to say THANK YOU to both Dick and Terje for making the
complete program (the current version that is) available. The discussion
of L1 cache usage also provides valuable insight to what your concepts
are and how they impact the code. While the percentage of individuals
who actually consider cache usage when an algorithm is being designed
is certainly higher than usual with the followers of newsgroups, I doubt
that it is always a high priority.
The only place I know it commonly done is matrix multiply and
matrix transpose. In both cases, one matrix is traversed in the
"wrong" direction (for the cache), and so the effect is easily seen.
Post by Jerome H. Fine
It would be interesting to develop some
tools to measure hit rates. In that regard, could Intel and AMD add
any microcode which would help the user? Does anyone have any idea
how much extra time (if any) it might take to maintain a hit rate register
for cache hits and misses as part of the microcode for the CPU, just as
it is possible to obtain the number of cycles?
-- glen
Terje Mathisen
2010-05-24 06:09:43 UTC
Permalink
Post by Jerome H. Fine
I am attempting to follow the code for this problem, however, it
becomes a bit difficult to follow as it changes so often. This is not
a complaint as I very much appreciate some real examples.
OK, here is my complete program, as it currently exists:
(I finally realized that my potential solution can explode when working
with random data, so I have increased the buffer to store them to the
maximum possible count!)

// unique-strings.cpp : Defines the entry point for the console application.
//

//#include "stdafx.h"
#include <stdio.h>
#include <tchar.h>
#include <stdlib.h>

typedef unsigned char byte;

#define MAX_DATA 4096
int data_len;

// Data to be scanned
byte data[MAX_DATA+1];

// Remember the shortest sequence with n unique symbols
short shortest[256]; // Alphabet size entries
// Running count of unique symbols seen, from [n] to current pos
short count[MAX_DATA];

typedef struct _solution_t {
int uniq;
int len;
int start;
} solution_t;

solution_t solution[MAX_DATA*256]; // Maximum possible count of possible
solutions
int solutions = 0;

void init(void)
{
for (int i = 0; i < MAX_DATA; i++)
shortest[i] = 32767;

for (int i = 0; i < MAX_DATA; i++)
count[i] = 1;

for (int i = 0; i < MAX_DATA; i++)
data[i] = (rand() & 63) + 33;
data_len = MAX_DATA;

// Here are a couple of small sets with known solutions:
//memcpy(data,
"aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja", 56);
//data_len = 56;
//memcpy(data, "abc", 3);
//data_len = 3;
}

void save(int unique, int len, int start)
{
solution[solutions].uniq = unique;
solution[solutions].len = len;
solution[solutions].start = start;
solutions++;
}

int solsort(const void *a, const void *b)
{
solution_t *as = (solution_t *) a;
solution_t *bs = (solution_t *) b;
int u = bs->uniq - as->uniq; // Return longest unique sets first
if (!u) {
u = as->len - bs->len; // Return shortest string first for each set
length
if (!u)
u = as->start - bs->start; // Finally order by starting position
}
return u;
}

int filter(void)
{
int prev_uniq = 0x7fffffff; // Max 32-bit int
int prev_len = 0x7fffffff; // Max 32-bit int

// First get rid of all non-optimal strings for each unique set length:
// Sort reverse by set length, forward by string length, then by
starting position
qsort(solution, solutions, sizeof(solution[0]), solsort);

int sols = 0;
for (int s = 0; s < solutions; s++) {
if (solution[s].uniq == prev_uniq) {
if (solution[s].len > prev_len) { // Filter this entry away!
solution[s].uniq = 0; // Mark as removed
}
else { // Keep this, move down:
solution[sols++] = solution[s];
}
}
else { // (solution[s].uniq < prev_uniq) This is the first entry for a
shorter set:
prev_uniq = solution[s].uniq;
prev_len = solution[s].len;
solution[sols++] = solution[s];
}
}
solutions = sols;

// For each remaining solution, check if any other overlap-1:

for (int s = 0; s < solutions; s++) {
for (int k = s+1; k < solutions; k++) {
if ((solution[s].len == solution[k].len+1) &&
((solution[s].start == solution[k].start) ||
(solution[s].start+1 == solution[k].start)))
solution[k].uniq = 0; // Mark as removed
if (solution[s].len > solution[k].len+1) break;
}
}
// Compact the final solution set:
sols = 0;
for (int s = 0; s < solutions; s++) {
if (solution[s].uniq)
solution[sols++] = solution[s];
}
solutions = sols;

return solutions;
}

int scan(byte *data, int data_len)
{
int c = 0; // # of candidate solutions

// The first 3-byte solution cannot end before the third entry, so
// the loop below starts at i = 2, i.e. the third byte:
for (int i = 1; i < data_len; i++) {
byte b = data[i];
for (int j = i-1; j >= 0; j--) {
// A previously seen character cannot end a new longest sequence!
// It also cannot add to the sequence count for this or any
// longer string, so stop scanning!
if (data[j] == b) break;

// Increment the count of unique values seen when starting from
position [j]
int unique_values = ++count[j];
if (unique_values >= 3) {
int len = i-j+1;
if (len <= shortest[unique_values]) {
save(unique_values, len, j);
shortest[unique_values] = len;
c++;
}
}
}
}
return c;
}

typedef __int64 int64_t;

int64_t rdtsc(void)
{
__asm rdtsc
}

int _tmain(int argc, _TCHAR* argv[])
{
init();
int64_t t0 = rdtsc();
int pot = scan(data, data_len);
int64_t t1 = rdtsc();
int cnt = filter();
int64_t t2 = rdtsc();
// Print the final solutions in opposite order, from shortest to longest:
printf("len aperture from to value\n");
for (int s = cnt-1; s >= 0; s--) {
int start = solution[s].start;
int len = solution[s].len;
printf("%3d %6d %5d %3d ", solution[s].uniq, len, start+1, start+len,
data+start);
if (len > 20) len = 20;
for (int s = start; s < start+len; s++) {
byte b = data[s];
if ((b >= ' ') && (b < 127))
printf("%c", b);
else
printf("\\%03o", b);
}
printf("\n");
}

printf("Time to scan: %12I64d (%d potential solutions)\nTime to filter:
%10I64d (%d filtered solutions)\n",
t1-t0, pot, t2-t1, cnt);

return 0;
}
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Robert AH Prins
2010-05-25 23:04:58 UTC
Permalink
Post by Terje Mathisen
Post by Dick Wesseling
Post by Terje Mathisen
Still, running at 130+ cycles per input byte is still fast enough to
handle significant amounts of data.
Sounds like you've beaten me. My O() depends on the size of the longest
sequence minus the size of the longest unique sequence, which is good
for the reference data. If I time 4K of random data I get 7250 cycles per
input byte. Since sqrt(C) is 16 for that dataset you ought to get about
800 cycles/byte ( http://www.fi.uu.nl/~ftu/random.bin )
With 4K random input the scan took 252K cycles, which means that I
averaged about 83 cycles/byte. (I.e. very fast!)
The problem was that with totally random data, there are a _lot_ of both
potential and real solutions, I found 4099 candidates (i.e. just over
one per starting byte!) and 121 after filtering.
The filter process here took a _lot_ of time though: 8M cycles!
This means that it needed 2000 cycles/input byte. :-(
This simply shows that the problem space (and therefore required
algorithm as well) almost certainly won't contain totally random data,
and this will most probably help my code...
The data is in fact not very random, but you'll be hard pressed to
figure out what it represents. Here is the current full set, represented
by 66

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+-/*

printable chars:

aaabbbbbbbccccadddddddddddddddddddddddddddddddeaaaaaaaabbbcacddddeeeee
eeeebbeebeeeeeefeeegggdggggggggggggggddddddddddddddddccbbbbhbbbbibjjjj
kbbabaaaaaacbcccaddddddcccbaaaaaaaabbbbablmkkkbnnnnnnonnnnbkkkkkkllibb
ibaaaaaaapbbbbccddadddaddddgccccccccccaaaaaaaaaabbbbbabbbdbdddddddcccc
ccccddbbbbbbqlllnhnnnnrhblkaabpbbbbbllknnnnnbbaaaaaabbbbnkaaalbllllhln
nnnsinnkkkkhhpblbbbbbcccbdgdaaaaabbbbbfbtknnnnnnnknbbbbbaaaaaaabbbbucg
gdbhhhhhhhhhhhhhhhhohhhhhhhhrhhhhhhhhhhhhbhhhinjnnnnaaaaaaaiiiiiioojoa
aaaooooojjjjiiiiiioviiiaaaaaabbibrrrrrrrrrrrrrrrnnnnnbrrrrbbaaaiiiiiii
iiiiiiiiiiiiiiiiiiwiiiiiiiiixyiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
iihiiiiiiziiiiiiiiiiiooaaaaaaabbbbfbbbrrrrgginnralarabbaaaiiiiiiiiAiii
iiibiiiiiiiiiiiiiiiiiiiiiiiibiiiiiiiiiiiiiiiiiiiiiiiiiiabbbblbrrairrbr
lnnnjnirrrblbbjaaaaabbbbbaiiiiiiiiiiiiiiiBiiiiiiiiiiriiiiiiiiiiiiiiiii
iiiiiiiiiiiiiiiiiiiiiiiCiiiiiiiiiiiiiiiriiiiiiiiiiiiibiiiiiiiiiiiiiiii
aaaDbbbbbbbrrriaiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiriiiiiii
iiiiiiiiiiiiiiiiiiiiiiiiiiiiviiiiiiiiiiiiiiiiiiiiiiabhbiblelrrribnbbbo
rrrrabEbbbadcdddddbbbbbbbbBbBbbbaaaaaaiiiiiiijiiiioiiiiaaFahbbbbbbbbbb
bbbbbbbbbbbbbbbGbbbbbbbbbbrrrrrnlrHcbbrbbbaiiiiiiIiiiicoaaaaabbbbbibab
bbbybbbabbbbbbooooaiiiiiiiiiiiiiiiiiiaaaabalbbbbbbblbnrrrrcbbbaiiiiixi
ziiiiiiiiaaaaaabbbabrJbinnaiiiiaaaaaabbGJKLLLLLMLdddddddaaaaaaioiiiiii
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiNiiiiiiiiiiiiiifiiibbhcbbacccbccO
bbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKgdd
dgSdddddjcccccccccjaaaaaaafabbNGGLLLLLLLLGGGrbbbaoiiiiiiiiiiiiiiiaaiii
iibbbbbGGizzzziiiTiiiiijbNbLaGGGGKLLLLLLLLLLDDDDDDDDDDDDDDDDaaaaaabbbG
GGGGGGGLLLLGbaaaaaoooobiaiaoiiiiiiiiiiibaMbbUbGGGLLLLLLGGGGGGNbbooiiii
iiiiiiijibmabbGLLLLLLVbbbioiiiiiiiiiiiiiiiiiiiWiiiiiiXiaaaoooiaiiiiiii
iaaaaaGaVaYGZLLLLLLL0KGGGGGGGGbbooooiaiiiiii1iooaabbGaGGGGGGLLLLLLLLbG
GGGGGbaaaaajwiiiiiiiiiihaoGCGLLLLLLLLLLKbaoiiiiiiioabGLLLLLLLLEGGGbbob
ooon2iiiiijjjiooVbVbr3rrbrHHbalhGbbGGGLLLbbb4baaoaiiiiwiiYiiiiiiibaaaa
aaGGLLLLLLLLGGGGGGGGVGGVGbbbaaNbooiaiiiii0ooaYLLLLKGGGGGGLihiiiiijjoog
5zabbbbGLLLiiiiiiiiziiiiiiooaaaabbbGbooboooommmmoooooooaaoaoiooowooabm
mmmBBBBBBBBBI6jjIjLjoC7jjbmoaoooiooEEE0000000G8bFb0oooommmoMmm9Yoooooo
iGbbbbbaaaaao+oobGbbbLGLLLLLbbbjiiooobbGbbmo-ooaoJbbbbVbbbbbb/baoooooo
o*bbbbbbbbbbbbbbbbbbaotoommooojaoooo-ooahbbbbbbbaaa3ooooobbllbbrbbbbb*
bmaoooooobobbbbbooo (2329 characters)

The top-3 account for 58.3 % of all values, the top-10 for 85.2%.

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
Terje Mathisen
2010-05-26 08:43:31 UTC
Permalink
Post by Robert AH Prins
Post by Terje Mathisen
Post by Dick Wesseling
Post by Terje Mathisen
Still, running at 130+ cycles per input byte is still fast enough to
handle significant amounts of data.
Sounds like you've beaten me. My O() depends on the size of the longest
sequence minus the size of the longest unique sequence, which is good
for the reference data. If I time 4K of random data I get 7250 cycles per
input byte. Since sqrt(C) is 16 for that dataset you ought to get about
800 cycles/byte ( http://www.fi.uu.nl/~ftu/random.bin )
With 4K random input the scan took 252K cycles, which means that I
averaged about 83 cycles/byte. (I.e. very fast!)
The problem was that with totally random data, there are a _lot_ of both
potential and real solutions, I found 4099 candidates (i.e. just over
one per starting byte!) and 121 after filtering.
The filter process here took a _lot_ of time though: 8M cycles!
This means that it needed 2000 cycles/input byte. :-(
This simply shows that the problem space (and therefore required
algorithm as well) almost certainly won't contain totally random data,
and this will most probably help my code...
The data is in fact not very random, but you'll be hard pressed to
figure out what it represents. Here is the current full set, represented
by 66
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+-/*
aaabbbbbbbccccadddddddddddddddddddddddddddddddeaaaaaaaabbbcacddddeeeee
eeeebbeebeeeeeefeeegggdggggggggggggggddddddddddddddddccbbbbhbbbbibjjjj
kbbabaaaaaacbcccaddddddcccbaaaaaaaabbbbablmkkkbnnnnnnonnnnbkkkkkkllibb
ibaaaaaaapbbbbccddadddaddddgccccccccccaaaaaaaaaabbbbbabbbdbdddddddcccc
ccccddbbbbbbqlllnhnnnnrhblkaabpbbbbbllknnnnnbbaaaaaabbbbnkaaalbllllhln
nnnsinnkkkkhhpblbbbbbcccbdgdaaaaabbbbbfbtknnnnnnnknbbbbbaaaaaaabbbbucg
gdbhhhhhhhhhhhhhhhhohhhhhhhhrhhhhhhhhhhhhbhhhinjnnnnaaaaaaaiiiiiioojoa
aaaooooojjjjiiiiiioviiiaaaaaabbibrrrrrrrrrrrrrrrnnnnnbrrrrbbaaaiiiiiii
iiiiiiiiiiiiiiiiiiwiiiiiiiiixyiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
iihiiiiiiziiiiiiiiiiiooaaaaaaabbbbfbbbrrrrgginnralarabbaaaiiiiiiiiAiii
iiibiiiiiiiiiiiiiiiiiiiiiiiibiiiiiiiiiiiiiiiiiiiiiiiiiiabbbblbrrairrbr
lnnnjnirrrblbbjaaaaabbbbbaiiiiiiiiiiiiiiiBiiiiiiiiiiriiiiiiiiiiiiiiiii
iiiiiiiiiiiiiiiiiiiiiiiCiiiiiiiiiiiiiiiriiiiiiiiiiiiibiiiiiiiiiiiiiiii
aaaDbbbbbbbrrriaiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiriiiiiii
iiiiiiiiiiiiiiiiiiiiiiiiiiiiviiiiiiiiiiiiiiiiiiiiiiabhbiblelrrribnbbbo
rrrrabEbbbadcdddddbbbbbbbbBbBbbbaaaaaaiiiiiiijiiiioiiiiaaFahbbbbbbbbbb
bbbbbbbbbbbbbbbGbbbbbbbbbbrrrrrnlrHcbbrbbbaiiiiiiIiiiicoaaaaabbbbbibab
bbbybbbabbbbbbooooaiiiiiiiiiiiiiiiiiiaaaabalbbbbbbblbnrrrrcbbbaiiiiixi
ziiiiiiiiaaaaaabbbabrJbinnaiiiiaaaaaabbGJKLLLLLMLdddddddaaaaaaioiiiiii
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiNiiiiiiiiiiiiiifiiibbhcbbacccbccO
bbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKgdd
dgSdddddjcccccccccjaaaaaaafabbNGGLLLLLLLLGGGrbbbaoiiiiiiiiiiiiiiiaaiii
iibbbbbGGizzzziiiTiiiiijbNbLaGGGGKLLLLLLLLLLDDDDDDDDDDDDDDDDaaaaaabbbG
GGGGGGGLLLLGbaaaaaoooobiaiaoiiiiiiiiiiibaMbbUbGGGLLLLLLGGGGGGNbbooiiii
iiiiiiijibmabbGLLLLLLVbbbioiiiiiiiiiiiiiiiiiiiWiiiiiiXiaaaoooiaiiiiiii
iaaaaaGaVaYGZLLLLLLL0KGGGGGGGGbbooooiaiiiiii1iooaabbGaGGGGGGLLLLLLLLbG
GGGGGbaaaaajwiiiiiiiiiihaoGCGLLLLLLLLLLKbaoiiiiiiioabGLLLLLLLLEGGGbbob
ooon2iiiiijjjiooVbVbr3rrbrHHbalhGbbGGGLLLbbb4baaoaiiiiwiiYiiiiiiibaaaa
aaGGLLLLLLLLGGGGGGGGVGGVGbbbaaNbooiaiiiii0ooaYLLLLKGGGGGGLihiiiiijjoog
5zabbbbGLLLiiiiiiiiziiiiiiooaaaabbbGbooboooommmmoooooooaaoaoiooowooabm
mmmBBBBBBBBBI6jjIjLjoC7jjbmoaoooiooEEE0000000G8bFb0oooommmoMmm9Yoooooo
iGbbbbbaaaaao+oobGbbbLGLLLLLbbbjiiooobbGbbmo-ooaoJbbbbVbbbbbb/baoooooo
o*bbbbbbbbbbbbbbbbbbaotoommooojaoooo-ooahbbbbbbbaaa3ooooobbllbbrbbbbb*
bmaoooooobobbbbbooo (2329 characters)
The top-3 account for 58.3 % of all values, the top-10 for 85.2%.
Here's my results, with the longer result strings truncated:
len aperture from to value
3 3 2291 2293 a3o
3 3 2271 2273 jao
3 3 2270 2272 oja
3 3 2241 2243 o*b
3 3 2201 2203 bji
3 3 2191 2193 bLG
3 3 2186 2188 obG
3 3 2182 2184 ao+
3 3 2159 2161 oMm
3 3 2158 2160 moM
3 3 2117 2119 IjL
3 3 2089 2091 aoi
3 3 2066 2068 Gbo
3 3 2037 2039 bGL
3 3 2010 2012 LKG
3 3 2001 2003 i0o
3 3 1994 1996 oia
3 3 1984 1986 VGb
3 3 1955 1957 iba
3 3 1939 1941 oai
3 3 1935 1937 4ba
3 3 1915 1917 brH
3 3 1906 1908 oVb
3 3 1903 1905 jio
3 3 1882 1884 LEG
3 3 1848 1850 CGL
3 3 1825 1827 Gba
3 3 1818 1820 LbG
3 3 1802 1804 bGa
3 3 1795 1797 1io
3 3 1786 1788 oia
3 3 1757 1759 GaV
3 3 1741 1743 oia
3 3 1734 1736 Xia
3 3 1705 1707 bio
3 3 1701 1703 LVb
3 3 1694 1696 bGL
3 3 1691 1693 mab
3 3 1671 1673 GNb
3 3 1655 1657 UbG
3 3 1651 1653 aMb
3 3 1637 1639 aoi
3 3 1636 1638 iao
3 3 1573 1575 GKL
3 3 1549 1551 Giz
3 3 1514 1516 Grb
3 3 1500 1502 bNG
3 3 1497 1499 fab
3 3 1488 1490 cja
3 3 1478 1480 djc
3 3 1472 1474 gSd
3 3 1471 1473 dgS
3 3 1455 1457 biG
3 3 1437 1439 bao
3 3 1399 1401 cOb
3 3 1392 1394 bac
3 3 1389 1391 hcb
3 3 1388 1390 bhc
3 3 1322 1324 aio
3 3 1308 1310 MLd
3 3 1286 1288 nai
3 3 1259 1261 xiz
3 3 1252 1254 bai
3 3 1248 1250 rcb
3 3 1233 1235 alb
3 3 1232 1234 bal
3 3 1208 1210 oai
3 3 1187 1189 iba
3 3 1162 1164 bai
3 3 1151 1153 rnl
3 3 1049 1051 bor
3 3 1039 1041 elr
3 3 1034 1036 hbi
3 3 924 926 ria
3 3 913 915 aDb
3 3 795 797 bai
3 3 784 786 bja
3 3 780 782 rbl
3 3 765 767 air
3 3 764 766 rai
3 3 761 763 lbr
3 3 755 757 iab
3 3 682 684 rab
3 3 680 682 lar
3 3 674 676 gin
3 3 589 591 xyi
3 3 588 590 ixy
3 3 543 545 nbr
3 3 522 524 ibr
3 3 509 511 ovi
3 3 508 510 iov
3 3 488 490 joa
3 3 400 402 knb
3 3 377 379 gda
3 3 354 356 sin
3 3 353 355 nsi
3 3 348 350 hln
3 3 341 343 alb
3 3 318 320 lkn
3 3 309 311 abp
3 3 296 298 lnh
3 3 292 294 bql
3 3 237 239 dgc
3 3 219 221 apb
3 3 211 213 iba
3 3 207 209 lib
3 3 198 200 nbk
3 3 186 188 kbn
3 3 166 168 cba
3 3 156 158 cad
3 3 151 153 acb
3 3 140 142 jkb
3 3 135 137 ibj
3 3 60 62 acd
3 3 58 60 bca
3 3 46 48 dea
3 3 14 16 cad
4 4 2279 2282 oahb
4 4 2260 2263 baot
4 4 2232 2235 /bao
4 4 2218 2221 aoJb
4 4 2212 2215 bmo-
4 4 2170 2173 oiGb
4 4 2162 2165 m9Yo
4 4 2149 2152 Fb0o
4 4 2121 2124 oC7j
4 4 2112 2115 BI6j
4 4 2097 2100 oabm
4 4 2017 2020 GLih
4 4 2004 2007 oaYL
4 4 1990 1993 aNbo
4 4 1909 1912 Vbr3
4 4 1893 1896 on2i
4 4 1831 1834 ajwi
4 4 1770 1773 L0KG
4 4 1649 1652 ibaM
4 4 1632 1635 obia
4 4 1621 1624 LGba
4 4 1563 1566 ijbN
4 4 1518 1521 baoi
4 4 1446 1449 iFra
4 4 1438 1441 aobd
4 4 1279 1282 abrJ
4 4 1242 1245 lbnr
4 4 1174 1177 icoa
4 4 1108 1111 Fahb
4 4 1060 1063 badc
4 4 1054 1057 rabE
4 4 1043 1046 ribn
4 4 1036 1039 ible
4 4 1031 1034 iabh
4 4 775 778 jnir
4 4 769 772 brln
4 4 677 680 nral
4 4 465 468 hinj
4 4 421 424 gdbh
4 4 417 420 bucg
4 4 374 377 cbdg
4 4 363 366 hpbl
4 4 336 339 bnka
5 5 2310 2314 *bmao
5 5 2145 2149 0G8bF
5 5 2125 2129 jbmoa
5 5 2119 2123 LjoC7
5 5 1920 1924 alhGb
5 5 1688 1692 jibma
5 5 1566 1570 NbLaG
5 5 1465 1469 LMKgd
5 5 1443 1447 bRriF
5 5 1406 1410 PQBub
5 5 1405 1409 bPQBu
5 5 1299 1303 bGJKL
5 5 1281 1285 rJbin
5 5 389 393 fbtkn
5 5 180 184 ablmk
6 6 2029 2034 og5zab
6 6 1918 1923 HbalhG
6 6 1870 1875 ioabGL
6 6 1859 1864 LKbaoi
6 6 1843 1848 ihaoGC
6 6 1759 1764 VaYGZL
6 6 1447 1452 Frainb
6 6 1152 1157 nlrHcb
7 7 302 308 nrhblka
8 10 2025 2034 ijjoog5zab
8 10 1438 1447 aobdbbRriF
8 10 302 311 nrhblkaabp
9 13 2117 2129 IjLjoC7jjbmoa
9 13 2027 2039 joog5zabbbbGL
9 13 1439 1451 obdbbRriFrain
10 15 2025 2039 ijjoog5zabbbbGL
11 18 2112 2129 BI6jjIjLjoC7jjbmoa
11 18 2017 2034 GLihiiiiijjoog5zab
12 21 1903 1923 jiooVbVbr3rrbrHHbalhG
13 25 2112 2136 BI6jjIjLjoC7jjbmoaoooiooE
14 28 2112 2139 BI6jjIjLjoC7jjbmoaoooiooEEE0
15 33 2117 2149 IjLjoC7jjbmoaoooiooEEE0000000G8bF
16 36 2114 2149 6jjIjLjoC7jjbmoaoooiooEEE0000000G8bF
16 36 2112 2147 BI6jjIjLjoC7jjbmoaoooiooEEE0000000G8
17 38 2112 2149 BI6jjIjLjoC7jjbmoaoooiooEEE0000000G8bF
18 48 2117 2164 IjLjoC7jjbmoaoooiooEEE0000000G8bFb0oooommmoMmm9Y
19 51 2114 2164 6jjIjLjoC7jjbmoaoooiooEEE0000000G8bFb0oooommmoMmm9Y
20 53 2112 2164 BI6jjIjLjoC7jjbmoaoooiooEEE0000000G8bFb0oooommmoMmm9Y
21 70 2095 2164
wooabmmmmBBBBBBBBBI6jjIjLjoC7jjbmoaoooiooEEE0000000G8bFb0oooommmoMmm9Y
22 81 1400 1480
ObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKgd
22 81 1399 1479
cObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKg
23 91 1389 1479
hcbbacccbccObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGG
23 91 1383 1473
fiiibbhcbbacccbccObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainb
24 97 1383 1479
fiiibbhcbbacccbccObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainb
25 112 1368 1479
NiiiiiiiiiiiiiifiiibbhcbbacccbccObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibb
26 131 2112 2242
BI6jjIjLjoC7jjbmoaoooiooEEE0000000G8bFb0oooommmoMmm9YooooooiGbbbbbaaaa
27 148 2095 2242
wooabmmmmBBBBBBBBBI6jjIjLjoC7jjbmoaoooiooEEE0000000G8bFb0oooommmoMmm9Y
28 169 2095 2263
wooabmmmmBBBBBBBBBI6jjIjLjoC7jjbmoaoooiooEEE0000000G8bFb0oooommmoMmm9Y
29 181 2112 2292
BI6jjIjLjoC7jjbmoaoooiooEEE0000000G8bFb0oooommmoMmm9YooooooiGbbbbbaaaa
30 189 2112 2300
BI6jjIjLjoC7jjbmoaoooiooEEE0000000G8bFb0oooommmoMmm9YooooooiGbbbbbaaaa
31 193 2112 2304
BI6jjIjLjoC7jjbmoaoooiooEEE0000000G8bFb0oooommmoMmm9YooooooiGbbbbbaaaa
32 210 2095 2304
wooabmmmmBBBBBBBBBI6jjIjLjoC7jjbmoaoooiooEEE0000000G8bFb0oooommmoMmm9Y
33 252 1991 2242
Nbooiaiiiii0ooaYLLLLKGGGGGGLihiiiiijjoog5zabbbbGLLLiiiiiiiiziiiiiiooaa
33 252 1912 2163
3rrbrHHbalhGbbGGGLLLbbb4baaoaiiiiwiiYiiiiiiibaaaaaaGGLLLLLLLLGGGGGGGGV
34 267 1894 2160
n2iiiiijjjiooVbVbr3rrbrHHbalhGbbGGGLLLbbb4baaoaiiiiwiiYiiiiiiibaaaaaaG
35 270 1894 2163
n2iiiiijjjiooVbVbr3rrbrHHbalhGbbGGGLLLbbb4baaoaiiiiwiiYiiiiiiibaaaaaaG
36 291 1894 2184
n2iiiiijjjiooVbVbr3rrbrHHbalhGbbGGGLLLbbb4baaoaiiiiwiiYiiiiiiibaaaaaaG
37 314 1991 2304
Nbooiaiiiii0ooaYLLLLKGGGGGGLihiiiiijjoog5zabbbbGLLLiiiiiiiiziiiiiiooaa
38 327 1894 2220
n2iiiiijjjiooVbVbr3rrbrHHbalhGbbGGGLLLbbb4baaoaiiiiwiiYiiiiiiibaaaaaaG
39 339 1894 2232
n2iiiiijjjiooVbVbr3rrbrHHbalhGbbGGGLLLbbb4baaoaiiiiwiiYiiiiiiibaaaaaaG
40 349 1894 2242
n2iiiiijjjiooVbVbr3rrbrHHbalhGbbGGGLLLbbb4baaoaiiiiwiiYiiiiiiibaaaaaaG
41 370 1894 2263
n2iiiiijjjiooVbVbr3rrbrHHbalhGbbGGGLLLbbb4baaoaiiiiwiiYiiiiiiibaaaaaaG
42 469 1795 2263
1iooaabbGaGGGGGGLLLLLLLLbGGGGGGbaaaaajwiiiiiiiiiihaoGCGLLLLLLLLLLKbaoi
43 501 1763 2263
ZLLLLLLL0KGGGGGGGGbbooooiaiiiiii1iooaabbGaGGGGGGLLLLLLLLbGGGGGGbaaaaaj
44 516 1727 2242
WiiiiiiXiaaaoooiaiiiiiiiiaaaaaGaVaYGZLLLLLLL0KGGGGGGGGbbooooiaiiiiii1i
45 536 1400 1935
ObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKgd
46 609 1655 2263
UbGGGLLLLLLGGGGGGNbbooiiiiiiiiiiijibmabbGLLLLLLVbbbioiiiiiiiiiiiiiiiii
47 664 1600 2263
DaaaaaabbbGGGGGGGGLLLLGbaaaaaoooobiaiaoiiiiiiiiiiibaMbbUbGGGLLLLLLGGGG
48 706 1558 2263
TiiiiijbNbLaGGGGKLLLLLLLLLLDDDDDDDDDDDDDDDDaaaaaabbbGGGGGGGGLLLLGbaaaa
49 724 1400 2123
ObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKgd
50 748 1400 2147
ObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKgd
51 764 1400 2163
ObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKgd
52 785 1400 2184
ObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKgd
53 815 1406 2220
PQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKgdddgSdd
54 821 1400 2220
ObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKgd
55 833 1400 2232
ObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKgd
56 843 1400 2242
ObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKgd
57 864 1400 2263
ObbbbbPQBubbbiiiiiiiiiiiiiiiiKiiiiiibbaobdbbRriFrainbbbbiGGGLLLLLLMKgd
58 1005 1259 2263
xiziiiiiiiiaaaaaabbbabrJbinnaiiiiaaaaaabbGJKLLLLLMLdddddddaaaaaaioiiii
59 1070 1194 2263
ybbbabbbbbbooooaiiiiiiiiiiiiiiiiiiaaaabalbbbbbbblbnrrrrcbbbaiiiiixizii
60 1225 1039 2263
elrrribnbbborrrrabEbbbadcdddddbbbbbbbbBbBbbbaaaaaaiiiiiiijiiiioiiiiaaF
61 1255 1009 2263
viiiiiiiiiiiiiiiiiiiiiiabhbiblelrrribnbbborrrrabEbbbadcdddddbbbbbbbbBb
62 1567 697 2263
Aiiiiiibiiiiiiiiiiiiiiiiiiiiiiiibiiiiiiiiiiiiiiiiiiiiiiiiiiabbbblbrrai
63 1852 391 2242
tknnnnnnnknbbbbbaaaaaaabbbbucggdbhhhhhhhhhhhhhhhhohhhhhhhhrhhhhhhhhhhh
64 1879 364 2242
pblbbbbbcccbdgdaaaaabbbbbfbtknnnnnnnknbbbbbaaaaaaabbbbucggdbhhhhhhhhhh
64 1879 354 2232
sinnkkkkhhpblbbbbbcccbdgdaaaaabbbbbfbtknnnnnnnknbbbbbaaaaaaabbbbucggdb
65 1889 354 2242
sinnkkkkhhpblbbbbbcccbdgdaaaaabbbbbfbtknnnnnnnknbbbbbaaaaaaabbbbucggdb
66 1950 293 2242
qlllnhnnnnrhblkaabpbbbbbllknnnnnbbaaaaaabbbbnkaaalbllllhlnnnnsinnkkkkh

Time to scan: 2285008 (1200 potential solutions)
Time to filter: 2459754 (253 filtered solutions)

Does this seem reasonable?

The total solution time is about 2.2 ms...

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Robert AH Prins
2010-05-26 12:54:58 UTC
Permalink
Post by Robert AH Prins
Post by Robert AH Prins
Post by Terje Mathisen
Post by Dick Wesseling
Post by Terje Mathisen
Still, running at 130+ cycles per input byte is still fast enough to
handle significant amounts of data.
Sounds like you've beaten me. My O() depends on the size of the longest
sequence minus the size of the longest unique sequence, which is good
for the reference data. If I time 4K of random data I get 7250 cycles per
input byte. Since sqrt(C) is 16 for that dataset you ought to get about
800 cycles/byte ( http://www.fi.uu.nl/~ftu/random.bin )
With 4K random input the scan took 252K cycles, which means that I
averaged about 83 cycles/byte. (I.e. very fast!)
The problem was that with totally random data, there are a _lot_ of both
potential and real solutions, I found 4099 candidates (i.e. just over
one per starting byte!) and 121 after filtering.
The filter process here took a _lot_ of time though: 8M cycles!
This means that it needed 2000 cycles/input byte. :-(
This simply shows that the problem space (and therefore required
algorithm as well) almost certainly won't contain totally random data,
and this will most probably help my code...
The data is in fact not very random, but you'll be hard pressed to
figure out what it represents. Here is the current full set, represented
by 66
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+-/*
aaabbbbbbbccccadddddddddddddddddddddddddddddddeaaaaaaaabbbcacddddeeeee
<SNIP>
Post by Robert AH Prins
Post by Robert AH Prins
bmaoooooobobbbbbooo (2329 characters)
The top-3 account for 58.3 % of all values, the top-10 for 85.2%.
len aperture from to value
3 3 2291 2293 a3o
<SNIP>
Post by Robert AH Prins
66 1950 293 2242
qlllnhnnnnrhblkaabpbbbbbllknnnnnbbaaaaaabbbbnkaaalbllllhlnnnnsinnkkkkh
A very quick look at the results tells me they are OK. My results
perform one other, minor massage of the output by removing duplicates,
so to take an example

15 33 2117 2149 IjLjoC7jjbmoaoooiooEEE0000000G8bF

would come out as

15 33 2117 2149 IjLoC7bmaiE0G8F

Or even more correct, as the 4 character strings indexed by the values
of these characters. Nit-picking...
Post by Robert AH Prins
Time to scan: 2285008 (1200 potential solutions)
Time to filter: 2459754 (253 filtered solutions)
Does this seem reasonable?
The total solution time is about 2.2 ms...
I think my very first Pascal (Turbo Pascal 3.01a) solution using a
sliding window without any optimizations ran for around 26 *minutes*
(OK, on a 16MHz 386) on a far smaller subset (less than half) of this
data, so I think 2.2 ms is pretty reasonable. ;)

Still cutting the Pascal - the PL/I would be easier, but PL/I compilers
are a bit thin on the ground...

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
Terje Mathisen
2010-05-26 11:32:00 UTC
Permalink
Post by Robert AH Prins
A very quick look at the results tells me they are OK. My results
perform one other, minor massage of the output by removing duplicates,
so to take an example
15 33 2117 2149 IjLjoC7jjbmoaoooiooEEE0000000G8bF
would come out as
15 33 2117 2149 IjLoC7bmaiE0G8F
Or even more correct, as the 4 character strings indexed by the values
of these characters. Nit-picking...
Post by Robert AH Prins
Time to scan: 2285008 (1200 potential solutions)
Time to filter: 2459754 (253 filtered solutions)
Does this seem reasonable?
The total solution time is about 2.2 ms...
I think my very first Pascal (Turbo Pascal 3.01a) solution using a
sliding window without any optimizations ran for around 26 *minutes*
(OK, on a 16MHz 386) on a far smaller subset (less than half) of this
data, so I think 2.2 ms is pretty reasonable. ;)
My first usable compiler (on a PC) was TP 1.0. :-)

Currently my best runtime has been:

65 1889 354 2242
sinnkkkkhhpblbbbbbcccbdgdaaaaabbbbbfbtknnnnnnnknbbbbbaaaaaaabbbbucggdb
66 1950 293 2242
qlllnhnnnnrhblkaabpbbbbbllknnnnnbbaaaaaabbbbnkaaalbllllhlnnnnsinnkkkkh
Time to scan: 1247257 (1200 potential solutions)
Time to filter: 1226621 (253 filtered solutions)

I.e. 2.47M cycles on a 2.2GHz cpu, for a total of about 1.1 ms.

I have also implemented the promised improved filter function, using
that I got:

Time to scan: 1048432 (1200 potential solutions)
Time to filter: 982135 (237 filtered solutions)

I.e. 0.9 ms, but I'm missing 16 of the shorter solutions. :-(

I'll have to recheck my logic...

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Terje Mathisen
2010-05-26 11:38:07 UTC
Permalink
Post by Robert AH Prins
A very quick look at the results tells me they are OK. My results
perform one other, minor massage of the output by removing duplicates,
so to take an example
15 33 2117 2149 IjLjoC7jjbmoaoooiooEEE0000000G8bF
would come out as
15 33 2117 2149 IjLoC7bmaiE0G8F
Here's with the same stripping of repeating result bytes:

len aperture from to value
3 3 2291 2293 a3o
3 3 2271 2273 jao
3 3 2270 2272 oja
3 3 2241 2243 o*b
3 3 2201 2203 bji
3 3 2191 2193 bLG
3 3 2186 2188 obG
3 3 2182 2184 ao+
3 3 2159 2161 oMm
3 3 2158 2160 moM
3 3 2117 2119 IjL
3 3 2089 2091 aoi
3 3 2066 2068 Gbo
3 3 2037 2039 bGL
3 3 2010 2012 LKG
3 3 2001 2003 i0o
3 3 1994 1996 oia
3 3 1984 1986 VGb
3 3 1955 1957 iba
3 3 1939 1941 oai
3 3 1935 1937 4ba
3 3 1915 1917 brH
3 3 1906 1908 oVb
3 3 1903 1905 jio
3 3 1882 1884 LEG
3 3 1848 1850 CGL
3 3 1825 1827 Gba
3 3 1818 1820 LbG
3 3 1802 1804 bGa
3 3 1795 1797 1io
3 3 1786 1788 oia
3 3 1757 1759 GaV
3 3 1741 1743 oia
3 3 1734 1736 Xia
3 3 1705 1707 bio
3 3 1701 1703 LVb
3 3 1694 1696 bGL
3 3 1691 1693 mab
3 3 1671 1673 GNb
3 3 1655 1657 UbG
3 3 1651 1653 aMb
3 3 1637 1639 aoi
3 3 1636 1638 iao
3 3 1573 1575 GKL
3 3 1549 1551 Giz
3 3 1514 1516 Grb
3 3 1500 1502 bNG
3 3 1497 1499 fab
3 3 1488 1490 cja
3 3 1478 1480 djc
3 3 1472 1474 gSd
3 3 1471 1473 dgS
3 3 1455 1457 biG
3 3 1437 1439 bao
3 3 1399 1401 cOb
3 3 1392 1394 bac
3 3 1389 1391 hcb
3 3 1388 1390 bhc
3 3 1322 1324 aio
3 3 1308 1310 MLd
3 3 1286 1288 nai
3 3 1259 1261 xiz
3 3 1252 1254 bai
3 3 1248 1250 rcb
3 3 1233 1235 alb
3 3 1232 1234 bal
3 3 1208 1210 oai
3 3 1187 1189 iba
3 3 1162 1164 bai
3 3 1151 1153 rnl
3 3 1049 1051 bor
3 3 1039 1041 elr
3 3 1034 1036 hbi
3 3 924 926 ria
3 3 913 915 aDb
3 3 795 797 bai
3 3 784 786 bja
3 3 780 782 rbl
3 3 765 767 air
3 3 764 766 rai
3 3 761 763 lbr
3 3 755 757 iab
3 3 682 684 rab
3 3 680 682 lar
3 3 674 676 gin
3 3 589 591 xyi
3 3 588 590 ixy
3 3 543 545 nbr
3 3 522 524 ibr
3 3 509 511 ovi
3 3 508 510 iov
3 3 488 490 joa
3 3 400 402 knb
3 3 377 379 gda
3 3 354 356 sin
3 3 353 355 nsi
3 3 348 350 hln
3 3 341 343 alb
3 3 318 320 lkn
3 3 309 311 abp
3 3 296 298 lnh
3 3 292 294 bql
3 3 237 239 dgc
3 3 219 221 apb
3 3 211 213 iba
3 3 207 209 lib
3 3 198 200 nbk
3 3 186 188 kbn
3 3 166 168 cba
3 3 156 158 cad
3 3 151 153 acb
3 3 140 142 jkb
3 3 135 137 ibj
3 3 60 62 acd
3 3 58 60 bca
3 3 46 48 dea
3 3 14 16 cad
4 4 2279 2282 oahb
4 4 2260 2263 baot
4 4 2232 2235 /bao
4 4 2218 2221 aoJb
4 4 2212 2215 bmo-
4 4 2170 2173 oiGb
4 4 2162 2165 m9Yo
4 4 2149 2152 Fb0o
4 4 2121 2124 oC7j
4 4 2112 2115 BI6j
4 4 2097 2100 oabm
4 4 2017 2020 GLih
4 4 2004 2007 oaYL
4 4 1990 1993 aNbo
4 4 1909 1912 Vbr3
4 4 1893 1896 on2i
4 4 1831 1834 ajwi
4 4 1770 1773 L0KG
4 4 1649 1652 ibaM
4 4 1632 1635 obia
4 4 1621 1624 LGba
4 4 1563 1566 ijbN
4 4 1518 1521 baoi
4 4 1446 1449 iFra
4 4 1438 1441 aobd
4 4 1279 1282 abrJ
4 4 1242 1245 lbnr
4 4 1174 1177 icoa
4 4 1108 1111 Fahb
4 4 1060 1063 badc
4 4 1054 1057 rabE
4 4 1043 1046 ribn
4 4 1036 1039 ible
4 4 1031 1034 iabh
4 4 775 778 jnir
4 4 769 772 brln
4 4 677 680 nral
4 4 465 468 hinj
4 4 421 424 gdbh
4 4 417 420 bucg
4 4 374 377 cbdg
4 4 363 366 hpbl
4 4 336 339 bnka
5 5 2310 2314 *bmao
5 5 2145 2149 0G8bF
5 5 2125 2129 jbmoa
5 5 2119 2123 LjoC7
5 5 1920 1924 alhGb
5 5 1688 1692 jibma
5 5 1566 1570 NbLaG
5 5 1465 1469 LMKgd
5 5 1443 1447 bRriF
5 5 1406 1410 PQBub
5 5 1405 1409 bPQBu
5 5 1299 1303 bGJKL
5 5 1281 1285 rJbin
5 5 389 393 fbtkn
5 5 180 184 ablmk
6 6 2029 2034 og5zab
6 6 1918 1923 HbalhG
6 6 1870 1875 ioabGL
6 6 1859 1864 LKbaoi
6 6 1843 1848 ihaoGC
6 6 1759 1764 VaYGZL
6 6 1447 1452 Frainb
6 6 1152 1157 nlrHcb
7 7 302 308 nrhblka
8 10 2025 2034 ijog5zab
8 10 1438 1447 aobdRriF
8 10 302 311 nrhblkap
9 13 2117 2129 IjLoC7bma
9 13 2027 2039 jog5zabGL
9 13 1439 1451 obdRriFan
10 15 2025 2039 ijog5zabGL
11 18 2112 2129 BI6jLoC7bma
11 18 2017 2034 GLihjog5zab
12 21 1903 1923 jioVbr3HalhG
13 25 2112 2136 BI6jLoC7bmaiE
14 28 2112 2139 BI6jLoC7bmaiE0
15 33 2117 2149 IjLoC7bmaiE0G8F
16 36 2114 2149 6jILoC7bmaiE0G8F
16 36 2112 2147 BI6jLoC7bmaiE0G8
17 38 2112 2149 BI6jLoC7bmaiE0G8F
18 48 2117 2164 IjLoC7bmaiE0G8FM9Y
19 51 2114 2164 6jILoC7bmaiE0G8FM9Y
20 53 2112 2164 BI6jLoC7bmaiE0G8FM9Y
21 70 2095 2164 woabmBI6jLC7iE0G8FM9Y
22 81 1400 1480 ObPQBuiKaodRrFnGLMgSjc
22 81 1399 1479 cObPQBuiKaodRrFnGLMgSj
23 91 1389 1479 hcbaOPQBuiKodRrFnGLMgSj
23 91 1383 1473 fibhcaOPQBuKodRrFnGLMgS
24 97 1383 1479 fibhcaOPQBuKodRrFnGLMgSj
25 112 1368 1479 NifbhcaOPQBuKodRrFnGLMgSj
26 131 2112 2242 BI6jLoC7bmaiE0G8FM9Y+-JV/*
27 148 2095 2242 woabmBI6jLC7iE0G8FM9Y+-JV/*
28 169 2095 2263 woabmBI6jLC7iE0G8FM9Y+-JV/*t
29 181 2112 2292 BI6jLoC7bmaiE0G8FM9Y+-JV/*th3
30 189 2112 2300 BI6jLoC7bmaiE0G8FM9Y+-JV/*th3l
31 193 2112 2304 BI6jLoC7bmaiE0G8FM9Y+-JV/*th3lr
32 210 2095 2304 woabmBI6jLC7iE0G8FM9Y+-JV/*th3lr
33 252 1991 2242 Nboia0YLKGhjg5zmwBI6C7E8FM9+-JV/*
33 252 1912 2163 3rbHalhGL4oiwYVN0Kjg5zmBI6C7E8FM9
34 267 1894 2160 n2ijoVbr3HalhGL4wYN0Kg5zmBI6C7E8FM
35 270 1894 2163 n2ijoVbr3HalhGL4wYN0Kg5zmBI6C7E8FM9
36 291 1894 2184 n2ijoVbr3HalhGL4wYN0Kg5zmBI6C7E8FM9+
37 314 1991 2304 Nboia0YLKGhjg5zmwBI6C7E8FM9+-JV/*t3lr
38 327 1894 2220 n2ijoVbr3HalhGL4wYN0Kg5zmBI6C7E8FM9+-J
39 339 1894 2232 n2ijoVbr3HalhGL4wYN0Kg5zmBI6C7E8FM9+-J/
40 349 1894 2242 n2ijoVbr3HalhGL4wYN0Kg5zmBI6C7E8FM9+-J/*
41 370 1894 2263 n2ijoVbr3HalhGL4wYN0Kg5zmBI6C7E8FM9+-J/*t
42 469 1795 2263 1ioabGLjwhCKEn2Vr3Hl4YN0g5zmBI678FM9+-J/*t
43 501 1763 2263 ZL0KGboia1jwhCEn2Vr3Hl4YNg5zmBI678FM9+-J/*t
44 516 1727 2242 WiXaoGVYZL0Kb1jwhCEn2r3Hl4Ng5zmBI678FM9+-J/*
45 536 1400 1935 ObPQBuiKaodRrFnGLMgSjcfNzTDUmVWXYZ01whCE23Hl4
46 609 1655 2263 UbGLNoijmaVWXYZ0K1whCEn2r3Hl4g5zBI678FM9+-J/*t
47 664 1600 2263 DabGLoiMUNjmVWXYZ0K1whCEn2r3Hl4g5zBI678F9+-J/*t
48 706 1558 2263 TijbNLaGKDoMUmVWXYZ01whCEn2r3Hl4g5zBI678F9+-J/*t
49 724 1400 2123 ObPQBuiKaodRrFnGLMgSjcfNzTDUmVWXYZ01whCE23Hl45I67
50 748 1400 2147 ObPQBuiKaodRrFnGLMgSjcfNzTDUmVWXYZ01whCE23Hl45I678
51 764 1400 2163 ObPQBuiKaodRrFnGLMgSjcfNzTDUmVWXYZ01whCE23Hl45I6789
52 785 1400 2184 ObPQBuiKaodRrFnGLMgSjcfNzTDUmVWXYZ01whCE23Hl45I6789+
53 815 1406 2220 PQBubiKaodRrFnGLMgSjcfNzTDUmVWXYZ01whCE23Hl45I6789+-J
54 821 1400 2220
ObPQBuiKaodRrFnGLMgSjcfNzTDUmVWXYZ01whCE23Hl45I6789+-J
55 833 1400 2232
ObPQBuiKaodRrFnGLMgSjcfNzTDUmVWXYZ01whCE23Hl45I6789+-J/
56 843 1400 2242
ObPQBuiKaodRrFnGLMgSjcfNzTDUmVWXYZ01whCE23Hl45I6789+-J/*
57 864 1400 2263
ObPQBuiKaodRrFnGLMgSjcfNzTDUmVWXYZ01whCE23Hl45I6789+-J/*t
58 1005 1259 2263
xizabrJnGKLMdoNfhcOPQBuRFgSjTDUmVWXYZ01wCE23Hl45I6789+-/*t
59 1070 1194 2263
ybaoilnrcxzJGKLMdNfhOPQBuRFgSjTDUmVWXYZ01wCE23H45I6789+-/*t
60 1225 1039 2263
elribnoaEdcBjFhGHIyxzJKLMNfOPQuRgSTDUmVWXYZ01wC23456789+-/*t
61 1255 1009 2263
viabhlernoEdcBjFGHIyxzJKLMNfOPQuRgSTDUmVWXYZ01wC23456789+-/*t
62 1567 697 2263
AibalrnjBCDvheoEdcFGHIyxzJKLMNfOPQuRgSTUmVWXYZ01w23456789+-/*t
63 1852 391 2242
tknbaucgdhorijvwxyzflABCDeEFGHIJKLMNOPQRSTUmVWXYZ0123456789+-/*
64 1879 364 2242
pblcdgaftknuhorijvwxyzABCDeEFGHIJKLMNOPQRSTUmVWXYZ0123456789+-/*
64 1879 354 2232
sinkhpblcdgaftuorjvwxyzABCDeEFGHIJKLMNOPQRSTUmVWXYZ0123456789+-/
65 1889 354 2242
sinkhpblcdgaftuorjvwxyzABCDeEFGHIJKLMNOPQRSTUmVWXYZ0123456789+-/*
66 1950 293 2242
qlnhrbkapsicdgftuojvwxyzABCDeEFGHIJKLMNOPQRSTUmVWXYZ0123456789+-/*
Time to scan: 853094 (1200 potential solutions)
Time to filter: 874588 (253 filtered solutions)

1.7M cycles, 0.75ms total time.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Dick Wesseling
2010-05-26 13:23:39 UTC
Permalink
In article <***@mid.individual.net>,
Robert AH Prins <***@prino.org> writes:
[snip]
Post by Robert AH Prins
Post by Robert AH Prins
Time to scan: 2285008 (1200 potential solutions)
Time to filter: 2459754 (253 filtered solutions)
Does this seem reasonable?
The total solution time is about 2.2 ms...
I think my very first Pascal (Turbo Pascal 3.01a) solution using a
sliding window without any optimizations ran for around 26 *minutes*
(OK, on a 16MHz 386) on a far smaller subset (less than half) of this
data, so I think 2.2 ms is pretty reasonable. ;)
Still cutting the Pascal - the PL/I would be easier, but PL/I compilers
are a bit thin on the ground...
My results:

2165682 cycles 253 solutions.

Using 2K of random data instead takes 6 times longer.
Robert AH Prins
2010-05-25 22:35:24 UTC
Permalink
Post by Terje Mathisen
Post by James J. Weinkam
Post by Robert AH Prins
His code is in Pascal, and until I started running it for parts of the
input string I had had blind faith in it. Wrong, as it turns out as
there are three strings that it does not process correctly, and all my
debugging hasn't given me any clues as to the why.
Can you send a copy of the data set that fails? Also, if it isn't prying
into your business, would you describe the application this problem is
abstracted from? You indicated in your original post that the values
(characters in the posted esample) were actually indices into a table in
the real application.
I would also love to see that, I really do believe that my current
approach is sound.
Group 61 - the x-marked 3-string (FEG) is not found by the Paul Green code:

61 1 A
61 2 A
61 3 B
61 4 C
61 5 D
61 6 E
61 7 F x
61 8 E x
61 9 G x
61 10 G
61 11 G
61 12 G
61 13 G
61 14 G
61 15 G
61 16 G
61 17 G
61 18 G
61 19 H
61 20 I
61 21 C
61 22 D
61 23 A
61 24 A
61 25 A
61 26 A
61 27 A

Group 69, again the x-marked 3-string (HIJ) is not found

69 1 A
69 2 A
69 3 A
69 4 A
69 5 A
69 6 B
69 7 B
69 8 C
69 9 C
69 10 D
69 11 E
69 12 F
69 13 G
69 14 H
69 15 H
69 16 H
69 17 H x
69 18 I x
69 19 J x
69 20 J
69 21 J

Will extract the code from the Pascal program, and get it working RSN,
bit tired from travel, it takes as long to get from Vilnius to Charleroi
as from Charleroi to Oostende.

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
Dick Wesseling
2010-05-26 03:48:27 UTC
Permalink
What a relief, mine works:

5641 cycles 6 solutions
AABCDEFEGGGGGGGGGGHICDAAAAA
len win pos
9 19 1 ABCDEFEGGGGGGGGGGHI
8 17 6 FEGGGGGGGGGGHICDA
7 8 1 ABCDEFEG
6 6 1 ABCDEF
3 3 6 FEG <there it is!
6 6 17 GHICDA
Post by Robert AH Prins
Group 69, again the x-marked 3-string (HIJ) is not found
5630 cycles 6 solutions
AAAAABBCCDEFGHHHHIJJJ
len win pos
10 15 4 ABBCCDEFGHHHHIJ
9 13 6 BCCDEFGHHHHIJ
8 10 4 ABBCCDEFGH
7 8 6 BCCDEFGH
6 6 8 CDEFGH
3 3 16 HIJ <there it is!
Post by Robert AH Prins
Will extract the code from the Pascal program, and get it working RSN,
bit tired from travel, it takes as long to get from Vilnius to Charleroi
as from Charleroi to Oostende.
If it isn't asking too much, once you've taken some rest would you mind
posting one of the longer (2K or so) data sets?
Earlier I had to disappoint the readers of clax by saying that it wasn't
worth bothering about an asm version because it outperformed the C
version by a small margin only. However, that was for random data.
Random data suffocates branch prediction causing _any_ version to come
to a grinding halt (*).

For real data such as group 61 and 69 and the original example the asm
outperformed the high level version by 30% or so. Still not something to
get excited about, but I'm curious how it would perform on data sets
that are both real and large.

(*) I expect professor Weinkam's solution to perform much better for
random data. If my understanding is correct most of the time is spend
in:

do i=c-1 to 1 by -1; a(i,*)=min(32766,a(i+1,*)+1); a(i,x(i))=1; end;

a branchless implementation of min() should work fine, and even a
branching one should should have a 99% branch prediction success rate
because overflow will happen only for very large data sets.
Terje Mathisen
2010-05-26 06:23:32 UTC
Permalink
Post by Dick Wesseling
5641 cycles 6 solutions
AABCDEFEGGGGGGGGGGHICDAAAAA
len win pos
9 19 1 ABCDEFEGGGGGGGGGGHI
8 17 6 FEGGGGGGGGGGHICDA
7 8 1 ABCDEFEG
6 6 1 ABCDEF
3 3 6 FEG<there it is!
6 6 17 GHICDA
I got these:
C:\>\c2\unique-strings\Release\unique-strings.exe
len aperture from to value
3 3 7 9 FEG
6 6 18 23 GHICDA
6 6 2 7 ABCDEF
7 8 2 9 ABCDEFEG
8 17 7 23 FEGGGGGGGGGGHICDA
9 19 2 20 ABCDEFEGGGGGGGGGGHI
Time to scan: 3278 (26 potential solutions)
Time to filter: 10736 (6 filtered solutions)
Post by Dick Wesseling
Post by Robert AH Prins
Group 69, again the x-marked 3-string (HIJ) is not found
5630 cycles 6 solutions
AAAAABBCCDEFGHHHHIJJJ
len win pos
10 15 4 ABBCCDEFGHHHHIJ
9 13 6 BCCDEFGHHHHIJ
8 10 4 ABBCCDEFGH
7 8 6 BCCDEFGH
6 6 8 CDEFGH
3 3 16 HIJ<there it is!
and so did I:

len aperture from to value
3 3 17 19 HIJ
6 6 9 14 CDEFGH
7 8 7 14 BCCDEFGH
8 10 5 14 ABBCCDEFGH
9 13 7 19 BCCDEFGHHHHIJ
10 15 5 19 ABBCCDEFGHHHHIJ
Time to scan: 2453 (25 potential solutions)
Time to filter: 9735 (6 filtered solutions)


I have however found a _much_ faster way to filter my potential
solutions, I believe I can make it run in close to linear time. :-)

Terje
Post by Dick Wesseling
Post by Robert AH Prins
Will extract the code from the Pascal program, and get it working RSN,
bit tired from travel, it takes as long to get from Vilnius to Charleroi
as from Charleroi to Oostende.
If it isn't asking too much, once you've taken some rest would you mind
posting one of the longer (2K or so) data sets?
Earlier I had to disappoint the readers of clax by saying that it wasn't
worth bothering about an asm version because it outperformed the C
version by a small margin only. However, that was for random data.
Random data suffocates branch prediction causing _any_ version to come
to a grinding halt (*).
For real data such as group 61 and 69 and the original example the asm
outperformed the high level version by 30% or so. Still not something to
get excited about, but I'm curious how it would perform on data sets
that are both real and large.
(*) I expect professor Weinkam's solution to perform much better for
random data. If my understanding is correct most of the time is spend
do i=c-1 to 1 by -1; a(i,*)=min(32766,a(i+1,*)+1); a(i,x(i))=1; end;
a branchless implementation of min() should work fine, and even a
branching one should should have a 99% branch prediction success rate
because overflow will happen only for very large data sets.
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Robert AH Prins
2010-05-26 13:16:20 UTC
Permalink
Post by Dick Wesseling
5641 cycles 6 solutions
AABCDEFEGGGGGGGGGGHICDAAAAA
len win pos
9 19 1 ABCDEFEGGGGGGGGGGHI
8 17 6 FEGGGGGGGGGGHICDA
7 8 1 ABCDEFEG
6 6 1 ABCDEF
3 3 6 FEG<there it is!
6 6 17 GHICDA
Post by Robert AH Prins
Group 69, again the x-marked 3-string (HIJ) is not found
5630 cycles 6 solutions
AAAAABBCCDEFGHHHHIJJJ
len win pos
10 15 4 ABBCCDEFGHHHHIJ
9 13 6 BCCDEFGHHHHIJ
8 10 4 ABBCCDEFGH
7 8 6 BCCDEFGH
6 6 8 CDEFGH
3 3 16 HIJ<there it is!
Post by Robert AH Prins
Will extract the code from the Pascal program, and get it working RSN,
bit tired from travel, it takes as long to get from Vilnius to Charleroi
as from Charleroi to Oostende.
If it isn't asking too much, once you've taken some rest would you mind
posting one of the longer (2K or so) data sets?
Earlier I had to disappoint the readers of clax by saying that it wasn't
worth bothering about an asm version because it outperformed the C
version by a small margin only. However, that was for random data.
Random data suffocates branch prediction causing _any_ version to come
to a grinding halt (*).
I'm afraid the 2329 set of input data, built up over 29 years, is all
that I have at the moment. It's likely to grow a bit more in the coming
years, at a rate of 40 to 80 entries per year (in 2008 there were 138
new values, but only 5 of them had never been seen, in 2009 there were
56 with 3 previously unseen ones, and so far this year there have been
67 new entries, but all of them have occurred before) Most of the growth
will come from the top-10 values with possibly minor additions from the
values in the remainder of the top-20.

The amount of data from other users is likely to be rather a lot smaller
and more likely to contain far fewer unique values, I'll see if I can
get any.

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
Terje Mathisen
2010-05-26 11:45:50 UTC
Permalink
Post by Robert AH Prins
I'm afraid the 2329 set of input data, built up over 29 years, is all
that I have at the moment. It's likely to grow a bit more in the coming
years, at a rate of 40 to 80 entries per year (in 2008 there were 138
new values, but only 5 of them had never been seen, in 2009 there were
56 with 3 previously unseen ones, and so far this year there have been
67 new entries, but all of them have occurred before) Most of the growth
will come from the top-10 values with possibly minor additions from the
values in the remainder of the top-20.
The amount of data from other users is likely to be rather a lot smaller
and more likely to contain far fewer unique values, I'll see if I can
get any.
In that case, and with my current code running at sub-ms speed, it
really seems like this is a completely solved problem.

I really can't see how you could need to to re-run the scanner 1000
times/second, every second, for almost a week while waiting for the next
addition to your data set!
:-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Robert AH Prins
2010-06-08 21:11:27 UTC
Permalink
Post by James J. Weinkam
Post by Robert AH Prins
His code is in Pascal, and until I started running it for parts of the
input string I had had blind faith in it. Wrong, as it turns out as
there are three strings that it does not process correctly, and all my
debugging hasn't given me any clues as to the why.
Can you send a copy of the data set that fails? Also, if it isn't prying
into your business, would you describe the application this problem is
abstracted from? You indicated in your original post that the values
(characters in the posted esample) were actually indices into a table in
the real application.
Somehow I missed this post, apologies!

The values are indices into a table of "DISTINGUISHING SIGNS USED ON
VEHICLES IN INTERNATIONAL TRAFFIC", or in more understandable language,
"international licence plate country code" signs - the current full
official list can be found at

http://www.unece.org/trans/conventn/Distsigns.pdf

one also containing informal ones can be found on Wikipedia,

http://en.wikipedia.org/wiki/List_of_international_license_plate_codes

The upper limit of 255 comes from here.

As for the application itself?

Ahum, it's from a program that processes hitchhike data...

The addition of this particular table (one of about 60 or so) was
inspired in 1995 by a ride with Danish truck driver who took me from
somewhere around Jönköping to Stockholm. He was the fourth consecutive
driver with the fourth distinct nationality, D-NL-S-DK. When I told him
so, he asked me if this had happened before and I told him I did not
have a clue, but the eventual addition of the table produced by my
sliding window code told me that it had in fact happened on 13 previous
occasions.

The program is used by a few other hitch-hikers and one of them asked me
if I could change the code to also produce this table for the trips done
in a single year, and it was this change that resulted in the discovery
of the erroneous results for several sets of data.

The hitchhiking angle also explains why the data grows at a rather
leisurely pace, I'm no longer a spring-chicken and I do not hitchhike as
much as I used to do.

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
Terje Mathisen
2010-06-09 06:11:06 UTC
Permalink
Post by Robert AH Prins
Ahum, it's from a program that processes hitchhike data...
<BG>
Post by Robert AH Prins
The addition of this particular table (one of about 60 or so) was
inspired in 1995 by a ride with Danish truck driver who took me from
somewhere around Jönköping to Stockholm. He was the fourth consecutive
driver with the fourth distinct nationality, D-NL-S-DK. When I told him
so, he asked me if this had happened before and I told him I did not
have a clue, but the eventual addition of the table produced by my
sliding window code told me that it had in fact happened on 13 previous
occasions.
The program is used by a few other hitch-hikers and one of them asked me
if I could change the code to also produce this table for the trips done
in a single year, and it was this change that resulted in the discovery
of the erroneous results for several sets of data.
So in reality, this program only has to keep up with the rate of new
rides, comparing the current string with the previous records, right?
Post by Robert AH Prins
The hitchhiking angle also explains why the data grows at a rather
leisurely pace, I'm no longer a spring-chicken and I do not hitchhike as
much as I used to do.
I think I stopped doing that nearly 25 years ago: My wife & I tramped
around the US for two months on our honeymoon, flying all the
long-distance stretches, hitching rides on many shorter stretches.

In the US you'd have to check state licence plates instead, I guess.:-)

Anyway, I love the idea: Getting a large group of hotshot programmers to
optimize a program that could be handled with a small set of
handwritten/colored cards. :-)

Terje
Post by Robert AH Prins
Robert
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Robert AH Prins
2010-06-09 10:42:18 UTC
Permalink
Post by Terje Mathisen
Post by Robert AH Prins
Ahum, it's from a program that processes hitchhike data...
<BG>
Post by Robert AH Prins
The addition of this particular table (one of about 60 or so) was
inspired in 1995 by a ride with Danish truck driver who took me from
somewhere around Jönköping to Stockholm. He was the fourth consecutive
driver with the fourth distinct nationality, D-NL-S-DK. When I told him
so, he asked me if this had happened before and I told him I did not
have a clue, but the eventual addition of the table produced by my
sliding window code told me that it had in fact happened on 13 previous
occasions.
The program is used by a few other hitch-hikers and one of them asked me
if I could change the code to also produce this table for the trips done
in a single year, and it was this change that resulted in the discovery
of the erroneous results for several sets of data.
So in reality, this program only has to keep up with the rate of new
rides, comparing the current string with the previous records, right?
"Yes", he replied with a blush...
Post by Terje Mathisen
Post by Robert AH Prins
The hitchhiking angle also explains why the data grows at a rather
leisurely pace, I'm no longer a spring-chicken and I do not hitchhike as
much as I used to do.
I think I stopped doing that nearly 25 years ago: My wife & I tramped
around the US for two months on our honeymoon, flying all the
long-distance stretches, hitching rides on many shorter stretches.
My wife (I met her at the 4th International Hitch-Hikers Congress in
2000) and I did a bit of hitchhiking on our honeymoon in Japan, and only
short stretches.
Post by Terje Mathisen
In the US you'd have to check state licence plates instead, I guess.:-)
Your guess is as good as mine, I don't know if any Americans are using
the program.
Post by Terje Mathisen
Anyway, I love the idea: Getting a large group of hotshot programmers to
optimize a program that could be handled with a small set of
handwritten/colored cards. :-)
The very first posting started with "Can anyone give me some *hints* as
to solve the following problem, preferably in a way that is faster than
the way I used to do it, and without the bug in the current version;"
(emphasis added)

How could I possibly have known that a large group of hotshot
programmers would jump on something one of them referred to as "This is
an intriguing problem, I'll have think some more... :-)" ;)

A few years ago I came across "CSCRX2HT.txt" on
https://sites.google.com/site/schlabb/home/code-snippets/rexx-to-html
and found that it had a raft of bugs. I managed to eliminate them,
thought that it would be fun to write similar routines for a bunch of
other "legacy" languages, even spent some of my own money on the
Javascript that produces the z/OS ISPF-like scrolling HTML and in the
end I made it freely available under the provisions of the GPL V3. See
http://www.cbttape.org/ftp/cbt/CBT769.zip - the contents are in TSO XMIT
format, use XMIT Manager, http://www.cbttape.org/njw/index.html to
extract the REXX code.

I've got a quote somewhere from a Dutch computer scientist, W.L. (Willem
Louis) van der Poel: "The intrinsic value of software is nil"

I may have cheated a little (or a lot, take your pick) by not telling
what the actual code was used for - if I had done so, the thread might
never have developed into the current pretty high quality-for-Usenet one.

The value of my hitchhike program is nil (or at least pretty close to
it), but to quote another contributor, Jerome Fine, "The problem is
very instructive and I really enjoy the description of what you fellows
are doing."

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
Dr J R Stockton
2010-06-09 20:54:36 UTC
Permalink
Post by Robert AH Prins
The addition of this particular table (one of about 60 or so) was
inspired in 1995 by a ride with Danish truck driver who took me from
somewhere around Jönköping to Stockholm. He was the fourth consecutive
driver with the fourth distinct nationality, D-NL-S-DK. When I told him
so, he asked me if this had happened before and I told him I did not
have a clue, but the eventual addition of the table produced by my
sliding window code told me that it had in fact happened on 13 previous
occasions.
That's a more understandable description than before, IMHO.

It does indeed seem a close match to what I wrote after and because of
reading that earlier description - JavaScript to find the shortest
interval with (all) 35 different Easter Sunday dates.

My script has a sliding window; the front of the window moves forward
until the content of the window becomes interesting, then the back moves
forward until the content is about to become uninteresting, and reports
if necessary. Passes are so fast that they can be repeated for
different "interesting".

See <URL:http://www.merlyn.demon.co.uk/estrcons.HTM#ESP>, and "Pop Code"
- the code is essentially pascal-compatible. Sets are not used; an
array of as many integer elements as there are possible elements is
needed.

FYI, the minimum span is 72 years, for both Gregorian and Julian; the
Julians are in the middle of such a period, but the Gregorians will have
to wait a few millennia for a shortest span.
--
(c) John Stockton, nr London, UK. ?@merlyn.demon.co.uk Turnpike v6.05.
Web <URL:http://www.merlyn.demon.co.uk/> - w. FAQish topics, links, acronyms
PAS EXE etc : <URL:http://www.merlyn.demon.co.uk/programs/> - see 00index.htm
Dates - miscdate.htm estrdate.htm js-dates.htm pas-time.htm critdate.htm etc.
Terje Mathisen
2010-06-10 05:48:13 UTC
Permalink
Post by Dr J R Stockton
Post by Robert AH Prins
The addition of this particular table (one of about 60 or so) was
inspired in 1995 by a ride with Danish truck driver who took me from
somewhere around Jönköping to Stockholm. He was the fourth consecutive
driver with the fourth distinct nationality, D-NL-S-DK. When I told him
so, he asked me if this had happened before and I told him I did not
have a clue, but the eventual addition of the table produced by my
sliding window code told me that it had in fact happened on 13 previous
occasions.
That's a more understandable description than before, IMHO.
It does indeed seem a close match to what I wrote after and because of
reading that earlier description - JavaScript to find the shortest
interval with (all) 35 different Easter Sunday dates.
Nice problem...
Post by Dr J R Stockton
My script has a sliding window; the front of the window moves forward
until the content of the window becomes interesting, then the back moves
forward until the content is about to become uninteresting, and reports
if necessary. Passes are so fast that they can be repeated for
different "interesting".
I assume you run the scan from any given starting year, then cross out
the various dates as you get to them?

As soon as you've found all possible dates, you optimize by running the
back end forward until you get to the first date which isn't repeated
anywhere later in the interval.

This is very easy to do by using a counting array to cross out each date
as you pass it: Increment when the front end finds a date, decrement
when the back end passes it.

Increment the total count each time a counter goes from 0->1, decrement
for 1->0.

(Actually, in asm you could do a little trick, by using -1 as the
initial count for each date and -35 as the initial day count:

;; Forward scan, for each year
;; EAX = easter_sunday(year)

add day_count[EAX*4],1 ; Generates a carry for -1 -> 0
adc EBX,0 ; Total number of different days
jc Found_35_Front

Similarly for the tail end:

sub day_count[EAX*4],1
sbb EBX,0
jc Found_35_Tail

This makes the code totally branchless except for each time you actually
find an edge of a new interval.
Post by Dr J R Stockton
See<URL:http://www.merlyn.demon.co.uk/estrcons.HTM#ESP>, and "Pop Code"
- the code is essentially pascal-compatible. Sets are not used; an
array of as many integer elements as there are possible elements is
needed.
FYI, the minimum span is 72 years, for both Gregorian and Julian; the
Julians are in the middle of such a period, but the Gregorians will have
to wait a few millennia for a shortest span.
:-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Dr J R Stockton
2010-06-11 21:28:06 UTC
Permalink
In comp.lang.pascal.borland message <vta6e7-***@ntp.tmsw.no>, Thu,
10 Jun 2010 07:48:13, Terje Mathisen <"terje.mathisen at
Post by Terje Mathisen
Post by Dr J R Stockton
It does indeed seem a close match to what I wrote after and because of
reading that earlier description - JavaScript to find the shortest
interval with (all) 35 different Easter Sunday dates.
Nice problem...
Post by Dr J R Stockton
My script has a sliding window; the front of the window moves forward
until the content of the window becomes interesting, then the back moves
forward until the content is about to become uninteresting, and reports
if necessary. Passes are so fast that they can be repeated for
different "interesting".
I assume you run the scan from any given starting year, then cross out
the various dates as you get to them?
You assume incorrectly, to judge by that sentence. But read on ...
Post by Terje Mathisen
As soon as you've found all possible dates, you optimize by running the
back end forward until you get to the first date which isn't repeated
anywhere later in the interval.
Yes.
Post by Terje Mathisen
This is very easy to do by using a counting array to cross out each
I don't cross out.
Post by Terje Mathisen
Increment when the front end finds a date, decrement when the back end
passes it.
Yes.
Post by Terje Mathisen
Increment the total count each time a counter goes from 0->1, decrement
for 1->0.
Yes.
Post by Terje Mathisen
(Actually, in asm you could do a little trick, by using -1 as the
It's fast enough in JavaScript, even for the 5700000 year Gregorian
cycle. There should be no need to optimise ASM code, fun though that
can be.
Post by Terje Mathisen
Post by Dr J R Stockton
See<URL:http://www.merlyn.demon.co.uk/estrcons.HTM#ESP>, and "Pop Code"
- the code is essentially pascal-compatible. Sets are not used; an
array of as many integer elements as there are possible elements is
needed.
FYI, the minimum span is 72 years, for both Gregorian and Julian; the
Julians are in the middle of such a period, but the Gregorians will have
to wait a few millennia for a shortest span.
I did think of adapting it for the maximum span not containing all 35
dates, but that must be given by the longest interval between date
repeats, which I've already done.

Aficionados of ISO 8601 will realise, with a little thought, that the
number of different dates is actually (probably; provably) 35 + 36 + 6.

My page <URL:http://www.merlyn.demon.co.uk/estrdate.htm> links to
various JavaScript full-range algorithms for Easter Sunday in
estralgs.txt - could you beat
<http://www.hugi.scene.org/compo/compoold.htm#compo20> in ASM while
retaining traceability to the formal standards (the Papal Bull, the
British Calendar Act, and whatever Norway relies on?!?)? The rules for
compo20 give Oudin's algorithm, which I can beat.
--
(c) John Stockton, nr London, UK. ?@merlyn.demon.co.uk Turnpike v6.05.
Web <URL:http://www.merlyn.demon.co.uk/> - w. FAQish topics, links, acronyms
PAS EXE etc : <URL:http://www.merlyn.demon.co.uk/programs/> - see 00index.htm
Dates - miscdate.htm estrdate.htm js-dates.htm pas-time.htm critdate.htm etc.
Terje Mathisen
2010-05-20 08:07:47 UTC
Permalink
Post by Terje Mathisen
Post by robin
| I believe the program below does what you want and should be a
couple of orders of magnitude faster
| than the method you outlined (based on your statement about how many
"PL/I statements" are executed
| on an input of length 2329 with 66 distinct values). For
illustrative purposes, the program uses the
| character representation you used in your example for input and
output, but the algorithm itself
| will work for values of c up to 32765 and v up to 32764 if you have
enough memory and patience.
Assuming that it works, a very impressive effort!
Indeed.
I have not tried to read the code (or explanation) except to note that
it must be far better than O(n*n). :-)
Scan through the input string from beginning to end, while maintaining a
set of lookup tables, one for each previous starting position. In each
lookup table we will have two things: The values seen so far, and the
total number of different characters.
For each possible length we remember the shortest string that so far has
resulted in this many unique values, updating it when we find a shorter
string.
This way the processing would consist of loading data[i], then for each
of the previous bytes taken as the starting position check if this value
hasn't been seen before, and if so, update the count and check if this
for (i = 2; i < data_len; i++) {
byte b = data[i];
for (j = 0; j < i-2; j++) {
if (!seen[j][b]) {
seen[j][b] = 1;
unique_values = ++count[j];
len = i-j+1;
if (len <= shortest[unique_values]) {
save(unique_values, len, j);
shortest[unique_values] = len;
}
}
}
}
I just realized that the actual lookup tables can be skipped, while
making the program significantly faster:

Simply scan backwards through the input string, starting from the
previous position:

For each such position increment the count of unique values seen, until
we reach the first byte, or a duplicate of the current input: At this
point the scan can terminate at once, which will significantly reduce
the expected number of iterations, while getting rid of all the big
lookup tables:

for (i = 2; i < data_len; i++) {
byte b = data[i];
for (j = i-1; j >= 0; j--) {
if (data[j] == b) break;
len = i-j+1;
unique_values = ++count[len];
if (len <= shortest[unique_values]) {
save(unique_values, len, j);
shortest[unique_values] = len;
}
}
}
Post by Terje Mathisen
The save function will generate a list of all possible "winning"
candidates, so it must be sorted and pruned before the final output but
that is trivial.
This is still the same.
Post by Terje Mathisen
The main (only?) problem with the program above is that it is still
O(n*n), with data_len = 4096, the total number of iterations will be
1+2+3+4+...+4093 which is ~8M.
According to the birthday paradox, for random inputs the expected number
of samples needed to find a collision (i.e. a non-unique value) in a set
of size n is ~sqrt(n), so in one fell swoop the rewrite above has
reduced the algorithm from O(n*n) to O(n*sqrt(m)) where (n) is the input
array length and (m) is the alphabet size.

I.e. for the case of n=4096 and m=256, the expected iteration count is
now 4096*16=64K, and everything will fit in L1 cache.

Running time should be _well_ below a single ms.

The inner loop can never iterate over more than the last (m) characters,
so we have 4kB of input and 512 bytes (with 16-bit counters) for the
active part of the shortest[] table.

Going to huge inputs (more than 64k of data and a 32-bit (utf8)
alphabet) we end up with n*4 bytes of input and n*4 bytes of shortest[]
table space, for a total of 8*n.

Assuming a 4GB input array consisting of 1G 32-bit random values, the
expected iteration count becomes ~1G * sqrt(1G) = 1G*32k or 32e12.

With 5 instructions in the inner loop, say 10 cycles (all accesses are
sequential so we get maximum use of L1 and L2!), this would take 3-6 hours.
Post by Terje Mathisen
I expect the running time to be ~ 10-20 cycles/inner loop iteration, or
80-160M cycles for the 4K input, letting the program finish in less than
a tenth of a second on any currently existing PC.
Is that fast enough?
Obviously not. :-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
James J. Weinkam
2010-05-18 08:27:44 UTC
Permalink
Post by Robert AH Prins
Hi all,
Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;
...
I believe the program below does what you want and should be a couple of orders of magnitude faster
than the method you outlined (based on your statement about how many "PL/I statements" are executed
on an input of length 2329 with 66 distinct values). For illustrative purposes, the program uses the
character representation you used in your example for input and output, but the algorithm itself
will work for values of c up to 32765 and v up to 32764 if you have enough memory and patience.

The program uses four arrays: y(c), a character array into which the input is read; x(c), an integer
array giving the encoded input with a=1, b=2, etc; a(0:c,v+1), described below; and l(v), with l(j)
the length of the shortest subsequence with j distinct values.

The basic idea is to build an array, a, with a row for each cell and a column for each value. The
i,j element of the array gives the relative cell number in the input array starting from cell i of
the first occurrence of value j, if any, otherwise a(i,j) is greater than max(c,v+1). The array is
built starting from the last row, c, and working back to row 1. Row c is first set to max(c+1,v+2).
Then a(c,x(c)) is set to 1. Working backward for i=c-1 to 1, each element of row i is set to the
minimum of 32766 and the corresponding element of row i+1 plus 1 and then element a(i,x(i)) is set to 1.

Once this construction is complete, it is evident that there is one and only one instance of the
integer 1 in each row, and all values in each row less than c+1 are unique.

Next each row is sorted. After sorting, the a(i,j) element contains the length of the shortest
subsequence starting at position i that contains j distinct values unless there is no such
subsequence, in which case a(i,j) is greater than c. It is also evident that if a(i,j)=j then
a(i,k)=k for 1<=k<=j. Now if a(i,j)=j the subsequence of length j starting at position i has j
distinct values, but it should be excluded from the solution set for j distinct values if it is a
proper subsequence of a longer subsequence of distinct values. This will be so iff a(i,j+1)=j+1 or
a(i-1,j+1)=j+1.

Based on these considerations, the solution sets for each j are constructed by scanning each column
starting with j=3 and proceeding to j=v. For this purpose the array a actually has an additional
row, 0, to contain the subscript of the first cell of the first minimal subsequence for each j, and
an extra column, v+1, containing values greater than c to ensure that the test for inclusion can be
carried out for j=v. For each j, l(j) is initialized to c+1 and k (the previous element in the list
of minimal subsequences) to 0. The scan proceeds as follows: if a(i,j) is less than l(j) the current
subsequence is shorter that the shortest found so far so l(j) is updated and k reset to 0. If a(i,j)
is equal to l(j) the inclusion test is applied and if it passes the current subsequence is added to
the list. Once the entire column has been processed the list is terminated by setting a(k,j)=0.

It is now a simple matter to read off the solution subsequences. The program utilizes sorta to sort
the rows. call sorta(addr(first element),n,w,c) sorts an array of n elements of width w in place
according the comparison function c. c(u,v) returns '1'b (bit(1) aligned) if the u-th element may
precede the v-th element, i.e., x(u)<=x(v).


Source:

%process mar(2,100) offset;
subsets: proc options(main) reorder;
dcl
(c,v,i,j,k,m,ar init((rank('a')-1))) bin fixed,
(x(c),a(0:c,v+1),l(v)) bin fixed ctl,
y(c) char(1) ctl,
sorta entry(ptr,bin fixed(31),bin fixed(31),
entry(bin fixed(31),bin fixed(31)) returns(bit(1) aligned)),
vfmt entry(bin float(53),bin fixed(15),bin fixed(15)) returns(char(50) var),
sysin file input,
sysprint file print,
(addr,max,min,rank) builtin;
get file(sysin) list(c,v);
put file(sysprint) edit('c: ',vfmt(c,10,0),', v: ',vfmt(v,10,0))(col(1),4 a);
allocate x,a,l,y;
get file(sysin) edit(y)(col(1),(c)a(1));
put file(sysprint) edit('Input: ',y)(col(1),a,(c)a(1));
do i=1 to c; x(i)=rank(y(i))-ar; end;
a(c,*)=max(c+1,v+2); a(c,x(c))=1;
do i=c-1 to 1 by -1; a(i,*)=min(32766,a(i+1,*)+1); a(i,x(i))=1; end;
do i=1 to c; call sorta(addr(a(i,1)),v,2,comp); end;
do j=3 to v; l(j)=c+1; k=0; m=j+1; a(0,m)=0;
do i=1 to c;
if a(i,j)<l(j) then do; l(j)=a(i,j); k=0; end;
if a(i,j)=l(j) then if a(i-1,m)~=m then if a(i,m)~=m then do; a(k,j),k=i; end;
end;
a(k,j)=0;
if a(0,j)>0 then do;
put file(sysprint) edit('Distinct: ',vfmt(j,10,0),', length: ',vfmt(l(j),10,0))(col(1),4 a);
i=a(0,j); k=0; do while(i~=0); k+=1;
put file(sysprint) edit(k,': (',vfmt(i,10,0),') ',(y(i+m) do m=0 to l(j)-1))
(col(1),f(10),3 a,(l(j))a(1));
i=a(i,j);
end;
end;
end;
comp: proc(u,v) returns(bit(1) aligned) reorder;
dcl (u,v) bin fixed(31);
return(a(i,u)<=a(i,v));
end comp;
end subsets;

Input:

56 10
aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja

Output:

c: 56, v: 10
Input: aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja
Distinct: 3, length: 3
1: (40) bef
2: (44) fgh
Distinct: 7, length: 7
1: (50) gihbfja
Distinct: 8, length: 16
1: (41) efffghggggihbfja
Distinct: 9, length: 23
1: (34) cbbbbbbefffghggggihbfja
Distinct: 10, length: 25
1: (32) dccbbbbbbefffghggggihbfja

Here is the output from another example:

c: 48, v: 10
Input: abcabcabcabcdefgabcabccdabcdeabcaaabbbcccdefghij
Distinct: 3, length: 3
1: (1) abc
2: (2) bca
3: (3) cab
4: (4) abc
5: (5) bca
6: (6) cab
7: (7) abc
8: (8) bca
9: (9) cab
10: (18) bca
11: (19) cab
12: (20) abc
13: (31) bca
Distinct: 4, length: 4
1: (23) cdab
2: (24) dabc
Distinct: 5, length: 5
1: (25) abcde
2: (26) bcdea
3: (27) cdeab
4: (28) deabc
Distinct: 7, length: 7
1: (10) abcdefg
2: (11) bcdefga
3: (12) cdefgab
4: (13) defgabc
Distinct: 8, length: 8
1: (41) cdefghij
Distinct: 9, length: 11
1: (38) bcccdefghij
Distinct: 10, length: 14
1: (35) abbbcccdefghij
James J. Weinkam
2010-05-18 17:56:36 UTC
Permalink
Sorry for the double posting.

When I sent the message the first time, a message from my outgoing mail server appeared saying
something to the effect that "this account is valid for email only; postings to newsgroups are ignored."

I looked on the newsgroup and sure enough the message was not there. I messed around a bit with the
copy in my sent folder and tried again. I did not get the message from the isp this time but the
message did not show up on the newsgroup either. I decided to give up and go to bed and try again in
the morning. By that time both messages had appeared.

Can anyone explain what was going on? I have never seen this behavior before.
Frank Kotler
2010-05-18 18:40:54 UTC
Permalink
Post by James J. Weinkam
Sorry for the double posting.
When I sent the message the first time, a message from my outgoing mail
server appeared saying something to the effect that "this account is
valid for email only; postings to newsgroups are ignored."
Hi James,

The NG comp.lang.asm.x86 is moderated. What this means is that your NNTP
server, instead of posting the article as usual, sends it - by mail - to
the moderator for approval. I've never seen that message, but I suspect
that's where it's from. When the moderator (or the "moderator's
apprentice", me) approves the message, we re-post it to an NNTP server.
This time, bearing the highly arcane moderator's incantation, it gets
posted. In the case of a cross-posted message, this delays posting to
all groups, even the unmoderated ones. We try to keep on top of it, but
there is inevitably some delay.

I strongly suspected I was approving the same message twice, but would
rather do that than risk missing some minor edit. If anyone is terribly
bothered by the double-posting, my abject apologies on behalf of the
moderation team - and my personal advice: "get a life!" :)

Sorry 'bout that - we appreciate your posting!

Best,
Frank
James J. Weinkam
2010-05-20 02:17:08 UTC
Permalink
Post by Frank Kotler
Post by James J. Weinkam
Sorry for the double posting.
When I sent the message the first time, a message from my outgoing
mail server appeared saying something to the effect that "this account
is valid for email only; postings to newsgroups are ignored."
Hi James,
The NG comp.lang.asm.x86 is moderated. What this means is that your NNTP
server, instead of posting the article as usual, sends it - by mail - to
the moderator for approval. I've never seen that message, but I suspect
that's where it's from. When the moderator (or the "moderator's
apprentice", me) approves the message, we re-post it to an NNTP server.
This time, bearing the highly arcane moderator's incantation, it gets
posted. In the case of a cross-posted message, this delays posting to
all groups, even the unmoderated ones. We try to keep on top of it, but
there is inevitably some delay.
I strongly suspected I was approving the same message twice, but would
rather do that than risk missing some minor edit. If anyone is terribly
bothered by the double-posting, my abject apologies on behalf of the
moderation team - and my personal advice: "get a life!" :)
Sorry 'bout that - we appreciate your posting!
Best,
Frank
Well this just goes to show that you learn something new every day.

Looking at the subsequent posts in this thread, it strikes me that the process being used to deal
with messages posted to multiple news groups some of which are moderated is ill conceived. The
outcome should be the same as if the message were posted separately to each group. This could easily
be accomplished by sending the measage directly to each unmoderated group and directly to the
moderator of each moderated group. Why should there be any interaction? How is the ordering decided?
Just the order the poster happened to choose?

Anyway, thanks for the explanation. However it still doesn't explain the pop up I got when I first
posted the message. Surely in many years of posting to news groups, I must have posted a message to
multiple groups before sone of which were moderated. Nevertheless, I have never seen that message
from the mail server that the "account is only vaild for email and newsgroups will be ignored."
Moreover that isn't what eventually happened.
Nathan
2010-05-20 05:02:50 UTC
Permalink
Post by James J. Weinkam
Post by Frank Kotler
Post by James J. Weinkam
Sorry for the double posting.
When I sent the message the first time, a message from my outgoing
mail server appeared saying something to the effect that "this account
is valid for email only; postings to newsgroups are ignored."
Hi James,
The NG comp.lang.asm.x86 is moderated. What this means is that your NNTP
server, instead of posting the article as usual, sends it - by mail - to
the moderator for approval. I've never seen that message, but I suspect
that's where it's from. When the moderator (or the "moderator's
apprentice", me) approves the message, we re-post it to an NNTP server.
This time, bearing the highly arcane moderator's incantation, it gets
posted. In the case of a cross-posted message, this delays posting to
all groups, even the unmoderated ones. We try to keep on top of it, but
there is inevitably some delay.
I strongly suspected I was approving the same message twice, but would
rather do that than risk missing some minor edit. If anyone is terribly
bothered by the double-posting, my abject apologies on behalf of the
moderation team - and my personal advice: "get a life!" :)
Sorry 'bout that - we appreciate your posting!
Best,
Frank
Well this just goes to show that you learn something new every day.
Looking at the subsequent posts in this thread, it strikes me that the process being used to deal
with messages posted to multiple news groups some of which are moderated is ill conceived. The
outcome should be the same as if the message were posted separately to each group. This could easily
be accomplished by sending the measage directly to each unmoderated group and directly to the
moderator of each moderated group. Why should there be any interaction? How is the ordering decided?
Just the order the poster happened to choose?
Well, I am pretty sure these questions were asked when Usenet was
quite young. I imagine that policies and 'actual practice' have long
been not-quite-in-sync and surely both have evolved much over time.
If you are interested in the details, there are resources specifically
dedicated to these topics:

The Big-8 Management Board
http://www.big-8.org/dokuwiki/doku.php

Internet Systems Consortium
http://www.isc.org/community/reference

And some newsgroups where the above folks 'hang out' to post
references and answer questions:

news.admin.announce
news.announce.newsgroups
news.groups.questions
news.answers

The status of Usenet as a whole seems that it may not be in very good
health: http://news.slashdot.org/story/10/05/18/2342241/Duke-To-Shut-Down-Usenet-Server?art_pos=24

Maybe it will come a day that we'll need to set-up our own News
Servers?

Heck, maybe people would see the value of it if there was a Facebook
app that allowed them to read/post to Usenet?? I dunno.. maybe it'll
just follow the gopher? :(
Post by James J. Weinkam
Anyway, thanks for the explanation. However it still doesn't explain the pop up I got when I first
posted the message. Surely in many years of posting to news groups, I must have posted a message to
multiple groups before sone of which were moderated. Nevertheless, I have never seen that message
from the mail server that the "account is only vaild for email and newsgroups will be ignored."
Moreover that isn't what eventually happened.
Probably just a quirk of your Newsreader -- it is the only item that
can actually produce a 'pop up' on your machine.

Nathan.
Shmuel (Seymour J.) Metz
2010-05-21 00:55:44 UTC
Permalink
Post by James J. Weinkam
Looking at the subsequent posts in this thread, it strikes me
that the process being used to deal with messages posted to
multiple news groups some of which are moderated is ill
conceived.
That's the result of taking a superficial look at a complex problem.
Post by James J. Weinkam
The outcome should be the same as if the message were posted
separately to each group.
That would be fundamentally wrong, for reasons that have been hashed
out ad infinitum.
--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to ***@library.lspace.org
Dr J R Stockton
2010-05-22 16:52:46 UTC
Permalink
Post by James J. Weinkam
Looking at the subsequent posts in this thread, it strikes me that the
process being used to deal with messages posted to multiple news groups
some of which are moderated is ill conceived.
No. It is well-defined and appropriate. All moderators should
understand it. All moderating software should implement it.
Post by James J. Weinkam
The outcome should be the same as if the message were posted
separately to each group.
No. If you want to do that :
(a) you are one or more of : foolish, ignorant, or inconsiderate,
(b) you can do it directly and easily enough.

You should have permission from SFU (commonly considered, IIRC, a
reputable organisation) to cause span to be directed to that address?
--
(c) John Stockton, nr London UK. ?@merlyn.demon.co.uk Turnpike v6.05 MIME.
Grandson-Of-RFC1036 is released. RFC 5536 Netnews Article Format is a
subset of Internet Message Format which is described in RFC 5532. The
RFCs are read together to determine standard Netnews article format.
Nathan Baker
2010-05-23 05:06:58 UTC
Permalink
Post by Dr J R Stockton
Post by James J. Weinkam
Looking at the subsequent posts in this thread, it strikes me that the
process being used to deal with messages posted to multiple news groups
some of which are moderated is ill conceived.
No. It is well-defined and appropriate. All moderators should
understand it. All moderating software should implement it.
The word "should" implies a *desired* state. In practice, well...

For instance, one would expect all such software to pass-along the MIME
headers. When we were using PyModerator last year, we discovered that it
did not do this. So we patched it.
Post by Dr J R Stockton
Post by James J. Weinkam
The outcome should be the same as if the message were posted
separately to each group.
(a) you are one or more of : foolish, ignorant, or inconsiderate,
I'm pretty sure we don't do that here at CLAX. But, it is possible we might
not be following all of the traditional practice in other areas. Mainly
because our tree-house is still a bit in an "under construction" state.
However, we are none of the above 'colorful words' -- we are simply
pragmatic. :)

I am currently in the process of porting our moderation client software from
Linux to Windows (ASM code just "wants" to be OS agnostic!!), so I intend to
revisit all of the related "handbooks", RFCs, etc. during this time.
Post by Dr J R Stockton
You should have permission from SFU (commonly considered, IIRC, a
reputable organisation) to cause span to be directed to that address?
A good point. But I bet that Mr. Weinkam wants to tell you that it "isn't
any of your business" and while doing so, he might drop the leading 'S' from
his organisation's acronym. :)

Nathan.
mecej4
2010-05-19 02:02:31 UTC
Permalink
Post by James J. Weinkam
Post by Robert AH Prins
Hi all,
Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;
...
I believe the program below does what you want and should be a couple of
orders of magnitude faster
than the method you outlined (based on your statement about how many
"PL/I statements" are executed
on an input of length 2329 with 66 distinct values). For illustrative
purposes, the program uses the
character representation you used in your example for input and output,
but the algorithm itself
will work for values of c up to 32765 and v up to 32764 if you have
enough memory and patience.
The program uses four arrays: y(c), a character array into which the
input is read; x(c), an integer
array giving the encoded input with a=1, b=2, etc; a(0:c,v+1), described
below; and l(v), with l(j)
the length of the shortest subsequence with j distinct values.
The basic idea is to build an array, a, with a row for each cell and a
column for each value. The
i,j element of the array gives the relative cell number in the input
array starting from cell i of
the first occurrence of value j, if any, otherwise a(i,j) is greater
than max(c,v+1). The array is
built starting from the last row, c, and working back to row 1. Row c is
first set to max(c+1,v+2).
Then a(c,x(c)) is set to 1. Working backward for i=c-1 to 1, each
element of row i is set to the
minimum of 32766 and the corresponding element of row i+1 plus 1 and
then element a(i,x(i)) is set to 1.
Once this construction is complete, it is evident that there is one and
only one instance of the
integer 1 in each row, and all values in each row less than c+1 are unique.
Next each row is sorted. After sorting, the a(i,j) element contains the
length of the shortest
subsequence starting at position i that contains j distinct values unless there is no such
subsequence, in which case a(i,j) is greater than c. It is also evident
that if a(i,j)=j then
a(i,k)=k for 1<=k<=j. Now if a(i,j)=j the subsequence of length j
starting at position i has j
distinct values, but it should be excluded from the solution set for j
distinct values if it is a
proper subsequence of a longer subsequence of distinct values. This will
be so iff a(i,j+1)=j+1 or
a(i-1,j+1)=j+1.
Based on these considerations, the solution sets for each j are
constructed by scanning each column
starting with j=3 and proceeding to j=v. For this purpose the array a
actually has an additional
row, 0, to contain the subscript of the first cell of the first minimal
subsequence for each j, and
an extra column, v+1, containing values greater than c to ensure that
the test for inclusion can be
carried out for j=v. For each j, l(j) is initialized to c+1 and k (the
previous element in the list
of minimal subsequences) to 0. The scan proceeds as follows: if a(i,j)
is less than l(j) the current
subsequence is shorter that the shortest found so far so l(j) is updated
and k reset to 0. If a(i,j)
is equal to l(j) the inclusion test is applied and if it passes the
current subsequence is added to
the list. Once the entire column has been processed the list is
terminated by setting a(k,j)=0.
It is now a simple matter to read off the solution subsequences. The
program utilizes sorta to sort
the rows. call sorta(addr(first element),n,w,c) sorts an array of n
elements of width w in place
according the comparison function c. c(u,v) returns '1'b (bit(1)
aligned) if the u-th element may
precede the v-th element, i.e., x(u)<=x(v).
%process mar(2,100) offset;
subsets: proc options(main) reorder;
dcl
(c,v,i,j,k,m,ar init((rank('a')-1))) bin fixed,
(x(c),a(0:c,v+1),l(v)) bin fixed ctl,
y(c) char(1) ctl,
sorta entry(ptr,bin fixed(31),bin fixed(31),
entry(bin fixed(31),bin fixed(31)) returns(bit(1) aligned)),
vfmt entry(bin float(53),bin fixed(15),bin fixed(15)) returns(char(50) var),
sysin file input,
sysprint file print,
(addr,max,min,rank) builtin;
get file(sysin) list(c,v);
',vfmt(v,10,0))(col(1),4 a);
allocate x,a,l,y;
get file(sysin) edit(y)(col(1),(c)a(1));
put file(sysprint) edit('Input: ',y)(col(1),a,(c)a(1));
do i=1 to c; x(i)=rank(y(i))-ar; end;
a(c,*)=max(c+1,v+2); a(c,x(c))=1;
do i=c-1 to 1 by -1; a(i,*)=min(32766,a(i+1,*)+1); a(i,x(i))=1; end;
do i=1 to c; call sorta(addr(a(i,1)),v,2,comp); end;
do j=3 to v; l(j)=c+1; k=0; m=j+1; a(0,m)=0;
do i=1 to c;
if a(i,j)<l(j) then do; l(j)=a(i,j); k=0; end;
if a(i,j)=l(j) then if a(i-1,m)~=m then if a(i,m)~=m then do;
a(k,j),k=i; end;
end;
a(k,j)=0;
if a(0,j)>0 then do;
',vfmt(l(j),10,0))(col(1),4 a);
i=a(0,j); k=0; do while(i~=0); k+=1;
put file(sysprint) edit(k,': (',vfmt(i,10,0),') ',(y(i+m) do m=0 to l(j)-1))
(col(1),f(10),3 a,(l(j))a(1));
i=a(i,j);
end;
end;
end;
comp: proc(u,v) returns(bit(1) aligned) reorder;
dcl (u,v) bin fixed(31);
return(a(i,u)<=a(i,v));
end comp;
end subsets;
56 10
aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja
c: 56, v: 10
Input: aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja
Distinct: 3, length: 3
1: (40) bef
2: (44) fgh
Distinct: 7, length: 7
1: (50) gihbfja
Distinct: 8, length: 16
1: (41) efffghggggihbfja
Distinct: 9, length: 23
1: (34) cbbbbbbefffghggggihbfja
Distinct: 10, length: 25
1: (32) dccbbbbbbefffghggggihbfja
c: 48, v: 10
Input: abcabcabcabcdefgabcabccdabcdeabcaaabbbcccdefghij
Distinct: 3, length: 3
1: (1) abc
2: (2) bca
3: (3) cab
4: (4) abc
5: (5) bca
6: (6) cab
7: (7) abc
8: (8) bca
9: (9) cab
10: (18) bca
11: (19) cab
12: (20) abc
13: (31) bca
Distinct: 4, length: 4
1: (23) cdab
2: (24) dabc
Distinct: 5, length: 5
1: (25) abcde
2: (26) bcdea
3: (27) cdeab
4: (28) deabc
Distinct: 7, length: 7
1: (10) abcdefg
2: (11) bcdefga
3: (12) cdefgab
4: (13) defgabc
Distinct: 8, length: 8
1: (41) cdefghij
Distinct: 9, length: 11
1: (38) bcccdefghij
Distinct: 10, length: 14
1: (35) abbbcccdefghij
Professor Weinkam's elegant algorithm solves the problem using c.v.lg v
operations (his notation), i.e., N.T.lg T (RAH Prins' notation).

In the PL/I code that he gave, Professor Weinkam used two procedures
that are not included in IBM's VAPLI: SORTA and VFMT. However, it is
easy to supply the missing functions. In place of SORTA one may use
quicksort or mergesort, for which PL/I code is available at
http://rosettacode.org/wiki/Sorting_algorithms/Quicksort#PL.2FI and
http://rosettacode.org/wiki/Sort_an_integer_array#PL.2FI. For VFMT, the
following will do, although it allocates memory that is freed only by
exiting the program:

vfmt: proc(v) returns(char(10) var);
dcl (l,v) fixed bin;
dcl str char(10) var ctl;
allocate str;
if v < 10 then l=1;
else if v < 100 then l=2;
else if v < 1000 then l=3;
else l=4;
put string (str) edit (v) (F(l));
return (str);
end vfmt;

-- mecej4
James J. Weinkam
2010-05-20 22:18:30 UTC
Permalink
....
Post by mecej4
In the PL/I code that he gave, Professor Weinkam used two procedures
that are not included in IBM's VAPLI: SORTA and VFMT. However, it is
easy to supply the missing functions. In place of SORTA one may use
quicksort or mergesort, for which PL/I code is available at
http://rosettacode.org/wiki/Sorting_algorithms/Quicksort#PL.2FI and
http://rosettacode.org/wiki/Sort_an_integer_array#PL.2FI. For VFMT, the
following will do, although it allocates memory that is freed only by
vfmt: proc(v) returns(char(10) var);
dcl (l,v) fixed bin;
dcl str char(10) var ctl;
allocate str;
if v < 10 then l=1;
else if v < 100 then l=2;
else if v < 1000 then l=3;
else l=4;
put string (str) edit (v) (F(l));
return (str);
end vfmt;
Since the OP indicate he could handle programming details, I didn't think it necessary to include
the sort subroutine or the formatting routine, which is purely cosmetic in any case. I will post the
routines upon request.
Post by mecej4
-- mecej4
Dick Wesseling
2010-05-18 18:58:43 UTC
Permalink
Post by James J. Weinkam
Sorry for the double posting.
When I sent the message the first time, a message from my outgoing mail server appeared saying
something to the effect that "this account is valid for email only; postings to newsgroups are ignored."
I looked on the newsgroup and sure enough the message was not there.
Which newsgroup? I am reading comp.lang.asm.x86 which is moderated,
so it is not unusual for a message to not show up immediately. I don't know
if the other two groups - comp.lang.pascal.borland and comp.lang.pl1 - are
moderated. What happens when posting to multiple groups when one is moderated
and the others are not?
Post by James J. Weinkam
I messed around a bit with the
copy in my sent folder and tried again. I did not get the message from the isp this time but the
message did not show up on the newsgroup either. I decided to give up and go to bed and try again in
the morning. By that time both messages had appeared.
Can anyone explain what was going on? I have never seen this behavior before.
I noticed some strange things too.
My message <4bf1cbef$0$22934$***@news6.xs4all.nl> references three
other messages:

References: <***@mid.individual.net>
<4bf0ccd0$0$22945$***@news6.xs4all.nl>
<***@mid.individual.net>

But the message that appeared in clax86 contained only the first
reference. Did the moderation software remove the other two? Never
seen that before.
glen herrmannsfeldt
2010-05-18 19:09:38 UTC
Permalink
In comp.lang.pl1 Dick Wesseling <***@nospicedham.securityaudit.val.newsbank.net> wrote:
(snip)
Post by Dick Wesseling
Which newsgroup? I am reading comp.lang.asm.x86 which is moderated,
(snip)
Post by Dick Wesseling
What happens when posting to multiple groups when one is moderated
and the others are not?
Last I knew, it wasn't posted to any until the moderator approved it.

-- glen
Frank Kotler
2010-05-18 21:35:30 UTC
Permalink
Post by Dick Wesseling
Post by James J. Weinkam
Sorry for the double posting.
When I sent the message the first time, a message from my outgoing mail server appeared saying
something to the effect that "this account is valid for email only; postings to newsgroups are ignored."
I looked on the newsgroup and sure enough the message was not there.
Which newsgroup? I am reading comp.lang.asm.x86 which is moderated,
so it is not unusual for a message to not show up immediately. I don't know
if the other two groups - comp.lang.pascal.borland and comp.lang.pl1 - are
moderated. What happens when posting to multiple groups when one is moderated
and the others are not?
All postings are delayed (I'm not sure why, exactly...) A more(?)
interesting question is, "What if more than one of the groups is
moderated?" I understand that it is considered "impolite" to approve a
message to another group, but I *think* if it's approved for one, it's
approved for all. It is not obvious whether other groups are moderated
or not. Apologies to any other moderators whose toes are being stepped on!
Post by Dick Wesseling
Post by James J. Weinkam
copy in my sent folder and tried again. I did not get the message from the isp this time but the
message did not show up on the newsgroup either. I decided to give up and go to bed and try again in
the morning. By that time both messages had appeared.
Can anyone explain what was going on? I have never seen this behavior before.
I noticed some strange things too.
But the message that appeared in clax86 contained only the first
reference. Did the moderation software remove the other two? Never
seen that before.
Everything, as received by the "submission address" is "supposed" to be
dumped to "clax.log". As shown there, your header looks like...

+OK 3319 octets
Return-path: <***@news1.news.xs4all.nl>
Envelope-to: clax86-***@inspiretomorrow.net
Delivery-date: Sun, 16 May 2010 23:58:00 -0500
Received: from moderators.individual.net ([130.133.4.7])
by srv16.hosting24.com with esmtp (Exim 4.69)
(envelope-from <***@news1.news.xs4all.nl>)
id 1ODsOu-0000DM-Az
for clax86-***@inspiretomorrow.net; Sun, 16 May 2010 23:58:00 -0500
Received: from news1.news.xs4all.nl ([194.109.133.242])
by moderators.individual.net (Exim 4.69)
for comp-lang-asm-***@moderators.isc.org with esmtp
(envelope-from <***@news1.news.xs4all.nl>)
id <1ODsOo-00024O-57>; Mon, 17 May 2010 06:57:59 +0200
Received: (from ***@localhost)
by news1.news.xs4all.nl (8.11.6/8.11.6) id o4H4vqL01125;
Mon, 17 May 2010 06:57:52 +0200 (CEST)
(envelope-from news)
To: comp-lang-asm-***@moderators.isc.org
Mime-Version: 1.0
X-Newsreader: knews 1.0c.0
References: <***@mid.individual.net>
From: ***@securityaudit.val.newsbank.net (Dick Wesseling)
Subject: Re: Optimization problem
Newsgroups: comp.lang.pascal.borland,comp.lang.asm.x86,comp.lang.pl1
Content-Type: text/plain; charset=us-ascii
Date: 17 May 2010 04:57:52 GMT
Lines: 40
Message-ID: <4bf0ccd0$0$22945$***@news6.xs4all.nl>
NNTP-Posting-Host: 2001:888:11d7:0:210:a7ff:fe0a:d33
X-Trace: 1274072272 news6.xs4all.nl 22945
[2001:888:11d7:0:210:a7ff:fe0a:d33]:40744
X-Complaints-To: ***@xs4all.nl

As you can see, only one reference showing. We do not intentionally
remove any "References:" lines. We *do* remove "Received:" lines, and a
couple others which, some people feel, reveal too much information about
the poster (sorry about "outing" you). Possible we're not getting this
right... it is homemade software...

Again, sorry for the confusion. I suspect "moderation issues" are not of
interest to the other groups, so further discussion should probably go
in another thread...

Best,
Frank
Dr J R Stockton
2010-05-19 17:47:09 UTC
Permalink
Post by Frank Kotler
All postings are delayed (I'm not sure why, exactly...) A more(?)
interesting question is, "What if more than one of the groups is
moderated?" I understand that it is considered "impolite" to approve a
message to another group, but I *think* if it's approved for one, it's
approved for all. It is not obvious whether other groups are moderated
or not. Apologies to any other moderators whose toes are being stepped on!
The initial submission should be mailed to the moderation address of the
first moderated newsgroup in the Newsgroups line. Each moderator is
expected either to reject it or to mail it to the moderation address of
the next listed moderated newsgroup, except that the moderator address
of the last listed moderated newsgroup either rejects or posts.

A moderator is expected to consider suitability only for his own group,
but might be allowed to reject to protect his group (the Newsgroups
header might include a troll-haunt such as alt.palin.a.nutter.for.half.a
century).


Moderation team members should know how the system is meant to work,
even if their software actually handles the matter.
--
(c) John Stockton, nr London UK. ?@merlyn.demon.co.uk Turnpike v6.05 MIME.
Web <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
Proper <= 4-line sig. separator as above, a line exactly "-- " (RFCs 5536/7)
Do not Mail News to me. Before a reply, quote with ">" or "> " (RFCs 5536/7)
Nathan
2010-05-19 03:36:54 UTC
Permalink
Post by Dick Wesseling
I noticed some strange things too.
But the message that appeared in clax86 contained only the first
reference. Did the moderation software remove the other two? Never
seen that before.
Wow! Thanks for helping to debug our "paper mache" moderation
software. Who writes that kind of thing in ASM? We must be 'high' or
crazy or something... {shifty eyes, pretends bewilderment} ...
yeah, we're nuts. Bring on the straight jacket! :)

Nathan.
http://clax.inspiretomorrow.net/index.php
Dr J R Stockton
2010-05-18 17:13:08 UTC
Permalink
In comp.lang.pascal.borland message <***@mid.individual.net>,
Sun, 16 May 2010 18:28:26, Robert AH Prins <***@prino.org> posted:

(to clpb only)
Post by Robert AH Prins
The program processes an array of N cells, with N up to about 4096. The
array is filled from the start, it may not be completely full but there
are no empty cells between full cells. A cell contains a value of
1..255, so logically every cell would be a byte, but if processing
efficiency requires each cell to be a word or even a double-word, that
would be no problem. The values are normalized, they are effectively
indexes into a table, so if the table contains just T entries, the
values would go from 1 to T.
Given the above setup, the problem is to find *all* sub-arrays that
contain 3..T distinct elements and are as short as possible, i.e. if
there are 42 sub-arrays of three elements containing three distinct
values, there is no need to find sub-arrays of four elements containing
three distinct values. Also, a shorter sub-array may *never* be a part
of a longer sub-array containing *only* distinct values.
I can generate a circular list of 532 items of values 0 to 34 (or 22 to
56), to which such an algorithm might be applied. A related list has
5.700,000 items of the same values. There is a constraint which means
that the gaps between consecutive instances of any value can only be of
certain sizes.
--
(c) John Stockton, nr London UK. ?@merlyn.demon.co.uk Turnpike v6.05 MIME.
<URL:http://www.merlyn.demon.co.uk/> TP/BP/Delphi/&c., FAQqy topics & links;
<URL:http://www.merlyn.demon.co.uk/clpb-faq.txt> RAH Prins : c.l.p.b mFAQ;
<URL:ftp://garbo.uwasa.fi/pc/link/tsfaqp.zip> Timo Salmi's Turbo Pascal FAQ.
Dr J R Stockton
2010-05-21 21:09:59 UTC
Permalink
In comp.lang.pascal.borland message <***@invalid.uk.co.demo
n.merlyn.invalid>, Tue, 18 May 2010 18:13:08, Dr J R Stockton
Post by Dr J R Stockton
I can generate a circular list of 532 items of values 0 to 34 (or 22 to
56), to which such an algorithm might be applied. A related list has
5.700,000 items of the same values. There is a constraint which means
that the gaps between consecutive instances of any value can only be of
certain sizes.
That appears to be working, in JavaScript, at
<URL:http://www.merlyn.demon.co.uk/estrcons.htm#ESP>; there is nothing
specifically JavaScript about the code.

A variable-length window, initially of zero size, slides along the
array. When its front moves one step, the character is read. A counter
array element, indexed by the character, is incremented, and if it was
zero a count of different characters is incremented. When the window
tail moves one step, the reverse happens.

When the count of characters is insufficient, the front moves, otherwise
the rear moves. When the count ceases to be sufficient, the previous
window contained what was wanted, and both the first and last position
of that previous window were needed.

When that window is smaller than the previous best, it is reported.

Actually, to save thinking about an array of size 5700000, there is no
array; instead, a sufficiently fast function is called to determine the
character. And the characters are really numbers 22 to 56. The full
data set can take about a minute to process.

Instead of calling the function twice, one could use a circular array
(i.e., indexed by Y mod 1024) of adequate but modest size to store what
is at any time in the window.

A subsequent process checks the reported results.

That might have some bearing on the question asked.
--
(c) John Stockton, nr London, UK. ?@merlyn.demon.co.uk Turnpike v6.05.
Web <URL:http://www.merlyn.demon.co.uk/> - w. FAQish topics, links, acronyms
PAS EXE etc : <URL:http://www.merlyn.demon.co.uk/programs/> - see 00index.htm
Dates - miscdate.htm estrdate.htm js-dates.htm pas-time.htm critdate.htm etc.
io_x
2010-05-22 10:36:48 UTC
Permalink
Post by Robert AH Prins
Hi all,
for answer, you use ot code in asm group.
than the problem in the few i understand
it seems to me about DNA search strings
or change strings; i'm again all this
Bernhard Schornak
2010-05-23 17:29:56 UTC
Permalink
Post by Robert AH Prins
Hi all,
Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;
The program processes an array of N cells, with N up to about 4096. The
array is filled from the start, it may not be completely full but there
are no empty cells between full cells. A cell contains a value of
1..255, so logically every cell would be a byte, but if processing
efficiency requires each cell to be a word or even a double-word, that
would be no problem. The values are normalized, they are effectively
indexes into a table, so if the table contains just T entries, the
values would go from 1 to T.
Given the above setup, the problem is to find *all* sub-arrays that
contain 3..T distinct elements and are as short as possible, i.e. if
there are 42 sub-arrays of three elements containing three distinct
values, there is no need to find sub-arrays of four elements containing
three distinct values. Also, a shorter sub-array may *never* be a part
of a longer sub-array containing *only* distinct values.
An example, suppose only the array contains 56 elements, in this case
1 2 3 4 5 5
....5....0....5....0....5....0....5....0....5....0....56
aaaaaabbbbbabbbcbcccccccddddddddccbbbbbbefffghggggihbfja
len aperture from to value
3 3 40 42 bef
3 3 44 46 fgh
7 7 50 56 gihbfja
8 16 41 56 efffghggggihbfja, containing efghibja
9 23 34 56 cbbbbbbefffghggggihbfja, containing cbefghija
10 25 32 56 dccbbbbbbefffghggggihbfja, containing dcbefghija
It is possible to find sub-arrays with 4, 5, or 6 distinct values, but
they are either longer than the series of 7-in-7 (12-25 contain abcd),
or are part of the 7-in-7 series and as such they should not be included!
My AD 1996 program used to slide a window that started with three
elements (plus a sentinel on either side) over the big array and spit
out the position of the first element if it found three distinct
elements in the window *and* the two sentinels also contained any of the
values in the window. The program included some fairly minor
optimizations, such as immediately sliding the window to the last of a
series of equal values, in the example above, the three position wide
window would start by covering pos 6..8, and next it would move to
11..13, etc. If at any moment the program would find that all elements
in the window *and the sentinel on the right* would be distinct, the
window size would be increased by one and the scan would continue.
Taking the above example, a window with an aperture of four elements
would start its slide at position 6..9. Once it reached position 49,
with "gihb" visible and "g" and "f" in the two sentinels, the window
would be widened to five characters as the "f" is distinct from the four
"visible" characters, with a new sentinel of "j". Given that this is
again distinct from the now five characters in the window, the process
would repeat itself, eventually resulting in the string of seven
distinct characters starting at position 50.
Once the string of seven distinct characters has been found, the window
is widened to *nine* (if there had been a string of 8-in-8 it would have
been found by the previous slide!) characters and the process restarts
at position 6..14, but in the end it fails to find a series of eight
distinct characters and so the process is repeated with windows of 10,
11, 12, 13, 14, 15 and finally 16 characters, when a string of eight
distinct characters is finally found.
The problem is, for relatively low values of N and C, the process may
not be overly efficient, but it works. However, once N and C increase,
(N is theoretically unbounded, C has an upper limit of around 210), the
process becomes horribly inefficient: using the old IBM V2.3.0 OS PL/I
compiler, which had a statement count facility, this method required the
execution of 556,379,518 PL/I statements (and a number of actual machine
instruction that is at least one order of magnitude greater) for the
current values of N(2329) and C(66).
As for some figures on required restarts, the final series of 66 is
found in a sub-array with a length of 1950 elements and with a series of
65 in a sub-array of only 1889 elements, this means that the slide had
to be restarted 61 times! Even more restarts, 312(!), are needed when
going from 61 to 62 elements.
In 1998 I posted the problem to comp.lang.pascal.borland, the post
should be on Google, but Google refuses to show more than the first
9,960 (out of 28,517) posts in that group.
Three people claimed that there was a much faster way, and two of them
backed this up with programs, Brent Beach and Paul Green, with Paul's
solution being the fastest by a fair margin. The results of his code
matched my output, I replaced my code with his code, and that seemed to
be the end of the story...
Until a few months back...
... when I started converting the original Pascal code in assembler.
While doing so I decided that it would be interesting to use this
procedure in another part of the code. However, the results were not
what I expected: result rows were missing.
If you want to do this in assembler, you might
consider to use an alternative approach:

1. Allocate 4,352 byte of memory.
2. Scan through the input:

EBP = address input
ESI = address input
EDI = address array (1st byte)
ECX = input size

Store EDI, ESI and EBP on stack or another lo-
cation for reloading. The array must be zeroed
before it is used (2,048 byte).

...
add EBP,ECX
0:
dec EBP
movz EAX,BYTE 00[ESI]
movz EBX,BYTE 00[EBP]
cmp DWORD 00[EDI + EAX * 8],0
jne 1f
mov DWORD 00[EDI + EAX * 8],ESI
1:
cmp DWORD 04[EDI + EBX * 8],0
jne 2f
mov DWORD 04[EDI + EBX * 8],EBP
2:
inc ESI
dec ECX
jne 0b
...

Time: ~131,072 clocks or better.

This one pass function stores the addresses of
the 1st and last occurence of a character. The
array is organised as follows:

0000 1st_00, last_00, 1st_01, last_01
0010 1st_02, last_02, 1st_03, last_03
...
07F0 1st_FE, last_FE, 1st_FF, last_FF

Array entries with a value of zero tell us the
corresponding character was not found. If both
entries are equal, the corresponding character
was found only once.

Next, sort out all non-existing characters and
create three work tables:

EDI = address array (1st byte)
EDI + 0x0800 + (n * 4) = 1st occurence
EDI + 0x0C00 + (n * 4) = last occurence
EDI + 0x1000 + (n * 1) = associated char

...
xor EAX,EAX
mov EBX,254
0:
mov ESI,DWORD 0x00[EDI + EBX * 8]
mov EBP,DWORD 0x04[EDI + EBX * 8]
mov ECX,DWORD 0x08[EDI + EBX * 8]
mov EDX,DWORD 0x0C[EDI + EBX * 8]
test ECX,ECX
je 1f
mov DWORD 0x0800[EDI + EAX * 4],ECX
mov DWORD 0x0C00[EDI + EAX * 4],EDX
mov BYTE 0x1000[EDI + EAX * 4],BL
inc EAX
1:
decl EBX
test ESI,ESI
je 2f
mov DWORD 0x0800[EDI + EAX * 4],ESI
mov DWORD 0x0C00[EDI + EAX * 4],EBP
mov BYTE 0x1000[EDI + EAX * 4],BL
inc EAX
2:
decl EBX
jns 0b
...

EAX holds the amount of found chars on exit.

Time: ~5,120 clocks or better.

Tables 1 and 2 contain lists with addresses of
found characters (1st and last), table 3 holds
the associated characters in descending order.
Both - addresses and associated char - are ac-
cessed via indices to get address / char-pairs
for the output. If you move any address to an-
other index (e.g. when sorting), you *have to*
move the other address and the associated cha-
racter to the same index. Otherwise, relation-
ship between any of them will be corrupted!

Searching non-unique patterns is done via com-
paring addresses in both tables. Every char is
included at least once in the area between its
1st and last address, even if they're equal. A
character included in *any* non-unique pattern
must 'sit' at an address equal to or above the
currently processed character. A found pattern
ends at the highest last address.

I do not know which sequences are of interest,
so I cannot provide code for that. If you look
for sequences m...n, a corresponding substring
starts at m(first) and ends at n(last). Only a
copy operation (from n(start) up to m(end)) is
required to retrieve non-unique substrings for
a print-out (address calculation is trivial).

To search for unique patterns - all characters
differ from each other - the entire input must
be scanned one time:

EBP = address input
ESI = input_size

...
0:
xor EDX,EDX
mov AL,BYTE 00[EBP + EDX]
inc EDX
1:
mov BL,BYTE 0x00[EBP + EDX]
mov CL,BYTE 0x01[EBP + EDX]
cmp AL,BL
je 4f
inc EDX
cmp BL,CL
je 4f
cmp AL,CL
je 4f
inc EDX
cmp EDX,ESI
jae 4f
push EDX
sub EDX,2
je 3f
2:
cmp BL,BYTE 0x00[EBP + EDX]
je 3f
cmp CL,BYTE 0x00[EBP + EDX]
je 3f
dec EDX
jne 2b
3:
mov EAX,EDX
pop EDX
test EAX,EAX
je 1b
4:
cmp EDX,3
jb 5f
... # store EBP (address 1st char)
... # store EDX as pattern length
... # (store EDX chars as found pattern)
5:
add EBP,EDX
sub ESI,EDX
jbe done
jmp 0b

done:
...

With the largest patterns (255 unique chars in
a row), about 3,000,000 clock cycles are quite
realistic. The smaller the found patterns, the
less time is spent - with any pattern below 32
chars, it probably takes less than half a mil-
lion clocks (over the thumb). A once processed
character is not touched again, when it passed
the inner loop. If you search for one specific
pattern, time can be reduced further (by using
a '1st' address from the table and subtracting
the skipped bytes from ESI, decreasing overall
iterations).

Given clocks are for 4,096 byte arrays (linear
growth with increasing size, except the unique
pattern search).

Clock counts are for AMD Athlon, Family 8+. In
64 bit code, execution times can be reduced to
about 60...75 percent of the given ones. Eight
additional registers are snake oil for complex
functions... ;)


Greetings from Augsburg

Bernhard Schornak
Dick Wesseling
2010-05-23 21:06:01 UTC
Permalink
Post by Bernhard Schornak
If you want to do this in assembler, you might
1. Allocate 4,352 byte of memory.
[snip]

I haven't studied your table construction and search algorithm
yet,
Post by Bernhard Schornak
Searching non-unique patterns is done via com-
paring addresses in both tables. Every char is
included at least once in the area between its
1st and last address, even if they're equal. A
character included in *any* non-unique pattern
must 'sit' at an address equal to or above the
currently processed character. A found pattern
ends at the highest last address.
I do not know which sequences are of interest,
so I cannot provide code for that. If you look
for sequences m...n, a corresponding substring
starts at m(first) and ends at n(last). Only a
copy operation (from n(start) up to m(end)) is
required to retrieve non-unique substrings for
a print-out (address calculation is trivial).
To search for unique patterns - all characters
differ from each other - the entire input must
EBP = address input
ESI = input_size
...
xor EDX,EDX
mov AL,BYTE 00[EBP + EDX]
inc EDX
mov BL,BYTE 0x00[EBP + EDX]
mov CL,BYTE 0x01[EBP + EDX]
cmp AL,BL
je 4f
inc EDX
cmp BL,CL
je 4f
cmp AL,CL
je 4f
inc EDX
cmp EDX,ESI
jae 4f
push EDX
sub EDX,2
je 3f
cmp BL,BYTE 0x00[EBP + EDX]
je 3f
cmp CL,BYTE 0x00[EBP + EDX]
je 3f
dec EDX
jne 2b
mov EAX,EDX
pop EDX
test EAX,EAX
je 1b
When label 1 is reached from 0 AL contains the first input
character, When label 1 is reached via "je 1b" AL is always
zero.
Bernhard Schornak
2010-05-24 03:01:27 UTC
Permalink
Dick Wesseling wrote:

<snip>
<snip>
Post by Dick Wesseling
Post by Bernhard Schornak
EBP = address input
ESI = input_size
...
xor EDX,EDX
mov AL,BYTE 00[EBP + EDX]
inc EDX
mov BL,BYTE 0x00[EBP + EDX]
mov CL,BYTE 0x01[EBP + EDX]
cmp AL,BL
je 4f
inc EDX
cmp BL,CL
je 4f
cmp AL,CL
je 4f
inc EDX
cmp EDX,ESI
jae 4f
push EDX
sub EDX,2
je 3f
mov AL,CL # AL = char at -1[EBP + EDX]
......... # for the next iteration
Post by Dick Wesseling
Post by Bernhard Schornak
cmp BL,BYTE 0x00[EBP + EDX]
je 3f
cmp CL,BYTE 0x00[EBP + EDX]
je 3f
dec EDX
jne 2b
pop EDX
test ECX,ECX
Post by Dick Wesseling
Post by Bernhard Schornak
je 1b
When label 1 is reached from 0 AL contains the first input
character, When label 1 is reached via "je 1b" AL is always
zero.
Sorry! (I optimised the most important instruction out
of my code.) AL must contain the last character in the
already processed pattern, of course.

The table stuff is not as complicated as it seems. All
substrings can be determined with the addresses stored
in it (no extra scans required). It uses a fast method
scanning through an input array in both directions si-
multaneously. Getting the first and last positition of
each char in a single pass is (probably) faster than a
repeated scan through the entire array - the addresses
are sufficient to calculate all substrings rather than
to search for them.


Greetings from Augsburg

Bernhard Schornak
Branimir Maksimovic
2010-05-24 20:12:29 UTC
Permalink
On Sun, 16 May 2010 18:28:26 +0000
Post by Robert AH Prins
Hi all,
Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;
That depends on if you are using hyperthreaded CPU (cheaper variant)
or real multicore?

Greets1
--
http://maxa.homedns.org/

Sometimes online sometimes not

Svima je "dozvoljeno" biti idiot i
Post by Robert AH Prins
mrak, ali samo neki to odaberu,
Robert AH Prins
2010-05-27 15:09:59 UTC
Permalink
Post by Robert AH Prins
Hi all,
Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;
Follow-up to the very first post, with compilable source extracted from
the current Pascal program:

{
This was Paul Green's literal comment to the code:

The method I used was as follows. As I read the data file, I keep a
record of the last occurance of each string, in a list sorted by the
line numbers of the occurances. For each line, I read the string and
find the last occurance. Then I check each possible series ending on
the penultament line. Then I update the list and continue. Now let's see
if I can explain this.

The reason this works is because the shortest series, as you describe
it, would both start and end with a string that occurs only once in
the series. If you have a sequence like C A B C D E for example, C is
in it twice; you can shorten the sequence by dropping the C. So that
means you only have to check the length of series that begin with the
last previous occurance of some string. And we have a list of those
lines. In order to check that the string previous to that is in the
series, we check the next last occurance to make sure it is at least
two lines before the first string in the series. In order to make sure
the current line is in the series, you skip series that start after
the last occurance of the current line, because that string must be
included in the series. Finally, I put in one extra test. You don't
have to check a series that extends past the penultamate occurance of
the penultamate line, since that means the last element of the series is
in the series twice. Well that's clear as mud. Does it sound like the
right method though?

My remarks:

Paul's code was fitted into the existing program that used a linked
list due to the unknown amount of records being read in. To simulate
the reading in, an array is allocated and that is later used in the
scanner. Going backwards through the array turned out to be necessary
to get the output in the correct order. Also his original routine
expected four character strings, for speed purposes these are cast into
longints, the real program uses longints and only at the stage where
the data is being printed, it uses these as index into the array with
the 4-character strings.

Below are the two sets of data that fail to return the 3-in-3 strings,
I've been tearing my hair out trying to find why, but without any
result, it's probably a case of Edgar Allan Poe's "The Purloined
Letter", with a solution so obvious that one misses it by staring at
the problem too long. (Wishful thinking?)

The two input files that fail are:

A 61 1
A 61 2
B 61 3
C 61 4
D 61 5
E 61 6
F 61 7
E 61 8
G 61 9
G 61 10
G 61 11
G 61 12
G 61 13
G 61 14
G 61 15
G 61 16
G 61 17
G 61 18
H 61 19
I 61 20
C 61 21
D 61 22
A 61 23
A 61 24
A 61 25
A 61 26
A 61 27

Result:

| 6 | 6 | 2 | A B C D E F |
| | | 18 | G H I C D A |
| 7 | 8 | 2 | A B C D E F G |
| 8 | 17 | 7 | F E G H I C D A |
| 9 | 19 | 2 | A B C D E F G H I |

Missing:

| 3 | 3 | 7 | F E G |

and

A 69 1
A 69 2
A 69 3
A 69 4
A 69 5
B 69 6
B 69 7
C 69 8
C 69 9
D 69 10
E 69 11
F 69 12
G 69 13
H 69 14
H 69 15
H 69 16
H 69 17
I 69 18
J 69 19
J 69 20
J 69 21

Result:

| 6 | 6 | 9 | C D E F G H |
| 7 | 8 | 7 | B C D E F G H |
| 8 | 10 | 5 | A B C D E F G H |
| 9 | 13 | 7 | B C D E F G H I J |
| 10 | 15 | 5 | A B C D E F G H I J |

Missing:

| 3 | 3 | 17 | H I J |

}
type
tycona = array [1..4] of char;
scanptr = ^scan_list;
runptr = ^run_list;

scan_list = record {s 16}
scan_nxt : scanptr; { 4}
id : longint; { 4}
sca : tycona; { 4}
fil : array [1..4] of char; { 4}
end;


spot_rec = record {s 16}
sca : tycona; { 4}
atl : longint; { 4}
id : longint; { 4}
fil : array [1..4] of char; { 4}
end;

run_list = record {s1044}
run_nxt: runptr; { 4}
run_prv: runptr; { 4}
rlen : longint; { 4}
scas : longint; { 4}
id : longint; { 4}
run : array [1..256] of tycona; {1024}
end;

run_data = record {s 16}
run_top: runptr; { 4}
run_end: runptr; { 4}
run_cnt: longint; { 4}
fil : array [1..4] of char; { 4}
end;

_sc_ptr = ^_sc_tab;
_sc_tab = array [1..65536 div sizeof(scanptr)] of scanptr;

_spot_ptr = ^_spot_arr;
_spot_arr = array [0..65536 div sizeof(spot_rec) - 1] of spot_rec;

_run_ptr = ^_run_arr;
_run_arr = array [0..65536 div sizeof(run_data) - 1] of run_data;

const sca_skel : string[24] = ' | | | | ';
const run_ptr : runptr = nil;
const sc_tab : _sc_ptr = nil;
const spot_arr : _spot_ptr = nil;
const run_arr : _run_ptr = nil;
const _atline : longint = 0;
const _lnls : longint = 0;
const _lnls0 : longint = 0;
const _lsat : longint = 0;
const _i : longint = 0;
const _nct : longint = 0;
const _nreq : longint = 0;
const _scount : longint = 0;
const
scan_ptr: scanptr = nil;
scan_top: scanptr = nil;
scan_end: scanptr = nil;
const
_line : string[255] = '';
_linex : array[1..4] of char = #0#0#0#0;

var w_tycona : tycona;
var print : string;
var scanin : text;

procedure add_integer_2_line4(_l: longint);
begin
str(_l:4, print);
_line:= _line + print + ' | ';
end; {add_integer_2_line4}

procedure dispose_run(_i: longint);
var nxt_run: runptr;

begin
run_ptr:= run_arr^[_i].run_top;

while run_ptr <> nil do
begin
nxt_run:= run_ptr^.run_nxt;
dispose(run_ptr);
run_ptr:= nxt_run;
end;

run_arr^[_i].run_top:= nil;
run_arr^[_i].run_cnt:= 0;
end; {dispose_run}

procedure setup_run(_n: longint);
var _in: longint;

begin
new(run_ptr);
fillchar(run_ptr^.run_nxt, sizeof(run_ptr^), #0);
fillchar(run_ptr^.run , sizeof(run_ptr^.run), ' ');

if run_arr^[_n].run_top = nil then
run_arr^[_n].run_top:= run_ptr
else
begin
run_arr^[_n].run_end^.run_nxt:= run_ptr;
run_ptr^.run_prv := run_arr^[_n].run_end;
end;

run_arr^[_n].run_end:= run_ptr;

inc(run_arr^[_n].run_cnt);

run_ptr^.rlen := _atline - spot_arr^[_n].atl;
run_ptr^.scas := _n;
run_ptr^.id := spot_arr^[1].id;

for _in:= 1 to _n do
run_ptr^.run[_in]:= spot_arr^[_in].sca;
end; {setup_run}

procedure check_runs;
var _j: longint;
var _k: longint;
var _l: longint;

begin
_k:= 3;
if _k < _lsat then
_k:= _lsat;

for _j:= _k to _nct do
begin
_l:= spot_arr^[_j].atl;

if _lnls0 <= _l then
if (_j = _nct) or
(_l > spot_arr^[_j + 1].atl + 1) then
begin
if (run_arr^[_j].run_top = nil) or
(run_arr^[_j].run_top^.rlen > _atline - _l) then
begin
dispose_run(_j);
setup_run(_j);
end
else
if (run_arr^[_j].run_top <> nil) and
(run_arr^[_j].run_top^.rlen = _atline - _l) then
setup_run(_j);
end;
end;
end; {check_run}

{
* Scanner:
*
* Part of the code in this procedure was originally written by Paul
* Green as a result of a posting in comp.lang.pascal.borland, asking
* for a faster method of performing scan.
}
procedure scanner;
var _i : longint;
var _k : longint;
var _id : longint;

var scn_in : tycona;

begin;
_i:= (_scount + 2) * sizeof(run_data);
getmem(run_arr, _i);
fillchar(run_arr^, _i, #0);

_i:= (_scount + 2) * sizeof(spot_rec);
getmem(spot_arr, _i);
fillchar(spot_arr^, _i, #0);

writeln('Scanning');

for _i:= _scount downto 1 do
begin
move(sc_tab^[_i]^.sca[1], w_tycona, sizeof(w_tycona));
_id:= sc_tab^[_i]^.id;

inc(_atline);

_lsat:= 1;
while (_lsat <= _nct) and
(longint(w_tycona) <> longint(spot_arr^[_lsat].sca)) do
inc(_lsat);

if _lsat > _nct then
_lsat:= 0;

_lnls:= 0;
if _lsat <> 0 then
begin
_lnls:= spot_arr^[_lsat].atl;

if _nct >= 3 then
check_runs;
end;

_lnls0:= _lnls;

if _lsat = 0 then
begin
inc(_nct);
_lsat:= _nct;
end;

if _lsat > 1 then
move(spot_arr^[1], spot_arr^[2], sizeof(spot_rec) * pred(_lsat));

spot_arr^[1].sca := w_tycona;
spot_arr^[1].id := _id;
spot_arr^[1].atl := _atline;
end;

inc(_atline);
_lsat:= 1;
_lnls:= _atline + 1;
check_runs;

for _lnls:= 3 to _nct do
begin
run_ptr:= run_arr^[_lnls].run_end;

if (run_ptr <> nil) and
((_lnls = _nct) or
(_lnls < _nct) and
(run_arr^[succ(_lnls)].run_end <> nil) and
(run_ptr^.rlen < run_arr^[succ(_lnls)].run_end^.rlen)) then
begin
_line:= sca_skel;
_line[0]:= #3;

add_integer_2_line4(_lnls);
add_integer_2_line4(run_ptr^.rlen);
add_integer_2_line4(run_ptr^.id);
_line[0]:= #24;

while run_ptr <> nil do
begin
_k:= 1;

repeat
move(run_ptr^.run[_k], _line[25], 40);
inc(_line[0], 39);
_line:= _line + ' | ';
writeln(_line);

_line:= sca_skel;
inc(_k, 10);
until run_ptr^.run[_k][1] = ' ';

run_ptr:= run_ptr^.run_prv;

_line[0]:= #17;
if run_ptr <> nil then
add_integer_2_line4(run_ptr^.id);
_line[0]:= #24;
end;
end;

dispose_run(_lnls);
end;
end; {scanner}

begin
assign(scanin, 'scanin.txt');
reset (scanin);

_scount:= 0;
repeat
_line:= ' ';
readln(scanin, _line);
new(scan_ptr);

if scan_top = nil then
scan_top:= scan_ptr
else
scan_end^.scan_nxt:= scan_ptr;

scan_end:= scan_ptr;

inc(_scount);
scan_ptr^.scan_nxt:= nil;
scan_ptr^.id:= _scount;
move(_line[1], scan_ptr^.sca, 4);
until eof(scanin);

getmem(sc_tab, _scount * sizeof(scan_ptr));

scan_ptr:= scan_top;
_i := 0;

while (scan_ptr <> nil) do
begin
inc(_i);
sc_tab^[_i]:= scan_ptr;
scan_ptr := scan_ptr^.scan_nxt;
end;

scanner;
end.

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
Terje Mathisen
2010-05-27 16:48:21 UTC
Permalink
Post by Robert AH Prins
Post by Robert AH Prins
Hi all,
Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;
Follow-up to the very first post, with compilable source extracted from
{
The method I used was as follows. As I read the data file, I keep a
record of the last occurance of each string, in a list sorted by the
line numbers of the occurances. For each line, I read the string and
find the last occurance. Then I check each possible series ending on
the penultament line. Then I update the list and continue. Now let's see
if I can explain this.
Robert, there are two critical issues here:

a) Do you have to run this Pascal version of the program? I.e. is there
anything that stops you from simply switching to one of the two fast &
working versions we have written as a result of your OP?

b) How fast is this Pascal algorithm?

If it is significantly faster, then it might be worthwhile to figure out
where it fails and see if a fix is possible that doesn't make it too slow!

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Robert AH Prins
2010-05-28 00:45:02 UTC
Permalink
Post by Terje Mathisen
Post by Robert AH Prins
Post by Robert AH Prins
Hi all,
Can anyone give me some hints as to solve the following problem,
preferably in a way that is faster than the way I used to do it, and
without the bug in the current version;
Follow-up to the very first post, with compilable source extracted from
{
The method I used was as follows. As I read the data file, I keep a
record of the last occurance of each string, in a list sorted by the
line numbers of the occurances. For each line, I read the string and
find the last occurance. Then I check each possible series ending on
the penultament line. Then I update the list and continue. Now let's see
if I can explain this.
a) Do you have to run this Pascal version of the program? I.e. is there
anything that stops you from simply switching to one of the two fast &
working versions we have written as a result of your OP?
No, I can (and quite likely will) change the code any way I like, but
the code will remain Pascal (and PL/I), as this routine is part of a
much larger program - this routine produces just one of the many tables
and the post-processing logic, a few more Pascal programs on the PC and
a bunch of ISPF edit macro's on z/OS rely on the current order of the
tables.

The "problem" is that I have to translate the two fast solutions to
Pascal and as far as yours is concerned... I've got some fairly basic
understanding of C, but even your source is already using language
constructions above my rather modest knowledge of the language.

BTW, I see (in your 2010-05-24 06:09 posting) that you use a qsort. The
PG solution does not need any sorts... Or maybe your as yet unpublished
fastest version has also dispensed with it?
Post by Terje Mathisen
b) How fast is this Pascal algorithm?
If it is significantly faster, then it might be worthwhile to figure out
where it fails and see if a fix is possible that doesn't make it too slow!
Just timed it via RDTSC. Actual scan time is around 2.7M cycles, and the
somewhat fancy printout adds around another 4M cycles. Source is
compiled with Virtual Pascal 2.1 build 279, and VP isn't really, to
express it politely, a very optimizing compiler, so I guess it's pretty
good - using the assembler output option of VP it would also be pretty
easy to convert it into in-line assembler, giving considerable scope for
optimizations.

Sadly it cannot handle two cases and although I'm currently out of work,
there is more to life than sitting behind a screen for up to 10 hours a
day, certainly now that Mrs Prins has also returned from Vilnius...

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
Terje Mathisen
2010-05-28 08:16:53 UTC
Permalink
Post by Robert AH Prins
Post by Terje Mathisen
a) Do you have to run this Pascal version of the program? I.e. is there
anything that stops you from simply switching to one of the two fast &
working versions we have written as a result of your OP?
No, I can (and quite likely will) change the code any way I like, but
the code will remain Pascal (and PL/I), as this routine is part of a
much larger program - this routine produces just one of the many tables
and the post-processing logic, a few more Pascal programs on the PC and
a bunch of ISPF edit macro's on z/OS rely on the current order of the
tables.
OK. I assume you for some reason cannot link in functions written in C
either?
Post by Robert AH Prins
The "problem" is that I have to translate the two fast solutions to
Pascal and as far as yours is concerned... I've got some fairly basic
understanding of C, but even your source is already using language
constructions above my rather modest knowledge of the language.
That's easy: There's no "funky" C at all in my code, it is really an asm
algorithm written in C shorthand.
Post by Robert AH Prins
BTW, I see (in your 2010-05-24 06:09 posting) that you use a qsort. The
PG solution does not need any sorts... Or maybe your as yet unpublished
fastest version has also dispensed with it?
The qsort is actually slower than a tiny custom (sort) routine to get
the potential solutions into the order I need for the final filtering.

I have written many such routines in my Pascal days. :-)
Post by Robert AH Prins
Post by Terje Mathisen
b) How fast is this Pascal algorithm?
If it is significantly faster, then it might be worthwhile to figure out
where it fails and see if a fix is possible that doesn't make it too slow!
Just timed it via RDTSC. Actual scan time is around 2.7M cycles, and the
somewhat fancy printout adds around another 4M cycles. Source is
OK, so it is 4-10 times slower than my best (working) code, and it
doesn't actually work.
Post by Robert AH Prins
compiled with Virtual Pascal 2.1 build 279, and VP isn't really, to
express it politely, a very optimizing compiler, so I guess it's pretty
good - using the assembler output option of VP it would also be pretty
easy to convert it into in-line assembler, giving considerable scope for
optimizations.
So maybe an inline asm of my C code would be the easiest? Then you would
just have to change the function definition. :-?

int scan(byte *data, int data_len)
{
/* esi -> data
ecx = data_len
*/
__asm {
mov esi,[data]
mov ecx,[data_len]
push ebp
lea ebp,[solution] // solutions = 0

add ecx,esi // Points one beyond the end of data array
inc esi // Skipping the first byte

cmp esi,ecx // If we had less than 3 chars, then there
jae outer_done // cannot be any solutions!

outer_loop: // Scanning from i=1 to data_len
mov bl,[esi] // Current byte
lea edi,[esi-1] // Start at prev byte
inc esi // Get ready for next iteration

inner_loop: // Scanning backwards from i-1 to 0
cmp bl,[edi] // Same as byte we started with?
je found_equal // Yes! (Not a new unique char, so abort)

// Increment count of unique values for this starting position:
mov edx,count[edi*4]
mov eax,esi
inc edx
sub eax,edi // String length
mov count[edi*4],edx
cmp edx,3 // At least 3 unique values?
jb skip_too_short

cmp eax,shortest[edx*4]
ja not_shortest_string

// Save as potential solution:
mov shortest[edx*4],eax
mov [ebp.len],eax
mov [ebp.uniq],edx
mov [ebp.start],edi
add ebp,16 // sizeof(solution[0])

not_shortest_string:
sub edi,1
jnc inner_loop

found_equal:
cmp esi,ecx
jb outer_loop
outer_done:
lea eax,[ebp - offset solution]
pop ebp
shr eax,4 // EAX has potential solution count
}
}

Better?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Marco van de Voort
2010-05-28 09:03:53 UTC
Permalink
Post by Terje Mathisen
So maybe an inline asm of my C code would be the easiest? Then you would
just have to change the function definition. :-?
(and assumptions about calling conventions and other ABI stuff).

But it is more readable than PL/I, and that's what counts :-)

/me ducks for cover.
Robert AH Prins
2010-05-28 19:27:45 UTC
Permalink
Post by Terje Mathisen
Post by Robert AH Prins
Post by Terje Mathisen
a) Do you have to run this Pascal version of the program? I.e. is there
anything that stops you from simply switching to one of the two fast &
working versions we have written as a result of your OP?
No, I can (and quite likely will) change the code any way I like, but
the code will remain Pascal (and PL/I), as this routine is part of a
much larger program - this routine produces just one of the many tables
and the post-processing logic, a few more Pascal programs on the PC and
a bunch of ISPF edit macro's on z/OS rely on the current order of the
tables.
OK. I assume you for some reason cannot link in functions written in C
either?
Post by Robert AH Prins
The "problem" is that I have to translate the two fast solutions to
Pascal and as far as yours is concerned... I've got some fairly basic
understanding of C, but even your source is already using language
constructions above my rather modest knowledge of the language.
That's easy: There's no "funky" C at all in my code, it is really an asm
algorithm written in C shorthand.
Post by Robert AH Prins
BTW, I see (in your 2010-05-24 06:09 posting) that you use a qsort. The
PG solution does not need any sorts... Or maybe your as yet unpublished
fastest version has also dispensed with it?
The qsort is actually slower than a tiny custom (sort) routine to get
the potential solutions into the order I need for the final filtering.
I have written many such routines in my Pascal days. :-)
Post by Robert AH Prins
Post by Terje Mathisen
b) How fast is this Pascal algorithm?
If it is significantly faster, then it might be worthwhile to figure out
where it fails and see if a fix is possible that doesn't make it too slow!
Just timed it via RDTSC. Actual scan time is around 2.7M cycles, and the
somewhat fancy printout adds around another 4M cycles. Source is
OK, so it is 4-10 times slower than my best (working) code, and it
doesn't actually work.
Post by Robert AH Prins
compiled with Virtual Pascal 2.1 build 279, and VP isn't really, to
express it politely, a very optimizing compiler, so I guess it's pretty
good - using the assembler output option of VP it would also be pretty
easy to convert it into in-line assembler, giving considerable scope for
optimizations.
So maybe an inline asm of my C code would be the easiest? Then you would
just have to change the function definition. :-?
int scan(byte *data, int data_len)
{
/* esi -> data
ecx = data_len
*/
__asm {
mov esi,[data]
mov ecx,[data_len]
push ebp
lea ebp,[solution] // solutions = 0
add ecx,esi // Points one beyond the end of data array
inc esi // Skipping the first byte
cmp esi,ecx // If we had less than 3 chars, then there
jae outer_done // cannot be any solutions!
outer_loop: // Scanning from i=1 to data_len
mov bl,[esi] // Current byte
lea edi,[esi-1] // Start at prev byte
inc esi // Get ready for next iteration
inner_loop: // Scanning backwards from i-1 to 0
cmp bl,[edi] // Same as byte we started with?
je found_equal // Yes! (Not a new unique char, so abort)
mov edx,count[edi*4]
mov eax,esi
inc edx
sub eax,edi // String length
mov count[edi*4],edx
cmp edx,3 // At least 3 unique values?
jb skip_too_short
cmp eax,shortest[edx*4]
ja not_shortest_string
mov shortest[edx*4],eax
mov [ebp.len],eax
mov [ebp.uniq],edx
mov [ebp.start],edi
add ebp,16 // sizeof(solution[0])
sub edi,1
jnc inner_loop
cmp esi,ecx
jb outer_loop
lea eax,[ebp - offset solution]
pop ebp
shr eax,4 // EAX has potential solution count
}
}
Better?
No! ;) (But thank you anyway!)

I should have done it myself.

Anyway, I need to get back to the C code, as I also need a PL/I version!
But before doing anything, I will spend one more day trying to find the
bug in the existing code.

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
Marco van de Voort
2010-05-28 09:05:23 UTC
Permalink
Post by Robert AH Prins
Post by Terje Mathisen
where it fails and see if a fix is possible that doesn't make it too slow!
Just timed it via RDTSC. Actual scan time is around 2.7M cycles, and the
somewhat fancy printout adds around another 4M cycles. Source is
compiled with Virtual Pascal 2.1 build 279, and VP isn't really, to
express it politely, a very optimizing compiler, so I guess it's pretty
good - using the assembler output option of VP it would also be pretty
easy to convert it into in-line assembler, giving considerable scope for
optimizations.
Any chance to repeat the timing with Delphi or FPC? It would make a great
table.
Robert AH Prins
2010-05-28 19:31:54 UTC
Permalink
Post by Marco van de Voort
Post by Robert AH Prins
Post by Terje Mathisen
where it fails and see if a fix is possible that doesn't make it too slow!
Just timed it via RDTSC. Actual scan time is around 2.7M cycles, and the
somewhat fancy printout adds around another 4M cycles. Source is
compiled with Virtual Pascal 2.1 build 279, and VP isn't really, to
express it politely, a very optimizing compiler, so I guess it's pretty
good - using the assembler output option of VP it would also be pretty
easy to convert it into in-line assembler, giving considerable scope for
optimizations.
Any chance to repeat the timing with Delphi or FPC? It would make a great
table.
I only have an uninstalled version of Turbo Delphi somewhere and after
having found a full blown PL/I compiler for Windows, equivalent to
V3.9.0 on z/OS, I'm not sure if I want to install FPC, when VP does, at
least at the moment, everything I need.

Robert
--
Robert AH Prins
spamtrap(a)prino(d)org
Marco van de Voort
2010-05-29 15:17:06 UTC
Permalink
["Followup-To:" header set to comp.lang.pascal.borland.]
Post by Robert AH Prins
Post by Marco van de Voort
Post by Robert AH Prins
Post by Terje Mathisen
where it fails and see if a fix is possible that doesn't make it too slow!
Just timed it via RDTSC. Actual scan time is around 2.7M cycles, and the
somewhat fancy printout adds around another 4M cycles. Source is
compiled with Virtual Pascal 2.1 build 279, and VP isn't really, to
express it politely, a very optimizing compiler, so I guess it's pretty
good - using the assembler output option of VP it would also be pretty
easy to convert it into in-line assembler, giving considerable scope for
optimizations.
Any chance to repeat the timing with Delphi or FPC? It would make a great
table.
I only have an uninstalled version of Turbo Delphi somewhere and after
having found a full blown PL/I compiler for Windows, equivalent to
V3.9.0 on z/OS, I'm not sure if I want to install FPC, when VP does, at
least at the moment, everything I need.
FPC is relatively self contained (the worst can happen is that the installer
adds the directory to the path, which can be undone easily)

The question was not as much meant to convince you to swap compilers, just
that it would be interesting to see how the compilers fare relatively to
eachother. (and maybe pit it against a GCC implementation of the same
algorithm)
Loading...