Van Eck, US States and Raku

by Arne Sommer

Van Eck, US States and Raku

[19] Published 29. June 2019.

Perl 6 → Raku

This article has been moved from «perl6.eu» and updated to reflect the language rename in 2019.

This is my response to the Perl Weekly Challenge #14.

Challenge #14.1

Write a script to generate Van Eck’s sequence starts with 0. For more information, please check out wikipedia page. This challenge was proposed by team member Andrezgz.


The first value in the Van Eck sequence is zero. Each subsequent value is calculated like this:

  1. Look for the previous occurence of the last value in the list (from the end)
  2. If we don't find it, the next value is zero
  3. If we do find it, the next value is the distance from the end

A walk through of the first ten values may make it easier to understand:

SequenceSearch for   Found   The next value
0 0No 0
0, 0 0Yes1
0, 0, 1 1No 0
0, 0, 1, 0 0Yes 2
0, 0, 1, 0, 2 2No0
0, 0, 1, 0, 2, 0 0Yes2
0, 0, 1, 0, 2, 0, 2 2Yes2
0, 0, 1, 0, 2, 0, 2, 2 2Yes1
0, 0, 1, 0, 2, 0, 2, 2, 1 1Yes6
0, 0, 1, 0, 2, 0, 2, 2, 1, 6    

I have chosen to delegate the actual calculation to a helper procedure «van-help»:

File: van-eck-list
unit sub MAIN (Int $limit = 10);     # [1]

my @van-eck = (0);                   # [2]
  
@van-eck.push(van-help(@van-eck)) for ^($limit -1);  # [3]

say @van-eck.join(", ");             # [4]

sub van-help(@array is copy)         # [5]
{
  my $last = @array.pop;             # [6]

  if any(@array) == $last            # [7]
  {
    for @array.reverse               # [8]
    {
      state $delta++;                # [9]
      return $delta if $_ == $last;  # [10]
    }
  }
  return 0;                          # [11]
}

[1] The sequence is infinite, so we have to stop at an explicit location. This will give us the first 10 numbers by default.

[2] The first number in the sequence is zero.

[3] Add the next number, one at a time, nine times by default. Combined with the initial one (from [2]), we get ten.

[4] Print them.

[5] This helper function takes a partial Van Eck sequence as input, and returns the next number. The input is set up with «is copy» so that we can change it (in [6]).

[6] Remove the last element from the sequence.

[7] If any of the remaining elements are equal to the one we removed,

[8] Loop through the values, from the end.

[9] • Keep a count of how many iterations in the loop.

[10] • Return the distance between the two values when we find the first match.

[11] If the test in [7] fails, we return 0 (meaning that it is the first time the number occurs in the sequence).

A variable declared with state gives us a state variable. The assignment in the declaration (the initialization) is only done once, so we get a counter. See docs.raku.org/syntax/state for more information about «state».

Running it:

$ raku van-eck-list 
0, 0, 1, 0, 2, 0, 2, 2, 1, 6

$ raku van-eck-list 20
0, 0, 1, 0, 2, 0, 2, 2, 1, 6, 0, 5, 0, 2, 6, 5, 4, 0, 5, 3

And that's it, really. As a list, as shown in the Wikipedia article the challenge refers to.

But this list form is extremely unhelpful if we dig a little deeper, and discover the https://oeis.org/A181391/b181391.txt list of the 100.000 first elements in the list - and we want to check that we have gotten e.g. the 100th value right.

Adding the indices is straightforward(ish):

File: van-eck-verbose (changes only)
unit sub MAIN (Int $limit = 10, :$verbose = False);

my @van-eck = (0);
  
@van-eck.push(van-help(@van-eck)) for ^($limit -1);

if $verbose
{
  my $count = 1;
  say "{ $count++ }: $_" for @van-eck;
}
else
{
  say @van-eck.join(", ");
}

The program behaves as before, unless we use the «--verbose» flag:

$ van-eck-verbose --verbose 20
1: 0
2: 0
3: 1
4: 0
5: 2
6: 0
7: 2
8: 2
9: 1
10: 6
11: 0
12: 5
13: 0
14: 2
15: 6
16: 5
17: 4
18: 0
19: 5
20: 3

Testing, testing, testing

This makes it easier to check that I got it right, but it does require manual work. That is tedious. So let us make the program do it for us.

We should let the program download the sequence file, and compare the values with the ones from our sequence generator one by one.

But before doing that, let us see how the program handles long sequences. I'm not optimistic (and will show why in the next section).

$ van-eck-verbose --verbose 10000
$ van-eck-verbose --verbose 100000

The first one takes 9 minutes, and the second one takes 13 hours (and some minutes). I have shown the commands only, not the output.

There are several reasons:

  • The helper function «van-help» makes a copy of the array each time a new value is calculated. The longer the array, the longer time this will take
  • I use «any» to look for a value in the array, and this is essentially the same as checking every single value, if the value we are looking for isn't there. (And we have no way of knowing how efficient «any» is. It will probably not look from the end, but that is just a guess.)

The solution to the first one is to inline the helper function, and the second is fixed by storing the last position of the values we have in the sequence in a hash (lookup by the number, and the value is the position in the array). Then it is simply a hash lookup, and not a search. Hashes are optimized for lookup, so are extremely fast.

File: van-eck
unit sub MAIN (Int $limit where 2 <= $limit <= 100000 = 10, # [1]
   	       :$verbose = False, :$test = False);          # [1]

my @van-eck = (0);                                          # [2]

my %seen;                                                   # [3]

for ^($limit -1) -> $pos                                    # [4]
{
  %seen{@van-eck[*-1]}.defined                              # [5]
   ?? @van-eck.push: $pos - %seen{@van-eck[*-1]}            # [5a]
   !! @van-eck.push: 0;                                     # [5b]

  %seen{@van-eck[*-2]} = $pos;                              # [6]
}

if $test                                                    # [7]
{
  use LWP::Simple;                                          # [8]
 
  my $ok    = 0;
  my $error = 0;
  my $index = 1;

  for LWP::Simple.get('https://oeis.org/A181391/b181391.txt').lines -> $line
  {                                                        # [8]
    last if $index > $limit;                               # [9]
    
    my $my  = @van-eck[$index -1];                         # [10]
    my $def = $line.words[1];                              # [10a]

    if $my == $def                                         # [10b]
    {
      say "$index: $my == $def (OK)";
      $ok++;
    }
    else
    {
      say "$index: $my != $def (Error)";
      $error++;
    }
    
    $index++;
  }
      
  say "\nOK: $ok";                                         # [11]
  say "Error: $error";                                     # [11]
}

elsif $verbose                                             # [7]
{
  my $count = 1;
  say "{ $count++ }: $_" for @van-eck;
}

else                                                       # [7]
{
  say @van-eck.join(", ");
}

[1] The sequence file has 100.000 entries, so I have set that up as the limit. Invoke the test with the new «--test» argument.

[2] The first value in the sequence.

[3] This hash keeps the last position of every value we have in the sequence as we generate values one at a time.

[4] If the limit is 10, we get the values 0 .. 8 in this loop.

[5] If we have seen the value in question,

[5a] • use the distance to the current position as the new value,

[5b] • and if not, use 0.

[6] Update the position of the last value. It must not be used in the distance calculation in step [5a], so we do it here. Post-factum.

[7] I have used «if/elsif/else» instead of multiple dispatch this time.

[8] I use the «LWP::Simple» module to fetch the seqence. (On my Ubuntu/Linux machine this failed because of a missing system library; installing the «libssl-dev» package resolved the problem.) Iterate through the values, one at a time.

[9] Stop when we have reached the specified limit.

[10] Get the value from my sequence, and compare it with the one we downloaded [10a, 10b]. Count the correct values and errors, and report them as we go along.

[11] Print the result.

Running it:

$ raku van-eck-test --test
1: 0 == 0 (OK)
2: 0 == 0 (OK)
3: 1 == 1 (OK)
4: 0 == 0 (OK)
5: 2 == 2 (OK)
6: 0 == 0 (OK)
7: 2 == 2 (OK)
8: 2 == 2 (OK)
9: 1 == 1 (OK)
10: 6 == 6 (OK)

OK: 10
Error: 0

Running it, and testing the full set of 100000 values (without showing the values):

$ raku van-eck --test 100000
...
OK: 100000
Error: 0

So I think we can conclude that my algorithm generates the correct values. (Both of them, actually, But disregard the first one.) The program took about 5 seconds to run on my machine.

Running the minimalistic program is of course faster (as it doesn't download the file from Internet, and have less output). This takes about 1.5 seconds:

$ raku van-eck 100000

Lesson Learned: A poorly designed algorithm can cause incredible inefficient code.

Challenge #14.2

Using only the official postal (2-letter) abbreviations for the 50 U.S. states, write a script to find the longest English word you can spell? Here is the list of U.S. states abbreviations as per wikipedia page. This challenge was proposed by team member Neil Bowers.

For example,
Pennsylvania + Connecticut = PACT
Wisconsin + North Dakota = WIND
Maine + Alabama = MEAL
California + Louisiana + Massachusetts + Rhode Island = Calamari

The first task is obtaining the states, as a data structure (a hash). This page https://www.perlmonks.org/?node_id=60745 gives it in Perl.

I have modified it for Raku:

File: us-states:
my %states = 
       "AK  Alaska           LA  Louisiana        OH  Ohio
        AL  Alabama          MA  Massachusetts    OK  Oklahoma
        AR  Arkansas         MD  Maryland         OR  Oregon
        AZ  Arizona          ME  Maine            PA  Pennsylvania
        CA  California       MI  Michigan         RI  Rhode Island
        CO  Colorado         MN  Minnesota        SC  South Carolina
        CT  Connecticut      MO  Missouri         SD  South Dakota
        DE  Delaware         MS  Mississippi      TN  Tennessee
        FL  Florida          MT  Montana          TX  Texas
        GA  Georgia          NC  North Carolina   UT  Utah
        HI  Hawaii           ND  North Dakota     VA  Virginia
        IA  Iowa             NE  Nebraska         VT  Vermont
        ID  Idaho            NH  New Hampshire    WA  Washington
        IL  Illinois         NJ  New Jersey       WI  Wisconsin
        IN  Indiana          NM  New Mexico       WV  West Virginia
        KS  Kansas           NV  Nevada           WY  Wyoming
        KY  Kentucky         NY  New York".split(/\s\s+/).hash;

say %states.keys.sort;

We split on two or more spaces, so that the single space inside a two word state name is ignored by «split».

Running it (with a newline inserted to make it fit the article width):

$ raku us-states
(AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS \
MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY)

Then we need a dictionary, and we did just that back in Challenge #5.1 (See my Aruk Masgnara (or «Rakugrams») article).

The «dictionary-lookup2» program is almost what we want (look it up in the article if you want to know more), and it is easy to modify it to fit the new requirements.

This time we don't need a hash (or Set, actually). The challenge wanted the longest word, so we sort the dictionary by word size:

File: dictionary-list-by-length
my $dictionary = "/usr/share/dict/british-english"; # [1]

.say for get-dictionary($dictionary);               # [2]

sub get-dictionary ($file where $file.IO.r)         # [3]
{
  return $file.IO.lines.grep(* !~~ /\W/)>>.uc.sort({ $^b.chars cmp $^a.chars });
}        # 4 ########## # 5 ########### # 6 # # 7 ############################

[1] The dictionary file. Change the file name if you want to use another file.

[2] Load the dictionary (in [3]) and print it, one word at a time.

[3] Note the «where» clause to get a better error message if the file doesn't exist, or isn't readable..

[4] Read the file, one line at a time.

[5] • only use lines with nothing but letters.

[6] • turn the words to uppercase, as the state codes are in uppercase.

[7] • sort them, by the length. We have swapped the placeholder variables «$^a» and «$^b» around so that the longest word comes first.

Running it (and showing the first 10 lines only):

$ raku dictionary-list-by-length
COUNTERREVOLUTIONARIES
ELECTROENCEPHALOGRAPHS
ELECTROENCEPHALOGRAMS
ELECTROENCEPHALOGRAPH
ANDRIANAMPOINIMERINA
COUNTERREVOLUTIONARY
ELECTROENCEPHALOGRAM
UNCHARACTERISTICALLY
CHLOROFLUOROCARBONS
COUNTERINTELLIGENCE
...

The full program

Now we can do the full program:

File: state-words
unit sub MAIN ($dictionary where $dictionary.IO.e && $dictionary.IO.r  # [1]
    = "/usr/share/dict/british-english", :$all);                    

my %states = 
       "AK  Alaska           LA  Louisiana        OH  Ohio
        AL  Alabama          MA  Massachusetts    OK  Oklahoma
        AR  Arkansas         MD  Maryland         OR  Oregon
        AZ  Arizona          ME  Maine            PA  Pennsylvania
        CA  California       MI  Michigan         RI  Rhode Island
        CO  Colorado         MN  Minnesota        SC  South Carolina
        CT  Connecticut      MO  Missouri         SD  South Dakota
        DE  Delaware         MS  Mississippi      TN  Tennessee
        FL  Florida          MT  Montana          TX  Texas
        GA  Georgia          NC  North Carolina   UT  Utah
        HI  Hawaii           ND  North Dakota     VA  Virginia
        IA  Iowa             NE  Nebraska         VT  Vermont
        ID  Idaho            NH  New Hampshire    WA  Washington
        IL  Illinois         NJ  New Jersey       WI  Wisconsin
        IN  Indiana          NM  New Mexico       WV  West Virginia
        KS  Kansas           NV  Nevada           WY  Wyoming
        KY  Kentucky         NY  New York".split(/\s\s+/).hash;

sub get-dictionary ($file where $file.IO.r)            # [2]
{
  return $file.IO.lines>>.uc                           # [2a]
    .grep({ .chars %% 2 })                             # [2b]
    .grep(* !~~ /\W/)                                  # [2c]
    .sort({ $^b.chars cmp $^a.chars });                # [2d]
}

my $found = 0;

for get-dictionary($dictionary) -> $word               # [3]
{ 
  last if $word.chars < $found;                        # [3a]
  
  check-word($word);                                   # [4]

  last if $found && !$all;                             # [5]
}

say "\nWord length: $found.";                          # [6]

sub check-word ($word)                                 # [4]
{
  my @parts = $word.comb(2);                           # [7]

  return unless %states{$_} for @parts;                # [8]
  
  say "{ @parts.map({ %states{$_} }).join(" + ") } = $word";  # [9]
  
  $found = $word.chars;                               # [10]
}

[1] I use an english dictionatty file, but specify another file if needed. Use the «--all» flag to get all the matches (with the same length), and not hust the first one.

[2] Read the dictionary file, force the words to uppercase [2a], remove words with an odd number of characters [2b] (as the state codes are two characters long, and any combination of them has an even length), remove any word with a non character on them [2c], and finally sort them by size, with the longest first [2d].

[3] Iterate through the words in the dictionary (starting with the longest word), and abort if the current word length is smaller than the length of the first match [3a].

[4] «check-word» does the job (and sets «$found»).

[5] Abort if we have shown the first match, and the user hasn't asked for all («--all»).

[6] Report the length of the word(s) as well.

[7] Split the words in strings of two and two characters with comb(2). We have already ensured that the word length is even, so all the parts will have the correct length.

[8] Iterate through the partial strings of the word, and return if the current one doesn't match a state identifer.

[9] If they all match, we have matching word. Print it.

[10] and record the word length.

See docs.raku.org/routine/comb for more information about «comb».

Running it:

$ raku state-words
Colorado + North Carolina + Oregon + Delaware = CONCORDE

Word length: 8.
$ raku state-words --all
Colorado + North Carolina + Oregon + Delaware = CONCORDE
Georgia + New York + Maine + Delaware = GANYMEDE
Massachusetts + North Dakota + Arkansas + Indiana = MANDARIN
California + Louisiana + Michigan + Nebraska = CALAMINE
Massachusetts + Indiana + Louisiana + North Dakota = MAINLAND
Massachusetts + Louisiana + Rhode Island + Alabama = MALARIAL
Massachusetts + North Dakota + Arkansas + Indiana = MANDARIN
Maine + Missouri + Rhode Island + Alabama = MEMORIAL
Missouri + Oregon + Louisiana + North Dakota = MOORLAND

Word length: 8.

The duplicate hits for «MANDARIN» is caused by two entries in the dictionary: «Mandarin» and «mandarin».

The easiest way of removing duplicate values from a list is with the unique method. The change is trivial:

File: state-words (partial)
sub get-dictionary ($file where $file.IO.r)
{
  return $file.IO.lines>>.uc.grep({ .chars %% 2 })
    .grep(* !~~ /\W/).unique
    .sort({ $^b.chars cmp $^a.chars });
}

See docs.raku.org/routine/unique for more information about «unique».

The dictionary I use doesn't have the word «Calamari» (which is shown exactly like this in the challenge, inconsistent with the other words which are in upper case letters only).

The actual result(s) depend on the dictionary file.

Other Languages

The challenge asked for «the longest English word», but we can have a go at an American dictionary as well:

$ raku state-words --all  "/usr/share/dict/american-english"

It gives exactly the same result as the English one.

The French dictionary;

$ raku state-words -all /usr/share/dict/french 
Arkansas + Missouri + Rhode Island + California + Indiana = ARMORICAIN
Massachusetts + North Dakota + Arkansas + Indiana + Alabama = MANDARINAL

Word length: 10.

Longer words than in English.

The German dictionary:

$ raku state-words --all  "/usr/share/dict/ngerman"
Colorado + North Carolina + Oregon + Delaware = CONCORDE
Massachusetts + North Dakota + Arkansas + Indiana = MANDARIN
Massachusetts + South Carolina + Hawaii + Nebraska = MASCHINE
Virginia + North Dakota + Alabama + Indiana = VANDALIN

Word length: 8.

The Italian dictionary:

$ raku state-words -all /usr/share/dict/italian 
Rhode Island + Colorado + Michigan + North Carolina + Iowa + Virginia \
  + Missouri = RICOMINCIAVAMO
Rhode Island + Colorado + North Carolina + Illinois + Iowa + Virginia \
  + Missouri = RICONCILIAVAMO

Word length: 14.

14 characters is quite impressive. It would have been even more impressive if I had recognized them as words...