United States of Anagrams with Raku - Part 1: Permutations

[200.1] Published 11. September 2022.

This first part is basically based on Rakugrams, my response to the Perl Weekly Challenge #5 from April 2019.

The Dictionary

My Linux system has a US English dictionary file that we can use:

File: /usr/share/dict/american-english (first 10 lines only)

A
A's
AMD
AMD's
AOL
AOL's
AWS
AWS's
Aachen
Aachen's

The procedure reading the dictionary looks like this:

sub get-dictionary ($file where $file.IO.r)
{
  return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;  # [1]
}

[1] We do not want any words ending with 's, and grep( * !~~ /\W/) will remove any word with one or more non-word characters. The remaining words are coerced to lowercase (with >>.lc) for easier comparison. Then we coerce the list of words into a Set, which is a hash like structure. A real hash would have required a value for each key (which typically would be «1» in Perl or «True» in Raku).

See docs.raku.org/type/Set for more information about the Set type, and docs.raku.org/routine/Set for more information about the Set method.

The Words

We can use the 50 states of USA as the words (subjects), and this csv file saves us the job of typing them. Note that it includes Washington DC, even though it is not a state.

File: states.csv (first 10 lines only)

"State","Abbreviation"
"Alabama","AL"
"Alaska","AK"
"Arizona","AZ"
"Arkansas","AR"
"California","CA"
"Colorado","CO"
"Connecticut","CT"
"Delaware","DE"
"District of Columbia","DC"

Here is a procedure reading this file, returning a list of states (without the quotes). Note that the header line (the first line in the state file) will result in «State» in the list of states. We'll get rid of that later.

sub get-states ($file where $file.IO.r)
{
  return $file.IO.lines.map({ /\" (.*!) \"\,/ && $0.Str });
}

The Algorithm

The basic idea is quite straightforward. Start with a word (the subject of the anagram), generate all the possible permutations (by shuffling the letters) and check each one towards the dictionary. We get the permutations with the permutations method:

> say "abc".comb.permutations>>.join.join(", ");
abc, acb, bac, bca, cab, cba

> say "abcd".comb.permutations>>.join.join(", ");
abcd, abdc, acbd, acdb, adbc, adcb, bacd, badc, bcad, bcda, bdac, bdca, cabd,\
cadb, cbad, cbda, cdab, cdba, dabc, dacb, dbac, dbca, dcab, dcba

A lot of gibberish, so using a dictionary as filter is essential.

See docs.raku.org/routine/permutations for more information about permutations.

Single Word Anagrams Only

This approach will only detect single word anagrams. «Washington» has the multi word anagram «not washing», but this program will not find it. I'll get back to how to approach this task in part 3.

(«West Virginia» can give «west virgin i a» - which are all legal words, but they do not make sense as a sentence, regardless of how we reorder the words. So this is in fact not a legal anagram.)

This is brute force, but it works. The number of letters in the subject is an issue, as the number of permutations increases exponentially:

> say "abc".comb.permutations>>.join.elems;     # ->   6
> say "abcd".comb.permutations>>.join.elems;    # ->  24
> say "abcde".comb.permutations>>.join.elems;   # -> 120
> say "abcdef".comb.permutations>>.join.elems;  # -> 720

We'll get back to this issue when we run the program. Hope for the best, expect the worst. (Note that the numbers above are the faculty of the number of characters. Raku does not have a faculty operator, but it is easy to define one. On the other hand, you can type it as this [*] 1..6 (in the case of length 6) - instead of actually generating all the permutations as I did above.)

Note that the number of unique combinations will be lower if we have repeating characters. E.g.:

> say "AAcdef".comb.permutations>>.join.unique.elems;  # -> 360
> say "AACCef".comb.permutations>>.join.unique.elems;  # -> 180

The Program

File: usa-permutations-unique

#! /usr/bin/env raku

unit sub MAIN
(
 :c(:$csv)        where $csv.IO.r        = 'states.csv',
 :d(:$dictionary) where $dictionary.IO.r = '/usr/share/dict/american-english',
 :v(:$verbose)
);

my $dict   = get-dictionary($dictionary);                      # [1]
my @states = get-states($csv)[1..Inf];                         # [2]

say ":States: { @states.join(", ")} " if $verbose;

for @states -> $state
{
  my $compact = $state.lc.split(" ").join;                     # [3]
  my @anagrams;
  
  say ":State: $state ($compact)" if $verbose;
 
  for $compact.comb.permutations>>.join.unique -> $candidate  # [4]
  {
    next if $candidate eq $compact;                           # [5]
    @anagrams.push: $candidate if $dict{$candidate};          # [6]
  }

  say "$state ($compact): { @anagrams.elems } { @anagrams.elems ?? \
     "({ @anagrams.join(", ") })" !! "" }";                   # [7]
}

sub get-states ($file where $file.IO.r)
{
  return $file.IO.lines.map({ /\" (.*!) \"\,/ && $0.Str });
}

sub get-dictionary ($file where $file.IO.r)
{
  return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}

[1] Load the dictionary.

[2] Load the list of states. Note the starting index of «1», so that we skip the «State» from the header line.

[3] Coece the word(s) to lower case, and remove spaces. (There are more efficient ways of removing spaces, but this is probably efficient enough.)

[4] We use .unique to get rid of duplicate candidates, which we will get if we have duplicate letters in the input word.

[5] Skip the anagram if it is the word itself.

[6] A legal word (in the dictionary)? if so, add it to the list of anagrams.

[7] Print the state name, and the list of anagrams - if any.

Running it:

$ ./usa-permutations-unique
- Alabama (alabama)
- Alaska (alaska)
- Arizona (arizona)
- Arkansas (arkansas)
- California (california)
- Colorado (colorado)
^C

«California» took some time, and I killed the program when it got to «Connecticut» (which has one more letter than «California»; 11 vs 10). This means that «Connecticut» has 11 times more permututations for the program to go through (if we disregard repeating letters). Even if we had the patience to wait for «Connecticut», we would run out of it when we come to «District of Columbia» (at 18 characters when we remove the spaces).

The program clearly needs a cut off point, so that we can get some of the states at least. Let us make the program a little bit faster as well, while we are at it:

File: usa-permutations-squish

#! /usr/bin/env raku

unit sub MAIN
(
 :c(:$csv)        where $csv.IO.r        = 'states.csv',
 :d(:$dictionary) where $dictionary.IO.r = '/usr/share/dict/american-english',
 :v(:$verbose),
 :l(:$limit) = 10
);

my $dict   = get-dictionary($dictionary);
my @states = get-states($csv)[1..Inf];

say ":States: { @states.join(", ")} " if $verbose;

for @states -> $state
{
  my $compact = $state.lc.split(" ").join;
  my @anagrams;

  if $state.chars > $limit
  {
    say "? $state ($compact) skipped";
    next;
  }
 
  for $compact.comb.sort.permutations>>.join.squish -> $candidate  # [1]
  {
    next if $candidate eq $compact;
    @anagrams.push: $candidate if $dict{$candidate};
  }

  @anagrams = @anagrams.unique;

  say @anagrams.elems
    ?? "+ $state ($compact): { @anagrams.join(", ") }"
    !! "- $state ($compact)";
}

sub get-states ($file where $file.IO.r)
{
  return $file.IO.lines.map({ /\" (.*!) \"\,/ && $0.Str });
}

sub get-dictionary ($file where $file.IO.r)
{
  return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}

[1] The problem with unique, as used in the original program, is that it has to store the values as they occur, so that duplicates can be removed. The list can grow pretty large. It does not affect the status as a Sequence, as shown here:

> say (1...Inf).unique.head(10);  # -> (1 2 3 4 5 6 7 8 9 10)

If we sort the input, the output will also be sorted - and any duplicates will come sequentially. Then we can use squish instead to get rid of them.

See docs.raku.org/routine/unique for more information about unique.

See docs.raku.org/routine/squish for more information about squish.

Running it:

$ ./usa-permutations-squish
- Alabama (alabama)
- Alaska (alaska)
- Arizona (arizona)
- Arkansas (arkansas)
- California (california)
- Colorado (colorado)
? Connecticut (connecticut) skipped
- Delaware (delaware)
? District of Columbia (districtofcolumbia) skipped
- Florida (florida)
- Georgia (georgia)
- Hawaii (hawaii)
- Idaho (idaho)
- Illinois (illinois)
- Indiana (indiana)
- Iowa (iowa)
- Kansas (kansas)
- Kentucky (kentucky)
- Louisiana (louisiana)
+ Maine (maine): anime
- Montana (montana)
- Nebraska (nebraska)
- Nevada (nevada)
? New Hampshire (newhampshire) skipped
- New Jersey (newjersey)
- New Mexico (newmexico)
- New York (newyork)
? North Carolina (northcarolina) skipped
? North Dakota (northdakota) skipped
- Ohio (ohio)
- Oklahoma (oklahoma)
- Oregon (oregon)
- Maryland (maryland)
? Massachusetts (massachusetts) skipped
- Michigan (michigan)
+ Minnesota (minnesota): nominates
? Mississippi (mississippi) skipped
- Missouri (missouri)
? Pennsylvania (pennsylvania) skipped
? Rhode Island (rhodeisland) skipped
? South Carolina (southcarolina) skipped
? South Dakota (southdakota) skipped
- Tennessee (tennessee)
+ Texas (texas): taxes
- Utah (utah)
- Vermont (vermont)
- Virginia (virginia)
- Washington (washington)
? West Virginia (westvirginia) skipped
- Wisconsin (wisconsin)
- Wyoming (wyoming)

We got 3 anagrams, highlighted manully with white text. (Maine => anime, Minnesota => nominates, Texas => taxes). The program skipped 12 states. Those states may - or may not - result in an anagram, if we choose to have a go at them (and have the patience to wait for the result). But for now, we simply do not know now.

Permutations are way too time consuming when we have many letters in the subject of the anagram. But we can do it much more efficiently, with a radically different approach... And we will do just that in the next part.

Links

About me
Keyword Index
Raku Courses
Beginning Raku (book)
RSS Feed
Code (zip)
Reddit Comments