United States of Anagrams
Canonical

by Arne Sommer

United States of Anagrams with Raku - Part 2: Canonical

[200.2] Published 11. September 2022.

[ Index | Permutations | Canonical | Multigrams | The Program | On My Wig ]

A Radically Different Approach

In «usa-permutations-squish» we sorted the letters in the subject of the anagram, before computing the permutations. That gave us a canoncial version of the word. We can do the same with the dictionary, and simply look up the canoncial version of the word in this version of the dictionary. That will tell us if we have a valid word, but not the actual word. We can turn the dictionary into a hash, with the canonical version as the keys, and the actual (original) word - or words - as the values.

An illustration may help:

We start with a the 7 letter word «aligned», which has two anagrams: «dealing» and «leading». The number of permutations is 7! (or 5040). The original program generated all these permutations, and looked up each and every one of them in the dictionary.

If we instead convert all the words in the dictionary, shown here with a very limited selection, to canonical form, all we have to do is look up the canonical form of the input.

The program is quite straight forward:

File: usa-canonical
#! /usr/bin/env raku

unit sub MAIN
(
 :c(:$csv)        where $csv.IO.r        = 'states.csv',
 :d(:$dictionary) where $dictionary.IO.r = '/usr/share/dict/american-english',
 :v(:$verbose),
 :l(:$limit) = 10
);

my $dict   = get-dictionary($dictionary);
my @states = get-states($csv)[1..Inf];

say ":States: { @states.join(", ")} " if $verbose;

my %anagram;

for keys $dict -> $word
{
  %anagram{$word.comb.sort.join}.push: $word;                            # [1]
}

for @states -> $state
{
  my $compact  = $state.lc.split(" ").join;
  my $anagrams = %anagram{$compact.comb.sort.join};                      # [2]
  my @anagrams = $anagrams ?? @$anagrams.grep({ $_ ne $compact }) !! (); # [3]

  say @anagrams.elems
    ?? "+ $state ($compact): { @anagrams.join(", ") }"
    !! "- $state ($compact)";
}

sub get-states ($file where $file.IO.r)
{
  return $file.IO.lines.map({ /\" (.*!) \"\,/ && $0.Str });
}

sub get-dictionary ($file where $file.IO.r)
{
  return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}

[1] Set up the dictionary on canonical form (i.e. the canonical form of the words as key, and the actual word (or words) as the values (a list, courtesy of push).

[2] Look up the state (on canonical form) in the canonical dictionary.

[3] Get rid of the state name itself, if present in the dictionary.

Running it (with the 3 anagrams highlighted in white, manually):

$ ./usa-canonical
- Alabama (alabama)
- Alaska (alaska)
- Arizona (arizona)
- Arkansas (arkansas)
- California (california)
- Colorado (colorado)
- Connecticut (connecticut)
- Delaware (delaware)
- District of Columbia (districtofcolumbia)
- Florida (florida)
- Georgia (georgia)
- Hawaii (hawaii)
- Idaho (idaho)
- Illinois (illinois)
- Indiana (indiana)
- Iowa (iowa)
- Kansas (kansas)
- Kentucky (kentucky)
- Louisiana (louisiana)
+ Maine (maine): anime
- Montana (montana)
- Nebraska (nebraska)
- Nevada (nevada)
- New Hampshire (newhampshire)
- New Jersey (newjersey)
- New Mexico (newmexico)
- New York (newyork)
- North Carolina (northcarolina)
- North Dakota (northdakota)
- Ohio (ohio)
- Oklahoma (oklahoma)
- Oregon (oregon)
- Maryland (maryland)
- Massachusetts (massachusetts)
- Michigan (michigan)
+ Minnesota (minnesota): nominates
- Mississippi (mississippi)
- Missouri (missouri)
- Pennsylvania (pennsylvania)
- Rhode Island (rhodeisland)
- South Carolina (southcarolina)
- South Dakota (southdakota)
- Tennessee (tennessee)
+ Texas (texas):taxes
- Utah (utah)
- Vermont (vermont)
- Virginia (virginia)
- Washington (washington)
- West Virginia (westvirginia)
- Wisconsin (wisconsin)
- Wyoming (wyoming)

This porgram is fast. It took about 2 seconds on my pc.

It is actually possible to speed it up even more:

  • Combine the dictionary loading with the setting up of the hash.
  • Write a preprocessor that gives us a dictionary on the format we need; lowercase words only, and without those pesky «'s» words.

The second part is easy, as we already have a program doing that: «mkdictionary» from Challenge #005. Here is a version that has been extensively rewritten:

File: mkdictionary
#! /usr/bin/env raku

my %source =                                                           # [1]
   => "/usr/share/dict/british-english",
   => "/usr/share/dict/american-english";

unit sub MAIN (Str $language where %source{$language}.defined);

my @lines = %source{$language}.IO.lines.grep(* ~~ / ^ <[a..z]> ** 2..* $/);
                                                                       # [2]

@lines.push: 'a';                                                      # [3]
@lines.push: 'i';

if "ignore-$language.txt".IO.r                                         # [4]
{
  my %ignore = "ignore-$language.txt".IO.lines.Set;                    # [5]

  spurt "dict-$language.txt",
    @lines.sort.squish.grep( { ! %ignore{$_} } ).join("\n") ~ "\n";    # [6]
}
else
{
  spurt "dict-$language.txt", @lines.sort.squish.map( { "$_\n" });     # [7]
}

[1] Start with either the English or American dictionary.

[2] This line gets rid of the strange one letter words (except A and I) that the dictionaries have, as both lower and upper case letters:

> say "/usr/share/dict/british-english".IO.lines.grep({.chars == 1});
(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \
 a b c d e f g h i j k l m n o p q r s t u v w x y z)

> say "/usr/share/dict/american-english".IO.lines.grep({.chars == 1});
(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \
 a b c d e f g h i j k l m n o p q r s t u v w x y z)

[2 continued] As well as words containing any non-letter character. ** is a Regex quantifier, used here to specify that the term to the left of it must be repeated at least 2 times.

See docs.raku.org/language/regexes#General_quantifier:_**_min..max for more information about the Regex quantifier **.

[3] Add back «a» and «i», as those are real words.

[4] Several of the words in the dictionaries are strange, and we can add them to this file to have them ignored.

[5] A Set works out nicely for this list of words.

[6] Save the dictionary file (with spurt), after getting rid of ignorable words (with grep).

Then we can use this new dictionary:

File: usa-canonical-faster
#! /usr/bin/env raku

unit sub MAIN
(
  :c(:$csv)        where $csv.IO.r        = 'states.csv',
  :d(:$dictionary) where $dictionary.IO.r = 'dict-US.txt',
  :v(:$verbose),
  :l(:$limit) = 10
);

my %anagram;

for $dictionary.IO.lines -> $word
{
  %anagram{$word.comb.sort.join}.push: $word;
}

my @states = get-states($csv)[1..Inf];

say ":States: { @states.join(", ")} " if $verbose;

for @states -> $state
{
  my $compact  = $state.lc.split(" ").join;
  my $anagrams = %anagram{$compact.comb.sort.join};
  my @anagrams = $anagrams ?? @$anagrams.grep({ $_ ne $compact }) !! ();

  say @anagrams.elems
    ?? "+ $state ($compact): { @anagrams.join(", ") }"
    !! "- $state ($compact)";
}

sub get-states ($file where $file.IO.r)
{
  return $file.IO.lines.map({ /\" (.*!) \"\,/ && $0.Str });
}

Running it gives the same result as the previous version - but it takes about 1.1 seconds to do so on my pc, compared with 1.9 seconds for the previous version. That is quite an improvement.

Remember to create the dictionary first:

$ ./mkdictionary US

[ Index | Permutations | Canonical | Multigrams | The Program | On My Wig ]