[ Index | Permutations | Canonical | Multigrams | The Program | On My Wig ]
In «usa-permutations-squish» we sorted the letters in the subject of the anagram, before computing the permutations. That gave us a canoncial version of the word. We can do the same with the dictionary, and simply look up the canoncial version of the word in this version of the dictionary. That will tell us if we have a valid word, but not the actual word. We can turn the dictionary into a hash, with the canonical version as the keys, and the actual (original) word - or words - as the values.
An illustration may help:
We start with a the 7 letter word «aligned», which has two anagrams: «dealing» and «leading». The number of permutations is 7! (or 5040). The original program generated all these permutations, and looked up each and every one of them in the dictionary.
If we instead convert all the words in the dictionary, shown here with a very limited selection, to canonical form, all we have to do is look up the canonical form of the input.
The program is quite straight forward:
File: usa-canonical
#! /usr/bin/env raku
unit sub MAIN
(
:c(:$csv) where $csv.IO.r = 'states.csv',
:d(:$dictionary) where $dictionary.IO.r = '/usr/share/dict/american-english',
:v(:$verbose),
:l(:$limit) = 10
);
my $dict = get-dictionary($dictionary);
my @states = get-states($csv)[1..Inf];
say ":States: { @states.join(", ")} " if $verbose;
my %anagram;
for keys $dict -> $word
{
%anagram{$word.comb.sort.join}.push: $word; # [1]
}
for @states -> $state
{
my $compact = $state.lc.split(" ").join;
my $anagrams = %anagram{$compact.comb.sort.join}; # [2]
my @anagrams = $anagrams ?? @$anagrams.grep({ $_ ne $compact }) !! (); # [3]
say @anagrams.elems
?? "+ $state ($compact): { @anagrams.join(", ") }"
!! "- $state ($compact)";
}
sub get-states ($file where $file.IO.r)
{
return $file.IO.lines.map({ /\" (.*!) \"\,/ && $0.Str });
}
sub get-dictionary ($file where $file.IO.r)
{
return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}
[1] Set up the dictionary on canonical form (i.e. the canonical form of the words
as key, and the actual word (or words) as the values (a list, courtesy of
push
).
[2] Look up the state (on canonical form) in the canonical dictionary.
[3] Get rid of the state name itself, if present in the dictionary.
Running it (with the 3 anagrams highlighted in white, manually):
$ ./usa-canonical
- Alabama (alabama)
- Alaska (alaska)
- Arizona (arizona)
- Arkansas (arkansas)
- California (california)
- Colorado (colorado)
- Connecticut (connecticut)
- Delaware (delaware)
- District of Columbia (districtofcolumbia)
- Florida (florida)
- Georgia (georgia)
- Hawaii (hawaii)
- Idaho (idaho)
- Illinois (illinois)
- Indiana (indiana)
- Iowa (iowa)
- Kansas (kansas)
- Kentucky (kentucky)
- Louisiana (louisiana)
+ Maine (maine): anime
- Montana (montana)
- Nebraska (nebraska)
- Nevada (nevada)
- New Hampshire (newhampshire)
- New Jersey (newjersey)
- New Mexico (newmexico)
- New York (newyork)
- North Carolina (northcarolina)
- North Dakota (northdakota)
- Ohio (ohio)
- Oklahoma (oklahoma)
- Oregon (oregon)
- Maryland (maryland)
- Massachusetts (massachusetts)
- Michigan (michigan)
+ Minnesota (minnesota): nominates
- Mississippi (mississippi)
- Missouri (missouri)
- Pennsylvania (pennsylvania)
- Rhode Island (rhodeisland)
- South Carolina (southcarolina)
- South Dakota (southdakota)
- Tennessee (tennessee)
+ Texas (texas):taxes
- Utah (utah)
- Vermont (vermont)
- Virginia (virginia)
- Washington (washington)
- West Virginia (westvirginia)
- Wisconsin (wisconsin)
- Wyoming (wyoming)
This porgram is fast. It took about 2 seconds on my pc.
It is actually possible to speed it up even more:
The second part is easy, as we already have a program doing that: «mkdictionary» from Challenge #005. Here is a version that has been extensively rewritten:
File: mkdictionary
#! /usr/bin/env raku
my %source = # [1]
=> "/usr/share/dict/british-english",
=> "/usr/share/dict/american-english";
unit sub MAIN (Str $language where %source{$language}.defined);
my @lines = %source{$language}.IO.lines.grep(* ~~ / ^ <[a..z]> ** 2..* $/);
# [2]
@lines.push: 'a'; # [3]
@lines.push: 'i';
if "ignore-$language.txt".IO.r # [4]
{
my %ignore = "ignore-$language.txt".IO.lines.Set; # [5]
spurt "dict-$language.txt",
@lines.sort.squish.grep( { ! %ignore{$_} } ).join("\n") ~ "\n"; # [6]
}
else
{
spurt "dict-$language.txt", @lines.sort.squish.map( { "$_\n" }); # [7]
}
[1] Start with either the English or American dictionary.
[2] This line gets rid of the strange one letter words (except A and I) that the dictionaries have, as both lower and upper case letters:
> say "/usr/share/dict/british-english".IO.lines.grep({.chars == 1});
(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \
a b c d e f g h i j k l m n o p q r s t u v w x y z)
> say "/usr/share/dict/american-english".IO.lines.grep({.chars == 1});
(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \
a b c d e f g h i j k l m n o p q r s t u v w x y z)
[2 continued] As well as words containing any non-letter character.
**
is a Regex quantifier, used here to specify that the term to the left of it
must be repeated at least 2 times.
See
docs.raku.org/language/regexes#General_quantifier:_**_min..max
for more information about the Regex quantifier **
.
[3] Add back «a» and «i», as those are real words.
[4] Several of the words in the dictionaries are strange, and we can add them to this file to have them ignored.
[5] A Set
works out nicely for this list of words.
[6] Save the dictionary file (with spurt
), after getting rid of ignorable
words (with grep
).
Then we can use this new dictionary:
File: usa-canonical-faster
#! /usr/bin/env raku
unit sub MAIN
(
:c(:$csv) where $csv.IO.r = 'states.csv',
:d(:$dictionary) where $dictionary.IO.r = 'dict-US.txt',
:v(:$verbose),
:l(:$limit) = 10
);
my %anagram;
for $dictionary.IO.lines -> $word
{
%anagram{$word.comb.sort.join}.push: $word;
}
my @states = get-states($csv)[1..Inf];
say ":States: { @states.join(", ")} " if $verbose;
for @states -> $state
{
my $compact = $state.lc.split(" ").join;
my $anagrams = %anagram{$compact.comb.sort.join};
my @anagrams = $anagrams ?? @$anagrams.grep({ $_ ne $compact }) !! ();
say @anagrams.elems
?? "+ $state ($compact): { @anagrams.join(", ") }"
!! "- $state ($compact)";
}
sub get-states ($file where $file.IO.r)
{
return $file.IO.lines.map({ /\" (.*!) \"\,/ && $0.Str });
}
Running it gives the same result as the previous version - but it takes about 1.1 seconds to do so on my pc, compared with 1.9 seconds for the previous version. That is quite an improvement.
Remember to create the dictionary first:
$ ./mkdictionary US
[ Index | Permutations | Canonical | Multigrams | The Program | On My Wig ]