This article has been moved from «perl6.eu» and updated to reflect the language rename in 2019.
This is my response to the Perl Weekly Challenge #5.
Write a program which prints out all anagrams for a given word. For more information about Anagram, please check this wikipedia page. |
We need a dictionary file, as random letter combinations doesn't cut it. Happily I already found three such files while writing the Raku P(i)ermutations article. (Scroll down to the «Oops?» section.) I'll use the english one; «/usr/share/dict/british-english».
unit sub MAIN (Str $word where $word !~~ /\W/); # [1]
my $dict = get-dictionary("/usr/share/dict/british-english");
say $dict{$word} # [2]
?? "$word: Is a valid word"
!! "$word: Not a valid word";
sub get-dictionary ($file where $file.IO.r) # [3]
{
my %hash;
$file.IO.lines.grep(* !~~ /\W/).map({ %hash{$_} = True; });
# [4] ######## # [5] ########## # [6] ####################
return %hash;
}
[1] We allow words only, no spaces (or any non-letter characters)
[2] Did we find the word, or not.
[3] We get a nice error message if we pass it the name of a non-readable file.
[4] This gives all the lines in the dictionary, one word on each.
[5] For almost every word, this dictionary file has an additional entry with «'s» appended. That is no good here, so we skip words with non-letters in them.
[6] We use «map
» to set an entry in the hash for the word (with True
as value).
We should probably do a case insensitive match. Adding
unit sub MAIN (Str $word is copy where $word !~~ /\W/);
$word .= lc;
my %dict = get-dictionary("/usr/share/dict/british-english");
say %dict{$word}
?? "$word: Is a valid word"
!! "$word: Not a valid word";
sub get-dictionary ($file where $file.IO.r)
{
my %hash;
$file.IO.lines.grep(* !~~ /\W/).map({ %hash{.lc} = True; });
return %hash;
}
See docs.raku.org/routine/lc for more information about «lc».
I have used a couple of short forms: «$word .= lc
» instead of
«$word = $word.lc;
», and «.lc
» instead of «$_.lc
».
All arguments to a procedure are read only by default. We can get a
writable local copy by appending «is copy
» in the procedure signature.
Note that lowercase/uppercase doesn't always roundtrip. A good example is the German character «ß» which is written as «SS» in uppercase:
> say "Straße".uc; # -> STRASSE
> say "Straße".uc.tclc; # -> Strasse
We can shorten «get-dictionary» considerably, if we use a «Set» instead of
a hash. A «Set» is a variant of hash, where the values can only be True
or
False
. Or rather, it looks that way from the outside. Only positive values
(the keys) are stored in a «Set», so it is quite compact. We get the value True
on lookup if the given key is present, and False
otherwise.
sub get-dictionary ($file where $file.IO.r)
{
return $file.IO.lines.grep(* !~~ /\W/).Set;
}
The «grep» gave us a list, and coercing that list to a «Set» by applying the «.Set
»
method on it gives a «Set». No need for «map», as we did with the hash version.
If you find the name «Set» familiar in a mathematical sense, you are right. They are the same, and Raku even has built in Set operators that you can use on them. See docs.raku.org/type/Set for more information about «Set».
The conversion of the dictionary to lowercase got lost as I got rid
of the «map». We cannot just put the «.lc
» back on after the «grep
»,
as we have a list - and «.lc
» works on a single string. The result would be the
entire list coerced to a single (and very long) string with lowercase letters. Try it in REPL:
> say <a b c d>.raku; # -> ("a", "b", "c", "d")
> say <a b c d>.lc.raku; # -> "a b c d"
See docs.raku.org/routine/raku for more information about «raku», which I used here to get a clearer view of the variable types.
The raku
method is also
available as perl
for legacy reasons. It will probably be deprecated, so you are
advised to use raku
.
We can tell Raku to work on all the elements in a list with the «>>.
»
Hyper Operator.
See docs.raku.org/language/operators#Hyper_operators for more information about Hyper Operators.
File: dictionary-lookup2
unit sub MAIN (Str $word is copy where $word !~~ /\W/,
:$dictionary where $dictionary.IO.r = "/usr/share/dict/british-english");
$word .= lc;
my $dict = get-dictionary($dictionary);
say $dict{$word}
?? "$word: Is a valid word"
!! "$word: Not a valid word";
sub get-dictionary ($file where $file.IO.r)
{
return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}
Note the «$
» in
«$dict
» this time. If we had kept the «%
» we would
have coerced the Set to a hash on assignment. (It is possible to tell
«%dict
» to be a Set, like this: «my %dict is Set
».
But the assignment will not work out quite as expected, as shown in the third
line:
> my $h = <A B C>.Set; # -> set(A B C)
> my %h = <A B C>.Set; # -> {A => True, B => True, C => True}
> my %h is Set = <A B C>.Set; # -> set(set(A B C))
> my %h is Set = <A B C>; # -> set(A B C)
I have added support for additional dictionary files, with the «--dictionary» argument:
$ raku dictionary-lookup2 defence
defence: Is a valid word
$ raku dictionary-lookup2 --dictionary=/usr/share/dict/american-english defence
defence: Not a valid word
$ raku dictionary-lookup2 defense
defense: Not a valid word
$ raku dictionary-lookup2 --dictionary=/usr/share/dict/american-english defense
defense: Is a valid word
$ raku dictionary-lookup2 --dictionary=/usr/share/dict/ngerman börse
börse: Is a valid word
The «permutations
» method (described in my
Raku P(i)ermutations article) gives us all the
possible permutations of the elements in a list. So we'll turn the word
into a list of single characters (with «.comb
» and apply
«.permutations
» on it.
Except that «permutations
» gives us a list of
lists (with the single characters):
> say "abc".comb.permutations;
((a b c) (a c b) (b a c) (b c a) (c a b) (c b a))
I prefer to deal with strings, so we can use the familiar «>>.
»
Hyper Operator to combine the inner lists into strings (potential words):
> say "abc".comb.permutations>>.join;
(abc acb bac bca cab cba)
No duplicates, but only because we didn't have duplicate
letters in the input string. The «.unique
» method fixes that:
> say "abb".comb.permutations>>.join;
(abb abb bab bba bab bba)
> say "abb".comb.permutations>>.join.unique;
(abb bab bba)
See docs.raku.org/routine/comb for more information about «comb».
See docs.raku.org/routine/unique for more information about «unique».
And here it is, the full program:
File: anagrams
unit sub MAIN (Str $word is copy where $word !~~ /\W/,
:$dictionary where $dictionary.IO.r = "/usr/share/dict/british-english");
$word .= lc;
my $dict = get-dictionary($dictionary);
print "Anagrams:";
for $word.comb.permutations>>.join.unique -> $candidate
{
# next if $candidate eq $word; # [1]
print " $candidate" if $dict{$candidate}; # [2]
}
print "\n";
sub get-dictionary ($file where $file.IO.r) is export
{
return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}
[1] Skip the input word itself. I have commented it out, as it is perhaps ok to report it. That way we can see if it is a legal word.
[2] Print the canditate, if we find it in the dictionary.
Testing it:
$ raku anagrams Elvis
Anagrams: elvis evils lives veils
$ raku anagrams Elvi
Anagrams: evil levi live veil vile
$ raku anagrams Elviz
Anagrams:
The challenge did specify «for a given word», so we can ignore multiple word input.
But multiple word solutions should be handled. E.g:
> say "/usr/share/dict/british-english".IO.lines.grep({.chars == 1});
(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \
a b c d e f g h i j k l m n o p q r s t u v w x y z)
> "/usr/share/dict/american-english".IO.lines.grep({.chars == 1});
(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \
a b c d e f g h i j k l m n o p q r s t u v w x y z)
That doesn't work out with multiple words anagrams, as we'd drown in single letter words. So I wrote a program to set up copies of the dictionary files without words with non-letters, and getting rid of the one letter words (except «A» and «I») in the english lists:
File: mkdictionary
my %source =
<UK> => "/usr/share/dict/british-english", # [1]
<US> => "/usr/share/dict/american-english",
<DE> => "/usr/share/dict/ngerman";
unit sub MAIN (Str $language where %source{$language}.defined); # [2]
my @lines = %source{$language}.IO.lines.grep(* !~~ /\W/);
spurt "dict-$language.txt", $language eq "DE"
?? @lines.join("\n") ~ "\n" # [3]
!! "A\nI\n" ~ @lines.grep( {.chars > 1 } ).join("\n") ~ "\n"; # [4]
[1] We set up the three languages and the corresponding dictionary files.
[2] The language must be defined in the hash in [1].
[3] German? No special filtering required.
[4] English? Get rid of the one letters words, but add «A» and «I» back again.
unit sub MAIN (Str $word is copy,
:$dictionary where $dictionary.IO.r = "dict-UK.txt"); # [1]
$word = $word.trans(" " => "", :delete).lc; # [2]
my $dict = get-dictionary($dictionary); # [3]
my @permutations = $word.comb.permutations>>.join.unique; # [4]
my SetHash $seen; # [5]
check-anagram("", $_) for @permutations; # [6]
say "Anagrams: { $seen.keys.elems }"; # [7]
.say for $seen.keys.sort; # [7]
sub get-dictionary ($file where $file.IO.r) # [3]
{
return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}
sub check-anagram ($base is copy, $candidate is copy) # [8]
{
if $dict{$candidate} # [9]
{
$seen{"$base $candidate".trim-leading} = True; # [9]
# The first character is a space.
return; # [9]
}
for 1 .. $candidate.chars # [10]
{
my $new-base = $candidate.substr(0, $_); # [11]
my $new-candidate = $candidate.substr($_); # [11]
check-anagram("$base $new-base", $new-candidate) if $dict{$new-base};
# [12] ###### # [13] ########## # [14] ######## # [12] #############
}
}
[1] I have removed the «where» clause on «$word»; see [2] for why.
[2] This line makes it possible to support multiple words as input. We simply remove the spaces with «trans».
[3] Load the dictionary.
[4] A list of candidates, without repetionsn and without spaces.
[5] Used to store the anagrams.
[6] And off we go. Note the recursive call. I'll explain it below.
[7] Report the result.
[8] A recursive procedure, called with the whole word the first time.
[9] If the candidate word is a word, take a note and we are finished.
[10] We loop through the first X number of letters in the candidate word,
[11] taking the letters as the new base and the rest as the new candidate.
[12] Then we do a recursive call, if the new base is a valis word,
[13] with the new base added to the old one (forming several words),
[14] and the new candidate.
See docs.raku.org/routine/trans for more information about «trans».
Running it takes slightly over 1 second on my pc, and the result shows that the english word list has too many «words» that shouldn't really be considered words. I have highlighted some of them, and abridged the ouput (which has 306 lines):
$ raku multigrams "real fun"
Anagrams: 306
a elf run
a elf urn
a flue rn
a fr le nu
a fr nu le
...
Getting rid of the redundant «words» is hard, but I added the «--log-words» command line option to the program. It writes the words it found to a separate file:
File: multigrams (changes only)
unit sub MAIN (Str $word is copy,
:$dictionary where $dictionary.IO.r = "dict-UK.txt", :$log-words);
my SetHash $seen;
my SetHash $word-list;
.say for $seen.keys.sort;
spurt "wordlog.txt", $word-list.keys.sort.join("\n") ~ "\n" if $log-words;
$word-list{$candidate} = True if $log-words;
$seen{"$base $candidate".trim-leading} = True;
$ raku multigrams --log-words "real fun"
$ mv wordlog.txt english.txt
$ emacs english.txt
The file has 61 lines (and so called words). I commented out (remove with undoability) any word that didn't make sense to me. Then I ran the «multigram» program again, with that dictionary:
$ raku multigrams --dictionary=english.txt "real fun"
Anagrams: 41
a elf run
a elf urn
a run elf
a urn elf
earl fun
earn flu
elf a run
elf a urn
elf run a
elf urn a
fan lure
fan rule
flea run
flea urn
flu earn
flu near
fun earl
fun lear
fun real
funeral
fur lane
fur lean
fur neal
lane fur
leaf run
leaf urn
lean fur
lear fun
lure fan
neal fur
near flu
real fun
rule fan
run a elf
run elf a
run flea
run leaf
urn a elf
urn elf a
urn flea
urn leaf
From 306 to 41 anagrams. That is much better.
We could consider the anagrams with the same words, but in a different order, as (almost) identical and show them on the same line. E.g:
fun real |real fun
a elf urn | a urn elf | elf a urn | elf urn a | urn a elf | urn elf a
That should reduce the list significantly, and it reads better as well. Changes only:
File: multigrams (changes only)
unit sub MAIN (Str $word is copy,
:$dictionary where $dictionary.IO.r = "dict-UK.txt",
:$log-words, :$tabular);
if $tabular
{
my %shown;
for $seen.keys.sort
{
unless /\s/ { .say; next; }
my @w = .words.sort;
my $w = @w.join(" ");
next if %shown{$w};
%shown{$w} = True;
print $w unless @w;
print @w.permutations.unique.join(" | ");
print "\n";
}
}
else
{
.say for $seen.keys.sort;
}
$ raku multigrams --dictionary=english.txt --tabular "real fun"
Anagrams: 41
a elf run | a run elf | elf a run | elf run a | run a elf | run elf a
a elf urn | a urn elf | elf a urn | elf urn a | urn a elf | urn elf a
earl fun | fun earl
earn flu | flu earn
fan lure | lure fan
fan rule | rule fan
flea run | run flea
flea urn | urn flea
flu near | near flu
fun lear | lear fun
fun real | real fun
funeral
fur lane | lane fur
fur lean | lean fur
fur neal | neal fur
leaf run | run leaf
leaf urn | urn leaf
Multi word anagrams should have meaning, and that is almost impossible to get a computer to figure out. The tabular output is the best I can do, but the reader must do the rest of the work by sorting out the meaningless candidates.
Write a program to find the sequence of characters that has the most anagrams. |
It will probably take quite some time to run, though... But what the heck, I'll program it anyway:
File: maxigrams-error
unit sub MAIN (Str :$dictionary where $dictionary.IO.r = "dict-UK.txt");
my $dict = get-dictionary($dictionary);
my %count;
for $dict.keys.sort( { $^b.chars <=> $^a.chars } ) -> $word
{
# next if $word.chars > 20;
last if %count.values.max > $word.chars;
%count{$word} = count-anagrams($word);
}
for %count.keys.sort( { %count{$^b} <=> %count{$^a} } )
{
say "$_ : ", %count{$_};
}
sub get-dictionary ($file where $file.IO.r)
{
return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}
sub count-anagrams ($word)
{
my $count = 0;
$count++ if $dict{$_} for $word.comb.permutations>>.join.unique;
say "$word: $count";
return $count;
}
Running it:
$ raku maxigrams-error
Cowardly refusing to permutate more than 20 elements, tried 22
in sub count-anagrams at ./maxigrams line 27
in sub MAIN at ./maxigrams line 13
in block at ./maxigrams line 1
Adding a workaround line (shown commented out above) to avoid permutations of more than 20 characters makes it work, but any word longer than 20 characters will be ignored. That may be a problem.
Note that by «makes it work» I am not quite honest. It takes forever to run. This is a
fundamental problem with the approach, as it applies «.permutations
» on the
words. A 20 character word has 20! (or 2432902008176640000) permutations (if all the
letters are different, which they probably aren't. But that doesn't really matter
that much, as it is still a very large number even if we get rid of some digits). That
is a lot to go through.
Also note that we already know the potential anagrams; the other words in the dictionary. So instead of all this permutation nonsense, we can simply rearrange the letters in the words (alphabetically is an obvious choice) and count how many times they occur in the dictionary. The highest count is the winner. We will not get a legal word, but we were not asked for it so that is ok.
File: maxigrams
unit sub MAIN (Str :$dictionary where $dictionary.IO.r = "dict-UK.txt");
my $dict = get-dictionary($dictionary);
my %count;
%count{ .comb.sort.join }++ for $dict.keys; # [1]
my $max = 0; # [2]
for %count.keys.sort( { %count{$^b} <=> %count{$^a} } ) # [3]
{
$max = %count{$_} if %count{$_} > $max; # [4]
last if %count{$_} < $max; # [5]
say "$_: ", %count{$_}; # [6]
}
sub get-dictionary ($file where $file.IO.r)
{
return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}
[1] We rearrange the letters in the words (alphabetically), and increment the counter.
[2] The maximum number of anagrams.
[3] We sort the hash by the key size (number of anagrams); largest first.
[4] Set the maximum count the first iteration.
[5] Finish if we go below that count. This allows several words with the same number of anagrams.
[6] print the «word».
«$^a
»
and «$^b
» are a Placeholder Variables that magically turn into
existence when we use them.
See
docs.raku.org/language/variables#The_^_twigil for more information.
Running it:
$ raku maxigrams
aelst: 8
$ raku maxigrams --dictionary=dict-US.txt
aelst: 8
$ raku maxigrams --dictionary=dict-DE.txt
ceehinrst: 11
The english versions ran in about 4-5 seconds, and the german took almost 40 seconds to run (as the dictionary is much larger, and we got a hit on a longer word).
Now, if you want to know the actual words, look them up with «anagram»:
$raku anagrams aelst
Anagrams: least slate stael stale steal tales teals tesla
$ raku anagrams --dictionary=dict-US.txt aelst
Anagrams: least slate stael stale steal tales teals tesla
$ raku anagrams --dictionary=dict-DE.txt ceehinrst
Anagrams: enterichs entsicher entsichre erscheint erschient reichsten \
scheitern schreiten sicherten streichen tierchens
But it isn't that hard to add it to the program, while we're at it:
File: maxigrams2
unit sub MAIN (Str :$dictionary where $dictionary.IO.r = "dict-UK.txt");
my $dict = get-dictionary($dictionary);
my %count;
%count{ .comb.sort.join }++ for $dict.keys;
my $max = 0;
for %count.keys.sort( { %count{$^b} <=> %count{$^a} } )
{
$max = %count{$_} if %count{$_} > $max;
last if %count{$_} < $max;
say "$_: ", %count{$_}, " ", anagrams($_);
}
sub get-dictionary ($file where $file.IO.r)
{
return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}
sub anagrams ($word)
{
$word.comb.permutations>>.join.unique.grep( { $dict{$_} } );
}
The last part is quite clever, or perhaps too clever. We start with a list of all
the possible permutations, and «.grep
» gets rid of those that are not
valid words (not present in the dictionary). Sit back, and enjoy that one line. Quite
a lot of activity for a single line of code...
$ raku maxigrams2
aelst: 8 (least slate stael stale steal tales teals tesla)
Note that we have lost the possibility to get this result:
And that's it.