Rakugrams

by Arne Sommer

Aruk Masgnara

[9] Published 28. April 2019

Perl 6 → Raku

This article has been moved from «perl6.eu» and updated to reflect the language rename in 2019.

This is my response to the Perl Weekly Challenge #5.

Challenge #5.1: All Anagrams

Write a program which prints out all anagrams for a given word. For more information about Anagram, please check this wikipedia page.

We need a dictionary file, as random letter combinations doesn't cut it. Happily I already found three such files while writing the Raku P(i)ermutations article. (Scroll down to the «Oops?» section.) I'll use the english one; «/usr/share/dict/british-english».

Dictionary Lookup

I'll start by writing a program that looks up a user specified word in the dictionary. We load the dictionary into a hash, and the rest is pretty straightforward:

File: dictionary-lookup (without fixes)
unit sub MAIN (Str $word where $word !~~ /\W/); # [1]

my $dict = get-dictionary("/usr/share/dict/british-english");

say $dict{$word}                                # [2]
  ?? "$word: Is a valid word"
  !! "$word: Not a valid word";

sub get-dictionary ($file where $file.IO.r)     # [3]
{
  my %hash;
  $file.IO.lines.grep(* !~~ /\W/).map({ %hash{$_} = True; });
  # [4] ######## # [5] ########## # [6] ####################
  return %hash;
}

[1] We allow words only, no spaces (or any non-letter characters)

[2] Did we find the word, or not.

[3] We get a nice error message if we pass it the name of a non-readable file.

[4] This gives all the lines in the dictionary, one word on each.

[5] For almost every word, this dictionary file has an additional entry with «'s» appended. That is no good here, so we skip words with non-letters in them.

[6] We use «map» to set an entry in the hash for the word (with True as value).

We should probably do a case insensitive match. Adding .lc to turn the string to lowercase is the best choice as most of the letters (in the dictionary) are lowercase anyway. I have highlighted the changes  like this :

File: dictionary-lookup
unit sub MAIN (Str $word is copy  where $word !~~ /\W/);

$word .= lc; 

my %dict = get-dictionary("/usr/share/dict/british-english");

say %dict{$word}
  ?? "$word: Is a valid word"
  !! "$word: Not a valid word";

sub get-dictionary ($file where $file.IO.r)
{
  my %hash;
  $file.IO.lines.grep(* !~~ /\W/).map({ %hash{.lc} = True; });
  return %hash;
}

See docs.raku.org/routine/lc for more information about «lc».

I have used a couple of short forms: «$word .= lc» instead of «$word = $word.lc;», and «.lc» instead of «$_.lc».

All arguments to a procedure are read only by default. We can get a writable local copy by appending «is copy» in the procedure signature.

Note that lowercase/uppercase doesn't always roundtrip. A good example is the German character «ß» which is written as «SS» in uppercase:

> say "Straße".uc;       # -> STRASSE
> say "Straße".uc.tclc;  # -> Strasse

We can shorten «get-dictionary» considerably, if we use a «Set» instead of a hash. A «Set» is a variant of hash, where the values can only be True or False. Or rather, it looks that way from the outside. Only positive values (the keys) are stored in a «Set», so it is quite compact. We get the value True on lookup if the given key is present, and False otherwise.

File: dictionary-lookup2 (partial)
sub get-dictionary ($file where $file.IO.r)
{
  return $file.IO.lines.grep(* !~~ /\W/).Set;
}

The «grep» gave us a list, and coercing that list to a «Set» by applying the «.Set» method on it gives a «Set». No need for «map», as we did with the hash version.

If you find the name «Set» familiar in a mathematical sense, you are right. They are the same, and Raku even has built in Set operators that you can use on them. See docs.raku.org/type/Set for more information about «Set».

The conversion of the dictionary to lowercase got lost as I got rid of the «map». We cannot just put the «.lc» back on after the «grep», as we have a list - and «.lc» works on a single string. The result would be the entire list coerced to a single (and very long) string with lowercase letters. Try it in REPL:

> say <a b c d>.raku;     # -> ("a", "b", "c", "d")
> say <a b c d>.lc.raku;  # -> "a b c d"

See docs.raku.org/routine/raku for more information about «raku», which I used here to get a clearer view of the variable types.

The raku method is also available as perl for legacy reasons. It will probably be deprecated, so you are advised to use raku.

We can tell Raku to work on all the elements in a list with the «>>.» Hyper Operator.

See docs.raku.org/language/operators#Hyper_operators for more information about Hyper Operators.

File: dictionary-lookup2
unit sub MAIN (Str $word is copy where $word !~~ /\W/,
  :$dictionary where $dictionary.IO.r = "/usr/share/dict/british-english");

$word .= lc;

my $dict = get-dictionary($dictionary);

say $dict{$word}
  ?? "$word: Is a valid word"
  !! "$word: Not a valid word";

sub get-dictionary ($file where $file.IO.r)
{
  return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}

Note the «$» in «$dict» this time. If we had kept the «%» we would have coerced the Set to a hash on assignment. (It is possible to tell «%dict» to be a Set, like this: «my %dict is Set». But the assignment will not work out quite as expected, as shown in the third line:

> my $h = <A B C>.Set;         # -> set(A B C)
> my %h = <A B C>.Set;         # -> {A => True, B => True, C => True}
> my %h is Set = <A B C>.Set;  # -> set(set(A B C))
> my %h is Set = <A B C>;      # -> set(A B C)

I have added support for additional dictionary files, with the «--dictionary» argument:

$ raku dictionary-lookup2 defence
defence: Is a valid word

$ raku dictionary-lookup2 --dictionary=/usr/share/dict/american-english defence
defence: Not a valid word

$ raku dictionary-lookup2 defense
defense: Not a valid word

$ raku dictionary-lookup2 --dictionary=/usr/share/dict/american-english defense
defense: Is a valid word

$ raku dictionary-lookup2 --dictionary=/usr/share/dict/ngerman börse
börse: Is a valid word

Anagrams

The «permutations» method (described in my Raku P(i)ermutations article) gives us all the possible permutations of the elements in a list. So we'll turn the word into a list of single characters (with «.comb» and apply «.permutations» on it.

Except that «permutations» gives us a list of lists (with the single characters):

> say "abc".comb.permutations;
((a b c) (a c b) (b a c) (b c a) (c a b) (c b a))

I prefer to deal with strings, so we can use the familiar «>>.» Hyper Operator to combine the inner lists into strings (potential words):

> say "abc".comb.permutations>>.join;
(abc acb bac bca cab cba)

No duplicates, but only because we didn't have duplicate letters in the input string. The «.unique» method fixes that:

> say "abb".comb.permutations>>.join;
(abb abb bab bba bab bba)

> say "abb".comb.permutations>>.join.unique;
(abb bab bba)

See docs.raku.org/routine/comb for more information about «comb».

See docs.raku.org/routine/unique for more information about «unique».

And here it is, the full program:

File: anagrams
unit sub MAIN (Str $word is copy where $word !~~ /\W/,
  :$dictionary where $dictionary.IO.r = "/usr/share/dict/british-english");

$word .= lc;

my $dict = get-dictionary($dictionary);
 
print "Anagrams:";

for $word.comb.permutations>>.join.unique -> $candidate
{
  # next if $candidate eq $word;                # [1]
  print " $candidate" if $dict{$candidate};   # [2]
}
print "\n";

sub get-dictionary ($file where $file.IO.r) is export
{
  return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}

[1] Skip the input word itself. I have commented it out, as it is perhaps ok to report it. That way we can see if it is a legal word.

[2] Print the canditate, if we find it in the dictionary.

Testing it:

$ raku anagrams Elvis
Anagrams: elvis evils lives veils

$ raku anagrams Elvi
Anagrams: evil levi live veil vile

$ raku anagrams Elviz
Anagrams:

Multi Word Anagrams?

Anagrams can also include several words. Here are two examples from the Wikipedia article:
  • «rail safety» = «fairy tales»
  • «eleven plus two» = «twelve plus one»

The challenge did specify «for a given word», so we can ignore multiple word input.

But multiple word solutions should be handled. E.g:

  • «funeral» = «real fun»

Word List Problem

In addition to the duplicate words (as mentioned above), the english files (both UK and US) have entries for every single letter, both upper- and lowercase:

> say "/usr/share/dict/british-english".IO.lines.grep({.chars == 1});
(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \
 a b c d e f g h i j k l m n o p q r s t u v w x y z)

> "/usr/share/dict/american-english".IO.lines.grep({.chars == 1});
(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \
 a b c d e f g h i j k l m n o p q r s t u v w x y z)

That doesn't work out with multiple words anagrams, as we'd drown in single letter words. So I wrote a program to set up copies of the dictionary files without words with non-letters, and getting rid of the one letter words (except «A» and «I») in the english lists:

File: mkdictionary
my %source =
  <UK> => "/usr/share/dict/british-english",                   # [1]
  <US> => "/usr/share/dict/american-english",
  <DE> => "/usr/share/dict/ngerman";

unit sub MAIN (Str $language where %source{$language}.defined);  # [2]

my @lines = %source{$language}.IO.lines.grep(* !~~ /\W/);

spurt "dict-$language.txt", $language eq "DE"
 ?? @lines.join("\n") ~ "\n"                                     # [3]
 !! "A\nI\n" ~ @lines.grep( {.chars > 1 } ).join("\n") ~ "\n";   # [4]

[1] We set up the three languages and the corresponding dictionary files.

[2] The language must be defined in the hash in [1].

[3] German? No special filtering required.

[4] English? Get rid of the one letters words, but add «A» and «I» back again.

Multi Word Anagrams!

This looks like a recursive problem to me, and it played out quite nicely:

File: multigrams
unit sub MAIN (Str $word is copy,
  :$dictionary where $dictionary.IO.r = "dict-UK.txt");    # [1]

$word = $word.trans(" " => "", :delete).lc;                # [2]

my $dict = get-dictionary($dictionary);                    # [3]

my @permutations = $word.comb.permutations>>.join.unique;  # [4]
  
my SetHash $seen;                                          # [5]

check-anagram("", $_) for @permutations;                   # [6]

say "Anagrams: { $seen.keys.elems }";                      # [7]

.say for $seen.keys.sort;                                  # [7]

sub get-dictionary ($file where $file.IO.r)                # [3]
{
  return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}

sub check-anagram ($base is copy, $candidate is copy)   # [8]   
{
  if $dict{$candidate}                                  # [9]   
  {
    $seen{"$base $candidate".trim-leading} = True;      # [9]   
      # The first character is a space. 
    return;                                             # [9]
  }

  for 1 .. $candidate.chars                             # [10]
  {
    my $new-base      = $candidate.substr(0, $_);       # [11]
    my $new-candidate = $candidate.substr($_);          # [11]
    check-anagram("$base $new-base", $new-candidate) if $dict{$new-base};
    # [12] ###### # [13] ##########  # [14] ######## # [12] #############
  }
}

[1] I have removed the «where» clause on «$word»; see [2] for why.

[2] This line makes it possible to support multiple words as input. We simply remove the spaces with «trans».

[3] Load the dictionary.

[4] A list of candidates, without repetionsn and without spaces.

[5] Used to store the anagrams.

[6] And off we go. Note the recursive call. I'll explain it below.

[7] Report the result.

[8] A recursive procedure, called with the whole word the first time.

[9] If the candidate word is a word, take a note and we are finished.

[10] We loop through the first X number of letters in the candidate word,

[11] taking the letters as the new base and the rest as the new candidate.

[12] Then we do a recursive call, if the new base is a valis word,

[13] with the new base added to the old one (forming several words),

[14] and the new candidate.

See docs.raku.org/routine/trans for more information about «trans».

Recursion Explained

Let's say that we start with the string «He is on it». We get a list of permutations (without the spaces) in [6]. When we get to the "heisonit" candidate, the call is «check-anagram("", "heisonit")». The procedure checks if it is a valid word, which is isn't, and goes on. The «for» loop does not call «check-anagram(" h", "eisonit")», as «h» isn't a legal word. The next iteration does call «check-anagram(" he", "isonit")», as «he» is a legal word. The procedure checks if "isonit" is a valid word. It isn't so it goes on to the «for» loop, where "i" is a valid word and «check-anagram(" he i", "sonit")» is called. That call leads to another match "so", but the rest "nit" doesn't work out as word(s), so the next iteration tries «check-anagram(" he is", "onit")». The "o" isn't a word, but "on" is, so «check-anagram(" he is on", "it")» is called, and it works out (on the second iteration as "i" is ok, but "t" isn't) as "it" is a valid word.

Running it takes slightly over 1 second on my pc, and the result shows that the english word list has too many «words» that shouldn't really be considered words. I have highlighted some of them, and abridged the ouput (which has 306 lines):

$ raku multigrams "real fun"
Anagrams: 306
a elf run
a elf urn
a flue rn
a fr le nu
a fr nu le
...

Getting rid of the redundant «words» is hard, but I added the «--log-words» command line option to the program. It writes the words it found to a separate file:

File: multigrams (changes only)
unit sub MAIN (Str $word is copy,
  :$dictionary where $dictionary.IO.r = "dict-UK.txt", :$log-words);
  
my SetHash $seen;
my SetHash $word-list;

.say for $seen.keys.sort;
spurt "wordlog.txt", $word-list.keys.sort.join("\n") ~ "\n" if $log-words;

    $word-list{$candidate} = True if $log-words;
    $seen{"$base $candidate".trim-leading} = True;
$ raku multigrams --log-words "real fun"
$ mv wordlog.txt english.txt
$ emacs english.txt

The file has 61 lines (and so called words). I commented out (remove with undoability) any word that didn't make sense to me. Then I ran the «multigram» program again, with that dictionary:

$ raku multigrams --dictionary=english.txt "real fun"
Anagrams: 41
a elf run
a elf urn
a run elf
a urn elf
earl fun
earn flu
elf a run
elf a urn
elf run a
elf urn a
fan lure
fan rule
flea run
flea urn
flu earn
flu near
fun earl
fun lear
fun real
funeral
fur lane
fur lean
fur neal
lane fur
leaf run
leaf urn
lean fur
lear fun
lure fan
neal fur
near flu
real fun
rule fan
run a elf
run elf a
run flea
run leaf
urn a elf
urn elf a
urn flea
urn leaf

From 306 to 41 anagrams. That is much better.

We could consider the anagrams with the same words, but in a different order, as (almost) identical and show them on the same line. E.g:

fun real  |real fun
a elf urn | a urn elf | elf a urn | elf urn a | urn a elf | urn elf a

That should reduce the list significantly, and it reads better as well. Changes only:

File: multigrams (changes only)
unit sub MAIN (Str $word is copy,
  :$dictionary where $dictionary.IO.r = "dict-UK.txt",
  :$log-words, :$tabular);

if $tabular
{
  my %shown;
  for $seen.keys.sort
  {
    unless /\s/ { .say; next; } 

    my @w = .words.sort;
    my $w = @w.join(" ");

    next if %shown{$w};

    %shown{$w} = True;
    print $w unless @w;

    print @w.permutations.unique.join(" | ");
    print "\n";
  }
}
else
{
  .say for $seen.keys.sort;
}
$ raku multigrams --dictionary=english.txt --tabular "real fun"
Anagrams: 41
a elf run | a run elf | elf a run | elf run a | run a elf | run elf a
a elf urn | a urn elf | elf a urn | elf urn a | urn a elf | urn elf a
earl fun | fun earl
earn flu | flu earn
fan lure | lure fan
fan rule | rule fan
flea run | run flea
flea urn | urn flea
flu near | near flu
fun lear | lear fun
fun real | real fun
funeral
fur lane | lane fur
fur lean | lean fur
fur neal | neal fur
leaf run | run leaf
leaf urn | urn leaf

Multi word anagrams should have meaning, and that is almost impossible to get a computer to figure out. The tabular output is the best I can do, but the reader must do the rest of the work by sorting out the meaningless candidates.

Challenge #5.2: Anagram Sequence

Write a program to find the sequence of characters that has the most anagrams.

Single Word Anagrams

This is rather easy. Load the dictionary, sort the words by length (largest first), and compute the number of anagrams for each of them. Stop when the number of letters in the words fall below the maximum number of anagrams already computed (as we clearly cannot do better). Then print the largest value.

It will probably take quite some time to run, though... But what the heck, I'll program it anyway:

File: maxigrams-error
unit sub MAIN (Str :$dictionary where $dictionary.IO.r = "dict-UK.txt");

my $dict = get-dictionary($dictionary);

my %count;

for $dict.keys.sort( { $^b.chars <=>  $^a.chars } ) -> $word
{
  # next if $word.chars > 20;
  
  last if %count.values.max > $word.chars;
  
  %count{$word} = count-anagrams($word);
}

for %count.keys.sort( { %count{$^b} <=> %count{$^a} } )
{
  say "$_ : ", %count{$_};
}


sub get-dictionary ($file where $file.IO.r)
{
  return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}

sub count-anagrams ($word)
{
  my $count = 0;
  
  $count++ if $dict{$_} for $word.comb.permutations>>.join.unique;

  say "$word: $count";
  return $count;
} 

Running it:

$ raku maxigrams-error
Cowardly refusing to permutate more than 20 elements, tried 22
  in sub count-anagrams at ./maxigrams line 27
  in sub MAIN at ./maxigrams line 13
  in block  at ./maxigrams line 1

Adding a workaround line (shown commented out above) to avoid permutations of more than 20 characters makes it work, but any word longer than 20 characters will be ignored. That may be a problem.

Note that by «makes it work» I am not quite honest. It takes forever to run. This is a fundamental problem with the approach, as it applies «.permutations» on the words. A 20 character word has 20! (or 2432902008176640000) permutations (if all the letters are different, which they probably aren't. But that doesn't really matter that much, as it is still a very large number even if we get rid of some digits). That is a lot to go through.

Also note that we already know the potential anagrams; the other words in the dictionary. So instead of all this permutation nonsense, we can simply rearrange the letters in the words (alphabetically is an obvious choice) and count how many times they occur in the dictionary. The highest count is the winner. We will not get a legal word, but we were not asked for it so that is ok.

File: maxigrams
unit sub MAIN (Str :$dictionary where $dictionary.IO.r = "dict-UK.txt");

my $dict = get-dictionary($dictionary);

my %count;

%count{ .comb.sort.join }++ for $dict.keys;             # [1]

my $max = 0;                                            # [2]

for %count.keys.sort( { %count{$^b} <=> %count{$^a} } ) # [3]
{
  $max = %count{$_} if %count{$_} > $max;               # [4]

  last if %count{$_} < $max;                            # [5]
  
  say "$_: ", %count{$_};                               # [6]
}

sub get-dictionary ($file where $file.IO.r)
{
  return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}

[1] We rearrange the letters in the words (alphabetically), and increment the counter.

[2] The maximum number of anagrams.

[3] We sort the hash by the key size (number of anagrams); largest first.

[4] Set the maximum count the first iteration.

[5] Finish if we go below that count. This allows several words with the same number of anagrams.

[6] print the «word».

«$^a» and «$^b» are a Placeholder Variables that magically turn into existence when we use them. See docs.raku.org/language/variables#The_^_twigil for more information.

Running it:

$ raku maxigrams
aelst: 8

$ raku maxigrams --dictionary=dict-US.txt
aelst: 8

$ raku maxigrams --dictionary=dict-DE.txt
ceehinrst: 11

The english versions ran in about 4-5 seconds, and the german took almost 40 seconds to run (as the dictionary is much larger, and we got a hit on a longer word).

Now, if you want to know the actual words, look them up with «anagram»:

$raku anagrams aelst
Anagrams: least slate stael stale steal tales teals tesla

$ raku anagrams --dictionary=dict-US.txt aelst
Anagrams: least slate stael stale steal tales teals tesla

$ raku anagrams --dictionary=dict-DE.txt ceehinrst
Anagrams: enterichs entsicher entsichre erscheint erschient reichsten \
scheitern schreiten sicherten streichen tierchens

But it isn't that hard to add it to the program, while we're at it:

File: maxigrams2
unit sub MAIN (Str :$dictionary where $dictionary.IO.r = "dict-UK.txt");

my $dict = get-dictionary($dictionary);

my %count;

%count{ .comb.sort.join }++ for $dict.keys;

my $max = 0;

for %count.keys.sort( { %count{$^b} <=> %count{$^a} } )
{
  $max = %count{$_} if %count{$_} > $max;

  last if  %count{$_} < $max;
  
  say "$_: ", %count{$_}, " ", anagrams($_);
}

sub get-dictionary ($file where $file.IO.r)
{
  return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}

sub anagrams ($word)
{
  $word.comb.permutations>>.join.unique.grep( { $dict{$_} } );
}

The last part is quite clever, or perhaps too clever. We start with a list of all the possible permutations, and «.grep» gets rid of those that are not valid words (not present in the dictionary). Sit back, and enjoy that one line. Quite a lot of activity for a single line of code...

$ raku maxigrams2
aelst: 8 (least slate stael stale steal tales teals tesla)

Multi Word Anagrams

This doesn't make sense, as any sentence whatsoever could give us anagrams simply by rearranging the order of the words. Four words (e.g. «He is on it») give 24 permuatations, clearly more than 8 (the english dictionaries) or 11 (the german dictionary), and adding even more words increases the number of permutations.

Note that we have lost the possibility to get this result:

  • «funeral» = «real fun»
Fixing that would involve running «multigrams» (after implementing the multi word support from «anagrams») and wait forever for it to finish. That clearly isn't doable.

And that's it.