Common Frequency with Raku

[96] Published 11. October 2020.

This is my response to the Perl Weekly Challenge #081.

Challenge #081.1: Common Base String

You are given 2 strings, $A and $B.

Write a script to find out common base strings in $A and $B.

A substring of a string $S is called base string if repeated concatenation of the substring results in the string.

Example 1

Input:
    $A = "abcdabcd"
    $B = "abcdabcdabcdabcd"

Output:
    ("abcd", "abcdabcd")

Example 2

Input:
    $A = "aaa"
    $B = "aa"

Output:
    ("a")

I'll dive straight in:

File: common-base-string

unit sub MAIN (Str $A where $A.chars > 0, Str $B where $B.chars > 0,  # [1]
    :v(:$verbose));

die "Different characters in A and B"
  unless $A.comb.unique.sort.join eq $B.comb.unique.sort.join;        # [2]

my $a = $A.chars;                                                     # [3]
my $b = $B.chars;                                                     # [3a]

my $unique = $A.comb.unique.elems;                                    # [4]

my @cbs;                                                              # [5]

for $unique .. min($a, $b) -> $length                                 # [6]
{
  my $c = $A.substr(0, $length);                                      # [7]
  say ": Length: $length -> $c" if $verbose;

  @cbs.push: $c if $c x ($a / $length) eq $A && $c x ($b / $length) eq $B;
}                                                                     # [8]

say '(' ~ @cbs.map({ '"' ~ $_ ~ '"' }).join(", ") ~ ')' if @cbs;      # [9]

[1] The where clauses are there to prevent empty strings.

[2] Get a list of unique charcters in each input string, sorted, and joined back into a string. They must be equal.

[3] The number of characters in the input strings.

[4] The number of unique characters.

[5] We are going to store the result here.

[6] The length of the common base string can be as low as 1, but no larger than the smallest of length of the two input strings. It cannot be as low as 1 if there are more than one unique character in the strings, so we start at the nymber of unique characters. This is possibly too small (e.g. «AAB» and «AABAABAAB» gives 2, but 3 is the correct lower limit), but it is better than starting at 1. Iterate over the values.

[7] • Get the specified number of characters from one of the strings (from the start).

[8] • Add the substring to the result if it is a common base string of both input strings. Note tha usage of the string repetition operator x.

[9] Print the result, on the requested form. Note that nothing is printed if there is no result. The map places the values in(side) double quotes, and the join adds the commas between them.

See docs.raku.org/routine/x for more information about the string repetition operator x.

Running it:

$ ./common-base-string abcdabcd abcdabcdabcdabcd
("abcd", "abcdabcd")

$ ./common-base-string aaa aa
("a")

$ ./common-base-string abababab abababab
("ab", "abab", "abababab")

Looking good.

We can sipmlify (and dumbify) the program by removing the $unique smartness. Simply replace the variable with the value 1 in [6] (and delete line [4]).

Challenge #081.2: Frequency Sort

You are given file named input.

Write a script to find the frequency of all the words.

It should print the result as first column of each line should be the frequency of the the word followed by all the words of that frequency arranged in lexicographical order. Also sort the words in the ascending order of frequency.

INPUT file

West Side Story

The award-winning adaptation of the classic romantic tragedy "Romeo and
Juliet". The feuding families become two warring New York City gangs,
the white Jets led by Riff and the Latino Sharks, led by Bernardo. Their
hatred escalates to a point where neither can coexist with any form of
understanding. But when Riff's best friend (and former Jet) Tony and
Bernardo's younger sister Maria meet at a dance, no one can do anything
to stop their love. Maria and Tony begin meeting in secret, planning to
run away. Then the Sharks and Jets plan a rumble under the
highway--whoever wins gains control of the streets. Maria sends Tony to
stop it, hoping it can end the violence. It goes terribly wrong, and
before the lovers know what's happened, tragedy strikes and doesn't
stop until the climactic and heartbreaking ending.

NOTE
For the sake of this task, please ignore the following in the input file:

. " ( ) , 's --

OUTPUT

1 But City It Jet Juliet Latino New Romeo Side Story Their Then West York
adaptation any anything at award-winning away become before begin best
classic climactic coexist control dance do doesn't end ending escalates
families feuding form former friend gains gangs goes happened hatred
heartbreaking highway hoping in know love lovers meet meeting neither no
one plan planning point romantic rumble run secret sends sister streets
strikes terribly their two under understanding until violence warring
what when where white whoever wins with wrong younger

2 Bernardo Jets Riff Sharks The by it led tragedy

3 Maria Tony a can of stop

4 to

9 and the

We can use words to split the text into words, after reading in the file with IO.slurp, like this:

> "input.txt".IO.slurp.words.raku
("West", "Side", "Story", "The", "award-winning", ..., "\"Romeo", "and", \
  "Juliet\".", ..., "(and", ..., "secret,", ..., "highway--whoever", ..., \
  "ending.").Seq

To slurp() or .slurp ?

Using slurp as a method works fine, but we must access it through an IO object. We can also use it as a function, without the IO object, like this:

slurp "input.txt"

But precedence rules will cause problem, so we have to use parens if we want to stack on more method calls:

(slurp "input.txt").words.raku

I like the parens-free version better.

See docs.raku.org/routine/words for more information about words.

See docs.raku.org/routine/slurp for more information about words.

Note that words will only split on whitespace (the \s regex). So we must take care of the punctuation characters et al manually. (I have used the raku method to make it easier to see the word boundaries.) The output has been abridged.

Fixing the input before calling words is probably the best approach.

File: requency-sort

#! /usr/bin/env raku

unit sub MAIN ($file where $file.IO.e && $file.IO.r = "input.txt"); # [1]

my $content = $file.IO.slurp                                        # [2]
      .trans(/<[."(),]>/ => ' ')                                    # [3]
      .subst("'s", " ", :global)                                    # [4]
      .subst("--", " ", :global);                                   # [5]

my %freq = $content.words.Bag;                                      # [6]

my @freq;                                                           # [7]

for %freq.keys.sort -> $word                                        # [8]
{
  @freq[%freq{$word}] ~= $word ~ " "                                # [8a]
}

for @freq.keys -> $freq                                             # [9]
{
  say "$freq " ~  @freq[$freq] if @freq[$freq];                     # [9a]
}

[1] I have chosen to give the input file a «.txt» filename extension, and made it possible for the user to specify another file.

[2] Read the entire file in one go.

[3] Use the trans method to get rid of the single character sequences that we should ignore. We replace the offending charcters with a space, so that we do not mess up the word boundaries by accident. (There are no probles with the given text, but it is nevertheless a good idea to take extra care here.)

[4] We can use the subst (substitute) method to replace strings. Here we get rid of 's. Note the use of the :global flag to ensure that all occurences will be replaced; three in this file. The default is the first one only.

[5] As above, but for --. Here it is actually important to replace it with a space, so that we end up with two words.

[6] We turn the modified string into a list of words, and then coerce that list into a Bag (with Bag). A Bag is specalised version of a hash, where the keys are the values in the list we apply it on here, and the values are the frequency they occur. Ecactly what we want, albeit on a form not quite suitable for the output we want.

[7] So we are going to build up the lists for the different word lengths here. I could have added the words as a list (giving a two-dimentional array), but it is easier to work with strings as that is what we are going to print anyway.

[8] Iterate over the words, in lexicographically sorted order, and add them to the correct entry accoring to the word length [8a].

[9] Iterate over the array of word lengths, and print the words if any [9a]. (If we insert a value at e.g @freq[9], all the entries below 9 will magically pop into existence (with an undefined value). So we have to skip the unused ones.

See docs.raku.org/routine/trans for more information about trans.

See docs.raku.org/routine/subst for more information about subst.

See docs.raku.org/routine/Bag for more information about the Bag method and docs.raku.org/type/Bag more information about the Bag type.

Running it:

./frequency-sort
1 But City It Jet Juliet Latino New Romeo Side Story Their Then West York
adaptation any anything at award-winning away become before begin best
classic climactic coexist control dance do doesn't end ending escalates 
families feuding form former friend gains gangs goes happened hatred 
heartbreaking highway hoping in know love lovers meet meeting neither no 
one plan planning point romantic rumble run secret sends sister streets 
strikes terribly their two under understanding until violence warring 
what when where white whoever wins with wrong younger 
2 Bernardo Jets Riff Sharks The by it led tragedy 
3 Maria Tony a can of stop 
4 to 
9 and the

Looking good.

We can replace the two for loops with more compact code. Still for loops, but postfix versions:

File: requency-sort-postfix (partial)

@freq[%freq{$_}] ~= $_ ~ " " for %freq.keys.sort;

say "$_ { @freq[$_] }" for @freq.keys.grep({ @freq[$_] });

It is possible to use map instead:

File: frequency-sort-map

#! /usr/bin/env raku

unit sub MAIN ($file where $file.IO.e && $file.IO.r = "input.txt");

my %freq = $file.IO.slurp
  .trans(/<[."(),]>/ => ' ')
  .subst("'s", " ", :global)
  .subst("--", " ", :global)
  .words.Bag;

my @freq;

%freq.keys.sort.map({ @freq[%freq{$_}] ~= $_ ~ " " });

@freq.keys.grep({ @freq[$_] }).map({ say "$_ { @freq[$_] }" });

I combined the two first my lines as well. The third one (now the second one) really stands out, but there is no easy way of getting rid of that one.

map is basically a for loop in disguise, and in my view a little harder to read. But that really is a matter of taste, and exposure.

And that's it.

Common Frequency with Raku

Challenge #081.1: Common Base String

Challenge #081.2: Frequency Sort

To slurp() or .slurp ?

Links