by Arne Sommer

# Common Frequency with Raku

[96] Published 11. October 2020.

This is my response to the Perl Weekly Challenge #081.

## Challenge #081.1: Common Base String

You are given 2 strings, `\$A` and `\$B`.

Write a script to find out common base strings in `\$A` and `\$B`.

A substring of a string `\$S` is called base string if repeated concatenation of the substring results in the string.

Example 1 ```Input: \$A = "abcdabcd" \$B = "abcdabcdabcdabcd" Output: ("abcd", "abcdabcd") ``` Example 2 ```Input: \$A = "aaa" \$B = "aa" Output: ("a") ```

I'll dive straight in:

File: common-base-string ```unit sub MAIN (Str \$A where \$A.chars > 0, Str \$B where \$B.chars > 0, # [1] :v(:\$verbose)); die "Different characters in A and B" unless \$A.comb.unique.sort.join eq \$B.comb.unique.sort.join; # [2] my \$a = \$A.chars; # [3] my \$b = \$B.chars; # [3a] my \$unique = \$A.comb.unique.elems; # [4] my @cbs; # [5] for \$unique .. min(\$a, \$b) -> \$length # [6] { my \$c = \$A.substr(0, \$length); # [7] say ": Length: \$length -> \$c" if \$verbose; @cbs.push: \$c if \$c x (\$a / \$length) eq \$A && \$c x (\$b / \$length) eq \$B; } # [8] say '(' ~ @cbs.map({ '"' ~ \$_ ~ '"' }).join(", ") ~ ')' if @cbs; # [9] ```

[1] The `where` clauses are there to prevent empty strings.

[2] Get a list of unique charcters in each input string, sorted, and joined back into a string. They must be equal.

[3] The number of characters in the input strings.

[4] The number of unique characters.

[5] We are going to store the result here.

[6] The length of the common base string can be as low as 1, but no larger than the smallest of length of the two input strings. It cannot be as low as 1 if there are more than one unique character in the strings, so we start at the nymber of unique characters. This is possibly too small (e.g. «AAB» and «AABAABAAB» gives 2, but 3 is the correct lower limit), but it is better than starting at 1. Iterate over the values.

[7] • Get the specified number of characters from one of the strings (from the start).

[8] • Add the substring to the result if it is a common base string of both input strings. Note tha usage of the string repetition operator `x`.

[9] Print the result, on the requested form. Note that nothing is printed if there is no result. The `map` places the values in(side) double quotes, and the `join` adds the commas between them.

See docs.raku.org/routine/x for more information about the string repetition operator `x`.

Running it:

```\$ ./common-base-string abcdabcd abcdabcdabcdabcd ("abcd", "abcdabcd") \$ ./common-base-string aaa aa ("a") \$ ./common-base-string abababab abababab ("ab", "abab", "abababab") ```

Looking good.

We can sipmlify (and dumbify) the program by removing the `\$unique` smartness. Simply replace the variable with the value `1` in [6] (and delete line [4]).

## Challenge #081.2: Frequency Sort

You are given file named `input`.

Write a script to find the frequency of all the words.

It should print the result as first column of each line should be the frequency of the the word followed by all the words of that frequency arranged in lexicographical order. Also sort the words in the ascending order of frequency.

INPUT file ```West Side Story The award-winning adaptation of the classic romantic tragedy "Romeo and Juliet". The feuding families become two warring New York City gangs, the white Jets led by Riff and the Latino Sharks, led by Bernardo. Their hatred escalates to a point where neither can coexist with any form of understanding. But when Riff's best friend (and former Jet) Tony and Bernardo's younger sister Maria meet at a dance, no one can do anything to stop their love. Maria and Tony begin meeting in secret, planning to run away. Then the Sharks and Jets plan a rumble under the highway--whoever wins gains control of the streets. Maria sends Tony to stop it, hoping it can end the violence. It goes terribly wrong, and before the lovers know what's happened, tragedy strikes and doesn't stop until the climactic and heartbreaking ending. ``` NOTE
For the sake of this task, please ignore the following in the input file: ```. " ( ) , 's -- ``` OUTPUT ```1 But City It Jet Juliet Latino New Romeo Side Story Their Then West York adaptation any anything at award-winning away become before begin best classic climactic coexist control dance do doesn't end ending escalates families feuding form former friend gains gangs goes happened hatred heartbreaking highway hoping in know love lovers meet meeting neither no one plan planning point romantic rumble run secret sends sister streets strikes terribly their two under understanding until violence warring what when where white whoever wins with wrong younger 2 Bernardo Jets Riff Sharks The by it led tragedy 3 Maria Tony a can of stop 4 to 9 and the ```

We can use `words` to split the text into words, after reading in the file with `IO.slurp`, like this:

```> "input.txt".IO.slurp.words.raku ("West", "Side", "Story", "The", "award-winning", ..., "\"Romeo", "and", \ "Juliet\".", ..., "(and", ..., "secret,", ..., "highway--whoever", ..., \ "ending.").Seq ```

#### To slurp() or .slurp ?

Using `slurp` as a method works fine, but we must access it through an `IO` object. We can also use it as a function, without the `IO` object, like this:

```slurp "input.txt" ```

But precedence rules will cause problem, so we have to use parens if we want to stack on more method calls:

```(slurp "input.txt").words.raku ```

I like the parens-free version better.

See docs.raku.org/routine/words for more information about `words`.

See docs.raku.org/routine/slurp for more information about `words`.

Note that `words` will only split on whitespace (the `\s` regex). So we must take care of the punctuation characters et al manually. (I have used the `raku` method to make it easier to see the word boundaries.) The output has been abridged.

Fixing the input before calling `words` is probably the best approach.

File: requency-sort ```#! /usr/bin/env raku unit sub MAIN (\$file where \$file.IO.e && \$file.IO.r = "input.txt"); # [1] my \$content = \$file.IO.slurp # [2] .trans(/<[."(),]>/ => ' ') # [3] .subst("'s", " ", :global) # [4] .subst("--", " ", :global); # [5] my %freq = \$content.words.Bag; # [6] my @freq; # [7] for %freq.keys.sort -> \$word # [8] { @freq[%freq{\$word}] ~= \$word ~ " " # [8a] } for @freq.keys -> \$freq # [9] { say "\$freq " ~ @freq[\$freq] if @freq[\$freq]; # [9a] } ```

[1] I have chosen to give the input file a «.txt» filename extension, and made it possible for the user to specify another file.

[2] Read the entire file in one go.

[3] Use the `trans` method to get rid of the single character sequences that we should ignore. We replace the offending charcters with a space, so that we do not mess up the word boundaries by accident. (There are no probles with the given text, but it is nevertheless a good idea to take extra care here.)

[4] We can use the `subst` (substitute) method to replace strings. Here we get rid of `'s`. Note the use of the `:global` flag to ensure that all occurences will be replaced; three in this file. The default is the first one only.

[5] As above, but for `--`. Here it is actually important to replace it with a space, so that we end up with two words.

[6] We turn the modified string into a list of words, and then coerce that list into a Bag (with `Bag`). A Bag is specalised version of a hash, where the keys are the values in the list we apply it on here, and the values are the frequency they occur. Ecactly what we want, albeit on a form not quite suitable for the output we want.

[7] So we are going to build up the lists for the different word lengths here. I could have added the words as a list (giving a two-dimentional array), but it is easier to work with strings as that is what we are going to print anyway.

[8] Iterate over the words, in lexicographically sorted order, and add them to the correct entry accoring to the word length [8a].

[9] Iterate over the array of word lengths, and print the words if any [9a]. (If we insert a value at e.g `@freq[9]`, all the entries below 9 will magically pop into existence (with an undefined value). So we have to skip the unused ones.

See docs.raku.org/routine/trans for more information about `trans`.

See docs.raku.org/routine/subst for more information about `subst`.

See docs.raku.org/routine/Bag for more information about the `Bag` method and docs.raku.org/type/Bag more information about the `Bag` type.

Running it:

```./frequency-sort 1 But City It Jet Juliet Latino New Romeo Side Story Their Then West York adaptation any anything at award-winning away become before begin best classic climactic coexist control dance do doesn't end ending escalates families feuding form former friend gains gangs goes happened hatred heartbreaking highway hoping in know love lovers meet meeting neither no one plan planning point romantic rumble run secret sends sister streets strikes terribly their two under understanding until violence warring what when where white whoever wins with wrong younger 2 Bernardo Jets Riff Sharks The by it led tragedy 3 Maria Tony a can of stop 4 to 9 and the ```

Looking good.

We can replace the two `for` loops with more compact code. Still `for` loops, but postfix versions:

File: requency-sort-postfix (partial) ```@freq[%freq{\$_}] ~= \$_ ~ " " for %freq.keys.sort; say "\$_ { @freq[\$_] }" for @freq.keys.grep({ @freq[\$_] }); ```

It is possible to use `map` instead:

File: frequency-sort-map ```#! /usr/bin/env raku unit sub MAIN (\$file where \$file.IO.e && \$file.IO.r = "input.txt"); my %freq = \$file.IO.slurp .trans(/<[."(),]>/ => ' ') .subst("'s", " ", :global) .subst("--", " ", :global) .words.Bag; my @freq; %freq.keys.sort.map({ @freq[%freq{\$_}] ~= \$_ ~ " " }); @freq.keys.grep({ @freq[\$_] }).map({ say "\$_ { @freq[\$_] }" }); ```

I combined the two first `my` lines as well. The third one (now the second one) really stands out, but there is no easy way of getting rid of that one.

`map` is basically a `for` loop in disguise, and in my view a little harder to read. But that really is a matter of taste, and exposure.

And that's it.