Raku Binary Clock

[35] Published 6. October 2019

Perl 6 → Raku

This article has been moved from «perl6.eu» and updated to reflect the language rename in 2019.

This is my response to the Perl Weekly Challenge #028.

Challenge #28.1: File Content

Write a script to check the file content without explicitly reading the content. It should accept ile name with path as command line argument and print “The file content is binary.” or else “The file content is ascii.” accordingly.

Filename Extension

If we assume that «without explicitly reading» means «without reading», we have only one way forward. And that is inspecting the file extension, if any, and look that up in a database.

The problem is that there is an awful lot of filename extensions out there, and we'd have to look into them one by one and decide if the extension prescribes a binary or ascii file. The question is how can we decide that? A further complication comes from the fact that several filename extensions (as e.g. «ADT») have several meanings, so that one may be binary and another one ascii.

So this approach is too hard, and will give too many errors.

Implicit Reading

We could argue that «without explicitly reading» could imply «implicit reading», which simply means to use a module or built-in operator that hides the reading behind a higher level construction.

A lawyer would probably love this interpretation, but I'll discard it.

Reading without Understanding

I choose to interpret reading in the human sense of the word, where we are trying to understand the text. That leaves us with (for lack of a more suitable word) reading the file, one byte at a time, looking for binary data, without looking for meaning.

And this must be what the challenge wants us to do.

7 bit Ascii

Is ascii to be understood as the (original) 7-bit version (also called US-ascii), or the extened 8-bit version?

If we assume 7-bit, the program is quite simple:

File: binary-ascii

sub MAIN ($file where $file.IO.e && $file.IO.r)  # [1]
{
  my $fh = open $file, :bin;                     # [2]

  while my $blob = $fh.read(1024)                # [3]
  {
    for @$blob -> $byte                          # [4]
    {
      if $byte > 127                             # [5]
      {
        say "The file content is binary";        # [5a]
	$fh.close;                               # [5b]
	return;                                  # [5c]
      }
    }
  }
  say "The file content is ascii";               # [6]
  $fh.close;                                     # [6a]
}

[1] The program fails if we pass it something that isn't a valid filename, and a file that we are able to read.

[2] Open the file in binary mode, where we read bytes instead of characters.

[3] Read 1024 bytes at a time. The chosen size is quite common, but feel free to use other values.

[4] We get the bytes as a list (hidden in the scalar), so iterate over them.

[5] 7-bit ascii has 127 as the highest value. Beyond that, we have the great unknown - or binary data. If we find a higher value, say so and exit.

[6] If we have read the entire file without finding bytes with a value larger than 127, we have an ascii file.

This program is (loosely) based on my «hex-dump» program (presented as a bonus in my answer to Challenge 24).

8 bit Ascii

It feels natural to me (as a Norwegian) to understand ascii as the extended 8-bit version, as it supports the Norwegian characters æ, ø and å.

The 8-bit version is also newer (or rather, less old) than the 7-bit version, and choosing the newest version of a standard is surely the right thing...

Jonathan Worthington's «Data::TextOrBinary» module can be used for this:

File: binary-utf

use Data::TextOrBinary;          # [1]

sub MAIN ($file)                 # [2]
{
  if $file.IO.d                  # [3]
  {
    say "Directory.";            # [3]
  }
  elsif $file.IO.e               # [4]
  {
    say is-text($file.IO)        # [4]
      ?? "Text file."
      !! "Binary file.";
  }
  else                           # [5]
  {
    say "File doesn't exist.";   # [5]
  }
}

[1] You must install the module, with «zef».

[2] I have chosen to write this program without the file test clauses on the argument to MAIN as done with «binary-ascii». This time we must do the checks manually, but the benefit is the error messages for directories and non-existing files. Which style do you like best?

[3] Is it a directory?

[4] Or a file, in which case thest for binary content.

[5] If none of the above, it doesn't exist.

Running it on itself:

$ raku binary-utf binary-utf
Text file.

PDF files can be tricky, as shown on one of my own files:

$ raku binary-utf Explained.pdf 
Text file.

That is wrong...

The module works by reading the first 4096 bytes from the file, and then looking for characters that doesn’t appear in text files. We can override the number of bytes with the «test-bytes» argument, like this:

my $text = is-text($filename.IO, test-bytes => 8192);

File: binary-utf-fixed (with the changes highlighted)

use Data::TextOrBinary;

sub MAIN ($file, :$test-bytes = 4096)
{
  if $file.IO.d
  {
    say "Directory.";
  }
  elsif $file.IO.e
  {
    say is-text($file.IO, :$test-bytes)
      ?? "Text file."
      !! "Binary file.";
  }
  else
  {
    say "File doesn't exist.";
  }
}

My pdf file is reported as binary if I double the byte count:

$ raku binary-utf-fixed --test-bytes=8192 Explained.pdf 
Binary file.

Conclusision: The result is an educated guess. Increasing the number of bytes makes the guessing better, but it is still a guess. (The 7-bit ascii version gives a 100% correct answer, but will read the entire file if it is ascii. The program assumes ascii, and looks for falsification. One wrong byte means binary.)

Unicode vs Ascii

The chosen file names («binary-utf» and «binary-utf-fixed») should have given a clue that the program may have a problem. In Raku everything is unicode, and so also with this module. The challenge explicitly asked for ascii, so we'll look into the difference.

The relevant line in the module code defines the printable characters (i.e. bytes that don't indicate a binary file) like this:

@table[flat ords("\t\b\o33\o14"), 32..126, 128..255] = 1 xx *;

Some knowledge of the Ascii Table (and Unicode) enables me to conclude that the difference isn't a problem. The program will recognise both ascii and unicode files as text.

But the challenge asked for ascii only, so we may have to consider unicode as binary...

Running the «hex-dump» program (presented earlier) on a unicode (or rather utf-8) text file with Norwegian characters (æ, ø and å) - and written in Norwegian and thus unreadable for most of you - show us that they are encoded like this:

Character	Encoding
æ	C3 A6
ø	C3 B8
å	C3 A5

Hexadecimal values are great for printing (as in the hex dump program), as each byte takes exactly two characters. But they are not listed in the extended (8-bit) Ascii table, so we must translate them to decimal ourselves. Asking Raku is a good idea; e.g. in REPL:

$ raku
> "FF".parse-base(16);  # -> 255  ## Just testing
> "C3".parse-base(16);  # -> 195
> "A6".parse-base(16);  # -> 166
> "B8".parse-base(16);  # -> 184
> "A5".parse-base(16);  # -> 165

All of these decimal values belong to legal ascii characters (in the extended 8-bit part). They are graphical border symbols, which hopefully are not very much used anymore. But we can use the fact that C3 (195) occurs for each letter to count the occurence of the values in the 128-255 range. If the count for 195 is higher than the rest (or equal to the sum), we have utf-8.

I have just made an assumption about utf-8 based on three characters (æ, ø and å). It is a good idea to check the specification, before writing a program based on what may be a false assumption.

The utf8 table shows that utf-8 uses both C2 (194) and C3 (195) as multi byte starter. The next byte is in the range 80 (127) to BF (191), and these extended values are only allowed after C2 or C3. Now we can code:

File: bytecount

multi sub MAIN ($file where $file.IO.e && $file.IO.r)
{
  my $fh = open $file, :bin;
  count-blob($fh.read);
  $fh.close;
}

sub count-blob ($blob)
{
  my @count;

  @count[$_]++ for @$blob;  # [1]

  for ^@count -> $index     # [2]
  {
    say "$index: { @count[$index] }" if @count[$index];
  }
}

[1] Counting the number of each byte. The byte (ascii value) is the index.

[2] Iterate over the ascii values, and print the number of occurences. We skip unused values

Running it on the same Norwegian utf-8 encoded file gives this result (abridged):

$ raku bytecount /srv/www/bbop.org/htdocs/stories/min.html
10: 124
32: 1159
33: 16
34: 158
...
121: 33
133: 1
165: 63
166: 4
184: 28
195: 96

If we add up the values between 127 and 191, we get 96 (1 + 63 + 4 + 28). Which just happens to be the count for 195, and we have way of detecting utf-8 files.

I'll leave it as en excerise to the reader to write a program doing the math to decide if a file us in utf-8 or not.

We could plug this not-yet-written code into our previous program, and report utf-8 files as binary. And that is perhaps ok. But what about utf-16, and utf-32? And other encodings (as EBCDIC)?

I'll leave it at that.

Note that a text file which only contains characters from the 7 bit ascii table, is identical in utf-8 and ascii.

Challenge #28.2: Digital Clock

Write a script to display Digital Clock. Feel free to be as creative as you can when displaying digits. We expect bare minimum something like “14:10:11”.

Here it is, the bare minimum:

File: digiclock-simple

say DateTime.now.hh-mm-ss;

Running it gives the time in the current timezone, assuming the timezone has been set up correctly:

$ raku digiclock-simple
22:23:46

See docs.raku.org/type/DateTime for more information about «DateTime».

A version of the program with an infinite loop, printing the current time ad infinitum:

File: digiclock-loop

loop
{
  say DateTime.now.hh-mm-ss;
  sleep 1;
}

Running it, and pressing Control-C to abort it:

$ raku digiclock-loop
22:30:43
22:30:44
^C

The time between the iterations is exactly 1 second. The «say» line takes some time to execute, so the time between each execution of the loop will be slightly longer. If you run the program long enough you will see a second beeing lost.

Trying to subtract the time the «say» line takes from the full second (to fix the sleep value) is not a good idea, as the execution time will vary from computer to computer.

The solution is a Promise:

File: digiclock-promise

show-time;

sleep;

sub show-time
{
  Promise.in(1).then:
  {
    show-time;
    say DateTime.now.hh-mm-ss;
  }
}

The program sets off the show by calling «show-time» and then goes to sleep indefinitely. «show-time» kicks off a Promise that is executed 1 second later. The promise calls «show-time» (to set up a new 1 second later Promise), and print the time.

The «loop»-version is much easier to undestand, and may be good enough in practice.

The «Promise»-version has some overhead as well, so will also skip a second after a while (a very long while). But calling a procedure is faster than writing to the screen, so the problem is smaller.

Note that the «Promise»-version has a serious design flaw, as it uses recursive calls that never returns. The memory usage of this program will be quite a lot if we run it for a long time.

We can fix that, with «await»:

File: digiclock-await

loop
{
  await Promise.in(1);
  say DateTime.now.hh-mm-ss;
}

But we are back where we started, as await Promise.in(1); basically is the same as sleep(1);.

The challenge asked for creativity «when displaying digits», and so far my creativity has been spent on the code. What about a clock that updates itself:

File: digiclock-overwrite

loop
{
  print DateTime.now.hh-mm-ss;
  sleep 1;
  print "\b \b" x 8;       # [1]
}

[1] This will erase the last 8 characters, so that we can print the new time on the same line. I have used «print» instead of «say» to avoid newlines.

Note that we erase the whole time string, even if most of it is unchanged. We should probably rewrite it to only erase changed parts of the string. (And feel free to do so.)

The cursor is annoying, and we can turn it off with the NCurses module (which you'll probably have to install; «zef install NCurses»):

File: digiclock-ncurses

use NCurses;

my $stdscr = initscr() or die "Could not initialize curses";

curs_set(0);  # [1]

loop
{
  print DateTime.now.hh-mm-ss;
  sleep 1;
  print "\b \b" x 8;
}

LEAVE
{
  delwin($stdscr) if $stdscr;
}

[1] Hide the cursor.

The program doesn't just hide the cursor, it clears the teminal as well. But the old content is returned when you exit the program (courtesy of the «LEAVE» phaser that cleans up).

See docs.raku.org/syntax/LEAVE for more information about the LEAVE phaser.

See docs.raku.org/language/phasers for more information about the phasers.

Ascii Art

What about a version using Ascii Art?

The problem is that (as far as I can tell) there are no Raku modules for this, and using an online service is a waste of Internet resources. So I'll write it myself.

This program uses the «Readline» module. Install it (with «zef install Readline») if you don't have it installed already.

File: digiclock-art

use Readline;

my %a;

%a<0> = q:to/END/;
   ,a8888a,     
 ,8P"'  `"Y8,   
,8P        Y8,  
88          88  
88          88  
`8b        d8'  
 `8ba,  ,ad8'   
   "Y8888P"     
END

%a<1> =q:to/END/;
    88  
  ,d88  
888888  
    88  
    88  
    88  
    88  
    88  
END

%a<2> =q:to/END/;
 ad888888b,  
d8"     "88  
        a8P  
     ,d8P"   
   a8P"      
 a8P'        
d8"          
88888888888  
END

%a<3> =q:to/END/;
 ad888888b,  
d8"     "88  
        a8P  
     aad8"   
     ""Y8,   
        "8b  
Y8,     a88  
 "Y888888P'  
END

%a<4> =q:to/END/;
        ,d8    
      ,d888    
    ,d8" 88    
  ,d8"   88    
,d8"     88    
8888888888888  
         88    
         88    
END

%a<5> =q:to/END/;
8888888888   
88           
88  ____     
88a8PPPP8b,  
PP"     `8b  
         d8  
Y8a     a8P  
 "Y88888P"   
END

%a<6> =q:to/END/;
  ad8888ba,  
 8P'    "Y8  
d8           
88,dd888bb,  
88P'    `8b  
88       d8  
88a     a8P  
 "Y88888P"   
END

%a<7> =q:to/END/;
888888888888  
        ,8P'  
       d8"    
     ,8P'     
    d8"       
  ,8P'        
 d8"          
8P'           
END

%a<8> =q:to/END/;
 ad88888ba   
d8"     "8b  
Y8a     a8P  
 "Y8aaa8P"   
 ,d8"""8b,   
d8"     "8b  
Y8a     a8P  
 "Y88888P"   
END

%a<9> =q:to/END/;
 ad88888ba   
d8"     "88  
8P       88  
Y8,    ,d88  
 "PPPPPP"88  
         8P  
8b,    a8P   
`"Y8888P'    
END

%a<:> =q:to/END/;
     
888  
888  
     
     
888  
888  
     
END

my %b;

for 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ":" -> $id
{
  for %a{$id}.split("\n") -> $line
  {
    %b{$id}.push($line);
  }
}

loop
{
  shell 'clear';
			  
  my @chars = DateTime.now.hh-mm-ss.comb;

  for 0 .. 7 -> $line
  {
    for @chars -> $char
    {
      print @(%b{$char})[$line];
    }
    say "";
  }

  sleep 1;
}

I used this Interactive ASCII Text Generator to generate the digits. It didn't have the colon, so I made that myself.

Note the spaces after the characters, so that all the lines for a singe character has the same length.

The «%a» hash contains a string (with embedded newlines) for each character. The «%b» hash contains an array of 8 lines for each character, without newlines.

Running it:

$ raku digiclock-art
   ,a8888a,         88           88  888888888888               ,d8        88  
 ,8P"'  `"Y8,     ,d88  888    ,d88          ,8P'  888        ,d888      ,d88  
,8P        Y8,  888888  888  888888         d8"    888      ,d8" 88    888888  
88          88      88           88       ,8P'            ,d8"   88        88  
88          88      88           88      d8"            ,d8"     88        88  
`8b        d8'      88  888      88    ,8P'        888  8888888888888      88  
 `8ba,  ,ad8'       88  888      88   d8"          888           88        88  
   "Y8888P"         88           88  8P'                         88        88

The output is generated by the «for 0 .. 7» loop. It prints each line, with the corresponding line for the character we want.

This program uses the «Readline» module's «shell 'clear'» to clear the screen before we print the time, as the backspace trick wouldn't work out here. (There are a lot of characters, and we don't know the exact number; «1» has a width of 4, but «4» has a width of 13.) The cursor is back, but is less annoying this time as the text is much larger.

And that's it.

Links

About me
Keyword Index
Raku Courses
Beginning Raku (book)
RSS Feed
Reddit Comments
Code (zip)