This article has been moved from «perl6.eu» and updated to reflect the language rename in 2019.
This is my response to the Perl Weekly Challenge #028.
The problem is that there is an awful lot of filename extensions out there, and we'd have to look into them one by one and decide if the extension prescribes a binary or ascii file. The question is how can we decide that? A further complication comes from the fact that several filename extensions (as e.g. «ADT») have several meanings, so that one may be binary and another one ascii.
So this approach is too hard, and will give too many errors.
A lawyer would probably love this interpretation, but I'll discard it.
And this must be what the challenge wants us to do.
If we assume 7-bit, the program is quite simple:
File: binary-ascii
sub MAIN ($file where $file.IO.e && $file.IO.r) # [1]
{
my $fh = open $file, :bin; # [2]
while my $blob = $fh.read(1024) # [3]
{
for @$blob -> $byte # [4]
{
if $byte > 127 # [5]
{
say "The file content is binary"; # [5a]
$fh.close; # [5b]
return; # [5c]
}
}
}
say "The file content is ascii"; # [6]
$fh.close; # [6a]
}
[1] The program fails if we pass it something that isn't a valid filename, and a file that we are able to read.
[2] Open the file in binary mode, where we read bytes instead of characters.
[3] Read 1024 bytes at a time. The chosen size is quite common, but feel free to use other values.
[4] We get the bytes as a list (hidden in the scalar), so iterate over them.
[5] 7-bit ascii has 127 as the highest value. Beyond that, we have the great unknown - or binary data. If we find a higher value, say so and exit.
[6] If we have read the entire file without finding bytes with a value larger than 127, we have an ascii file.
This program is (loosely) based on my «hex-dump» program (presented as a bonus in my answer to Challenge 24).
The 8-bit version is also newer (or rather, less old) than the 7-bit version, and choosing the newest version of a standard is surely the right thing...
Jonathan Worthington's «Data::TextOrBinary» module can be used for this:
File: binary-utf
use Data::TextOrBinary; # [1]
sub MAIN ($file) # [2]
{
if $file.IO.d # [3]
{
say "Directory."; # [3]
}
elsif $file.IO.e # [4]
{
say is-text($file.IO) # [4]
?? "Text file."
!! "Binary file.";
}
else # [5]
{
say "File doesn't exist."; # [5]
}
}
[1] You must install the module, with «zef».
[2] I have chosen to write this program without the file test clauses on the argument to MAIN as done with «binary-ascii». This time we must do the checks manually, but the benefit is the error messages for directories and non-existing files. Which style do you like best?
[3] Is it a directory?
[4] Or a file, in which case thest for binary content.
[5] If none of the above, it doesn't exist.
Running it on itself:
$ raku binary-utf binary-utf
Text file.
PDF files can be tricky, as shown on one of my own files:
$ raku binary-utf Explained.pdf
Text file.
That is wrong...
The module works by reading the first 4096 bytes from the file, and then looking for characters that doesn’t appear in text files. We can override the number of bytes with the «test-bytes» argument, like this:
my $text = is-text($filename.IO, test-bytes => 8192);
File: binary-utf-fixed (with the changes highlighted)
use Data::TextOrBinary;
sub MAIN ($file, :$test-bytes = 4096)
{
if $file.IO.d
{
say "Directory.";
}
elsif $file.IO.e
{
say is-text($file.IO, :$test-bytes)
?? "Text file."
!! "Binary file.";
}
else
{
say "File doesn't exist.";
}
}
My pdf file is reported as binary if I double the byte count:
$ raku binary-utf-fixed --test-bytes=8192 Explained.pdf
Binary file.
Conclusision: The result is an educated guess. Increasing the number of bytes makes the guessing better, but it is still a guess. (The 7-bit ascii version gives a 100% correct answer, but will read the entire file if it is ascii. The program assumes ascii, and looks for falsification. One wrong byte means binary.)
The relevant line in the module code defines the printable characters (i.e. bytes that don't indicate a binary file) like this:
@table[flat ords("\t\b\o33\o14"), 32..126, 128..255] = 1 xx *;
Some knowledge of the Ascii Table (and Unicode) enables me to conclude that the difference isn't a problem. The program will recognise both ascii and unicode files as text.
But the challenge asked for ascii only, so we may have to consider unicode as binary...
Running the «hex-dump» program (presented earlier) on a unicode (or rather utf-8) text file with Norwegian characters (æ, ø and å) - and written in Norwegian and thus unreadable for most of you - show us that they are encoded like this:
Character | Encoding |
æ | C3 A6 |
ø | C3 B8 |
å | C3 A5 |
Hexadecimal values are great for printing (as in the hex dump program), as each byte takes exactly two characters. But they are not listed in the extended (8-bit) Ascii table, so we must translate them to decimal ourselves. Asking Raku is a good idea; e.g. in REPL:
$ raku
> "FF".parse-base(16); # -> 255 ## Just testing
> "C3".parse-base(16); # -> 195
> "A6".parse-base(16); # -> 166
> "B8".parse-base(16); # -> 184
> "A5".parse-base(16); # -> 165
All of these decimal values belong to legal ascii characters (in the extended 8-bit part). They are graphical border symbols, which hopefully are not very much used anymore. But we can use the fact that C3 (195) occurs for each letter to count the occurence of the values in the 128-255 range. If the count for 195 is higher than the rest (or equal to the sum), we have utf-8.
I have just made an assumption about utf-8 based on three characters (æ, ø and å). It is a good idea to check the specification, before writing a program based on what may be a false assumption.
The utf8 table shows that utf-8 uses both C2 (194) and C3 (195) as multi byte starter. The next byte is in the range 80 (127) to BF (191), and these extended values are only allowed after C2 or C3. Now we can code:
File: bytecount
multi sub MAIN ($file where $file.IO.e && $file.IO.r)
{
my $fh = open $file, :bin;
count-blob($fh.read);
$fh.close;
}
sub count-blob ($blob)
{
my @count;
@count[$_]++ for @$blob; # [1]
for ^@count -> $index # [2]
{
say "$index: { @count[$index] }" if @count[$index];
}
}
[1] Counting the number of each byte. The byte (ascii value) is the index.
[2] Iterate over the ascii values, and print the number of occurences. We skip unused values
Running it on the same Norwegian utf-8 encoded file gives this result (abridged):
$ raku bytecount /srv/www/bbop.org/htdocs/stories/min.html
10: 124
32: 1159
33: 16
34: 158
...
121: 33
133: 1
165: 63
166: 4
184: 28
195: 96
If we add up the values between 127 and 191, we get 96 (1 + 63 + 4 + 28). Which just happens to be the count for 195, and we have way of detecting utf-8 files.
I'll leave it as en excerise to the reader to write a program doing the math to decide if a file us in utf-8 or not.
We could plug this not-yet-written code into our previous program, and report utf-8 files as binary. And that is perhaps ok. But what about utf-16, and utf-32? And other encodings (as EBCDIC)?
I'll leave it at that.
Note that a text file which only contains characters from the 7 bit ascii table, is identical in utf-8 and ascii.
say DateTime.now.hh-mm-ss;
Running it gives the time in the current timezone, assuming the timezone has been set up correctly:
$ raku digiclock-simple
22:23:46
See docs.raku.org/type/DateTime for more information about «DateTime».
A version of the program with an infinite loop, printing the current time ad infinitum:
File: digiclock-loop
loop
{
say DateTime.now.hh-mm-ss;
sleep 1;
}
Running it, and pressing Control-C to abort it:
$ raku digiclock-loop
22:30:43
22:30:44
^C
The time between the iterations is exactly 1 second. The «say» line takes some time to execute, so the time between each execution of the loop will be slightly longer. If you run the program long enough you will see a second beeing lost.
Trying to subtract the time the «say» line takes from the full second (to fix the sleep value) is not a good idea, as the execution time will vary from computer to computer.
File: digiclock-promise
show-time;
sleep;
sub show-time
{
Promise.in(1).then:
{
show-time;
say DateTime.now.hh-mm-ss;
}
}
The program sets off the show by calling «show-time» and then goes to sleep indefinitely. «show-time» kicks off a Promise that is executed 1 second later. The promise calls «show-time» (to set up a new 1 second later Promise), and print the time.
The «loop»-version is much easier to undestand, and may be good enough in practice.
The «Promise»-version has some overhead as well, so will also skip a second after a while (a very long while). But calling a procedure is faster than writing to the screen, so the problem is smaller.
Note that the «Promise»-version has a serious design flaw, as it uses recursive calls that never returns. The memory usage of this program will be quite a lot if we run it for a long time.
We can fix that, with «await»:
File: digiclock-await
loop
{
await Promise.in(1);
say DateTime.now.hh-mm-ss;
}
But we are back where we started, as await Promise.in(1);
basically is
the same as sleep(1);
.
The challenge asked for creativity «when displaying digits», and so far my creativity has been spent on the code. What about a clock that updates itself:
File: digiclock-overwrite
loop
{
print DateTime.now.hh-mm-ss;
sleep 1;
print "\b \b" x 8; # [1]
}
[1] This will erase the last 8 characters, so that we can print the new time on the same line. I have used «print» instead of «say» to avoid newlines.
Note that we erase the whole time string, even if most of it is unchanged. We should probably rewrite it to only erase changed parts of the string. (And feel free to do so.)
The cursor is annoying, and we can turn it off with the NCurses module (which you'll probably have to install; «zef install NCurses»):
File: digiclock-ncurses
use NCurses;
my $stdscr = initscr() or die "Could not initialize curses";
curs_set(0); # [1]
loop
{
print DateTime.now.hh-mm-ss;
sleep 1;
print "\b \b" x 8;
}
LEAVE
{
delwin($stdscr) if $stdscr;
}
[1] Hide the cursor.
The program doesn't just hide the cursor, it clears the teminal as well. But the old content is returned when you exit the program (courtesy of the «LEAVE» phaser that cleans up).
See
docs.raku.org/syntax/LEAVE
for more information about the LEAVE
phaser.
See docs.raku.org/language/phasers for more information about the phasers.
The problem is that (as far as I can tell) there are no Raku modules for this, and using an online service is a waste of Internet resources. So I'll write it myself.
This program uses the «Readline» module. Install it (with «zef install Readline») if you don't have it installed already.
File: digiclock-art
use Readline;
my %a;
%a<0> = q:to/END/;
,a8888a,
,8P"' `"Y8,
,8P Y8,
88 88
88 88
`8b d8'
`8ba, ,ad8'
"Y8888P"
END
%a<1> =q:to/END/;
88
,d88
888888
88
88
88
88
88
END
%a<2> =q:to/END/;
ad888888b,
d8" "88
a8P
,d8P"
a8P"
a8P'
d8"
88888888888
END
%a<3> =q:to/END/;
ad888888b,
d8" "88
a8P
aad8"
""Y8,
"8b
Y8, a88
"Y888888P'
END
%a<4> =q:to/END/;
,d8
,d888
,d8" 88
,d8" 88
,d8" 88
8888888888888
88
88
END
%a<5> =q:to/END/;
8888888888
88
88 ____
88a8PPPP8b,
PP" `8b
d8
Y8a a8P
"Y88888P"
END
%a<6> =q:to/END/;
ad8888ba,
8P' "Y8
d8
88,dd888bb,
88P' `8b
88 d8
88a a8P
"Y88888P"
END
%a<7> =q:to/END/;
888888888888
,8P'
d8"
,8P'
d8"
,8P'
d8"
8P'
END
%a<8> =q:to/END/;
ad88888ba
d8" "8b
Y8a a8P
"Y8aaa8P"
,d8"""8b,
d8" "8b
Y8a a8P
"Y88888P"
END
%a<9> =q:to/END/;
ad88888ba
d8" "88
8P 88
Y8, ,d88
"PPPPPP"88
8P
8b, a8P
`"Y8888P'
END
%a<:> =q:to/END/;
888
888
888
888
END
my %b;
for 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ":" -> $id
{
for %a{$id}.split("\n") -> $line
{
%b{$id}.push($line);
}
}
loop
{
shell 'clear';
my @chars = DateTime.now.hh-mm-ss.comb;
for 0 .. 7 -> $line
{
for @chars -> $char
{
print @(%b{$char})[$line];
}
say "";
}
sleep 1;
}
I used this Interactive ASCII Text Generator to generate the digits. It didn't have the colon, so I made that myself.
Note the spaces after the characters, so that all the lines for a singe character has the same length.
The «%a
» hash contains a string (with embedded newlines) for each
character. The «%b
» hash contains an array of 8 lines for each
character, without newlines.
Running it:
$ raku digiclock-art
,a8888a, 88 88 888888888888 ,d8 88
,8P"' `"Y8, ,d88 888 ,d88 ,8P' 888 ,d888 ,d88
,8P Y8, 888888 888 888888 d8" 888 ,d8" 88 888888
88 88 88 88 ,8P' ,d8" 88 88
88 88 88 88 d8" ,d8" 88 88
`8b d8' 88 888 88 ,8P' 888 8888888888888 88
`8ba, ,ad8' 88 888 88 d8" 888 88 88
"Y8888P" 88 88 8P' 88 88
The output is generated by the «for 0 .. 7
» loop. It prints each line,
with the corresponding line for the character we want.
This program uses the «Readline» module's «shell 'clear'
» to clear
the screen before we print the time, as the backspace trick wouldn't work out
here. (There are a lot of characters, and we don't know the exact number; «1»
has a width of 4, but «4» has a width of 13.) The cursor is back, but is less
annoying this time as the text is much larger.
And that's it.