BSD Unix hack: adding conditional preprocessing capabilities to calendar

One of the most useful features of the GNU Compiler Collection is the -D option to cpp, which allows you to define macros at the command line. This in combination with #ifdefs and #ifndefs in the C/C++ source files allows for very versatile conditional compilation, because it allows you to set certain parameters that the program uses without having to edit the original source code. You can use this to, say, compile a debugging version of a program, among other things.

It is widely known that the C/C++ languages use CPP for preprocessing. What is less well-known is the fact that calendar, the default BSD reminder program, also uses CPP. This is one advantage that calendar has over the newer and more feature-rich remind program, which, as far as I know, doesn’t use preprocessing. calendar is available for all major BSD variants, including macOS. It may have been ported to other *NIXs such as Linux, though I am not sure, and I don’t feel like looking it up right now.

calendar uses CPP to allow for the conditional inclusion of several libraries of pre-written reminders or events – from the standard run-of-the-mill dates like US holidays and birthdays of famous people, to more exotic things like important events in the history of computing, and important dates in the Lord of the Rings timeline. This is done in the obvious way:


# include <calendar.usholiday>
# include <calendar.birthday>
# include <calendar.computer>
# include <calendar.lotr>

You can also do #defines, with or without parameters:


#define PHYSICAL( TIME ) Appointment with -NAME REMOVED- for yearly physical at TIME
#define PSYCH_APPT( TIME ) Appointment with -NAME REMOVED- at TIME
#define THERAPY( TIME ) Therapy appointment with -NAME REMOVED- at TIME

...

Jul 20  PSYCH_APPT( 3:00 PM )
Nov 15  PHYSICAL( 11:00 AM )
Nov 29  THERAPY( 2:00 PM )

Unfortunately, that’s about all you can do with CPP in the default calendar program. I decided I wanted to be able to include certain libraries conditionally, so that if I want to just view reminders for things I have to do in my own life, I have that option, and if I want to also check on upcoming holidays, or events in Tolkien’s universe, I can manipulate those options with a simple command line flag. The CPP code in my calendar file would then look something like this:


#ifdef _HOL_
# include <calendar.usholiday>
#endif

#ifdef _BDAY_
# include <calendar.birthday>
#endif

#ifdef _COMP_
# include <calendar.computer>
#endif

#ifdef _LOTR_
# include <calendar.lotr>
#endif

… And I would manipulate these options from the command line using a parameter like -D_COMP_.

So I got to work writing a frontend for calendar that adds that capability. Here is the result, written in bash and sed:


#!/usr/bin/env bash
# This script is a frontend for the calendar program that adds the
# full power of cpp to calendar.  Namely, it can do conditional
# preprocessing and #includes based on arguments given by the -D
# option to CPP.

declare -i DAYS=10

# Parse command line options:
for arg in "$@"
do
        case "$arg" in
                -W ) shift; let DAYS=$1; shift;;
                -B ) shift; let DAYS=-$1; shift;;
                -D ) shift; break;; # Define CPP macros
                -* ) shift; shift;;
        esac
done

# Debugging info:
#echo "DAYS=$DAYS"
#echo "\$1=$1"
#echo "\$2=$2"

# Preprocess calendar file and run it through a sed script that performs necessary edits
cpp -D${1:-"NULL1"} -D${2:-"NULL2"} -D${3:-"NULL3"} -D${4:-"NULL4"} -I /usr/share/calendar ~/.calendar/calendar 2>/dev/null | sed -f ~/Scripts/calendar.sed >| ~/calendar.tmp


# Print output of calendar
if [[ $DAYS -gt 0 ]]
then
        # Print forward
        command calendar -f ~/calendar.tmp -W $DAYS
elif [[ $DAYS -lt 0 ]]
then
        # Print backward
        let DAYS=-$DAYS
        command calendar -f ~/calendar.tmp -B $DAYS
fi

# Cleanup
rm ~/calendar.tmp
unset arg DAYS

The accompanying sed script:


#!/usr/bin/env sed

/^#/d
/^[0-9][0-9]*\/[0-9][0-9]* /s/ /\       /
/^[A-Z][a-z]* [0-9][0-9]* /{
        s/ /\   /
        s/ /\   /
        s/\     / /
}

After this I set an alias in my .bashrc file to have the calendar command run this script, rather than running the calendar program directly.

There are some problems with this script, the main one being that it is extremely slow, sometimes taking as long as 10-15 seconds to do the preprocessing. If I rewrote this program in C, I could speed it up by a few orders of magnitude, not only because C is inherently faster, but also because it exposes more of the underlying details of how everything is implemented, which allows you to program more intelligently and optimize your program for the hardware.

For example, I don’t know for sure whether comparing two integers is faster than comparing two strings in the bash shell (a problem I ran into here when trying to decide whether to just use a “true”/”false” string to determine whether to use -B or -W; the bash shell doesn’t have Boolean types), because I don’t understand the underlying implementation. I would have to spend days studying the source code for the shell to get a sense of how to optimize everything. All I know is that all shell variables are essentially strings, so it’s not the same as C, where comparing two integers is much faster than going through two character arrays and comparing each pair of characters one by one. In the bash shell, you have to convert the numerical strings to numbers, perform an arithmetic or comparison operation on them, and then convert them back to strings. Both methods are extremely inefficient. This is what I don’t like about extremely high-level scripting languages such as Unix shell, Python, Ruby, PHP, etc. But hey, they allow you to write programs a hell of a lot faster than C, which is better if you just want a quick-and-dirty solution to a programming problem.

Advertisements

Unix error detection – cksum, md5sum, and shasum commands

This post will be a tutorial on Unix error checking. These commands are available in all Unix systems (that I have tested), though they have slightly different forms in each one.

Error detection codes are used to detect errors caused by either disturbances in a noisy communication channel, or deliberate modification by malicious users. There are many different methods of error checking. One form is called a checksum or hash. This is when some operation with even probability distribution is applied to the bytes, blocks, whatever, of a file at both ends of the channel to produce a much shorter value, called the sum. The sums at both ends are compared (usually by humans rather than by computers) to see if they match. If they don’t match, something has gone wrong in the transmission, and the values need to be retransmitted.

The first error detection command we will look at is the cksum command. cksum produces a checksum of a file by computing a remainder of a polynomial division of its contents. It then prints the checksum and the size of the file side-by-side.

bash-3.2$ cksum hackers.png
1967572667 13865 hackers.png

Here the first value is the checksum and the second value is the size of the file, followed of course by the name of the file.

We now compare these values to the values supplied by the site we downloaded the file from. If they match, we know the file probably downloaded correctly (it’s not an absolute guarantee, but the probability of an error is now known to be negligible).

cksum is used because it is fast and simple to implement. It has a couple of problems, though. First, implementations may vary. When using cksum you have to make sure you and the supplier of the file are using the same CRC code.

The second problem is that while CRCs protect a file from accidental modification by noise; they do not protect it from deliberate modification by malicious users. This is because multiple files can have the same checksum, and a middleman can easily inject a fake file into the network that has the same checksum as the one we are trying to download, despite being different. To guard against this sort of attack, we need what is called a one-way hash, which is a hash that is computationally very difficult to reverse.

There are a couple of these algorithms that are in common use on Unix systems. One is the MD5 algorithm, and the other is the SHA series of algorithms (in Unix the default is SHA-1). Each of these message digest algorithms, as they are called, has at least two forms – the form used in BSD systems (including Mac OS X) and the form used in Linux. Here is a table showing each of them:

Algorithm Linux form BSD form
MD5 md5sum md5
SHA-1 sha1sum shasum

Let’s look at a sample command in Mac OS X:

bash-3.2$ md5 hackers.png
MD5 (hackers.png) = d28903b7e06cd169f5e4ff59be348fb6

Unlike the cksum output, which is in decimal, the MD5/SHA-1 output is in hexadecimal. It is also much longer. Again, we compare this value to the one given by the providers of the file to make sure it is correct.

EDIT (November 11, 2016): Man, I’ve been putting off correcting this for too long.  I said something really stupid in this post, which was that a one-way hash has only one file mapped onto each hash.  Obviously, this is mathematically impossible, because a one-to-one function can only exist between two sets of the same size.  I have corrected it now.  God, this is a cringeworthy mistake, and I’m totally embarrassed by it.