Wednesday, September 11, 2013

Stone Knives and Bear Skins

In "The City on the Edge of Forever", Spock explains to Edith Keeler that he's trying to make a mnemonic circuit using only stone knives and bear skins.

Today I'd like to show one way to build a simple make-like utility, using the Unix equivalent of stone knives and bear skins.

This is not a "code golf" exercise, in that I'm not trying for the smallest possible solution (however inscrutable). Instead, I want to illustrate how little machinery it takes to construct a tool that approximates what make(1) does.

Here are the ingredients:

  1. a main shell script (make.sh), which needs
    • a modern Bourne shell, with shell functions
  2. a set of shell functions, which need
    • cat
    • sum
    • awk
    • touch
    • /bin/test (aka "[")
    • a compiler (for today, I'll use 'gfortran' on linux)

and that's it. With some more cleverness, we could probably figure out how to use fewer unix tools than these, but this set of programs (sh, rm, cat, sum, awk, touch, and [ ) are commonly found on Unix hosts. We could substitute another compiler without much difficulty.

For concreteness, the target program I want to compile has two small files: main.f90 and print.f90. They are written in Fortran 90. Don't worry if you don't know Fortran -- the target code is just an example. You could substitute different code with the appropriate compiler, in a straightforward way.

main.f90 is the initial procedure of the program:

---cut here---
program main 
    implicit none
    integer, parameter :: limit = 5
    call print_table(limit)
end program
---cut here---

print.f90 contains the PRINT_TABLE subroutine:

---cut here---
subroutine print_table(limit)
    implicit none
    integer :: limit 
    integer :: i
    do i = 1, limit 
        write(*,*) i, i**2, sqrt(real(i))
    end do     
end subroutine
---cut here---

(A subroutine in Fortran is roughly like a function returning void in C.)

The finished program writes out a short table of squares and square roots. The first column is an integer, the second column is the square of the integer, and the third column is the square root of the integer.

$ ./main
           1           1   1.0000000    
           2           4   1.4142135    
           3           9   1.7320508    
           4          16   2.0000000    
           5          25   2.2360680    

Now to work. The aim is to write a shell script that acts like make(1) for compiling this program. Performing the compile and link steps one file at a time, our utility should do:

# compile source files into objects

$ gfortran -c -o main.o main.f90
$ gfortran -c -o print.o print.f90

# link objects into executable

$ gfortran -o main main.o print.o -lm 

If a source file changes, we want to recompile just that file. If any object file changes, we want to re-link the main executable.

For make(1) and its cousins, there are different ideas about what constitutes "change" in a file. Every useful make-like tool doesn't actually compute an absolutely true answer to the question "has file XYZ changed?" - instead, each tool computes an approximate answer.

In the case of make(1), a file is considered "changed" if the file's mtime has changed. That's an approximation that's fast and easy to compute: do a stat(2) on the file, and read the file's mtime. There are other approximations for "has file XYZ changed?" that you could use - the file's length, a checksum of its contents, etc. Each approximation has its pros and cons, and its fans and detractors.

For the tool I'm writing about today, I will use a checksum computed by sum(1).

The logic of the main build script "make.sh" looks like this:

        for each object file
                if this object or its source file have changed
                        recompile to make this object

        if any of the objects have changed
                re-link the objects to make the executable

Here's the source of the build script "make.sh" :

---cut here---
#!/bin/dash

export PATH=$PATH:. 

. ./funs   # source our helper shell functions

target=main

objs="main.o print.o"

for o in $objs ; do 
        dep=${o%.*}.f90
        if [ $(changed $o $dep) ] ; then 
                do_compile $o $dep 
                record_sum $o $dep
        fi
done

if [ $(changed $target $objs) ] ; then 
        do_link $target "$objs"
        record_sum $target "$objs"
fi
---cut here---

The remaining magic lies in the shell functions we called: 'changed', 'do_compile', 'do_link', and 'record_sum', which are all in a helper file of shell functions, here named "funs" :

---cut here---
# helper shell functions

checksum(){
        cat $* | sum | awk '{ print $1 }' 
}

record_sum(){
        checksum $* > $1.sum
}

changed(){
        local new old rv
        if [ -f $1 ] ; then
                new=$(checksum $*)
        else
                touch $1
                new=$(checksum $*)
        fi
        old=x
        [ -f $1.sum ] && old=$(cat $1.sum)
        rv=""
        [ $new != $old ] && rv=1
        echo $rv
}

do_compile(){
        if [ -f $1.sh ] ; then
                # compile using a specific build script
                $1.sh $*
        else
                # compile using generic logic
                echo gfortran -c -o $1 $2
                gfortran -c -o $1 $2
        fi
}

do_link(){
        local targ
        if [ -f $1.sh ] ; then
                # link using a specific build script
                $1.sh $*
        else
                # link using generic logic
                targ=$1
                shift
                echo gfortran -o $targ $*  -lm 
                gfortran -o $targ $*  -lm 
        fi
}
---cut here---

Let's look at each of these helper shell functions in turn, and work out what they're doing, and why they're doing it.

First, we have a shell function to compute a checksum of a set of files given as arguments:

checksum(){
        cat $* | sum | awk '{ print $1 }' 
}

This shell function contains all the magic needed to figure out if a file has (approximately) changed. You could substitute 'cksum' for 'sum', if you wanted to have a stronger approximation for "file changed"; or you could use 'md5sum' for an even stronger approximation.

Next, we want a way to record the computed checksums, to see what's changed from one invocation of "make.sh" to the next:

record_sum(){
        checksum $* > $1.sum
}

The 'changed' shell function is the real work-horse. A file is considered changed if either (a) the file doesn't exist, or (b) the file or any of its prerequisites' checksums have changed.

changed(){
        local new old rv
        if [ -f $1 ] ; then
                new=$(checksum $*)
        else
                touch $1
                new=$(checksum $*)
        fi
        old=x
        [ -f $1.sum ] && old=$(cat $1.sum)
        rv=""
        [ $new != $old ] && rv=1
        echo $rv
}

Finally, there are shell functions to do the actual compilation and linking:

do_compile(){
        if [ -f $1.sh ] ; then
                # compile using a specific build script
                $1.sh $*
        else
                # compile using generic logic
                echo gfortran -c -o $1 $2
                gfortran -c -o $1 $2
        fi
}

do_link(){
        local targ
        if [ -f $1.sh ] ; then
                # link using a specific build script
                $1.sh $*
        else
                # link using generic logic
                targ=$1
                shift
                echo gfortran -o $targ $*  -lm 
                gfortran -o $targ $*  -lm 
        fi
}

The logic in 'do_compile' and 'do_link' merit further explanation. In do_compile, we do one of two concrete things: either we compile the target using some generic logic, or we use a target-specific shell script (if this script exists) to compile the target in a special way.

Suppose we have a file called foo.f90. The corresponding object file is foo.o. The generic logic that turns a .f90 into a .o, looks like

gfortran -c -o foo.o foo.f90 

But if we wanted to override this generic logic, for just target foo.o -- i.e., we want to use different compiler options -- we make an executable shell script called 'foo.o.sh', that contains the specific command we want to use to compile foo.f90 into foo.o :

---example foo.o.sh---
#!/bin/dash
gfortran -O -c -o foo.o foo.f90 
---cut here---

The logic of 'do_link' is like that of 'do_compile' -- by default we do a default link action, but if a special link script exists, we run that to perform the link in a non-default way.

That's it. We have the main make script "make.sh", and a file of helper shell functions "funs". To make the executable, we run make.sh:

$ ./make.sh
gfortran -c -o main.o main.f90
gfortran -c -o print.o print.f90
gfortran -o main main.o print.o -lm

We could build additional scripts to implement other 'make targets', like a 'clean' target that cleans everything up:

---example clean.sh---
#!/bin/dash
rm *.o *.sum main 
---cut here---

Although this set of scripts (make.sh + funs + clean.sh) works for this exact program, you would have to rewrite parts of make.sh + clean.sh to work for other programs or languages that you use. That may sound like a big bother, but really these scripts have much stereotypical content that you can copy from project to project, and they're not that big (less than 30 lines total).

The trio of scripts (make.sh + funs + clean.sh) weighs in at just under 100 lines of code. Gnu Make, which is very featureful, has around 31 KSLOC.

The build tool we've made works in a top-down way, unlike make(1) which works in a bottom-up way. make(1) is "fed" with a declarative specification of what needs to happen, whereas "make.sh" has a very imperative way of doing the same basic task - building an executable from one or more source files.

Armed with this blog post, should you now abandon make(1) or your current build tool of choice? Probably not. Any build tool you currently use is likely to give you faster builds than the approach I've sketched above. That said, it's neat that a very small amount of simple shell scripting can do a useful subset of what make(1) can do - rebuild a software project from source, recompiling just the files needed.

I developed these scripts after reading about apenwarr's "redo" utility, and after reading Alan Grosskurth's master's thesis, "Purely top-down software rebuilding", which is referenced at apenwarr's redo site: https://github.com/apenwarr/redo .

No comments:

Post a Comment