O you who turn the wheel

Thursday, August 8, 2024

Nothing is Sacred

Explanations exist; they have existed for all time;there is always a well-known solution to every human problem — neat, plausible, and wrong. –H L Mencken

“Half of all programmers are below average ability.”

This saying - or something close to it - is repeated from time to time. It sounds plausible. But you must consider the probability distribution involved. If the population of programmers is measured by ability (however one might do that), and that population is unimodal and symmetric about its mean, then sure, the saying “half of everyone is below average” is not just plausible, but correct.

But it seems to me that the implied sample (or population) that goes with the saying, could very well come from a sample / population that is greatly skewed one way or another.

Say you are in a workgroup with one really experienced programmer, one mid-experience programmer, and a dozen relatively inexperienced programmers. In this sample, many more than half of the programmers will be less than the arithmetic mean level of experience. The outliers will drag the mean above the median.

Probably the saying should be revised to “half of all programmers have less than the median level of ability.” That’s true by definition, and stretches to work with whatever sample or population you have explicitly identified, or implicitly intended.

. . . not that this bit of statistical nit-picking will change anyone’s mind. To complain that the mean is not universally appropriate for sayings like this, but that the median is, is to invite eye-rolls along with “you know what I meant, you raisin-crapper”.

There is a pestilence upon this land! nothing is sacred. – Roger the Shrubber, Monty Python and the Holy Grail

Nothing is sacred, including a grasp of basic statistics and probability…

Wednesday, August 7, 2024

No Plan B, and other lessons

Lately, I've been re-reading an old book, Japanese Aircraft of the Pacific War, by Rene J Francillon.   It's about the aircraft the Japanese made during WW2.   It's been a number of years since I've read the book, and I find now in it, familiar failure modes that I've encountered while doing software development.

No Plan B

Production of Japanese aircraft dropped steeply between 1944 and 1945. The reasons for this drop were many, including Allied bombing, earthquakes, and other factors. A plan was underway to disperse production away from large centralized plants (easy to hit with bombers) to a more distributed manufacturing basis.

Both the Army and the Navy had decisive battles to win. The Imperial Japanese Navy considered the decisive battle to be coming in June 1944 north of New Guinea; the Imperial Japanese Army thought their decisive battle would be in August 1944 in the Philippines. Until this was accomplished, dispersion [of production] was secondary. The Japanese disregarded all plans for the year and shoved everything towards production.   After the peak was achieved (and battles not won), the employees required rest, the machinery was worn out and had to be repaired, parts and supplies were exhausted, and readjustments had to be made. (Francillon, p. 16)

There was no Plan B. When Plan A (win the decisive battles) didn't work out, aircraft production sagged.   In software development, we've all experienced what happens after "crunch mode" - whether or not people physically go to work, they are definitely in "not working very hard" mode. People get worn out.

One Size Fits All - the Tachikawa Ki-9

In 1934, the Japanese Army Air Headquarters (Koku Hombu) asked Tachikawa (a manufacturer) to build a single aircraft to serve in two distinct roles - a primary trainer that students would use to get their first flight experience in, and an intermediate trainer that would help a student transition to combat aircraft. The idea is that you build one airframe, put a 150hp engine in to get a primary trainer, or put a 350hp engine in to get the intermediate trainer. Tachikawa was not in favor of this dual-use design, but Koku Hombu noted that various European aircraft designers had solved the problems with the dual-use requirement, so the Tachikawa effort went ahead.

The result? As a primary trainer, the prototype 150hp Ki-9 didn't work well, so it wasn't put into production.

As an intermediate trainer, the 350hp Ki-9 performed a bit better, so Koku Hombu had Tachikawa produce some 2,000 over the 1934-1944 period. The problem was that the center of gravity was too far aft in the 150hp model (engine not heavy enough), and center of gravity was too far forward in the 350hp model (engine too heavy). In a way, Tachikawa was fortunate to get even one of the two models to fly acceptably well - it's clear the design was a compromise that had to span a large range of weights, and probably flew reasonably well in the 350hp model, and poorly in the 150hp model.

As a software person, there are a number of wrong conclusions one could draw from the Ki-9 story: (a) don't compromise (b) "the customer is always right". Sometimes compromise is needed - and sometimes the customer isn't right. In the case of the Ki-9, was the problem that the base requirements were too broad? Was the problem that Tachikawa couldn't solve the dual-use problem (but someone else could)? After all these years, it's hard to say. Part of the difficulty for the Ki-9 was that Koku Hombu could (and often did) select a particular manufacturer to design and build a particular plane, without the use of competitive bids from other manufacturers. A result of this acquisition process is that Koku Hombu got whatever the manufacturer built, even if it wasn't fit for purpose. The 150hp model of Ki-9 was an example where this process failed to yield a good product.

[end of text]

Sunday, January 4, 2015

Effectiveness

Various notes to myself, about being effective.

1. allow for some slop in the machinery.

Being intolerant of error or imprecision in others (or in one's self) lowers effectiveness. To err is human.

2. strive for rightness, but don't beat people over the head with it

Related - being right doesn't always mean you're effective. And being right in a way that pisses off people around you (e.g. right in a pedantic sense) lowers your effectiveness when people disengage from you. Often as geeks we assume that our individual contribution is really all that matters; but on most projects, our contribution as a member of a larger hive is as important (maybe more so, on some projects).

3. furious activity is not a substitute for genuine progress.

Charging full speed down the wrong road, doesn't lead to being effective. The cure (if there is one) is to understand the project's goals well.

4. but at some point you need to stop staring at your navel.

The flip side of [3], is that you also must to realize when you need to actually deliver something (as opposed to spending more time in "understanding the problem" better).

5. consistently avoiding risk is a net loss.

If you're in a position where covering your ass (reducing project risks that you could be blamed for) is more important than delivering running code on time, it's probably a good time to find a different project to work on.

6. listen to customers.

It's a good thing to listen to customers. To listen, of course, implies you have a relationship with customers in the first place. "I don't do customer facing stuff" is a good way to tell everybody "I'm obsolete."

7. ideas are good. implementable ideas are vastly better.

No one will pay me to come up with ideas that I have little or no chance of implementing. Work out a rough sense of what can actually be built, or demo'd, and find customers who would be willing to build on that.

Thursday, April 24, 2014

The Classic Blunders

I was thinking today about The Classic Blunders, as described by Fezzini.

1. Turnover isn't free. So much knowledge never gets written down; it's domain knowledge or experience that only exists in the mind of the person. When the person walks out the door, all that know-how goes with him. (And that's just the smallest cost. Pour encourager les autres is a great way to utterly demoralize les autres.)

2. Prickliness isn't free. When you have prickly people you have to deal with, day in and day out, it wears on the soul.

3. The Mission Isn't Everything. When The Mission Becomes Everything, it means people are willing to sacrifice all the essential parts of a reasonably happy and healthy life - like health, family, a sense of balance - In Order To Make The Mission Happen.

Saturday, March 15, 2014

A tale of two distances (part 2)

Last time, I wrote about using cosine similarity as a way to tell how alike or different two trigram language models were.   This post describes a second distance-like measure, from Cavnar and Trenkle's 1994 paper, "N-Gram-Based Text Categorization", which can be found on the web. The way Cavnar and Trenkle (C&T) compute the distance between models is not via cosine similarity, but by using the ranks of the sorted trigram frequencies.

Here's a simple example of C&T's method: suppose you have two lists of trigrams, and we want to make a distance measure that will tell us how alike the lists are.

list X:    RAN CAT CAT THE THE THE RED
list Y:    THE CAT RAN RED RAN CAT RED

We compute the frequencies of the trigrams in the lists, sort the lists in descending frequency, rank the resulting lists, and compute the sum of the differences of the ranks. As worked out:

X freqs            RAN=1 CAT=2 THE=3 RED=1
sort down by freq THE   CAT   RAN   RED
rank               1     2     3     4

Y freqs    THE=1 CAT=2 RAN=2 RED=2
sort down by freq CAT   RAN   RED   THE
rank               1     2     3     4

rank diffs         THE(1-4) CAT(2-1) RAN(3-2) RED(4-3)
absolute diffs     3        1        1        1
sum of abs diffs   6 = 3+1+1+1

Here, lists X and Y are trigrams. We compute the differences between the ranks of the lists as sorted in descending order by frequence, then take the absolute values of these differences, and sum these absolute values.   The resulting sum (6 in this example) is a measure of the mismatch between the two lists.

Cavnar and Trenkle call this measure of mismatch an "out-of-place measure". In this posting, I call the measure of mismatch the rank distance between the two lists.
When an element is in both lists, the magnitude of the rank differences for that element tells us how far out of place that element is. When an element is not in both lists, we arbitrarily set the rank difference to N (the length of the lists). The values of rank distance vary from 0 (both lists have the same elements with the same frequency rankings) to N*N (the two lists share no common elements).

For two trigram models (from a reference text and a sample text), we can use the Cavnar & Trenkle out-of-place measure as a distance-like way to test the similarity between two models -- lower rank distance implies greater similarity.

The process described above is not exactly what Cavnar & Trenkle proposed.   C&T collect frequencies for all N-grams (N=1,2,3,4,5), not just trigrams and take the top 400 most-frequent N-grams in descending frequency order as their language models. I've tested a simplified variant of C&T (only use trigrams, take the top 300 for a language model) against the Python Recipes code, and the quality of classification results are similar, when used just for language detection.

For reference, the C&T 1994 paper is available from

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.162.2994

Friday, March 14, 2014

A tale of two distances

Suppose you have a blob of foreign text, and you want to identify what language the text is written in. At the Python Recipes web site, the URL

http://code.activestate.com/recipes/326576/

gives Python2 code for doing this. The idea is simple.

For each language you want to detect, you build a reference model for that language:

get a reasonably large sample of text in this language.
from this sample, make a frequency table of the 3-grams in the text.
save this frequency table as a model for this language.

For a new sample text you want to classify:

make a frequency table of 3-grams (a new language model)
compute the distance between this model and those of all the reference models you already have.
classify this sample text as the language of the reference model with the least distance.

What do we mean by distance in the context of trigram language models?

In high school, we were introduced to Euclidean distance - a way of measuring how far two points in the (x,y) plane are from each other:

distance(x,y) = sqrt((x1-y1)**2 + (x2-y2)**2) where x = (x1,x2) and y = (y1,y2)

So our garden-variety idea of distance is a function that takes two points (two ordered pairs), and returns a nonnegative number. Math people call a function like this a metric (or distance function, or just a distance) if the function has the following properties:

coincidence. d(x,y) = 0 if and only if x = y.
symmetry. d(x,y) = d(y,x).
triangle inequality. d(x,z) <= d(x,y) + d(y,z).

The Python Recipe code uses cosine similarity instead of a distance - if two language models are the same, they get a similarity value of 1.0, and if they are entirely unalike, they get a similarity of 0. In the following description,A and B are frequency tables whose indexes are trigrams, and whose values are the frequencies of the trigrams in two text samples. (Suppose A is one of the reference language models you built to test against, and B is the model from the sample text you want to classify.)

similarity(A, B) =
total = 0
    for each trigram t in both A and B:
        fa = A[t]    (frequency of trigram t in model A)
        fb = B[t]    (frequency of trigram t in model B)
        total += (fa * fb)
    return total / (scalar_length(A) * scalar_length(B))

and

scalar_length(X) =
    return square root of sum of squares of all trigram frequencies in X.

The Python code implements roughly the pseudocode above; the trigram frequency tables in the Python are stored as dictionaries of 2-grams whose values are (dictionaries of 1-grams whose values are frequencies).

Rather than provide a distance, the Python code defines a difference between models, defined as

difference(A, B) = 1 - similarity(A,B)

This difference function acts vaguely like a distance, but actually is not one in the precise mathematical sense (see wikipedia article on "cosine similarity" for the reasons why). In practice, though, the difference function acts the way we'd like a distance function to work: two trigram-identical language samples yield a difference value of 0, and two entire dissimilar samples yield a difference value of 1.

Next time, I'll describe a different method for calculating a distance between two models, based on a nonparametric approach called rank distance.

Sunday, September 29, 2013

When correctness is not enough

Met the spec, but not done yet

Unknown nonfunctional requirements are among the banes of a working programmer's existence. ("Heck," I can hear you say, "lack of any requirements...")

I'm focusing on nonfunctional requirements because it's been more typical in my experience to have the customers give a few vague verbal clues which can be generously described as functional requirements, but the nonfunctional aspects of what the customer needed are just glossed over.    Functional requirements give us some warm fuzzy feelings as programmers, because we associated them with correctness, and the correctness of an artifact w.r.t. a spec is something we can actually establish (even if imperfectly).

The usual thing people mean when they say "this software module is correct" is "given the appropriate inputs, this module produces the expected outputs."   This everyday meaning corresponds directly with the functional requirements of the module.   Alas, software may have nonfunctional requirements as well.

When software meets the user's functional requirements but fails to meet the user's nonfunctional requirements, correctness alone is not enough. [+]   The software I build can produce the right results all of the time, and still be useless if it can't run in the customer's environment, or within the time constraints the customer hasn't told me about.

It's ok to say "well, that never happens to me." Just don't be surprised if some of your customers are reluctant to continue paying you.

When people want software to solve a problem, I think there are a couple of quantum levels of capability that "plain old non-programmer people" can imagine:

I have zero software to help me do this thing I want to do. If I can do this thing at all, I do it using a manual process.
I have some software to help me do this task, and the speed and resource consumption of the software doesn't matter.
I have software to help me do this task, and the speed and resource consumption of the software matters greatly, because I have real-world limits on how much time or resources I can throw at my problem.

The quantum leaps happen when a customer goes from one capability level up to the next one.   The moment when "correctness is not enough" is when you have correct software which is nonetheless useless, because the software doesn't meet the user's nonfunctional requirements - e.g. it runs too slowly or consumes too many resources.

The First Leap : Basic Automation

When you (non-programmer paying customer) go from Capability Level 1 to Level 2, this is generally a Big Deal. A manual process (tedious, boring, error-prone) is now an automated process (still may be error-filled, but likely a lot less tedious and boring for the human user). A lot of everyday business process automation deals with this jump from Level 1 to Level 2.   These are the sorts of problems where simply having a computer being able to do the task at all, represents a big savings in cost for the customer.

Ideally, your Level 2 software process correctly does what you need it to do.   Part of the immense work required in moving from Level 1 to Level 2, is that possibly for the first time, the actual Level 1 process has to be formally described - completely, clearly, and correctly.   This formalization may appear in part as requirement documents, but from the working programmers' seat, the real deal is the code that describes exactly what the computer is going to do.
But a lot of Level 2 software customers can live with some measure of incorrectness, as long as the presented functionality still brings a net gain in economic value.   (Clearly some customers want provable correctness guarantees, and are willing to pay for them. I think these represent a minority of paying customers.)

The Second Leap : When Resources Matter

When you go from Level 2 (resources don't matter) to Level 3 (resources do matter), this is another quantum leap.    Before now, you (paying customer) had some software that did some useful task for you, but something has changed in the resources you need to run your software.

Some possible causes -

you acquired lots of new customers.
the acceptable level of risk associated with your software has changed.
the regulatory regime changed, which means you have to produce a lot of new metadata or audit trail info that you didn't used to produce.
the competition changed, and now has something that competes with your original automated process, but offers a better economic value to your customers.

Whatever the cause, now resource consumption of your software matters in ways it didn't before. (There are different degrees of "matters" - this is not an all or nothing proposition.) Software to solve this problem now has to (a) be sufficiently correct, and (b) operate within the allocate time or resource limits.

Software in a correctness vs constraints space

One way to look at the relationship between correctness and resource constraints, is to picture software as existing as a point in a two-dimensional space:

In this graph, points A and C represent software with minimal resource constraints, and varying needs for correctness. Points B and D represent software with strong resource constraints (e.g. speed or memory) and varying needs for correctness. If you are point C, and you need to be at point D, then simply being correct will not make your software actually usable by the customer. It's a truism that "performance doesn't matter, until it does", but for some customers, when performance matters, it really matters.

[+] To say "correctness is not enough" is not saying correctness doesn't matter at all -- of course it does. Even for customers who use somewhat incorrect software (most of us, most of the time), at some point of perceived incorrectness, a paying customer who wants your software to solve his problem, stops paying for it, because the incorrectness of the proposed solution is too expensive to live with.