Monday, December 10, 2007

Basic Programming Elements

What we observe is not nature itself, but nature exposed to our method of questioning.

—Werner Heisenberg

Code reading is in many cases a bottom-up activity. In this chapter we review the basic code elements that comprise programs and outline how to read and reason about them. In Section 2.1 we dissect a simple program to demonstrate the type of reasoning necessary for code reading. We will also have the first opportunity to identify common traps and pitfalls that we should watch for when reading or writing code, as well as idioms that can be useful for understanding its meaning. Sections 2.2 and onward build on our understanding by examining the functions, control structures, and expressions that make up a program. Again, we will reason about a specific program while at the same time examining the (common) control constructs of C, C++, Java, and Perl. Our first two complete examples are C programs mainly because realistic self-standing Java or C++ programs are orders of magnitude larger. However, most of the concepts and structures we introduce here apply to programs written in any of the languages derived from C such as C++, C#, Java, Perl, and PHP. We end this chapter with a section detailing how to reason about a program's flow of control at an abstract level, extracting semantic meaning out of its code elements.

2.1 A Complete Program
A very simple yet useful program available on Unix systems is echo, which prints its arguments on the standard output (typically the screen). It is often used to display information to the user as in:

echo "Cool! Let 's get to it..."

in the NetBSD upgrade script.[1] Figure 2.1 contains the complete source code of echo.[2]

[1] netbsdsrc/distrib/miniroot/upgrade.sh:98

[2] netbsdsrc/bin/echo/echo.c:3–80

As you can see, more than half of the program code consists of legal and administrative information such as copyrights, licensing information, and program version identifiers. The provision of such information, together with a summary of the specific program or module functionality, is a common characteristic in large, organized systems. When reusing source code from open-source initiatives, pay attention to the licensing requirements imposed by the copyright notice (Figure 2.1:1).

C and C++ programs need to include header files (Figure 2.1:2) in order to correctly use library functions. The library documentation typically lists the header files needed for each function. The use of library functions without the proper header files often generates only warnings from the C compiler yet can cause programs to fail at runtime. Therefore, a part of your arsenal of code-reading procedures will be to run the code through the compiler looking for warning messages (see Section 10.6).



Standard C, C++, and Java programs begin their execution from the function (method in Java) called main (Figure 2.1:3). When examining a program for the first time main can be a good starting point. Keep in mind that some operating environments such as Microsoft Windows, Java applet and servlet hosts, palmtop PCs, and embedded systems may use another function as the program's entry point, for example, WinMain or init.

In C/C++ programs two arguments of the main function (customarily named argc and argv) are used to pass information from the operating system to the program about the specified command-line arguments. The argc variable contains the number of program arguments, while argv is an array of strings containing all the actual arguments (including the name of the program in position 0). The argv array is terminated with a NULL element, allowing two different ways to process arguments: either by counting based on argc or by going through argv and comparing each value against NULL. In Java programs you will find the argv String array and its length method used for the same purpose, while in Perl code the equivalent constructs you will see are the @ARGV array and the $#ARGV scalar.

Figure 2.1 The Unix echo program.

/*
* Copyright (c) 1989, 1993
* The Regents of the University of California. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. All advertising materials mentioning features or use of this software
* must display the following acknowledgement:
* This product includes software developed by the University of
* California, Berkeley and its contributors.
* 4. Neither the name of the University nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ''AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*/

#include

<-- a
#ifndef lint
__COPYRIGHT(
"@(#) Copyright (c) 1989, 1993\n\
The Regents of the University of California. All rights reserved.\n");

__RCSID("$NetBSD: echo.c,v 1.7 1997/07/20 06:07:03 thorpej Exp $");
#endif /* not lint */


#include <-- b
#include <-- c
#include <-- d

int main __P((int, char *[])); <-- e

int
main(argc, argv)
int argc;
char *argv[];
{
int nflag;
/* This utility may NOT do getopt(3) option parsing. */
if (*++argv && !strcmp(*argv, "-n") ) {
++argv;
nflag = 1;
}
else
nflag = 0;

while (*argv) } <-- f
(void)printf("%s", *argv);
if (*++argv) <-- g
putchar(' '); <-- h
}
if (!nflag) <-- i
putchar('\n');
exit(0); <-- j
}



Comment (copyright and distribution license), ignored by the compiler. This license appears on most programs of this collection. It will not be shown again.



(a) Copyright and program version identifiers that will appear as strings in the executable program



Standard library headers for:



(b) printf



(c) exit



(d) strcmp



(e) Function declaration with macro to hide arguments for pre-ANSI compilers



The program starts with this function



Number of arguments to the program



The actual arguments (starting with the program name, terminated with NULL)



When true output will not be terminated with a newline



The first argument is -n



Skip the argument and set nflag



(f) There are arguments to process



Print the argument



(g) Is there a next argument? (Advance argv)



(h) Print the separating space



(i) Terminate output with newline unless -n was given



(j) Exit program indicating success



The declaration of argc and argv in our example (Figure 2.1:4) is somewhat unusual. The typical C/C++ definition of main is[3]

[3] netbsdsrc/usr.bin/elf2aout/elf2aout.c:72–73

int
main(int argc, char **argv)

while the corresponding Java class method definition is[4]

[4] jt4/catalina/src/share/org/apache/catalina/startup/Catalina.java:161

public static void main(String args[]) {

The definition in Figure 2.1:4 is using the old-style (pre-ANSI C) syntax of C, also known as K&R C. You may come across such function definitions in older programs; keep in mind that there are subtle differences in the ways arguments are passed and the checks that a compiler will make depending on the style of the function definition.



When examining command-line programs you will find arguments processed by using either handcrafted code or, in POSIX environments, the getopt function. Java programs may be using the GNU gnu.getopt package[5] for the same purpose.

[5] http://www.gnu.org/software/java/packages.html

The standard definition of the echo command is not compatible with the getopt behavior; the single -n argument specifying that the output is not to be terminated with a newline is therefore processed by handcrafted code (Figure 2.1:6). The comparison starts by advancing argv to the first argument of echo (remember that position 0 contains the program name) and verifying that such an argument exists. Only then is strcmp called to compare the argument against -n. The sequence of a check to see if the argument is valid, followed by a use of that argument, combined with using the Boolean AND (&&) operator, is a common idiom. It works because the && operator will not evaluate its righthand side operand if its lefthand side evaluates to false. Calling strcmp or any other string function and passing it a NULL value instead of a pointer to actual character data will cause a program to crash in many operating environments.



Note the nonintuitive return value of strcmp when it is used for comparing two strings for equality. When the strings compare equal it returns 0, the C value of false. For this reason you will see that many C programs define a macro STREQ to return true when two strings compare equal, often optimizing the comparison by comparing the first two characters on the fly:[6]

[6] netbsdsrc/usr.bin/file/ascmagic.c:45



#define STREQ(a, b) (*(a) == *(b) && strcmp((a), (b)) == 0)

Fortunately the behavior of the Java equals method results in a more intuitive reading:[7]

[7] jt4/catalina/src/share/org/apache/catalina/startup/CatalinaService.java:136–143

if (isConfig) {
configFile = args[i];
isConfig = false;
} else if (args[i].equals("-config")) {
isConfig = true;
} else if (args[i].equals("-debug")) {
debug = true;
} else if (args[i].equals("-nonaming")) {

The above sequence also introduces an alternative way of formatting the indentation of cascading if statements to express a selection. Read a cascading if-else if-...-else sequence as a selection of mutually exclusive choices.

An important aspect of our if statement that checks for the -n flag is that nflag will always be assigned a value: 0 or 1. nflag is not given a value when it is defined (Figure 2.1:5). Therefore, until it gets assigned, its value is undefined: it is the number that happened to be in the memory place it was stored. Using uninitialized variables is a common cause of problems. When inspecting code, always check that all program control paths will correctly initialize variables before these are used. Some compilers may detect some of these errors, but you should not rely on it.



The part of the program that loops over all remaining arguments and prints them separated by a space character is relatively straightforward. A subtle pitfall is avoided by using printf with a string-formatting specification to print each argument (Figure 2.1:7). The printf function will always print its first argument, the format specification. You might therefore find a sequence that directly prints string variables through the format specification argument:[8]

[8] netbsdsrc/sys/arch/mvme68k/mvme68k/machdep.c:347

printf(version);

Printing arbitrary strings by passing them as the format specification to printf will produce incorrect results when these strings contain conversion specifications (for example, an SCCS revision control identifier containing the % character in the case above).



Even so, the use of printf and putchar is not entirely correct. Note how the return value of printf is cast to void. printf will return the number of characters that were actually printed; the cast to void is intended to inform us that this result is intentionally ignored. Similarly, putchar will return EOF if it fails to write the character. All output functions—in particular when the program's standard output is redirected to a file—can fail for a number of reasons.



The device where the output is stored can run out of free space.

The user's quota of space on the device can be exhausted.

The process may attempt to write a file that exceeds the process's or the system's maximum file size.

A hardware error can occur on the output device.

The file descriptor or stream associated with the standard output may not be valid for writing.

Not checking the result of output operations can cause a program to silently fail, losing output without any warning. Checking the result of each and every output operation can be inconvenient. A practical compromise you may encounter is to check for errors on the standard output stream before the program terminates. This can be done in Java programs by using the checkError method (we have yet to see this used in practice on the standard output stream; even some JDK programs will fail without an error when running out of space on their output device); in C++ programs by using a stream's fail, good, or bad methods; and in C code by using the ferror function:[9]

[9] netbsdsrc/bin/cat/cat.c:213–214



if (ferror(stdout))
err(1, "stdout");

After terminating its output with a newline, echo calls exit to terminate the program indicating success (0). You will also often find the same result obtained by returning 0 from the function main.

Exercise 2.1 Experiment to find out how your C, C++, and Java compilers deal with uninitialized variables. Outline your results and propose an inspection procedure for locating uninitialized variables.

Exercise 2.2 (Suggested by Dave Thomas.) Why can't the echo program use the getopt function?

Exercise 2.3 Discuss the advantages and disadvantages of defining a macro like STREQ. Consider how the C compiler could optimize strcmp calls.

Exercise 2.4 Look in your environment or on the book's CD-ROM for programs that do not verify the result of library calls. Propose practical fixes.

Exercise 2.5 Sometimes executing a program can be a more expedient way to understand an aspect of its functionality than reading its source code. Devise a testing procedure or framework to examine how programs behave on write errors on their standard output. Try it on a number of character-based Java and C programs (such as the command-line version of your compiler) and report your results.

Exercise 2.6 Identify the header files that are needed for using the library functions sscanf, qsort, strchr, setjmp, adjacent–find, open, FormatMessage, and XtOwn- Selection. The last three functions are operating environment–specific and may not exist in your environment.

No comments: