Monday, December 10, 2007

Read Code

1.1 Why and How to Read Code
You may find yourself reading code because you have to, such as when fixing, inspecting, or improving existing code. You may also sometimes read code to learn how something works, in the manner we engineers tend to examine the innards of anything with a cover that can be opened. You may read code to scavenge for material to reuse or (rarely, but more commonly after reading this book, we hope) purely for your own pleasure, as literature. Code reading for each one of these reasons has its own set of techniques, emphasizing different aspects of your skills.[1]

[1] I am indebted to Dave Thomas for suggesting this section.

1.1.1 Code as Literature
Dick Gabriel makes the point that ours is one of the few creative professions in which writers are not allowed to read each other's work [GG00].

The effect of ownership imperatives has caused there to be no body of software as literature. It is as if all writers had their own private companies and only people in the Melville company could read Moby-Dick and only those in Hemingway's could read The Sun Also Rises. Can you imagine developing a rich literature under these circumstances? Under such conditions, there could be neither a curriculum in literature nor a way of teaching writing. And we expect people to learn to program in this exact context?

Open-source software (OSS) has changed that: we now have access to millions of lines of code (of variable quality), which we can read, critique, and improve and from which we can learn. In fact, many of the social processes that have contributed to the success of mathematical theorems as a scientific communication vehicle apply to open-source software as well. Most open-source software programs have been

Documented, published, and reviewed in source code form

Discussed, internalized, generalized, and paraphrased

Used for solving real problems, often in conjunction with other programs

Make it a habit to spend time reading high-quality code that others have written. Just as reading high-quality prose will enrich your vocabulary, trigger your imagination, and expand your mind, examining the innards of a well-designed software system will teach you new architectural patterns, data structures, coding methods, algorithms, style and documentation conventions, application programming interfaces (APIs), or even a new computer language. Reading high-quality code is also likely to raise your standards regarding the code you produce.

In your code-reading adventures you will inevitably encounter code that should best be treated as an example of practices to avoid. Being able to rapidly differentiate good code from bad code is a valuable skill; exposure to some sound coding counterexamples may help you develop the skill. You can easily discern code of low quality by the following signs:

An inconsistent coding style

A gratuitously complicated or unreadable structure

Obvious logical errors or omissions

Overuse of nonportable constructs

Lack of maintenance

You should not, however, expect to learn sound programming from poorly written code; if you are reading code as literature, you are wasting your time, especially considering the amount of available high-quality code you can now access.

Ask yourself: Is the code I am reading really the best of breed? One of the advantages of the open-source movement is that successful software projects and ideas inspire competition to improve on their structure and functionality. We often have the luxury to see a second or third iteration over a software design; in most cases (but not always) the latter design is significantly improved over the earlier versions. A search on the Web with keywords based on the functionality you are looking for will easily guide you toward the competing implementations.

Read code selectively and with a goal in your mind. Are you trying to learn new patterns, a coding style, a way to satisfy some requirements? Alternatively, you may find yourself browsing code to pick up random gems. In that case, be ready to study in detail interesting parts you don't know: language features (even if you know a language in depth, modern languages evolve with new features), APIs, algorithms, data structures, architectures, and design patterns.

Notice and appreciate the code's particular nonfunctional requirements that might give rise to a specific implementation style. Requirements for portability, time or space efficiency, readability, or even obfuscation can result in code with very peculiar characteristics.

We have seen code using six-letter external identifiers to remain portable with old-generation linkers.

There are efficient algorithms that have (in terms of source code lines) an implementation that is two orders of magnitude more complex than their naive counterparts.

Code for embedded or restricted-space applications (consider the various GNU/Linux or FreeBSD on-a-floppy distributions) can go to great lengths to save a few bytes of space.

Code written to demonstrate the functioning of an algorithm may use identifiers that may be impractically long.

Some application domains, like copy-protection schemes, may require code to be unreadable, in an (often vain) attempt to hinder reverse engineering efforts.

When you read code that falls in the above categories, keep in mind the specific nonfunctional requirements to see how your colleague satisfied them.

Sometimes you may find yourself reading code from an environment completely foreign to you (computer language, operating system, or API). Given a basic familiarity with programming and the underlying computer science concepts, you can in many cases use source code as a way to teach yourself the basics of the new environment. However, start your reading with small programs; do not immediately dive into the study of a large system. Build the programs you study and run them. This will provide you with both immediate feedback on the way the code is supposed to work and a sense of achievement. The next step involves actively changing the code to test your understanding. Again, begin with small changes and gradually increase their scope. Your active involvement with real code can quickly teach you the basics of the new environment. Once you think you have mastered them, consider investing some effort (and possibly some cash) to learn the environment in a more structured way. Read related books, documentation, or manual pages, or attend training courses; the two methods of learning complement each other.

One other way in which you can actively read existing code as literature entails improving it. In contrast to other literal works, software code is a live artifact that is constantly improved. If the code is valuable to you or your community, think about how you could improve it. This can involve using a better design or algorithm, documenting some code parts, or adding functionality. Open-source code is often not well documented; consider reinvesting your understanding of the code in improved documentation. When working on existing code, coordinate your efforts with the authors or maintainers to avoid duplication of work or bad feelings. If your changes are likely to be substantial, think about becoming a concurrent versions system (CVS) committer—an individual with the authority to directly commit code to a project's source base. Consider the benefits you receive from open-source software to be a loan; look for ways to repay it by contributing back to the open-source community.

1.1.2 Code as Exemplar
There are cases where you might find yourself wondering how a specific functionality is realized. For some application classes you may be able to find an answer to your questions in standard textbooks or specialized publications and research articles. However, in many cases if you want to know "how'd they do that" there's no better way than reading the code. Code reading is also likely to be the most reliable way to create software compatible with a given implementation.

The key concept when you are using code as exemplar is to be flexible. Be prepared to use a number of different strategies and approaches to understand how the code works. Start with any documentation you might find (see Chapter 8). A formal software design document would be ideal, but even user documentation can be helpful. Actually use the system to get a feeling of its external interfaces. Understand what exactly are you actually looking for: a system call, an algorithm, a code sequence, an architecture? Devise a strategy that will uncover your target. Different search strategies are effective for different purposes. You may need to trace through the instruction execution sequence, run the program and place a breakpoint in a strategic location, or textually search through the code to find some specific code or data elements. Tools (see Chapter 10) will help you here, but do not let one of them monopolize your attention. If a strategy does not quickly produce the results you want, drop it and try something different. Remember, the code you are looking for is there; you just have to locate it.

Once you locate the desired code, study it, ignoring irrelevant elements. This is a skill you will have to learn. Many exercises in this book will ask you to perform exactly this task. If you find it difficult to understand the code in its original context, copy it into a temporary file and remove all irrelevant parts. The formal name of this procedure is slicing (see Section 9.1.6), but you can get the idea by examining how we have informally applied it in the book's annotated code examples.

1.1.3 Maintenance
In other cases code, rather than being an exemplar, may actually need fixing. If you think you have found a bug in a large system, you need strategies and tactics to let you read the code at increasing levels of detail until you have found the problem. The key concept in this case is to use tools. Use the debugger, the compiler's warnings or symbolic code output, a system call tracer, your database's Structured Query Language (SQL) logging facility, packet dump tools, and Windows message spy programs to locate a bug's location. (Read more in Chapter 10 about how tools will help your code reading.) Examine the code from the problem manifestation to the problem source. Do not follow unrelated paths. Compile the program with debugging support and use the debugger's stack trace facility, single stepping, and data and code breakpoints to narrow down your search.

If the debugger is not cooperating (the debugging of programs that run in the background such as daemons and Windows services, C++ template-based code, servlets, and multithreaded code is sometimes notoriously difficult), consider adding print statements in strategic locations of the program's execution path. When examining Java code consider using AspectJ to insert into the program code elements that will execute only under specific circumstances. If the problem has to do with operating system interfaces, a system call tracing facility will often guide you very near the problem.

1.1.4 Evolution
In most situations (more than 80% of your time by some measurements) you will be reading code not to repair a fault but to add new functionality, modify its existing features, adapt it to new environments and requirements, or refactor it to enhance its nonfunctional qualities. The key concept in these cases is to be selective in the extent of the code you are examining; in most situations you will actually have to understand a very small percentage of the overall system's implementation. You can in practice modify a million-line system (such as a typical kernel or window system) by selectively understanding and changing one or two files; the exhilarating feeling that follows the success of such an operation is something I urge you to strive to experience. The strategy for selectively dealing with parts of a large system is outlined below.

Locate the code parts that interest you.

Understand the specific parts in isolation.

Infer the code excerpt's relationship with the rest of the code.

When adding new functionality to a system your first task is to find the implementation of a similar feature to use as a template for the one you will be implementing. Similarly, when modifying an existing feature you first need to locate the underlying code. To go from a feature's functional specification to the code implementation, follow the string messages, or search the code using keywords. As an example, to locate the user authentication code of the ftp command you would search the code for the Password string:[2]

[2] netbsdsrc/usr.bin/ftp/util.c:265–267

if (pass == NULL)
pass = getpass("Password:");
n = command("PASS %s", pass);

Once you have located the feature, study its implementation (following any code parts you consider relevant), design the new feature or addition, and locate its impact area—the other code parts that will interact with your new code. In most cases, these are the only code parts you will need to thoroughly understand.

Adapting code to new environments is a different task calling for another set of strategies. There are cases where the two environments offer similar capabilities: you may be porting code from Sun Solaris to GNU/Linux or from a Unix system to Microsoft Windows. In these situations the compiler can be your most valuable friend. Right from the beginning, assume you have finished the task and attempt to compile the system. Methodically modify the code as directed by compilation and linking errors until you end with a clean build cycle, then verify the system's functionality. You will find that this approach dramatically lessens the amount of code you will need to read. You can follow a similar strategy after you modify the interface of a function, class, template, or data structure. In many cases, instead of manually locating your change's impact, you follow the compiler's error or warning messages to locate the trouble spots. Fixes to those areas will often generate new errors; through this process the compiler will uncover for you the code location influenced by your code.

When the code's new environment is completely different from the old one (for example, as is the case when you are porting a command-line tool to a graphical windowing environment) you will have to follow a different approach. Here your only hope for minimizing your code-reading efforts is to focus at the point where the interfaces between the old code and the new environment will differ. In the example we outlined, this would mean concentrating on the user interaction code and completely ignoring all the system's algorithmic aspects.

A completely different class of code evolution changes concerns refactoring. These changes are becoming increasingly important as some types of development efforts adopt extreme programming and agile programming methodologies. Refactoring involves a change to the system that leaves its static external behavior unchanged but enhances some of its nonfunctional qualities, such as its simplicity, flexibility, understandability, or performance. Refactoring has a common attribute with cosmetic surgery. When refactoring you start with a working system and you want to ensure that you will end up with a working one. A suite of pertinent test cases will help you satisfy this obligation, so you should start by writing them. One type of refactoring concerns fixing a known trouble spot. Here you have to understand the old code part (which is what this book is about), design the new implementation, study its impact on the code that interfaces with your code (in many cases the new code will be a drop-in replacement), and realize the change.

A different type of refactoring involves spending some "quality time" with your software system, actively looking for code that can be improved. This is one of the few cases where you will need an overall picture of the system's design and architecture; refactoring in-the-large is likely to deliver more benefits than refactoring in-the-small. Chapter 6 discusses ways to untangle large systems, while Chapter 9 outlines how to move from code to the system's architecture. When reading code to search for refactoring opportunities, you can maximize your return on investment by starting from the system's architecture and moving downward to look at increasing levels of detail.

1.1.5 Reuse
You might also find yourself reading code to look for elements to reuse. The key concept here is to limit your expectations. Code reusability is a tempting but elusive concept; limit your expectations and you will not be disappointed. It is very hard to write reusable code. Over the years comparatively little software has survived the test of time and been reused in multiple and different situations. Software parts will typically become reuse candidates after they have been gracefully extended and iteratively adapted to work on two or three different systems; this is seldom the case in ad-hoc developed software. In fact, according to the COCOMO II software cost model [BCH+95], crafting reusable software can add as much as 50% to the development effort.

When looking for code to reuse in a specific problem you are facing, first isolate the code that will solve your problem. A keyword-based search through the system's code will in most cases guide you to the implementation. If the code you want to reuse is intractable, difficult to understand and isolate, look at larger granularity packages or different code. As an example, instead of fighting to understand the intricate relation of a code piece with its surrounding elements, consider using the whole library, component, process, or even system where the code resides.

One other reuse activity involves proactively examining code to mine reusable nuggets. Here your best bet is to look for code that is already reused, probably within the system you are examining. Positive signs indicating reusable code include the use of a suitable packaging method (see Section 9.3) or a configuration mechanism.

1.1.6 Inspections
Finally, in some work settings, the task of code reading may be part of your job description. A number of software development methodologies use technical reviews such as walkthroughs, inspections, round-robin reviews, and other types of technical assessments as an integral part of the development process. Moreover, when practicing pair programming while applying the extreme programming methodology you will often find yourself reading code as your partner writes it. Code reading under these circumstances requires a different level of understanding, appreciation, and alertness. Here you need to be thorough. Examine code to uncover errors in function and logic. The various elements we have marked in the margin as dangerous (see the icon at left) are some of the things you should be wary of. In addition, you should be ready to discuss things you fail to see; verify that the code meets all its requirements.



Nonfunctional issues of the code should absorb an equal part of your attention. Does the code fit with your organization's development standards and style guides? Is there an opportunity to refactor? Can a part be coded more readably or more efficiently? Can some elements reuse an existing library or component? While reviewing a software system, keep in mind that it consists of more elements than executable statements. Examine the file and directory structure, the build and configuration process, the user interface, and the system's documentation.

Software inspections and related activities involve a lot of human interaction. Use software reviews as a chance to learn, teach, lend a hand, and receive assistance.

No comments: