Bruce W. Weide Computer and Information Science The Ohio State University Columbus, OH 43210-1277 USA weide.1@osu.edu Phone: +1 614 292 1517 Fax: +1 614 292 2911 URL: www.cis.ohio-state.edu/~weide |
Scott M. Pike Computer and Information Science The Ohio State University Columbus, OH 43210-1277 USA pike@cis.ohio-state.edu Phone: +1 614 292 8234 Fax: +1 614 292 2911 URL: www.cis.ohio-state.edu/~pike |
Wayne D. Heym Computer and Information Science The Ohio State University Columbus, OH 43210-1277 USA w.heym@ieee.org Phone: +1 614 297 6803 Fax: +1 614 292 2911 URL: www.cis.ohio-state.edu/~heym |
An analysis of the pros and cons of all options available for the built-in data movement operator in imperative languages shows that the swap operator is the best choice, while the assignment operator is the worst.
Keywords
Assignment statement, copy, data movement, parameter passing, swap, swapping paradigm
Paper Category: technical paper
Emphasis: research
For years we have advocated swapping (i.e., exchanging values) as an alternative to assignment (i.e., copying either references or values) as the built-in operator for data movement in imperative programs [Harms91, Hollingsworth00]. But these are not the only possible choices, as both we and others (e.g., [Minsky96]) have noted. Is swapping really the best way to move data between variables?
The contribution of this paper is that it maps out all rational options for a data movement operator and a set of criteria through which they can be compared, and thereby explains two important conclusions:
We begin in Section 2 with a brief description of the data movement problem and our previously recommended solution: swapping. This section is adapted from the appendices of [Hollingsworth00, Weide01]. In Section 3 we map out the entire solution space for the data movement question and suggest several evaluation criteria to distinguish among solution points. Section 4 uses this setup to show why swap dominates all other possible data movement operators.
The issue of how to achieve movement of data between variables is a technical problem that needs to be faced by all designers of imperative programming languages, by all software engineering disciplines that concern themselves with component-level design details, and therefore by all software engineers. We pose the problem as the data movement question:
How do you make some variable (say, y) get the value of another variable (say, x)?
Before answering this question, many people legitimately ask another: Why you would ever want to "make some variable (say, y) get the value of another variable (say, x)". In other words, if you need the value of x at some point in a program, why not just use x directly, rather than first making y get that value and then using y? There are a few important reasons, including:
The obvious answer to the original question (which is implicit in case 1 above, and explicit in cases 2, 3, and 4) is that you use the assignment operator:
y := x;
This is the answer to the data movement question in all popular imperative languages, e.g., C, Fortran, Java, C++, C#, etc. (All these languages daftly use "=" rather than ":=" as the notation for assignment, but that's another story entirely.) Operationally, at the instruction-set level, assignment simply copies the data stored in location x into location y, replacing the data previously stored in location y. This is something that all computer hardware is designed to do efficiently, so it's no surprise to see a direct high-level-language manifestation of such an important low-level instruction. The formal semantic connection between assignment and parameter passing [Landin66] has been termed "the correspondence principle" and has even been suggested as a "design guide" for programming languages [Tennent77].
Everything about the assignment operator is not quite so simple, though. The operator has two related interpretations to modern high-level language programmers, depending on whether x and y are variables of a value type or variables of a reference type. If x and y are variables of a value type, e.g., int, then the copying leaves x and y independent of each other, so all subsequent changes to x do not affect y, and vice versa. If x and y are variables of a reference type, then the references x and y subsequently can be changed independently, i.e., without affecting each other. But the objects referenced by x and y are not independent because subsequent method calls that change the object referenced by x also change the object referenced by y, and vice versa. This difference in effect between copying values and copying references is evident in the two common parameter-passing mechanisms of call-by-value (actually, this is more properly termed call-by-value-copying) and call-by-reference (call-by-reference-copying).
An additional implication in the case of reference types is that the assignment operator for references creates an alias, which (if observable) dramatically complicates formal specification [Weide01] and modular reasoning about program behavior [Hogg91]. Note also that when reference types are involved, after assignment there is generally one more reference to some object and one less reference to another object, so there are storage management implications that either the programmer in a language such as C++, or the garbage collection mechanism in a language such as Java or C#, is obliged to deal with.
The question of the "observability" of aliasing arises because an object that is referenced might be either mutable (capable of having its value changed) or immutable (having a fixed value upon creation). The idea is that no reasoning problem arises from aliases to an immutable object because that object's value can't be changed through any of those references; you might as well have had one copy of the object for each reference, and there's no way to tell that you don't. In other words, a reference type where the referrent is immutable can be thought of as a value type. There may or may not be performance and storage management differences associated with the distinction, depending on the referrent type, but there are no reasoning differences.
Despite the potential for reasoning problems, both the assignment operator and the value-reference type dichotomy have been codified into modern commercial software technologies, including C++, Java, and the .NET framework. This is simply terrible from the software engineering standpoint. One reason is that the programmer now must be aware that variables of some types have ordinary values while variables of other types hold references to objects (it's the objects that have the values). For parameterized components this creates a special problem. Inside a component that is parameterized by a type Item, there is no way to know before instantiation time whether an assignment of one Item to another will assign a value or a reference. Of course, this can be "fixed" as it is in Java, by introducing otherwise-redundant reference types to immutable objects, such as Integer, that correspond to value types such as int. Actual template parameters can then be limited to reference types; but so much for parsimony. The technique known in .NET as "boxing" is a cosmetic improvement in terms of syntax but fails to remove the value-reference dichotomy or the need to understand its ramifications for program behavior.
In summary, then, going the direction of making everything a reference rather than a value only exacerbates the complications for specification and modular reasoning that are caused by potentially aliased references. It is true that the reasoning problems created by aliasing could, in principle, ultimately be avoided by requiring that all reference types should refer to immutable objects. This would be tantamount to mandating pure functional programming in a nominally imperative language, and to our knowledge no one is seriously proposing this as the light at the end of the tunnel on the moving-toward-references track.
Facing up to this unfortunate situation, a decade ago we wrote about a different approach to data movement that was originally proposed and used by Bill Ogden in the early 1980s: the swap operator [Harms91]. The idea is to make a language's built-in data movement operator not assignment but swap, so the answer to the original data movement question becomes:
y :=: x;
This means that the values of x and y are exchanged. For instance, if x = -42 and y = 97 before swapping, then x = 97 and y = -42 afterward. Swapping x and y is, crucially, safe with respect to modular reasoning. Moreover, it is also efficient in execution because the compiler can (for data representations larger than some threshhold) simply use one level of hidden indirection and swap pointers to the representation data structures for x and y in order to effect the logical swapping of x and y. And--surprisingly to most people--the resulting swapping paradigm for software design and implementation demands remarkably few changes in how most programmers write imperative programs [Hollingsworth00].
We are often asked about interactions between the swapping paradigm and other programming language and software engineering issues, such as the assignment of function results to variables, how parameter passing works, what happens in a case where you really need two copies of a value, etc. These issues are discussed in [Harms91, Hollingsworth00]. One question we are often asked that is not discussed elsewhere, however, is whether swapping is the only way to address the data movement question that is both efficient and safe with respect to modular reasoning. Here we explain why the answer is "no"--but why the swapping solution dominates all the alternatives in other dimensions, and why the assignment operator is the worst of all possible solutions.
First, we need to enumerate all the ways we might define a (binary) data movement operator. Let's use "<-" to denote this operator--in English we might pronounce it "gets"--so by definition the answer to the data movement question is:
y <- x;
The key to identifying and classifying all possible meanings for <- is to note that <- is required to leave the new y with the old value of x in order to solve the stated problem; but there is complete flexibility in how <- selects a new value for x. However, there are really only five possibilities for the new value of x after the execution of "y <- x;":
These exhaust the possibilities that make any sense. There are only two particular values in sight as the problem is stated: the old value of x and the old value of y. Other than these, we can select the new value of x without regard to the old values of x and y, from among all the values of its type, in one of three ways: we can give x no value (the "undefined" option), give it one particular value, or give it one value chosen arbitrarily (non-deterministically) from among two or more values. The last case could be further partitioned in obvious ways, but it turns out in the end that there is no point in doing so. Non-deterministic choice could also be replaced by probabilistic choice of some kind, but again it turns out that this does not affect any of our conclusions. Hence we do not further discuss these variants.
The above five possibilities apply both for values and references. First, suppose the type of x and y is int, a value type. If beforehand, y = 21 and x = 13, then after "y <- x;" we must have y = 13. But we could leave x undefined (i.e., unusable in subsequent computations until it somehow gets a value of type int); we could leave it with a particular known value (e.g., a natural choice for type int might be 0); we could leave it with one of a number of known values (e.g., any int value, or any even value, or any prime value, or either 1 or 69); we could leave it with its old value (i.e., 13); or we could leave it with y's old value (i.e., 21).
Now suppose the type of x and y is a reference to objects of type T. If beforehand, y = ref-to-B and x = ref-to-A (where A and B are of type T), then after "y <- x;" we must have y = ref-to-A. But we could leave x undefined (i.e., unusable in subsequent computations until it somehow gets a value of a reference to type T); we could leave it with a particular known value (e.g., a natural choice for all reference types might be null); we could leave it with one of a number of known values (e.g., ref-to-O where O is any object of type T, or ref-to-O where O is any object of type T for which some predicate P(O) holds); we could leave it with its old value (i.e., ref-to-A); or we could leave it with y's old value (i.e., ref-to-B). Technically, there are many more possibilities now that references are in play! If T is a value type, then there are five additional cases, one corresponding to each of the possibilities for value types. We could leave x = ref-to-Z (where Z is a new object whose value is undefined), or x = ref-to-Z (where Z is a new object whose value is some statically specified value of type T), and so on through x = ref-to-Z (where Z is a new object whose value equals B). And if T is a reference to a value type then we can iterate the same idea one level deeper, and so on. Doing this turns out not to affect any of our conclusions, other than to make them even more obvious because reasoning about behavior gets ever more complex and/or performance ever more dismal for each of the possible data movement operators except--you guessed it--the swap operator.
Before commencing an analysis of all the possible data movement operators, we need to list the criteria that might bear on the choice of a "best" one. The most important is that the solution should not break modular reasoning about software behavior, because this is required for any scalable software engineering discipline [Weide95]. Other than this there are no absolute requirements, but different solutions might be better or worse in terms of other important qualities. In Section 4, we consider the following desiderata that are obviously of general interest and that could differ from one data movement operator to another.
In the following summary table, we use a yes-no (Y-N) scale for support of modular reasoning, and a good-deficient (G-D) scale for each of the other factors. For each of the criteria, there are two entries: one for what happens with value types (val), the other for reference types (ref).
New value of x after "y <- x;" | Supports modular reasoning? | Efficiency of <- | Ease of storage management | Ease of specification and reasoning | Maximization of client knowledge | |
---|---|---|---|---|---|---|
1 | undefined | val: Y ref: Y |
val: G ref: G |
val: G ref: G |
val: D ref: D |
val: G ref: G |
2 | unique statically specified value of its type | val: Y ref: Y |
val: D ref: G |
val: G ref: G |
val: G ref: G |
val: G ref: G |
3 | some value of its type, arbitrarily chosen from among a statically specified set of two or more possible values | val: Y ref: Y |
val: G ref: G |
val: G ref: G |
val: G ref: G |
val: D ref: D |
4 (assignment operator) |
x's old value | val: Y ref: N |
val: D ref: G |
val: G ref: D |
val: G ref: D |
val: G ref: G |
5 (swap operator) |
y's old value | val: Y ref: Y |
val: G ref: G |
val: G ref: G |
val: G ref: G |
val: G ref: G |
Looking across the rows, we find deficiencies in all but #5 (i.e., the swap operator). Notably, the row with the most difficulties is #4, the one where x's new value equals its old value (i.e., the assignment operator). All the other options could be considered better approaches than the assignment operator, which, in a nasty quirk of fate, has been hardwired into every commercial software technology.
Let's examine, row by row, the most significant table entries for each of the possible solutions to the data movement problem:
p /= null and ...and/or postconditions start looking like
if #p = null then ... else ...The same style arises when variables are allowed to be undefined. It is possible to make the problem disappear syntactically simply by making it an implicit proof obligation at each point in the program that, say, none of the variables involved in an expression or a call is undefined (except possibly y in a situation like "y <- ...;"). But it remains incumbent upon the programmer to add this to his/her informal reasoning about program behavior; it's not merely a formality needed in program proofs. Reasoning is easier if every variable always has a legal value of its type, i.e., if it is always "defined", even if its value at some point is arbitrary or not uniquely specified.
A careful reader may wonder whether the criteria were "rigged" to produce this outcome. Doesn't swapping have any problems? An early review suggested that programmers who are used to assignment might have to learn a new paradigm of programming in order to use swapping [Hogg92]. However, subsequent experience with swapping (both in the classroom and in building commercial software) suggests that not much changes for the programmer except the quality of the resulting software [Hollingsworth00].
One other potential problem with swapping has been suggested [Minsky96]: Because of its symmetry, swapping doesn't mesh well with the inherently asymmetric notion of subtyping in object-oriented languages. Suppose x is of type S and y is of type T, where S is a behavioral subtype [Leavens90] of T. (The actual objection to swapping on the grounds of its interaction with subtyping fails to note that if we're not talking about behavioral subtypes, then not even assignment should be allowed.) The claimed problem is that OO programmers want to be able to assign x to y in this case, but not vice versa--so swapping cannot be permitted.
We do not give a complete analysis of the issue here, but just explain one part of our response. Indeed, swapping cannot be permitted where the variables' types don't match, if the situation involves an explicit swap statement. However, the first and by far most common use of data movement is for parameter passing, and here swapping can be used even in the presence of subtyping. The declared type of the actual parameter (i.e., in the example, S) must be a subtype of the declared type (i.e., T) of the formal parameter. If it is, then for purposes of swapping to pass this parameter for this call, you simply consider the type of the formal parameter to be the same as the type of the actual parameter (i.e., S). Now the actual and formal are swappable. This doesn't affect the soundness of the separate reasoning about the correctness of the method body because that reasoning uses T as the type of the formal, and by assumption S is a behavioral subtype of T.
Even if incompatibility with subtyping were a significant objection to swapping, any of the first three "transfer" or "move" operators would still be better than the assignment operator, and they are not subject to the symmetry objection. The data movement operators of rows #2 and #3 are the best choices for those who prefer asymmetry in their data movement operator in the presence of subtyping.
Since the assignment operator was introduced at a time when languages had only value types with small representations, it was a perfectly good data movement operator for its day. With the advent of languages having user-defined types and reference types, though, it has become sub-optimal for general use; i.e., it should not be the built-in data movement operator in modern languages. But assignment is deeply woven into the fabric of computing for most software engineers, whose early computing education typically involved programming only with built-in value types having small representations. Languages could still allow assignment statements in such situations.
Swapping has many advantages over assignment as the built-in data movement operator that is available for every type. Most importantly, it doesn't interfere with modular reasoning. It is efficiently implementable for all value and reference types, regardless of the sizes of their data representations. It dramatically simplifies storage management because there is only one reference to any object; hence, the clean-up discipline is that you reclaim resources when a variable goes out of scope. Adoption of the swap operator also simplifies specification and reasoning because it unifies values and references in a fundamental way: There is no logical difference to the client between swapping two object references and swapping the values of the referenced objects. (There was a hint of this property in Section 3, where we mentioned without justification that only the swap operator does not get progressively messier to deal with in the face of references to references to ...) This means that introduction of the swap operator in place of the assignment operator--and in this case in preference to the other three possible data movement operators as well--facilitates a move toward a uniform value semantics. This is precisely the opposite direction taken by Java and .NET. Some details and consequences of such a move are discussed in [Weide01]. (Of course, this is one of the main features of the RESOLVE language and discipline.)
If you use any commercial software technology, then you face the data movement question all the time. You might not know it because the assignment operator is so ingrained in all of us that to many people it seems inseparable from the very idea of imperative-language programming. But it has been demonstrated [Hollingsworth00] that:
Not only is nothing sacred about the assignment operator, it is the worst choice among all possible data movement operators. Swapping dominates assignment when data movement operators are compared along several important dimensions. In fact, swapping dominates all the other possibilities as well.
This work is supported in part by the National Science Foundation under grant number CCR-0081596, and in part by a grant from Lucent Technologies.