Abstract Data Types and the Uniform Referent Principle I: why Douglas T. Ross would hate nest(), unnest(), gather() and spread()

[This article was first published on R – Jocelyn Ireson-Paine's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

“What’s the Uniform Referent Principle?” my colleague asked me on reading my last post. I think I first came across it in Jean Sammet’s famous book Programming Languages: History and Fundamentals. In a description of Douglas Ross’s AED-0 language, she pointed out a feature that she thought particularly noteworthy: the notation for getting information out of a piece of data is the same regardless of how the data is stored. Or as Ross puts it: There must be a single, uniform method of referencing operands regardless of their detailed implementation. He considered this so important that he gave it a name: the Uniform Referent Principle. Unfortunately, it’s a principle universally ignored by R programmers. I’m going to explain why that is bad, and how it affects my coding of the tax and benefit calculations in our economic model.

Sammet doesn’t actually name the principle, but Ross himself does, as is evident from the title of his paper “Uniform Referents: An Essential Property for a Software Engineering Language” by Douglas T. Ross, in Software Engineering: Proceedings of the Third Symposium on Computer and Information Sciences held in Miami Beach, Florida, December, 1969 (Volume 1, edited by Julius Tou). It’s worth asking why it concerned him. His answer is as relevant now as it was in 1969. Here’s the start to his introduction:

The term software engineering is new and has yet to achieve a well-defined meaning. Most present-day software has not been engineered at all, but instead is a product of the individual experience of its creators and primarily ad hoc choices among various alternatives. The effects of this unstructured approach to software are quite obvious in terms of poor performance, missed schedules, and uncontrolled costs.

Ross continues by saying that although software engineering is one of the most challenging problems facing humanity, we are beginning to systematise it and to recognise the fundamental issues. The most fundamental of these involves:

[…] one basic criterion which a general-purpose programming language must meet if it is to be used as the expressive vehicle for software engineering activities regardless of the style of practice of these activities: There must be a single, uniform method of referencing operands regardless of their detailed implementation.

Why is this so important? Ross relates it to his ideas on how to design programming languages:

Programming language features for software engineering must be carefully selected; not any old programming language features will do. An unstructured blur of assembly language depending in turn upon an ad hoc collection of machine hardware features which just happen to have been brought together for some particular computer design has a low probability of matching the real world in any profound way. Similarly, most “high-level” languages invariably will be either not complete or not concise enough to serve as a basis for software engineering, for they have a similar ad hoc ancestry with roots in scientific computation or business data processing areas, which omit many important aspects of general software construction.

Heaven forbid that R’s ancestry be deemed in any way ad hoc. But let Ross continue:

Careful inspection and experimentation with various software engineering activities discloses a few fundamental features that somehow must be present in a software engineering language. These features can be informally derived directly from high-level considerations of the software engineering design process itself. The purpose of this paper is to indicate such a derivation of one basic principle without attempting an exhaustive enumeration of the consequences.

Ross then goes on to those promised high-level considerations. Quoting an excerpt from his report “Computer-Aided Design: A Statement of Objectives” (Tech. Mem. 8436-TM-4, Defense Documentation Center No. AD252060, M.I.T. Electron. Systems Lab., 1960), he writes that:

The manner of stating problems is also of primary importance. You must be able to state problems from outside in, not inside out. The normal practice of being explicit about a problem is to build a description of the solution procedure in small steps, i.e. to code the problem…. Stating a problem step by step from the inside out in this manner is satisfactory if the problem is well understood to begin with. But there is an alternative procedure which is frequently used among people, but is virtually non-existent as regards computers. In this alternate procedure, which we call stating the problem from the outside in, we state the problem originally in general terms and then explain further and in more and more detail what we mean…. It is quite apparent that this procedure of stating a problem from the outside in by means of further and further refinements is the only feasible way to attack grand problems.

This way of stating problems is also called stepwise refinement. One of the computer scientists who popularised the name was Niklaus Wirth, the designer of Pascal, in his paper “Program development by stepwise refinement” (Communications of the ACM, April 1971, Volume 14, Number 4). In it, he writes:

A guideline in the process of stepwise refinement should be the principle to decompose decisions as much as possible, to untangle aspects which are only seemingly interdependent, and to defer those decisions which concern details of representation as long as possible. This will result in programs which are easier to adapt to different environments (languages and computers), where different representations may be required.

Why is it important to defer decisions which concern details of representation? Because undoing these decisions is costly. One has to rewrite code, retest it, and redocument it. The modern technique for minimising this cost is to use abstract data types. To quote the beginning of the Wikipedia page:

In computer science, an abstract data type (ADT) is a mathematical model for data types, where a data type is defined by its behavior (semantics) from the point of view of a user of the data, specifically in terms of possible values, possible operations on data of this type, and the behavior of these operations. This contrasts with data structures, which are concrete representations of data, and are the point of view of an implementer, not a user.

So one centralises these operations inside a module, and makes everything that depends on the concrete representation private, inaccessible from outside. The only things that aren’t inaccessible are the interface functions. Then, no matter how much the representation is changed, the interface won’t be, and so neither will code that calls it.

Ross’s reasoning is similar:

The crucial feature about outside-in design as the underlying methodology for software engineering is that […] proper treatment of interface mechanisms will allow the higher level to remain unchanged regardless of changes in the details of the inner level. In other words all of the variations can be accommodated in the interface changes, so that the outer level in completely unaffected.

[…]

The criterion is this: A single uniform referent mechanism must be used for all references to operands. […] In order for the programs of an outer level to remain completely unchanged as interface programs change to accommodate various forms of inter-level detail, the manner of referencing these operands must be the same for all possible forms.

There is one difference, however. Ross was writing many years ago, before our modern views on abstract data types. I think it’s fair to say that he sees things in slightly more representational terms than we would today. If we were implementing a data structure as an array, we’d make our interface hide the fact that it’s an array, by exporting functions that do the subscripting. Ross, however, would write the subscripting operations themselves in the interface, but in a notation that looks the same as function calls. Which is why later on in his paper, Ross writes the following table:

The notation A(B) in any syntactic context always means

Property A of thing B

AED-0 declarations allow choices:

A: ARRAY B: INDEX
A: COMPONENT POINTER
A: FUNCTION ARGUMENT
A: MACRO ITEM-STRING

With the .INSERT STATEMENT:

Program fileDeclaration files
BEGIN7094 version
.INSERT DECL $ 360 version
BODY STATEMENTS $1108 version
END FINI

The program file never changes when any declaration is used

(Continued.)

To leave a comment for the author, please follow the link and comment on their blog: R – Jocelyn Ireson-Paine's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)