Notes on Formal Language Design
by Crutcher Dunnavant
current status "Work in Progress"
last updated 2005-04-14

Abstract
A great deal of work has been done in Linguistics, Semiotics, Semantics, Computer Science, and Mathematics towards developing methods for analyzing the articulations of formal languages, describing their semantic fields, and the relationships between them, and producing translators and interpreters by which these languages may be given impetus to affect the world. By contrast, very little work has been done to provide teachable techniques for the design and development of the semantic fields and articulations of formal languages. This work attempts to address some of the issues in this space. Trading on the incredible importance of formal languages in a technical society, any addition to the field of language design would have an immense value.

Preface
This document is much more than a draft, it is the kernel of my doctoral dissertation. As such, it will undergo many, many revisions during the next several years. Owing to positive past experiences, I hold continuous review in exceedingly high esteem, so the most current version of this document will always be online. Sections will frequently change, or be woefully incomplete, and spell checking, which I view as a cleanup process, will be done with incredible infrequency. I welcome your feedback, just not on the spelling.
1. Introduction
1.1. What is a Formal Language?
In which is described Natural and Formal Language, and in which the decision process of "Does a Formal Semantics exist for a language?" is given to distinguish between the two.
1.2. Why Formal Language Design is Needed
In which Formal Language Design is placed in context with other fields.
1.3. Who Should Read This Book
In which the educational background of the assumed reader is described, along with needs which this book can meet for different individuals.
2. Do you need a new Formal Language?
Overview: A discussion which will guide the reader through a cost/benefit analysis of the language the reader wishes to implement. Note that this discussion must be biased towards "No, you don't need a new language"; most situations do not require a new language (implementation and education costs will never be free) and the act of proving that a new language is needed, by listing points of inadequacy of existing languages, will have the side effect of pre-labeling most of the points of articulation around which the new language should be structured (though the reader may not realize this while they construct their proof).
2.1. A Language Creation Decision Process
In which is presented a process by which can be decided if a new language is indicated. This process will generate a collection of needs which will be used in later stages.
3. Linguistics for Formal Language Design
3.1. History
In which is presented a framework for understanding the history of linguistics up to the present, with citations and references to major works and influential writers. Including: de Saussure [saussure:linguistics], Chomsky, Derrida, Lyons, etc.
3.2. Signification
In which is presented a sketch of Semiotics.
3.3. Dialog
In which is presented a sketch of Semantics.
3.4. Differentiation and Analogical Reasoning
In which is presented the forces of differentiation and analogical reasoning on language, and the dynamic balance which exists between them. Including an attempt to characterize languages which are in balance (good, effective, efficient) against those which are not (verbose, ambiguous).
3.5. A Linguistic Framework
In which is presented Semiotics, Semantics, Grammar, and Transformation as framework for the analysis and understanding of language.
4. Semiotics for Formal Language Design
Overview: A presentation of Semiotics, followed by a discussion which guides the reader through an evolution of their proof of need to a description of the paradigmatic fields of their language. The presentation of Semiotics should be grounded in theory (with appropriate references for deeper study), but extremely light on history.
4.1. The Structure of the Sign
Overview: an introduction to the internal structure of the sign, denotation, connotation, and paradigmatic structural connotation.
4.2. The Arbitrariness of the Sign
Overview: an introduction to the Arbitrariness of the Sign, motivated and un-motivated sign choice, and more and less relatively motivated languages.
4.3. Oppositional Relationships
Overview: introduction to semantic fields / paradigms.
4.4. Positional Relationships
Overview: introduction to syntactic relationships.
4.5. Lexical Fields
The sense of a lexeme is therefore a conceptual area within a conceptual field, and any conceptual area that is associated with a lexeme, as its sense, is a concept. [lyons:semantics1 pp. 254]

Additionally, the set of lexemes which collectively cover a conceptual field make up the covering lexical field.

So, to use the cononical example, if we wish to discuss color in a language, then the collection of all conceptual understandings of color make up the conceptual field of color; and we break this field up into various conceptual areas, each of which we associate with a lexeme. The set of these lexemes make up the lexical field of color in the language.

5. Semantics for Formal Language Design
Overview: A presentation of Semantics, followed by a discussion which guides the reader through an evolution of their proof of need and paradigmatic fields to a description of the syntactic relationships of their language. The presentation of Semantics should be grounded in theory (with appropriate references for deeper study), but extremely light on history.
5.1. A Pattern Language for Computer Language Structure
In which is presented a Pattern Language for Computer Language Structure, as an extension of the language structure work in A Pattern Language for Language Implementation. This is to guide language semantics within structures which are common in Computing.
6. Grammar for Formal Language Design
Overview: A presentation of Grammar, followed by a discussion which guides the reader through an evolution of their proof of need and paradigmatic and syntactic fields to a developed formal grammar in a known family of parsable languages. The presentation of Grammar should be grounded in theory (with appropriate references for deeper study), but extremely light on history.
7. Transformation for Formal Language Design
Overview: A presentation of Language Transformation, followed by a discussion which guides the reader through an evolution of their proof of need, paradigmatic and syntactic fields, and developed formal grammar to a description of a transformation system for giving impetus to their language's semantics. The presentation of Transformation should be grounded in theory (with appropriate references for deeper study), but extremely light on history. The presentation of Transformation should study mention non-deterministic transformation, but should focus on deterministic transformation. Interpreters shall be considered a form of transformation, as the target language is syntactically structured in time, rather than space.
8. Testing Formal Languages
Overview: The development of a new Formal Language should not stop with the completion of a transformation environment. The main purpose of developing a new Formal Language is for providing good semantic compression for a given domain, so it now becomes necessary to test the language.

This section details how one tests and debugs a new language in parallel with its implementation, techniques include:
  • write a number of non-trivial documents in the language
  • frequency count the tokens in these documents
  • frequency count consecutive token strings
Developers should consider distorting a language to provide shorter versions of extremely common tokens, and special grammar cases for extremely common token strings. (See taratology)
9. A Language Design Process
Overview: A description of a full design process for Formal Language design and maintenance.
9.1. The Language Waterfall
In which is presented the Waterfall process, explicitly tuned for the needs of language design.
9.2. The Language Lifecycle
In which is presented a lifecycle of language evolution, from initial design, rapid prototyping, and deployment, through refactoring, performance tuning, and mature feature integration; all the way to graceful obsolescence and replacement integration.
10. Language Design in the Software Development Process
Overview: A speculative discussion of how to integrate language design into the software development lifecycle.
10.1. Semantic Abstraction
The basic idea: strong programmers / architects write compilers for application specific languages (ASLs), everyone else writes the application in the ASLs. Benefit: strong programmers acts as a multiplier on everyone else's talent.
10.2. Semantic Compression
One stage of the development cycle is added which attempts to maintain a feature set while increasing documentation and reducing total token count. (Done by adding generation layers in cheap languages.)
11. Glossary
Overview: A complete glossary of all technical terms used in the document, with references to their point of introduction.
Appendix A. Normative Linguistics

Part of the evolution of modern linguistics has been a deliberate movement away from Normative Linguistics towards Positive Linguistics, a democratization in the comparative study of language; built upon a de-emphasis of the importance of a culture's economically and educationally preferred proper and literature languages (that language which evolves as the proper written variant of a culture's language). It has been a principle of comparative linguistics that languages are not better or worse, but only different.

While this process has produced cleaner discussions of social sub-groups and class systems, and has greatly aided the teaching of language (and, indeed, other subjects, as teaching material is now sometimes modified for various dialects), it leaves the modern language designer lacking a basic vocabulary for making value judgments while comparing languages, a problem which we seek to address in the development of an art of language design.

1. Efficiency

Our first concept will therefore be efficiency. The efficiency of a language varies inversely with the expected length of a statement of that language. Notionally, we can define the expected length as the sum of over every possible statement in a language of the statement's length multiplied by the statement's probability of occurring; practically, as many languages are capable of producing an infinite number of statements, this is not a metric which we are likely to every calculate, so we will settle for estimators of the expected length. All other things being equal, we prefer more efficient languages to less efficient ones. When comparing two languages covering the same domain, the language with the lower expected length is more efficient.

2. Balance

Our second concept will be balance, but this will require the ancillary concepts of linguistic distance and semantic distance be defined.

Our first concept will therefore be linguistic distance. We shall say that the linguistic distance between two utterances is the edit distance between, not their lexical representation (which is linear), but their structural representation (the concrete syntax tree for a given expression). While it would be possible to mathematically describe the edit distance between to statements in this way, it will not be necessary for our purposes.

Note: The concept of edit distance is much discussed in the field of computer science as it applies to strings, and we abstract it here to a general form - the edit distance between two statements is the minimal number of edit operations (often given as replacement, addition, and subtraction) which need to be applied to one statement in order to produce the other.

Our second concept will be that of semantic distance. We shall say that the semantic distance between two statements is the edit distance between their meaning, deep structure, or model form. Unfortunately, this will always be an ambiguous definition, but in any given formal language context, it should be possible to roughly describe the model form (the non-serialized, multi-dimensional structural form which the language models) which statements in such languages communicate.

Now, given linguistic distance and semantic distance, we are ready to discuss balance. A language is balanced to the extent that the expected linguistic distance between two statements is proportional to the expected semantic distance.

3. Putting it Together

Therefore, we desire balanced efficient languages; and when comparing two languages for a given domain, we will prefer the language which is more balanced and efficient, though we must make judgment calls when one language is more balanced, and one is more efficient.

Appendix B. General Properties of Language

In discussing the question, "What is Language?", linguists frequently resort to an attempt to describe the characteristic features which any language must posses. While many features have been proposed as general properties, only four are accepted by all schools of linguistics. These four properties are: Arbitrariness, Duality, Productivity, and Discreteness[lyons:semantics1 pp. 70-79].

1. Arbitrariness
The arbitrariness of the sign[saussure:linguistics pp. ??], upon which the study of Semiotics is based, is a core feature of language. This is a complex concept, and needs further development.
2. Duality
Duality, or double-articulation, is the property where by the discrete elements of a language expression themselves make up second level language elements. In text, this would be the lexical and grammatical levels. Duality greatly enables Productivity. [lyons:semantics1 pp. 71-72]
3. Productivity
By productivity, as we shall employ the term, is meant that property of the language-system which enables native speakers to construct and understand an indefinitely large number of utterances, including utterances that they have never previously encountered. [lyons:semantics1 pp. 78]
4. Discreteness
The term discreteness applies to the signal-elements of a semiotic system. If the elements are discrete, in the sense that the difference between them is absolute and does not admit of graduation in terms of more or less, the system is said to be discrete; otherwise it is continuous. [lyons:semantics1 pp. 78]