Wild Grammar

This article is part of a series entitled Language Games. See also:
1. Wild Grammar; 2. Combinatorial Grammar; 3. Pragmatic Grammar

“Isn’t it true that example-sentences that people that you know produce are more likely to be accepted?” – De Roeck et al., 1982 [1]

“The man the dog the cat scratched bit died.” – Dan Scherlis, a former linguistics classmate of my mother

An Investigation

Chomsky first articulated the distinction between grammaticality and what he called performance. Making a grammatical sentence is one thing. Transmitting it successfully is another, and many potential obstacles – from distracting noise to the capacity of the human mind – can get in our way.

In particular, certain sentences are grammatical, but effectively incomprehensible. These sentences are typically complex, and they might contain intricately nested clauses and phrases. The capacity of our minds is limited. Language’s capacity for recursion is not. Who could be surprised that space eventually runs out? (The two sentences above contain double center embeddings, which are notoriously difficult to parse.)

Some sentences, though, feature an inscrutability difficult to explain on account of their complexity alone. Within a collection of sentences similar in length, complexity and meaning – but different in organization – certain sentences can emerge as particularly difficult to understand. Further, these arcane sentences share distinctive commonalities. (The second sentence, though perhaps much simpler than the first, is typically found to be less comprehensible.)

Linguists seek to describe these commonalities. Which grammatical characteristics make a sentence more difficult to parse than we should expect it to be on account of its semantic complexity alone? The enumeration of these characteristics is a central task of psycholinguistics. Linguists have developed precise technical criteria which purport to predict when a sentence – despite its grammaticality – is liable to baffle the mind.

These criteria are as fascinating as they are technical. Edward Gibson’s seminal work introduced Processing Load Units (PLUs) – which represent units of mental parsing difficulty – and described grammatical constructions which induce the accumulation (or reduction) of PLUs. When, and only when, the present tally of PLUs exceeds four, Gibson found empirically, our mental parsers simply fold. Gibson’s system proved incredibly predictive. James David Thomas explains the grammatical constructions which generate Gibson’s PLUs:

Associate a PLU to each lexical requirement position that is obligatory in the current structure, but is unsatisfied.

Associate a PLU to each semantically null C-node category in a position that can receive a thematic role, but whose lexical requirement is currently unsatisfied. [2]

These constructions produce convoluted – and often amusing – example sentences. Again, quoting from David’s work:

Claim 1: That embedding a relative clause inside a sentential complement is easier than the opposite embedding.

  1. The hunch that the serial killer who the waitress had trusted might hide the body frightened the FBI agent into action.
  2. The FBI agent who the hunch that the serial killer might hide the body had frightened into action had trusted the waitress.

Claim 2: That embedding a relative clause inside a sentential subject is easier than the opposite embedding.

  1. Whether the serial killer who the waitress had trusted might hide the body frightened the FBI agent into action.
  2. The FBI agent who whether the serial killer might hide the body had frightened into action had trusted the waitress. [2]

Welcome to the far-flung edge of the grammatical universe: sentences which are grammatical, but indecipherable; which feature regular grammatical structure, but organize it in such a ridiculous way that we lose all hope of comprehension.

Fun and Games

We use this setting to manufacture sentences with systematically absurd structure. How (little) are we constrained by the requirements of grammaticality? Though the sentences we produce won’t be comprehensible, they’ll be grammatical. I’ll pose a few games to the reader.

Game 1: Construct a family of sentences 1, …, n, … such that for some word, in the nth sentence this word appears n times consecutively.

Solution: Consider the family of sentences (I’ve bracketed embeddings for clarity):

  1. Proposition P is true.
  2. That [proposition P is true] is obvious.
  3. That [that [proposition P is true] is obvious] is obvious.
  4. That [that [that [proposition P is true] is obvious] is obvious] is obvious.
  1. That [… [that [proposition P is true] is obvious] …] is obvious.

Proceeding in this manner, we can construct grammatical sentences in which the word that appears repeatedly in consecutive sequences of arbitrary length. (Similar constructions can be achieved with other subordinating conjunctions, such as whether.) Each of these sentences is grammatical; given enough time, each could be understood.

This example works because the subordinator that repeatedly serves to embed an entire sentence into the subject of the next, larger sentence. The subject of any given sentence contains a descending chain of smaller, nested “copies” of itself.

These sentences would surely be reported incomprehensible after the second or third.

The sentences in this family feature syntax trees which are skewed heavily towards the left (subject). Each sentence’s tree features a large subject consisting in a long chain of subordinations; to each of these links (as well as to the root node representing the sentence itself), we also attach a copy of the small verb phrase is obvious. The size of the tree then – to use the language of computer science – grows linearly, or on the order of n.

Game 2: Construct a family of sentences 1, …, n, … such that for some word, in the nth sentence this word appears n times consecutively on two separate occasions, and such that the size of the sentences’ syntax trees grows exponentially, on the order of 2^n.

Solution: We use the coordinating conjunction and to join two copies of the earlier phrase within each successive embedding. Consider the family of sentences (embeddings are bracketed):

  1. Proposition P1 is true.
  2. That [proposition P1 is true] is obvious and that [proposition P2 is true] is obvious.
  3. That [that [proposition P1 is true] is obvious and that [proposition P2 is true] is obvious] is obvious, and that [that [proposition P3 is true] is obvious and that [proposition P4 is true] is obvious] is obvious.
  4. That [that [that [proposition P1 is true] is obvious and that [proposition P2 is true] is obvious] is obvious, and that [that [proposition P3 is true] is obvious and that [proposition P4 is true] is obvious] is obvious] is obvious, and that [that [that [proposition P5 is true] is obvious and that [proposition P6 is true] is obvious] is obvious, and that [that [proposition P7 is true] is obvious and that [proposition P8 is true] is obvious] is obvious] is obvious.
  1. I’ll leave this one out for your and my sake. It will contain 2^n propositions.

The syntax tree of the nth sentence, for any n, resembles a balanced binary tree of height n. Each new sentence embeds the previous sentence twice, in two separate clauses joined by the conjunction and. The words “that” and “is obvious” flank each embedding. The two n-length consecutive sequences of the word that occupy, respectively, the leftmost path of the entire tree and the leftmost path of the root’s right subchild. (Bonus question: what rule describes the comma placement?)

This exercise surely seems ridiculous. But behind the investigation and the games, an important point stands: grammaticality is quite a different condition from comprehensibility. Past that small region enclosed by the demands of comprehensibility, a much larger realm lies, where – as the complexity mounts – grammar continues to operate.

This world of the grammatical marches off far past our minds’ horizon.

  1. DeRoeck, et. al. provide counter-examples to the “myth” that native speakers reject double center embeddings.
  2. James David Thomas’s excellent masters thesis, “Center-embedding and Self-embedding in Human Language Processing“, was an invaluable resource in the writing of this post.

7 comments on “Wild Grammar

  1. Ben says:

    My mother Nancy and I have constructed another interesting class of systematic embeddings.

    0. The man’s house is white.
    1. The man the dog bit’s house is white.
    2. The man the dog’s fangs bit’s house is white.
    3. The man the dog the cat’s claw scratched’s fangs bit’s house is white.
    4. The man the dog the cat the flea inhabits’ claw scratched’s fangs bit’s house is white.

    For any n, the nth sentence’s subject consists of a nested collection of n concentric center-embeddings followed by an apostrophe-s — which places the entire collection in the genitive case — and a nominative noun to which the entire genitive phrase is linked. Finally, a verb phrase completes the sentence.

    The pattern pervades the recursion tree. Each inner embedded sub-phrase’s subject consists of k < n center-embeddings and an apostrophe-s — which places the sub-collection in the genitive case — followed by a nominative noun to which the genitive phrase is linked. Finally, a verb links the subject to the larger enclosing subject one level higher.

    The result is that the sentence contains n nested layers of the genitive case, each a strict subset of the last.

    I should remark on a broader point. Languages (including, of course, those in which case marking is more robust) often feature sentences with nested cases. Interestingly, the case markers reflect only the innermost case. For example, in the sentence “I like the the smell of those roses,” the entire direct object “the smell of those roses” is in the accusative case, while “those roses” is itself in the genitive case and also the accusative case. The genitive is the innermost case, and in languages with case-marking affixes, the case ostensibly attached to “those roses” will be the genitive one.

    This practice is not arbitrary. The presence of the accusative case has already been signaled in the affixes on “the smell”, and the outer layer may be maintained implicitly as the inner layer is introduced.

    This is surely another instance of that phenomenon whereby speech forces us to translate multi-layered linguistic constructions into a more “linear” form.

  2. Rachel says:

    This is a fun post! It really dances around the Chomsky hierarchy of formal languages (http://en.wikipedia.org/wiki/Chomsky_hierarchy). Your center embedding examples all demonstrate how English is (arguably) not a “regular language,” i.e., English sentences can’t be decided by a finite state acceptor. A language of the form a^nb^n (all strings with n instances of string a followed by n instances of string b) is not regular. (See: pumping lemma for regular languages.)

    There’s a lot more debate about whether natural languages are in the class of context-free languages (the next level above regular languages in Chomsky’s hierarchy). In languages with (arbitrary length) cross-serial dependencies (http://en.wikipedia.org/wiki/Cross-serial_dependencies), you can construct interesting “families” of sentences that are non-context-free. The famous paper by Shieber arguing that natural languages are non-context-free, based on an example from Swiss German: http://www.eecs.harvard.edu/shieber/Biblio/Papers/shieber85.pdf

    Of course, if you maintain that human memory is finite and so these constructions are bounded, then natural languages have to be regular. But that’s really no fun at all. :-)

  3. Hi Ben,

    Just, after 2 years hiatus, reading your well-argued stuff again. Perhaps you might care to answer the most basic question about grammar. Wiki says:”grammar (From Greek: γραμματική) is the set of structural rules governing the composition of clauses, phrases, and words in any given natural language.”. Zo, can you give a single **rule** from any linguist’s grammar that tells a mind of any kind, conscious and/or unconscious, how to construct a single sentence on any subject? (Doesn’t matter what the quality – just a grammar rule from anyone – that actually dictates how to compose a sentence on a subject). A “rule” would have to be a rule such as obtains in logical or algorithmic systems – not what people often confuse with it – a “principle” like “Write with economy.” . The rule must specify words and order if not punctuation.

    I’m wondering whether the “rules” of grammar are rather like *evolutionary algorithms* – something of which a great many scientists talk about a great deal as unquestionable realities, but of which no one could even begin to give an example.

    If it helps you may take the subject as a cat sitting on a mat – or Trump having a conversation with Kushner.

    • Ben says:

      Hi Rafa, how’s it going. A good question, and I’ll try to suggest what I can.

      When you ask about a “rule… how to construct a single sentence” the idea of generative grammar comes to mind. The idea here is to begin with an abstract sentence and then to proceed using a hierarchical process of replacements. For example, this sentence could be replaced by a noun phrase followed by a verb phrase. This noun phrase could then in turn be replaced by a determiner phrase and a noun, while the verb phrase could be replaced by a verb and then a prepositional phrase. Filling in the categories with concrete words, we finally end up with sentences like “The cat sits on the mat” or “Trump has a conversation with Kushner”.

      The theory of generative grammar states, essentially, that if we could ever manage to describe in enough detail what sorts of replacements are possible, then the resulting process would give us the rule that states exactly which sentences are grammatical, and also how to construct them. This was essentially the work of Chomsky in the 1950s; he also connected these ideas with context-free grammars, tools from computer science that allow us to more formally discuss exactly what these generative grammars are doing. So, in short, there are very much rules about how to construct grammatical sentences.

      On the other hand, you ask “how to construct a single sentence on any subject?” (italics mine), and from this it seems like you’d like to fix a subject matter beforehand and then construct sentences about that subject. This seems trickier, and can’t be done using the theory of context-free grammars alone—which constructs only grammatical sentences, but not ones on particular subject (or that have any meaning at all).

      On the other hand, though I’m not familiar with the details here, I’m pretty sure people working neural networks are trying to do just that. Though I’m guessing the rules are complicated, you can see an example of the results here—HAHAHA!

      • V helpful answer.wh essentially says – correctly I imagine – “no one has thought of any rules that would apply to real language use (real language used by real people in the real world) only about fantasies of an abstract, logical alternative to language.”

        In fact – is it fair to say? – generative grammar and the like ( possibly including cog. ling grammars) can only be applied anyway to the reading and consumption of language/sentences, and not to the writing and production of language/ sentences. at all – and then only to limited sets of “toy sentences”.

        “Toy sentences” are sentences specially chosen for their apparently logic-like construction (except that even then they are fundamentally different). And linguistics’ reliance on these is comparable to AGI’s reliance on “toy blocks worlds” as testing grounds for its would-be human/animal machines. In both cases, regular, repeated routine sentences and blocks in toy worlds, are chosen because they are the only ones that logic and algorithms can deal with.

        These are certainly parts of both language and the world, but mainly language and conceptual systems are designed to perceive, conceive and achieve in the real, creative world which continually produces new irregular, nonroutine forms and language usage.

      • Ben says:

        Hi Rafa,

        While you’re very right that organic language “continually produces new irregular, nonroutine forms and language usage”, I’d argue that your suggestion that today’s generative grammars and computational-linguistic tools produce only “toy sentences” is a bit unfair. For the term “toy sentences” implies—by design—that the sentences so produced are rigid, awkward, simplistic and unrealistic. And yet it seems that today’s technology has taken us well beyond this stage.

        To name a particular sort of example that comes to mind, it famously frequently occurs that “academic” writing generated by computers is sufficiently convincing to be accepted into (irreputable) scholarly journals. The google search “computer generated paper peer reviewed journal” yields dozens and dozens of results of this kind; one example is available here. You can try the “paper” generator yourself here.

        These are hardly toy sentences! And I’m sure there is plenty of even more sophisticated computer generated language out there.

        Of course, all of this depends on our definition of “toy sentence”—at the end of the day, it is a subjective judgment. Yet my feelings as to this judgment are that the term doesn’t capture the sophistication of the computational language and speech processing technology of today.

  4. Hi Ben, I’m aware of this sort of thing, though thanks for these particular examples. This is different from the linguistics theory I’m talking about. These are algorithmic analyses and reproduction/variation of a given body of texts of a certaiin linguistic complexity – like lit/ sci papers or ,say, a limited genre of fairy tales. They tend to superficial plausibility – but, in the final analysis, nonsense.They don’t really prove anything other than that there *is* a high degree of consistency and repetition within *parts* of real texts/passages/streams of language – sufficient that you can create initially plausible simulations as above, or, possibly, successfully automate certain basic language tasks, like, say weather reports or simple football/sports/news reports etc. Similarly, NLP is a wonderful achievement that is built on analysing these consistencies within texts and across languages. But it is only really a language aid, not real language use; It can’t deal with the whole of language texts only parts..

    I’m talking about the sentences which form the basis of linguistics theorising and analytical philosophy theorising for the last 100 years (and actually many of your own examples in articles here). They are all essentially “Jack and Jill went up the hill” sentences, and reading through any of these books both lingjuistic and philosophical has a weird feeling because they feel so artificial.

    These are all logical or quasi-logical sentences that, even if unconsciously, start from the idea that language is logical, even mathematical and algorithmic.

    The initial thing that’s so artificial about these sentences (which are mainly of course perfectly real occurrences) is that there are no obviously non-logical, not-really-“grammatical” sentences or word combinations, which also occur in great abundance in real world language. Like say:

    “Lolita, light of my life, fire of my loins. My sin, my soul. Lo-lee-ta: the tip of the tongue taking a trip of three steps down the palate to tap, at three, on the teeth. Lo. Lee. Ta.”

    Or scan the headlines and articles on any Goog. News page


    Or look at Powerpoint presentations. Or broken conversations. Or disjointed streams of thought.

    Well-formed, well-spelled, grammatical sentences are not the whole of language.

    The main thing that is quite simply preposterous about ALL language theorising of all kinds is that it is *sentence* based as opposed to the natural form of language which is multi-sentence & language units, i.e. TEXTS, passages, speeches, streams of thought or inner monologue, powerpoint pages etc etc.. We have a grammar of sentences, but no grammar of texts

    And if you don’t study language whole-as-well-as-parts you miss the PRIMARY nature of language which is that it is CREATIVE and as such the opposite of rational, i.e. the opposite of lawful-and-fomulaic, as in logic, maths and algorithms. It *includes* rational, formulaic units, but *as a whole* it is the diametrical opposite.

    Take any text in any genre you like. A tec story. A news article on Trump/politics. A speech on the weather in conversation… What’s the first sentence you have to utter? Make that easier, if say the first sentence is “Trump has shown definitively that he cannot be trusted.” what’s the 2nd sentence, and then the third ad finem? There are none. You can repeat this exercise for any text under the sun, incl. logic & sci papers..

    Language is a creative, informal genre – the diametrical opposite of any rational genre, like a logical argument or a mathematical computation, which are formal, and fully definable. There are no set words or topics for any text, no definable structure of sentence or language-unit construction, no set order, no set quantities. There are normally constraints, models, (conflicting) conventions, some much but not always used parts, but no laws and no formulae or, therefore, algos.for language. “Hell, there are no rules here, we’re trying to get something expressed”.

    Put that formally/informally in a form any rational thinker should and must understand (though many will still have difficulty absorbing). Language is a patchwork genre – the opposite of the pattern genres that unite logic, maths and algos. Patterns always stay the same – are variations on the same elements. Collections of patchworks never stop introducing new elements. Rationalists want language to be logical & mathematical like any 22 x 44 computation. Actually language is like number collages – continually changing.

    There are no rules. for what parts you put in a patchwork or collage. There are constraints in that it is normally accepted that it should be somewhat similar to other related patchworks. But there are no rules. The only “rule” here as in all forms of language-based art forms and all arts period, is that you must include new and different elements, must be at least somewhat surprising and non-formulaic and unpredictable.

    And all forms of real world language like all forms of arts very successfully conform to those nonconformist principles. They really are creative and evernew and the diametrical opposite of, say, basic mathematical computations and logical inferences which are everold and the same.

    This isn’t empirically arguable. No one is going to begin to make out that they can predict the first or nth sentence of any language text, any more than s.o. will make out they can explain a perpetual motion machine. They won’t get to first base/sentence.

    No one and nothing can formularise or predict the new, only the old.

    And so you can see why the argument that you and others make that you have to start linguistics analysis somewhere solid, and that is enough for now – just isn’t valid.

    You also have simultaneously to look at language as whole texts, because if you don’t, you miss, like linguistics and APhil do, that language is a) creative as a whole, and b) contains many creative, non-logical parts as well (like single word sentences/units.)

    Formally, you could say, language is like a *patterned patchwork*:

    It is partly rational with partly patterned parts, but as a whole it is mainly creative with creative, patchy parts too. It is creative/rational.

    The only way you can delude yourself it is purely rational and lawful, is as linguists and AP-ers have done, by studying only rational, logic-like parts/sentences.

    The moment you study texts that pretence falls apart immediately. It’s as absurd as behaviourism’s refusal to study the mind.

    P.S. My initial question to you began from a supplementary and important realisation. I had long ago realised that real world texts of any kind are creative, informal and unpredictable as to any sentence or unit.

    What I have also started to realise is that *even if you take a sentence IN ISOLATION* – as linguistics has done – you STILL cannot predict it. IOW if you ask how a sentence on a given subject is to be *composed* – e.g. the classic subject of cat sitting on mat – – that sentence is first and foremost creative and unpredictable.. There is no necessary form it must take, from any POV. The writer/composer may choose to give it a routine form, but is under no obligation.


    and an infinity of other diversifications, may all be also acceptable.

    As for the most basic elements of language – concepts – they are absolutely creative and the opposite of all rational systems. But that, you may be glad to hear, is another story. for another time.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s