The Now-or-Never Bottleneck: A Fundamental Constraint on Language PDF
Document Details
Uploaded by PleasurableLute
Morten H. Christiansen and Nick Chater
Tags
Summary
This article explores the 'Now-or-Never' bottleneck in language processing. It argues that the brain compresses and recodes linguistic input as quickly as possible to handle the flood of information. This constraint impacts language processing, acquisition, and change.
Full Transcript
BEHAVIORAL AND BRAIN SCIENCES (2016), Page 1 of 72 doi:10.1017/S0140525X1500031X, e62 The Now-or-Never bottleneck: A fundamental constraint on language Morten H. Christiansen...
BEHAVIORAL AND BRAIN SCIENCES (2016), Page 1 of 72 doi:10.1017/S0140525X1500031X, e62 The Now-or-Never bottleneck: A fundamental constraint on language Morten H. Christiansen Department of Psychology, Cornell University, Ithaca, NY 14853 The Interacting Minds Centre, Aarhus University, 8000 Aarhus C, Denmark Haskins Laboratories, New Haven, CT 06511 [email protected] Nick Chater Behavioural Science Group, Warwick Business School, University of Warwick, Coventry, CV4 7AL, United Kingdom [email protected] Abstract: Memory is fleeting. New material rapidly obliterates previous material. How, then, can the brain deal successfully with the continual deluge of linguistic input? We argue that, to deal with this “Now-or-Never” bottleneck, the brain must compress and recode linguistic input as rapidly as possible. This observation has strong implications for the nature of language processing: (1) the language system must “eagerly” recode and compress linguistic input; (2) as the bottleneck recurs at each new representational level, the language system must build a multilevel linguistic representation; and (3) the language system must deploy all available information predictively to ensure that local linguistic ambiguities are dealt with “Right-First-Time”; once the original input is lost, there is no way for the language system to recover. This is “Chunk-and-Pass” processing. Similarly, language learning must also occur in the here and now, which implies that language acquisition is learning to process, rather than inducing, a grammar. Moreover, this perspective provides a cognitive foundation for grammaticalization and other aspects of language change. Chunk-and-Pass processing also helps explain a variety of core properties of language, including its multilevel representational structure and duality of patterning. This approach promises to create a direct relationship between psycholinguistics and linguistic theory. More generally, we outline a framework within which to integrate often disconnected inquiries into language processing, language acquisition, and language change and evolution. Keywords: chunking; grammaticalization; incremental interpretation; language acquisition; language evolution; language processing; online learning; prediction; processing bottleneck; psycholinguistics 1. Introduction 5. The nature of what is learned during language acquisition; Language is fleeting. As we hear a sentence unfold, we 6. The degree to which language acquisition involves rapidly lose our memory for preceding material. Speakers, item-based generalization; too, soon lose track of the details of what they have just said. 7. The degree to which language change proceeds item- Language processing is therefore “Now-or-Never”: If lin- by-item; guistic information is not processed rapidly, that informa- 8. The connection between grammar and lexical tion is lost for good. Importantly, though, while knowledge; fundamentally shaping language, the Now-or-Never bottle- 9. The relationships between syntax, semantics, and neck1 is not specific to language but instead arises from pragmatics. general principles of perceptuo-motor processing and memory. The existence of a Now-or-Never bottleneck is relatively Thus, we argue that the Now-or-Never bottleneck has uncontroversial, although its precise character may be fundamental implications for key questions in the language debated. However, in this article we argue that the conse- sciences. The consequences of this constraint are, more- quences of this constraint for language are remarkably over, incompatible with many theoretical positions in lin- far-reaching, touching on the following issues: guistic, psycholinguistic, and language acquisition research. Note, however, that arguing that a phenomenon arises 1. The multilevel organization of language into sound- from the Now-or-Never bottleneck does not necessarily based units, lexical and phrasal units, and beyond; undermine alternative explanations of that phenomenon 2. The prevalence of local linguistic relations (e.g., in (although it may). Many phenomena in language may phonology and syntax); simply be overdetermined. For example, we argue that 3. The incrementality of language processing; incrementality (point 3, above) follows from the Now-or- 4. The use of prediction in language interpretation and Never bottleneck. But it is also possible that, irrespective production; of memory constraints, language understanding would still © Cambridge University Press 2016 0140-525X/16 1 https://doi.org/10.1017/S0140525X1500031X Published online by Cambridge University Press Christiansen & Chater: The Now-or-Never bottleneck: A fundamental constraint on language be incremental on functional grounds, to extract the linguis- 2. The Now-or-Never bottleneck tic message as rapidly as possible. Such counterfactuals are, of course, difficult to evaluate. By contrast, the properties of Language input is highly transient. Speech sounds, like the Now-or-Never bottleneck arise from basic information other auditory signals, are short-lived. Classic speech per- processing limitations that are directly testable by experi- ception studies have shown that very little of the auditory ment. Moreover, the Now-or-Never bottleneck should, we trace remains after 100 ms (Elliott 1962), with more suggest, have methodological priority to the extent that it recent studies indicating that much acoustic information provides an integrated framework for explaining many already is lost after just 50 ms (Remez et al. 2010). Similar- aspects of language structure, acquisition, processing, and ly, and of relevance for the perception of sign language, evolution that have previously been treated separately. studies of visual change detection suggest that the ability In Figure 1, we illustrate the overall structure of the argu- to maintain visual information beyond 60–70 ms is very ment in this article. We begin, in the next section, by briefly limited (Pashler 1988). Thus, sensory memory for language making the case for the Now-or-Never bottleneck as a input is quickly overwritten, or interfered with, by new in- general constraint on perception and action. We then coming information, unless the perceiver in some way pro- discuss the implications of this constraint for language pro- cesses what is heard or seen. cessing, arguing that both comprehension and production The problem of the rapid loss of the speech or sign signal involve what we call “Chunk-and-Pass” processing: incre- is further exacerbated by the sheer speed of the incoming mentally building chunks at all levels of linguistic structure linguistic input. At a normal speech rate, speakers as rapidly as possible, using all available information predic- produce about 10–15 phonemes per second, corresponding tively to process current input before new information to roughly 5–6 syllables every second or 150 words per arrives (sect. 3). From this perspective, language acquisition minute (Studdert-Kennedy 1986). However, the resolution involves learning to process: that is, learning rapidly to create of the human auditory system for discrete auditory events is and use chunks appropriately for the language being learned only about 10 sounds per second, beyond which the sounds (sect. 4). Consequently, short-term language change and fuse into a continuous buzz (Miller & Taylor 1948). Conse- longer-term processes of language evolution arise through quently, even at normal rates of speech, the language variation in the system of chunks and their composition, sug- system needs to work beyond the limits of auditory tempo- gesting an item-based theory of language change (sect. 5). ral resolution for nonspeech stimuli. Remarkably, listeners This approach points to a processing-based interpretation can learn to process speech in their native language at up to of construction grammar, in which constructions corre- twice the normal rate without much decrement in compre- spond to chunks, and where grammatical structure is funda- hension (Orr et al. 1965). Although the production of signs mentally the history of language processing operations appears to be slower than the production of speech (at least within the individual speaker/hearer (sect. 6). We conclude when comparing the production of ASL signs and spoken by briefly summarizing the main points of our argument. English; Bellugi & Fischer 1972), signed words are still very brief visual events, with the duration of an ASL syllable being about a quarter of a second (Wilbur & Nolkn 1986).2 Making matters even worse, our memory for sequences of auditory input is also very limited. For example, it has been known for more than four decades that naïve listeners are unable to correctly recall the temporal order of just four MORTEN H. CHRISTIANSEN is Professor of Psychology and Co-Director of the Cognitive Science Program at distinct sounds – for example, hisses, buzzes, and tones – Cornell University as well as Senior Scientist at the even when they are perfectly able to recognize and label Haskins Labs and Professor of Child Language at the each individual sound in isolation (Warren et al. 1969). Interacting Minds Centre at Aarhus University. He is Our ability to recall well-known auditory stimuli is not sub- the author of more than 170 scientific papers and has stantially better, ranging from 7 ± 2 (Miller 1956) to 4 ± 1 written or edited five books. His research focuses on (Cowan 2000). A similar limitation applies to visual the interaction of biological and environmental con- memory for sign language (Wilson & Emmorey 2006). straints in the processing, acquisition, and evolution of The poor memory for auditory and visual information, com- language, using a combination of computational, behav- bined with the fast and fleeting nature of linguistic input, ioral, and cognitive neuroscience methods. He is a imposes a fundamental constraint on the language Fellow of the Association for Psychological Science, and he delivered the 2009 Nijmegen Lectures. system: the Now-or-Never bottleneck. If the input is not processed immediately, new information will quickly over- NICK CHATER is Professor of Behavioural Science at write it. Warwick Business School, United Kingdom. He is the Importantly, the Now-or-Never bottleneck is not unique author of more than 250 scientific publications in to language but applies to other aspects of perception and psychology, philosophy, linguistics, and cognitive action as well. Sensory memory is rich in detail but decays science, and he has written or edited ten books. He rapidly unless it is further processed (e.g., Cherry 1953; has served as Associate Editor for Cognitive Science, Coltheart 1980; Sperling 1960). Likewise, short-term Psychological Review, Psychological Science, and Man- memory for auditory, visual, and haptic information is agement Science. His research explores the cognitive also limited and subject to interference from new input and social foundations of human rationality, focusing on formal models of inference, choice, and language. (e.g., Gallace et al. 2006; Haber 1983; Pavani & Turatto He is a Fellow of the Cognitive Science Society, the 2008). Moreover, our cognitive ability to respond to Association for Psychological Science, and the British sensory input is further constrained in a serial (Sigman & Academy. Dehaene 2005) or near-serial (Navon & Miller 2002) manner, severely restricting our capacity for processing 2 BEHAVIORAL AND BRAIN SCIENCES, 39 (2016) https://doi.org/10.1017/S0140525X1500031X Published online by Cambridge University Press Christiansen & Chater: The Now-or-Never bottleneck: A fundamental constraint on language Figure 1. The structure of our argument, in which implicational relations between claims are denoted by arrows. The Now-or-Never bottleneck provides a fundamental constraint on perception and action that is independent of its application to the language system (and hence outside the diamond in the figure). Specific implications for language (indicated inside the diamond) stem from the Now-or-Never bottleneck’s necessitating of Chunk-and-Pass language processing, with key consequences for language acquisition. The impact of the Now-or-Never bottleneck on both processing and acquisition together further shapes language change. All three of these interlinked claims concerning Chunk-and-Pass processing, acquisition as processing, and item-based language change (grouped together in the shaded upper triangle) combine to shape the structure of language itself. multiple inputs arriving in quick succession. Similar limita- face-to-face, was surreptitiously exchanged for a complete- tions apply to the production of behavior: The cognitive ly different person (Simons & Levin 1998). Information not system cannot plan detailed sequences of movements – a encoded in the short amount of time during which the long sequence of commands planned far in advance sensory information is available will be lost. would lead to severe interference and be forgotten Second, because memory limitations also apply to before it could be carried out (Cooper & Shallice 2006; recoded representations, the cognitive system further Miller et al. 1960). However, the cognitive system adopts chunks the compressed encodings into multiple levels of several processing strategies to ameliorate the effects of representation of increasing abstraction in perception, the Now-or-Never bottleneck on perception and action. and decreasing levels of abstraction in action. Consider, First, the cognitive system engages in eager processing: for example, memory for serially ordered symbolic infor- It must recode the rich perceptual input as it arrives to mation, such as sequences of digits. Typically, people are capture the key elements of the sensory information as eco- quickly overloaded and can recall accurately only the last nomically, and as distinctively, as possible (e.g., Brown three or four items in a sequence (e.g., Murdock 1968). et al. 2007; Crowder & Neath 1991); and it must do so But it is possible to learn to rapidly encode, and recall, rapidly, before new input overwrites or interferes with long random sequences of digits, by successively chunking the sensory information. This notion is a traditional one, such sequences into larger units, chunking those chunks dating back to early work on attention and sensory into still larger units, and so on. Indeed, an extended memory (e.g., Broadbent 1958; Coltheart 1980; Haber study of a single individual, SF (Ericsson et al. 1980), 1983; Sperling 1960; Treisman 1964). The resulting com- showed that repeated chunking in this manner makes it pressed representations are lossy: They provide only an ab- possible to recall with high accuracy sequences containing stract summary of the input, from which the rich sensory as many as 79 digits. But, crucially, this strategy requires input cannot be recovered (e.g., Pani 2000). Evidence learning to encode the input into multiple, successive, from the phenomena of change and inattentional blindness and distinct levels of representations – each sequence of suggests that these compressed representations can be very chunks at one level must be shifted as a single chunk to a selective (see Jensen et al. 2011 for a review), as exempli- higher level before more chunks interfere with or overwrite fied by a study in which half of the participants failed to the initial chunks. Indeed, SF chunked sequences of three notice that someone to whom they were giving directions, or four digits, the natural chunk size in human memory BEHAVIORAL AND BRAIN SCIENCES, 39 (2016) 3 https://doi.org/10.1017/S0140525X1500031X Published online by Cambridge University Press Christiansen & Chater: The Now-or-Never bottleneck: A fundamental constraint on language (Cowan 2000), into a single unit (corresponding to running object requires anticipating the grip force required to times, dates, or human ages), and then grouped sequences deal with the loads generated by the accelerations of the of three to four of those chunks into larger chunks. Inter- object. Grip force is adjusted too rapidly during the manip- estingly, SF also verbally produced items in overtly discern- ulation of an object to rely on sensory feedback (Flanagan ible chunks, interleaved with pauses, indicating how action & Wing 1997). Indeed, the rapid prediction of the sensory also follows the reverse process (e.g., Lashley 1951; Miller consequences of actions (e.g., Poulet & Hedwig 2006) sug- 1956). The case of SF further demonstrates that low-level gests the existence of so-called forward models, which allow information is far better recalled when organized into the brain to predict the consequence of its actions in real higher-level structures than merely coded as an unorga- time. Many have argued (e.g., Wolpert et al. 2011; see nized stream. Note, though, that lower-level information also Clark 2013; Pickering & Garrod 2013a) that forward is typically forgotten; it seems unlikely that even SF could models are a ubiquitous feature of the computational ma- recall the specific visual details of the digits with which chinery of motor control and more broadly of cognition. he was presented. More generally, the notion that percep- The three processing strategies we mention here – eager tion and action involve representational recoding at a suc- processing, computing multiple representational levels, cession of distinct representational levels also fits with a and anticipation – provide the cognitive system with impor- long tradition of theoretical and computational models in tant means to cope with the Now-or-Never bottleneck. cognitive science and computer vision (e.g., Bregman Next, we argue that the language system implements 1990; Marr 1982; Miller et al. 1960; Zhu et al. 2010; see similar strategies for dealing with the here-and-now Gobet et al. 2001 for a review). Our perspective on repeat- nature of linguistic input and output, with wide-reaching ed multilevel compression is also consistent with data from and fundamental implications for language processing, ac- functional magnetic resonance imaging (fMRI) and intra- quisition and change as well as for the structure of language cranial recordings, suggesting cortical hierarchies across itself. Specifically, we propose that our ability to deal with vision and audition – from low-level sensory to high-level sequences of linguistic information is the result of what perceptual and cognitive areas – integrating information we call “Chunk-and-Pass” processing, by which the lan- at progressively longer temporal windows (Hasson et al. guage system can ameliorate the effects of the Now-or- 2008; Honey et al. 2012; Lerner et al. 2011). Never bottleneck. More generally, our perspective offers Third, to facilitate speedy chunking and hierarchical a framework within which to approach language compre- compression, the cognitive system employs anticipation, hension and production. Table 1 summarizes the impact using prior information to constrain the recoding of of the Now-or-Never bottleneck on perception/action and current perceptual input (for reviews see Bar 2007; Clark language. 2013). For example, people see the exact same collection The style of explanation outlined here, focusing on pro- of pixels either as a hair dryer (when viewed as part of a cessing limitations, contrasts with a widespread interest in bathroom scene) or as a drill (when embedded in a rational, rather processing-based, explanations in cognitive picture of a workbench) (Bar 2004). Therefore, using science (e.g., Anderson 1990; Chater et al. 2006 Griffiths & prior information to predict future input is likely to be es- Tenenbaum 2009; Oaksford & Chater 1998; 2007; Tenen- sential to successfully encoding that future input (as well baum et al. 2011), including language processing (Gibson as helping us to react faster to such input). Anticipation et al. 2013; Hale 2001; 2006; Piantadosi et al. 2011). allows faster, and hence more effective, recoding when on- Given the fundamental nature of the Now-or-Never bottle- coming information creates considerable time urgency. neck, we suggest that such explanations will be relevant Such predictive processing will be most effective to the only for explaining language use insofar as they incorporate extent that the greatest possible amount of available infor- processing constraints. For example, in the spirit of rational mation (across different types and levels of abstraction) is analysis (Anderson 1990) and bounded rationality (Simon integrated as fast as possible. Similarly, anticipation is im- 1982), it is natural to view aspects of language processing portant for action as well. For example, manipulating an and structure, as described below, as “optimal” responses Table 1. Summary of the Now-or-Never bottleneck’s implications for perception/action and language Strategies Mechanisms Perception and action Language Eager processing Lossy chunking Chunking in memory and action (Lashley Incremental interpretation (Bever 1970) 1951; Miller 1956); lossy descriptions and production (Meyer 1996); multiple (Pani 2000) constraints satisfaction (MacDonald et al. 1994) Multiple levels of Hierarchical compression Hierarchical memory (Ericsson et al. Multiple levels of linguistic structure representation 1980), action (Miller et al. 1960), (e.g., sound-based, lexical, phrasal, problem solving (Gobet et al. 2001) discourse); local dependencies (Hawkins 2004) Anticipation Predictive processing Fast, top-down visual processing (Bar Syntactic prediction (Jurafsky 1996); 2004); forward models in motor multiple-cue integration (Farmer et al. control (Wolpert et al. 2011); 2006); visual world (Altmann & predictive coding (Clark 2013) Kamide 1999) 4 BEHAVIORAL AND BRAIN SCIENCES, 39 (2016) https://doi.org/10.1017/S0140525X1500031X Published online by Cambridge University Press Christiansen & Chater: The Now-or-Never bottleneck: A fundamental constraint on language to specific processing limitations, such as the Now-or- decreasing linguistic abstraction until the system arrives Never bottleneck (for this style of approach, see, e.g., at chunks with sufficient information to drive the articula- Chater et al. 1998; Levy 2008). Here, though, our focus tors (either the vocal apparatus or the hands). As in com- is primarily on mechanism rather than rationality. prehension, memory is limited within a given level of representation, resulting in potential interference between the items to be produced (e.g., Dell et al. 1997). 3. Chunk-and-Pass language processing Thus, higher-level chunks tend to be passed down immedi- ately to the level below as soon as they are “ready,” leading The fleeting nature of linguistic input, in combination with to a bias toward producing easy-to-retrieve utterance com- the impressive speed with which words and signs are pro- ponents before harder-to-retrieve ones (e.g., Bock 1982; duced, imposes a severe constraint on the language MacDonald 2013). For example, if there is a competition system: the Now-or-Never bottleneck. Each new incoming between two possible words to describe an object, the word or sign will quickly interfere with previous heard and word that is retrieved more fluently will immediately be seen input, providing a naturalistic version of the masking passed on to lower-level articulatory processes. To further used in psychophysical experiments. How, then, is language facilitate production, speakers often reuse chunks from comprehension possible? Why doesn’t interference the ongoing conversation, and those will be particularly between successive sounds (or signs) obliterate linguistic rapidly available from memory. This phenomenon is re- input before it can be understood? The answer, we flected by the evidence for lexical (e.g., Meyer & Schvane- suggest, is that our language system rapidly recodes this veldt 1971) and structural priming (e.g., Bock 1986; Bock & input into chunks, which are immediately passed to a Loebell 1990; Pickering & Branigan 1998; Potter & Lom- higher level of linguistic representation. The chunks at this bardi 1998) within individuals as well as alignment across higher level are then themselves subject to the same conversational partners (Branigan et al. 2000; Pickering & Chunk-and-Pass procedure, resulting in progressively Garrod 2004); priming is also extensively observed in text larger chunks of increasing linguistic abstraction. Crucially, corpora (Hoey 2005). As noted by MacDonald (2013), given that the chunks recode increasingly larger stretches of these memory-related factors provide key constraints on input from lower levels of representation, the chunking the production of language and contribute to cross-linguis- process enables input to be maintained over ever-larger tic patterns of language use.4 temporal windows. It is this repeated chunking of lower- A useful analogy for language production is the notion of level information that makes it possible for the language “just-in-time”5 stock control, in which stock inventories are system to deal with the continuous deluge of input that, if kept to a bare minimum during the manufacturing process not recoded, is rapidly lost. This chunking process is also (Ohno & Mito 1988). Similarly, the Now-or-Never bottle- what allows us to perceive speech at a much faster rate neck requires that, for example, low-level phonetic or artic- than nonspeech sounds (Warren et al. 1969): We have ulatory decisions not be made and stored far in advance and learned to chunk the speech stream. Indeed, we can easily then reeled off during speech production, because any understand (and sometimes even repeat back) sentences buffer in which such decisions can safely be stored would consisting of many tens of phonemes, despite our severe quickly be subject to interference from subsequent materi- memory limitations for sequences of nonspeech sounds. al. So the Now-or-Never bottleneck requires that once de- What we are proposing is that during comprehension, tailed production information has been assembled, it be the language system – similar to SF – must keep on chunk- executed straightaway, before it can be obliterated by the ing the incoming information into increasingly abstract oncoming stream of later low-level decisions, similar to levels of representation to avoid being overwhelmed by what has been suggested for motor planning (Norman & the input. That is, the language system engages in eager Shallice 1986; see also MacDonald 2013). We call this pro- processing when creating chunks. Chunks must be built posal Just-in-Time language production. right away, or memory for the input will be obliterated by interference from subsequent material. If a phoneme or 3.1. Implications of Strategy 1: Incremental processing syllable is recognized, then it is recoded as a chunk and passed to a higher level of linguistic abstraction. And Chunk-and-Pass processing has important implications for once recoded, the information is no longer subject to inter- comprehension and production: It requires that both take ference from further auditory input. A general principle of place incrementally. In incremental processing, representa- perception and memory is that interference arises primarily tions are built up as rapidly as possible as the input is en- between overlapping representations (Crowder & Neath countered. By contrast, one might, for example, imagine 1991; Treisman & Schmidt 1982); crucially, recoding a parser that waits until the end of a sentence before begin- avoids such overlap. For example, phonemes interfere ning syntactic analysis, or that meaning is computed only with each other, but phonemes interfere very little with once syntax has been established. However, such process- words. At each level of chunking, information from the pre- ing would require storing a stream of information at a vious level(s) is compressed and passed up as chunks to the single level of representation, and processing it later; but next level of linguistic representation, from sound-based given the Now-or-Never bottleneck, this is not possible chunks up to complex discourse elements.3 As a conse- because of severe interference between such representa- quence, the rich detail of the original input can no longer tions. Therefore, incremental interpretation and produc- be recovered from the chunks, although some key informa- tion follow directly from the Now-or-Never constraint on tion remains (e.g., certain speaker characteristics; Nygaard language. et al. 1994; Remez et al. 1997). To get a sense of the implications of Chunk-and-Pass In production, the process is reversed: Discourse-level processing, it is interesting to relate this perspective to spe- chunks are recursively broken down into subchunks of cific computational principles and models. How, for BEHAVIORAL AND BRAIN SCIENCES, 39 (2016) 5 https://doi.org/10.1017/S0140525X1500031X Published online by Cambridge University Press Christiansen & Chater: The Now-or-Never bottleneck: A fundamental constraint on language example, do classic models of parsing fit within this frame- over considerably longer periods of time than planning at work? A wide range of psychologically inspired models in- the syllabic level. Similarly, processes of reduction to facil- volves some degree of incrementality of syntactic analysis, itate production (e.g., modifying the speech signal to make which can potentially support incremental interpretation it easier to produce, such as reducing a vowel to a schwa, or (e.g., Phillips 1996; 2003; Winograd 1972). For example, shortening or eliminating phonemes) can be observed the sausage machine parsing model (Frazier & Fodor across different levels of linguistic representation, from in- 1978) proposes that a preliminary syntactic analysis is dividual words (e.g., Gahl & Garnsey 2004; Jurafsky et al. carried out phrase-by-phrase, but in complete isolation 2001) to frequent multiword sequences (e.g., Arnon & from semantic or pragmatic factors. But for a right-branch- Cohen Priva 2013; Bybee & Schiebman 1999). ing language such as English, chunks cannot be built left- Some may object that the Chunk-and-Pass perspective’s to-right, because the leftmost chunks are incomplete until strict notion of incremental interpretation and production later material has been encountered. Frameworks from leaves the language system vulnerable to the rather sub- Kimball (1973) onward imply “stacking up” incomplete stantial ambiguity that exists across many levels of linguistic constituents that may then all be resolved at the end of representation (e.g., lexical, syntactic, pragmatic). So-called the clause. This approach runs counter to the memory con- garden path sentences such as the famous “The horse raced straints imposed by the Now-or-Never bottleneck. Recon- past the barn fell” (Bever 1970) show that people are vul- ciling right-branching with incremental chunking and nerable to at least some local ambiguities: They invite com- processing is one motivation for the flexible constituency prehenders to take the wrong interpretive path by treating of combinatory categorial grammar (e.g., Steedman 1987; raced as the main verb, which leads them to a dead end. 2000; see also Johnson-Laird 1983). Only when the final word, fell, is encountered does it With respect to comprehension, considerable evidence become clear that something is wrong: raced should be in- supports incremental interpretation, going back more terpreted as a past participle that begins a reduced relative than four decades (e.g., Bever 1970; Marslen-Wilson clause (i.e., the horse [that was] raced past the barn fell). 1975). The language system uses all available information The difficulty of recovery in such garden path sentences in- to rapidly integrate incoming information as quickly as pos- dicates how strongly the language system is geared toward sible to update the current interpretation of what has been incremental interpretation. said so far. This process includes not only sentence-internal Viewed as a processing problem, garden paths occur information about lexical and structural biases (e.g., when the language system resolves an ambiguity incorrect- Farmer et al. 2006; MacDonald 1994; Trueswell et al. ly. But in many cases, it is possible for an underspecified 1993), but also extra-sentential cues from the referential representation to be constructed online, and for the ambi- and pragmatic context (e.g., Altmann & Steedman 1988; guity to be resolved later when further linguistic input Thornton et al. 1999) as well as the visual environment arrives. This type of case is consistent with Marr’s (1976) and world knowledge (e.g., Altmann & Kamide 1999; proposal of the “principle of least commitment,” that the Tanenhaus et al. 1995). As the incoming acoustic informa- perceptual system resolves ambiguous perceptual input tion is chunked, it is rapidly integrated with contextual in- only when it has sufficient data to make it unlikely that formation to recognize words, consistent with a variety of such decisions will subsequently have to be reversed. data on spoken word recognition (e.g., Marslen-Wilson Given the ubiquity of local ambiguity in language, such 1975; van den Brink et al. 2001). These words are then, underspecification may be used very widely in language in turn, chunked into larger multiword units, as evidenced processing. Note, however, that because of the severe con- by recent studies showing sensitivity to multiword sequenc- straints the Now-or-Never bottleneck imposes, the lan- es in online processing (e.g., Arnon & Snider 2010; Reali & guage system cannot adopt broad parallelism to further Christiansen 2007b; Siyanova-Chanturia et al. 2011; Trem- minimize the effect of ambiguity (as in many current prob- blay & Baayen 2010; Tremblay et al. 2011), and subse- abilistic theories of parsing, e.g., Hale 2006; Jurafsky 1996; quently further integrated with pragmatic context into Levy 2008). Rather, within the Chunk-and-Pass account, discourse-level structures. the sole role for parallelism in the processing system is in Turning to production, we start by noting the powerful deciding how the input should be chunked; only when con- intuition that we speak “into the void” – that is, that we flicts concerning chunking are resolved can the input be plan only a short distance ahead. Indeed, experimental passed on to a higher-level representation. In particular, studies suggest that, for example, when producing an utter- we suggest that competing higher-level codes cannot be ac- ance involving several noun phrases, people plan just one tivated in parallel. This picture is analogous to Marr’s prin- (Smith & Wheeldon 1999), or perhaps two, noun phrases ciple of least commitment of vision: Although there might ahead (Konopka 2012), and they can modify a message be temporary parallelism to resolve conflicts about, say, during production in the light of new perceptual input correspondence between dots in a random-dot stereogram, (Brown-Schmidt & Konopka 2015). Moreover, speech- it is not possible to create two conflicting three-dimensional error data (e.g., Cutler 1982) reveal that, across representa- surfaces in parallel, and whereas there may be parallelism tional levels, errors tend to be highly local: Phonological, over the interpretation of lines and dots in an image, it is morphemic, and syntactic errors apply to neighboring not possible to see something as both a duck and a rabbit chunks within each level (where material may be moved, simultaneously. More broadly, higher-level representations swapped, or deleted). Consequently, speech planning are constructed only when sufficient evidence has accrued appears to involve just a small number of chunks – the that they are unlikely later to need to be replaced (for number of which may be similar across linguistic levels – stimuli outside the psychological laboratory, at least). but which covers different amounts of time depending on Maintaining, and later resolving, an underspecified rep- the linguistic level in question. For example, planning in- resentation will create local memory and processing volving chunks at the level of intonational bursts stretches demands that may slow down processing, as is observed, 6 BEHAVIORAL AND BRAIN SCIENCES, 39 (2016) https://doi.org/10.1017/S0140525X1500031X Published online by Cambridge University Press Christiansen & Chater: The Now-or-Never bottleneck: A fundamental constraint on language for example, by increased reading times (e.g., Trueswell the occurrence of an ambiguous verb to specify the correct et al. 1994) and distinctive patterns of brain activity (as interpretation of that verb. Moreover, eye-tracking studies measured by ERPs; Swaab et al. 2003). Accordingly, have demonstrated that dialogue partners exploit both con- when the input is ambiguous, the language system may versational context and task demands to constrain interpre- require later input to recognize previous elements of the tations to the appropriate referents, thereby side-stepping speech stream successfully. The Now-or-Never bottleneck effects of phonological and referential competitors requires that such online “right-context effects” be highly (Brown-Schmidt & Konopka 2011) that have otherwise local because raw perceptual input will be lost if it is not been shown to impede language processing (e.g., Allo- rapidly identified (e.g., Dahan 2010). Right-context penna et al. 1998). These dialogue-based constraints also effects may arise where the language system can delay res- mitigate syntactic ambiguities that might otherwise olution of ambiguity or use underspecified representations disrupt processing (Brown-Schmidt & Tanenhaus 2008). that do not require resolving the ambiguity right away. Sim- This information may be further combined with other ilarly, cataphora, in which, for example, a referential probabilistic sources of information such as prosody (e.g., pronoun occurs before its referent (e.g., “He is a nice Kraljic & Brennan 2005; Snedeker & Trueswell 2003) to guy, that John”) require the creation of an underspecified resolve potential ambiguities within a minimal temporal entity (male, animate) when he is encountered, which is re- window. Finally, it is not clear that undetected garden solved to be coreferential with John only later in the sen- path errors are costly in normal language use, because if tence (e.g., van Gompel & Liversedge 2003). Overall, the communication appears to break down, the listener can Now-or-Never bottleneck implies that the processing repair the communication by requesting clarification from system will build the most abstract and complete represen- the dialogue partner. tation that is justified, given the linguistic input.6 Of course, outside of experimental studies, background knowledge, visual context, and prior discourse will 3.2. Implications of Strategy 2: Multiple levels of linguistic provide powerful cues to help resolve ambiguities in the structure signal, allowing the system rapidly to resolve many apparent The Now-or-Never bottleneck forces the language system ambiguities without incurring a substantial danger of to compress input into increasingly abstract chunks that “garden-pathing.” Indeed, although syntactic and lexical cover progressively longer temporal intervals. As an ambiguities have been much studied in psycholinguistics, example, consider the chunking of the input illustrated in increasing evidence indicates that garden paths are not a Figure 2. The acoustic signal is first chunked into higher- major source of processing difficulty in practice (e.g., Fer- level sound units at the phonological level. To avoid reira 2008; Jaeger 2010; Wasow & Arnold 2003).7 For interference between local sound-based units, such as pho- example, Roland et al. (2006) reported corpus analyses nemes or syllables, these units are further recoded as showing that, in naturally occurring language, there is gen- rapidly as possible into higher-level units such as mor- erally sufficient information in the sentential context before phemes or words. The same phenomenon occurs at the Figure 2. Chunk-and-Pass processing across a variety of linguistic levels in spoken language. As input is chunked and passed up to increasingly abstract levels of linguistic representations in comprehension, from acoustics to discourse, the temporal window over which information can be maintained increases, as indicated by the shaded portion of the bars associated with each linguistic level. This process is reversed in production planning, in which chunks are broken down into sequences of increasingly short and concrete units, from a discourse-level message to the motor commands for producing a specific articulatory output. More-abstract representations correspond to longer chunks of linguistic material, with greater look-ahead in production at higher levels of abstraction. Production processes may further serve as the basis for predictions to facilitate comprehension and thus provide top- down information in comprehension. (Note that the names and number of levels are for illustrative purposes only.) BEHAVIORAL AND BRAIN SCIENCES, 39 (2016) 7 https://doi.org/10.1017/S0140525X1500031X Published online by Cambridge University Press Christiansen & Chater: The Now-or-Never bottleneck: A fundamental constraint on language next level up: Local groups of words must be chunked into Such representational locality is exemplified across differ- larger units, possibly phrases or other forms of multiword ent linguistic levels by the local nature of phonological pro- sequences. Subsequent chunking then recodes these repre- cesses from reduction, assimilation, and fronting, including sentations into higher-level discourse structures (that may more elaborate phenomena such as vowel harmony (e.g., themselves be chunked further into even more abstract Nevins 2010), speech errors (e.g., Cutler 1982), the imme- representational structures beyond that). Similarly, produc- diate proximity of inflectional morphemes and the verbs to tion requires running the process in reverse, starting with which they apply, and the vast literature on the processing the intended message and gradually decoding it into in- difficulties associated with non-local dependencies in sen- creasingly more specific chunks, eventually resulting in tence comprehension (e.g., Gibson 1998; Hawkins 2004). the motor programs necessary for producing the relevant As noted earlier, the higher the level of linguistic represen- speech or sign output. As we discuss in section 3.3, the pro- tation, the longer the limited time window within which in- duction process may further serve as the basis for predic- formation can be chunked. Whereas dealing with just two tion during comprehension (allowing higher-level center-embeddings at the sentential level is prohibitively information to influence the processing of current input). difficult (e.g., de Vries et al. 2011; Karlsson 2007), we are More generally, our account is agnostic with respect to able to deal with up to four to six embeddings at the the specific characterization of the various levels of linguis- multi-utterance discourse level (Levinson 2013). This is tic representation8 (e.g., whether sound-based chunks take because chunking takes place at a much longer time the form of phonemes, syllables, etc.). What is central for course at the discourse level compared with the sentence the Chunk-and-Pass account: some form of sound-based level, providing more time to resolve the relevant depend- level of chunking (or visual-based in the case of sign lan- ency relations before they are subject to interference. guage), and a sequence of increasingly abstract levels of Finally, as indicated by Figure 2, processing within each chunked representations into which the input is continually level of linguistic representation takes place in parallel – but recoded. with a clear temporal component – as chunks are passed A key theoretical implication of Chunk-and-Pass pro- between levels. Note that, in the Chunk-and-Pass frame- cessing is that the multiple levels of linguistic representa- work, it is entirely possible that linguistic input can simulta- tion, typically assumed in the language sciences, are a neously, and perhaps redundantly, be chunked in more necessary by-product of the Now-or-Never bottleneck. than one way. For example, syntactic chunks and intona- Only by compressing the input into chunks and passing tional contours may be somewhat independent (Jackendoff them to increasingly abstract levels of linguistic representa- 2007). Moreover, we should expect further chunking across tion can the language system deal with the rapid onslaught different “channels” of communication, including visual of incoming information. Crucially, though, our perspective input such as gesture and facial expressions. also suggests that the different levels of linguistic represen- The Chunk-and-Pass perspective is compatible with a tations do not have a true part–whole relationship with one number of recent theoretical models of sentence compre- another. Unlike in the case of SF, who learned strategies to hension, including constraint-based approaches (e.g., Mac- perfectly unpack chunks from within chunks to reproduce Donald et al. 1994; Trueswell & Tanenhaus 1994) and the original string of digits, language comprehension typi- certain generative accounts (e.g., Jackendoff’s paral- cally employs lossy compression to chunk the input. That lel architecture). Intriguingly, fMRI data from adults is, higher-level chunks will not in general contain complete (Dehaene-Lambertz et al. 2006a) and infants (Dehaene- copies of lower-level chunks. Indeed, as speech input is Lambertz et al. 2006b) indicate that activation responses encoded into ever more abstract chunks, increasing to a single sentence systematically slows down when amounts of low-level information will typically be lost. moving away from the primary auditory cortex, either Instead, as in perception (e.g., Haber 1983), there is back toward Wernicke’s area or forward toward Broca’s greater representational underspecification with higher area, consistent with increasing temporal windows for levels of representation because of the repeated process chunking when moving from phonemes to words to of lossy compression.9 Thus, we would expect a growing in- phrases. Indeed, the cortical circuits processing auditory volvement of extralinguistic information, such as perceptu- input, from lower (sensory) to higher (cognitive) areas, al input and world knowledge, in processing higher levels of follow different temporal windows, sensitive to more and linguistic representation (see, e.g., Altmann & Kamide more abstract levels of linguistic information, from pho- 2009). nemes and words to sentences and discourse (Lerner Whereas our account proposes a lossy hierarchy across et al. 2011; Stephens et al. 2013). Similarly, the reverse levels of linguistic representation, only a very small process, going from a discourse-level representation of number of chunks are represented within a level: other- the intended message to the production of speech (or wise, information is rapidly lost due to interference. This sign) across parallel linguistic levels, is compatible with has the crucial implication that chunks within a given several current models of language production (e.g., level can interact only locally. For example, acoustic infor- Chang et al. 2006; Dell et al. 1997; Levelt 2001). Data mation must rapidly be coded in a non-acoustic form, say, from intracranial recordings during language production in terms of phonemes; but this is only possible if phonemes are consistent with different temporal windows for chunk correspond to local chunks of acoustic input. The process- decoding at the word, morphemic, and phonological ing bottleneck therefore enforces a strong pressure toward levels, separated by just over a tenth of a second (Sahin local dependencies within a given linguistic level. Impor- et al. 2009). These results are compatible with our proposal tantly, though, this does not imply that linguistic relations that incremental processing in comprehension and produc- are restricted only to adjacent elements but, instead, that tion takes place in parallel across multiple levels of linguis- they may be formed between any of the small number of tic representation, each with a characteristic temporal elements maintained at a given level of representation. window. 8 BEHAVIORAL AND BRAIN SCIENCES, 39 (2016) https://doi.org/10.1017/S0140525X1500031X Published online by Cambridge University Press Christiansen & Chater: The Now-or-Never bottleneck: A fundamental constraint on language 3.3. Implications of Strategy 3: Predictive language It also parallels Marr’s (1976) principle of least commit- processing ment, as we mentioned earlier, according to which the per- ceptual system should, as far as possible, only resolve We have already noted that, to be able to chunk incoming perceptual ambiguities when sufficiently confident that information as fast and as accurately as possible, the lan- they will not need to be undone. Moreover, it is compatible guage system exploits multiple constraints in parallel with the fine-grained weakly parallel interactive model across the different levels of linguistic representation. (Altmann & Steedman 1988) in which possible chunks Such cues may be used not only to help disambiguate pre- are proposed, word-by-word, by an autonomous parser vious input, but also to generate expectations for what may and one is rapidly chosen using top-down information. come next, potentially further speeding up Chunk-and-Pass To facilitate chunking across multiple levels of represen- processing. Computational considerations indicate that tation, prediction takes place in parallel across the different simple statistical information gleaned from sentences levels but at varying timescales. Predictions for higher-level provides powerful predictive constraints on language com- chunks may run ahead of those for lower-level chunks. For prehension and can explain many human processing results example, most people simply answer “two” in response to (e.g., Christiansen & Chater 1999; Christiansen & the question “How many animals of each kind did Moses MacDonald 2009; Elman 1990; Hale 2006; Jurafsky 1996; take on the Ark?” – failing to notice the semantic Levy 2008; Padó et al. 2009). Similarly, eye-tracking data anomaly (i.e., it was Noah’s Ark, not Moses’ Ark) even in suggest that comprehenders routinely use a variety of the absence of time pressure and when made aware that sources of probabilistic information – from phonological the sentence may be anomalous (Erickson & Matteson cues to syntactic context and real-world knowledge – to an- 1981). That is, anticipatory pragmatic and communicative ticipate the processing of upcoming words (e.g., Altmann & considerations relating to the required response appear to Kamide 1999; Farmer et al. 2011; Staub & Clifton 2006). trump lexical semantics. More generally, the time course Results from event-related potential experiments indicate of normal conversation may lead to an emphasis on more that rather specific predictions are made for upcoming temporally extended higher-level predictions over lower- input, including its lexical category (Hinojosa et al. 2005), level ones. This may facilitate the rapid turn-taking that grammatical gender (Van Berkum et al. 2005; Wicha has been observed cross-culturally (Stivers et al. 2009) et al. 2004), and even its onset phoneme (DeLong et al. and which seems to require that listeners make quite spe- 2005) and visual form (Dikker et al. 2010). Accordingly, cific predictions about when the speaker’s current turn there is a growing body of evidence for a substantial role will finish (Magyari & De Ruiter 2012), as well as being of prediction in language processing (for reviews, see, able to quickly adapt their expectations to specific linguistic e.g., Federmeier 2007; Hagoort 2009; Kamide 2008; environments (Fine et al. 2013). Kutas et al. 2014; Pickering & Garrod 2007) and evidence We view the anticipation of turn-taking as one instance that such language prediction occurs in children as young as of the broader alignment that takes place between dialogue 2 years of age (Mani & Huettig 2012). Importantly, as well partners across all levels of linguistic representation (for a as exploiting statistical relations within a representational review, see Pickering & Garrod 2004). This dovetails with level, predictive processing allows top-down information fMRI analyses indicating that although there are some from higher levels of linguistic representation to rapidly comprehension- and production-specific brain areas, spa- constrain the processing of the input at lower levels.10 tiotemporal patterns of brain activity are in general From the viewpoint of the Now-or-Never bottleneck, closely coupled between speakers and listeners (e.g., prediction provides an opportunity to begin Chunk-and- Silbert et al. 2014). In particular, Stephens et al. (2010) ob- Pass processing as early as possible: to constrain represen- served close synchrony between neural activations in speak- tations of new linguistic material as it is encountered, and ers and listeners in early auditory areas. Speaker activations even incrementally to begin recoding predictable linguistic preceded those of listeners in posterior brain regions (in- input before it arrives. This viewpoint is consistent with cluding parts of Wernicke’s area), whereas listener activa- recent suggestions that the production system may be tions preceded those of speakers in the striatum and pressed into service to anticipate upcoming input (e.g., anterior frontal areas. In the Chunk-and-Pass framework, Pickering & Garrod 2007; 2013a). Chunk-and-Pass pro- the listener lag primarily derives from delays caused by cessing implies that there is practically no possibility for the chunking process across the various levels of linguistic going back once a chunk is created because such backtrack- representation, whereas the speaker lag predominantly re- ing tends to derail processing (e.g., as in the classic garden flects the listener’s anticipation of upcoming input, espe- path phenomena mentioned above). This imposes a Right- cially at the higher levels of representation (e.g., First-Time pressure on the language system in the face of pragmatics and discourse). Strikingly, the extent of the lis- linguistic input that is highly locally ambiguous.11 The con- tener’s anticipatory brain responses were strongly correlat- tribution of predictive modeling to comprehension is that it ed with successful comprehension, further underscoring facilitates local ambiguity resolution while the stimulus is the importance of prediction-based alignment for language still available. Only by recruiting multiple cues and integrat- processing. Indeed, analyses of real-time interactions show ing these with predictive modeling is it possible to resolve that alignment increases when the communicative task local ambiguities quickly and correctly. becomes more difficult (Louwerse et al. 2012). By decreas- Right-First-Time parsing fits with proposals such as that ing the impact of potential ambiguities, alignment thus by Marcus (1980), where local ambiguity resolution is makes processing as well as production easier in the face delayed until later disambiguating information arrives, of the Now-or-Never bottleneck. and models in which aspects of syntactic structure may We have suggested that only an incremental, predictive be underspecified, therefore not requiring the ambiguity language system, continually building and passing on new to be resolved (e.g., Gorrell 1995; Sturt & Crocker 1996). chunks of linguistic material, encoded at increasingly BEHAVIORAL AND BRAIN SCIENCES, 39 (2016) 9 https://doi.org/10.1017/S0140525X1500031X Published online by Cambridge University Press Christiansen & Chater: The Now-or-Never bottleneck: A fundamental constraint on language abstract levels of representation, can deal with the on- 1986). Whatever the appropriate computational frame- slaught of linguistic input in the face of the severe work, the Now-or-Never bottleneck requires that language memory constraints of the Now-or-Never bottleneck. We acquisition be viewed as a type of skill learning, such as suggest that a productive line of future work is to consider learning to drive, juggle, play the violin, or play chess. the extent to which existing models of language are compat- Such skills appear to be learned through practicing the ible with these constraints, and to use these properties to skill, using online feedback during the practice itself, al- guide the creation of new theories of language processing. though the consolidation of learning occurs subsequently (Schmidt & Wrisberg 2004). The challenge of language ac- quisition is to learn a dazzling sequence of rapid processing 4. Acquisition is learning to process operations, rather than conjecturing a correct “linguistic theory.” If speaking and understanding language involves Chunk- and-Pass processing, then acquiring a language requires 4.1. Implications of Strategy 1: Online learning learning how to create and integrate the right chunks rapidly, before current information is overwritten by new The Now-or-Never bottleneck implies that learning can input. Indeed, the ability to quickly process linguistic depend only on material currently being processed. As input – which has been proposed as an indicator of chunk- we have seen, this implication requires a processing strat- ing ability (Jones 2012) – is a strong predictor of language egy according to which modification to current representa- acquisition outcomes from infancy to middle childhood tions (in this context, learning) occurs right away; in (Marchman & Fernald 2008). The importance of this machine-learning terminology, learning is online. If learn- process is also introspectively evident to anyone acquiring ing does not occur at the time of processing, the represen- a second language: Initially, even segmenting the speech tation of linguistic material will be obliterated, and the stream into recognizable sounds can be challenging, opportunity for learning will be gone forever. To facilitate let alone parsing it into words or processing morphology such online learning, the child must learn to use all avail- and grammatical relations rapidly enough to build a seman- able information to help constrain processing. The integra- tic interpretation. The ability to acquire and rapidly deploy tion of multiple constraints – or cues – is a fundamental a hierarchy of chunks at different linguistic scales is parallel component of many current theories of language acquisi- to the ability to chunk sequences of motor movements, tion (see, e.g., contributions in Golinkoff et al. 2000; numbers, or chess positions: It is a skill, built up by contin- Morgan & Demuth 1996; Weissenborn & Höhle 2001; ual practice. for a review, see Monaghan & Christiansen 2008). For Viewing language acquisition as continuous with other example, second-graders’ initial guesses about whether a types of skill learning is very different from the standard novel word refers to an object or an action are affected formulation of the problem of language acquisition in lin- by that word’s phonological properties (Fitneva et al. guistics. There, the child is viewed as a linguistic theorist 2009); 7-year-olds use visual context to constrain online who has the goal of inferring an abstract grammar from a sentence interpretation (Trueswell et al. 1999); and pre- corpus of example sentences (e.g., Chomsky 1957; 1965) schoolers’ language production and comprehension is con- and only secondarily learning the skill of generating and un- strained by pragmatic factors (Nadig & Sedivy 2002). Thus, derstanding language. But perhaps the child is not a mini- children learn rapidly to apply the multiple constraints used linguist. Instead, we suggest that language acquisition is in incremental adult processing (Borovsky et al. 2012). nothing more than learning to process: to turn meanings Nonetheless, online learning contrasts with traditional into streams of sound or sign (when generating language), approaches in which the structure of the language is and to turn streams of sound or sign back into meanings learned offline by the cognitive system acquiring a corpus (when understanding language). of past linguistic inputs and choosing the grammar or If linguistic input is available only fleetingly, then any other model of the language that best fits with those learning must occur while that information is present; inputs. For example, in both mathematical and theoretical that is, learning must occur in real time, as the Chunk- analysis (e.g., Gold 1967; Hsu et al. 2011; Pinker 1984) and and-Pass process takes place. Accordingly, any modifica- in grammar-induction algorithms in machine learning and tions to the learner’s cognitive system in light of processing cognitive science, it is typically assumed that a corpus of must, according to the Now-or-Never bottleneck, occur at language can be held in memory, and that the candidate the time of processing. The learner must learn to chunk the grammar is successively adjusted to fit the corpus as well input appropriately – to learn to recode the input at succes- as possible (e.g., Manning & Schütze 1999; Pereira & sively more abstract linguistic levels; and to do this requires, Schabes 1992; Redington et al. 1998). However, this ap- of course, learning the structure of the language being proach involves learning linguistic regularities (at, say, the spoken. But how is this structure learned? morphological level), by storing and later surveying rele- We suggest that, in language acquisition, as in other areas vant linguistic input at a lower level of analysis (e.g., involv- of perceptual-motor learning, people learn by processing, ing strings of phonemes); and then attempting to and that past processing leaves traces that can facilitate determine which higher-level regularities best fit the data- future processing. What, then, is retained, so that language base of lower-level examples. There are a number of diffi- processing gradually improves? We can consider various culties with this type of proposal – for example, that only a possibilities: For example, the weights of a connectionist very rich lower-level representation (perhaps combined network can be updated online in the light of current pro- with annotations concerning relevant syntactic and seman- cessing (Rumelhart et al. 1986a); in an exemplar-based tic context) is likely to be a useful basis for later analysis. model, traces of past examples can be reused in the But more fundamentally, the Now-or-Never bottleneck re- future (e.g., Hintzman 1988; Logan 1988; Nosofsky quires that information be retained only if it is recoded at 10 BEHAVIORAL AND BRAIN SCIENCES, 39 (2016) https://doi.org/10.1017/S0140525X1500031X Published online by Cambridge University Press Christiansen & Chater: The Now-or-Never bottleneck: A fundamental constraint on language processing time: Phonological information that is not acquisition involves learning to process, and generalizations chunked at the morphological level and beyond will be can only be made over past processing episodes. obliterated by oncoming phonological material.12 So, if learning is shaped by the Now-or-Never bottle- 4.2. Implications of Strategy 2: Local learning neck, then linguistic input must, when it is encountered, be recoded successively at increasingly abstract linguistic Online learning faces a particularly acute version of a levels if it is to be retained at all – a constraint imposed, general learning problem: the stability-plasticity dilemma we argue, by basic principles of memory. Crucially, such in- (e.g., Mermillod et al. 2013). How can new information formation is not, therefore, in a suitably “neutral” format to be acquired without interfering with prior information? allow for the discovery of previously unsuspected linguistic The problem is especially challenging because reviewing regularities. In a nutshell, the lossy compression of the lin- prior information is typically difficult (because recalling guistic input is achieved by applying the learner’s current earlier information interferes with new input) or impossible model of the language. But information that would point (where prior input has been forgotten). Thus, to a good ap- toward a better model of the language (if examined in ret- proximation, the learner can only update its model of the rospect) will typically be lost (or, at best, badly obscured) by language in a way that responds to current linguistic this compression, precisely because those regularities are input, without being able to review whether any updates not captured by the current model of the language. are inconsistent with prior input. Specifically, if the Suppose, for example, that we create a lossy encoding of learner has a global model of the entire language (e.g., a language using a simple, context-free phrase structure traditional grammar), the learner runs the risk of overfitting grammar that cannot handle, say, noun-verb agreement. that model to capture regularities in the momentary lin- The lossy encoding of the linguistic input produced using guistic input at the expense of damaging the match with this grammar will provide a poor basis for learning a past linguistic input. more sophisticated grammar that includes agreement – Avoiding this problem, we suggest, requires that learning precisely because agreement information will have been be highly local, consisting of learning about specific rela- thrown away. So the Now-or-Never bottleneck rules out tionships between particular linguistic representations. the possibility that the learner can survey a neutral database New items can be acquired, with implications for later pro- of linguistic material, to optimize its model of the language. cessing of similar items; but learning current items does not The emphasis on online learning does not, of course, rule thereby create changes to the entire model of the language, out the possibility that any linguistic material that is re- thus potentially interfering with what was learned from past membered may subsequently be used to inform learning. input. One way to learn in a local fashion is to store individ- But according to the present viewpoint, any further learn- ual examples (this requires, in our framework, that those ing requires reprocessing that material. So if a child comes examples have been abstractly recoded by successive to learn a poem, song, or story verbatim, the child might Chunk-and-Pass operations, of course), and then to gener- extract more structure from that material by mental re- alize, piecemeal, from these examples. This standpoint is hearsal (or, indeed, by saying it aloud). The online learning consistent with the idea that the “priority of the specific,” constraint is that material is learned only when it is being as observed in other areas of cognition (e.g., Jacoby et al. processed – ruling out any putative learning processes 1989), also applies to language acquisition. For example, that involve carrying out linguistic analyses or compiling children seem to be highly sensitive to multiword chunks statistics over a stored corpus of linguistic material. (Arnon & Clark 2011; Bannard & Matthews 2008; see If this general picture of acquisition as learning-to- Arnon & Christiansen, submitted, for a review13). More process is correct, then we should expect the exploitation generally, learning based on past traces of processing will of memory to require “replaying” learned material, so typically be sensitive to details of that processing, as is ob- that it can be re-processed. Thus, the application of served across phonetics, phonology, lexical access, syntax, memory itself requires passing through the Now-or- and semantics (e.g., Bybee 2006; Goldinger 1998; Pierre- Never bottleneck – there is no way of directly interrogating humbert 2002; Tomasello 1992). an internal database of past experience; indeed, this view- That learning is local provides a powerful constraint, in- point fits with our subjective sense that we need to “bring compatible with typical computational models of how the to mind” past experiences or rehearse verbal material to child might infer the grammar of the language – because process it further. Interestingly, there is now also substan- these models typically do not operate incrementally but tial neuroscientific evidence that replay does occur (e.g., in range across the input corpus, evaluating alternative gram- rat spatial learning, Carr et al. 2011). Moreover, it has long matical hypotheses (so-called batch learning). But, given been suggested that dreaming may have a related function the Now-or-Never bottleneck, the “unprocessed” corpus, (here using “reverse” learning over “fictional” input to elim- so readily available to the linguistic theorist, or to a comput- inate spurious relationships identified by the brain, Crick & er model, is lost to the human learner almost as soon as it is Mitchison 1983; see Hinton & Sejnowki 1986, for a closely encountered. Where such information has been memo- related computational model). Deficits in the ability to rized (as in the case of SF’s encoding of streams of replay material would, in this view, lead to consequent def- digits), recall and processing is slow and effortful. More- icits in memory and inference; consistent with this view- over, because information is encoded in terms of the point, Martin and colleagues have argued that rehearsal current encoding, it becomes difficult to neutrally review deficits for phonological pattern and semantic information that input to create a better encoding, and cross-check may lead to difficulties in the long-term acquisition and re- past data to test wide-ranging grammatical hypotheses.14 tention of word forms and word meanings, respectively, So, as we have already noted, the Now-or-Never bottleneck and their use in language processing (e.g., Martin & He seems incompatible with the view of a child as a mini- 2004; Martin et al. 1994). In summary, then, language linguist. BEHAVIORAL AND BRAIN SCIENCES, 39 (2016) 11 https://doi.org/10.1017/S0140525X1500031X Published online by Cambridge University Press Christiansen & Chater: The Now-or-Never bottleneck: A fundamental constraint on language By contrast, the principle of local learning is respected by which embodies these principles is the simple recurrent other approaches. For example, item-based (Tomasello network (Altmann 2002; Christiansen & Chater 1999; 2003), connectionist (e.g., Chang et al. 1999; Elman Elman 1990), which learns to map from the current input 1990; MacDonald & Christiansen 2002),15 exemplar- on to the next element in a continuous sequence of linguis- based (e.g., Bod 2009), and other usage-based (e.g., tic (or other) input; and which learns, online, by adjusting Arnon & Snider 2010; Bybee 2006) accounts of language its parameters (the “weights” of the network) to reduce acquisition tie learning and processing together – and the observed prediction error, using the back-propagation assume that language is acquired piecemeal, in the learning algorithm. Using a very different framework, in absence of an underlying Bauplan. Such accounts, based the spirit of construction grammar (e.g., Croft 2001; Gold- on local learning, provide a possible explanation of the fre- berg 2006), McCauley and Christiansen (2011) recently de- quency effects that are found at all levels of language pro- veloped a psychologically based, online chunking model of cessing and acquisition (e.g., Bybee 2007; Bybee & Hopper incremental language acquisition and processing , incorpo- 2001; Ellis 2002; Tomasello 2003), analogous to exemplar- rating prediction to generalize to new chunk combinations. based theories of how performance speeds up with practice Exemplar-based analogical models of language acquisition (Logan 1988). and processing may also be constructed, which build and The local nature of learning need not, though, imply that predict language structure online, by incrementally creat- language has no integrated structure. Just as in perception ing a database of possible structures, and dynamically and action, local chunks can be defined at many different using online computation of similarity to recruit these levels of abstraction, including highly abstract patterns, structures to process and predict new linguistic input. for example, governing subject, verb, and object; and gen- Importantly, prediction allows for top-down information eralizations from past processing to present processing will to influence current processing across different levels of operate across all of these levels. Therefore, in generating linguistic representation, fr