Time Is, Time Was, Time Is Past
In the Famous Historie of Fryer Bacon, the anonymous author (c. 1555) describes how the philosopher-monk Roger Bacon (1220 - 1292) consulted with the Devil to build a mechanical head out of brass capable of answering any question put to it. Bacon’s intention was to create a kind of wall-mounted oracle, dispensing counsel and advice, aiding the philosopher by reciting news, recipes, formulas, and spells for him. Variations on the legend of the brazen head (sometimes a brazen man) went back centuries, often replacing Bacon with other suspiciously cunning men: Robert Grosseteste, Albertus Magnus, Pope Sylvester.
In his comic retelling of the myth, the Elizabethan playwright Robert Greene (the man who called Shakespeare an “upstart Crow”) has Bacon asleep onstage when the head, after seven years of labor, finally comes awake and speaks seven cryptic words: “Time is, time was, time is past.” Then, in Greene’s stage directions, “A lightning flashes forth, and a hand appears that breaks down the Head with a hammer.” Greene’s ending followed the common arc of the legend, which in every version ends with the brazen head destroyed—either smited by God or else shattered by a pious Christian. There was something disturbing, to the Medieval mind, about an artificial voice.
The Clockwork Book
In a book that contains no less than six different methods the exit the earth’s atmosphere using common household materials, the earliest known description of a mobile home, a theological discourse on the personhood of plants and animals delivered by a talking cabbage, a bold prediction that the sun will one day exhaust all its fuel and consume all the planets around it—and all of this written the 1650s—the most striking invention proposed in Cyrano de Bergerac’s Comical History of the States and Empires of the Moon is a description of books in the lunar world. They are built like small metal boxes, fitting comfortably in the hand. “At the opening of the box,” Cyrano writes,
“I found something in metal almost similar to our clocks, filled with an infinite number of little springs and imperceptible machines. It is a book indeed, but a miraculous book without pages or letters; in fine, it is a book to learn from which eyes are useless, only ears are needed. When someone wishes to read he winds up the machine with a large number of all sorts of keys; then he turns the pointer towards the chapter he wishes to hear, and immediately, as if from a man's mouth of a musical instrument, this machine gives out all the distinct and different sounds which serve as the expression of speech between the noble Moon-dwellers.
The advantages of this technology over our own printed books, Cyrano writes, are manifold: citizens of the Moon are able to “read” as soon as they can speak, and by the age of eighteen have read as much as an Earthling at eighty. With their hands and eyes free, they can use these books anywhere they please: while resting at home, practicing their various arts and crafts, or traveling around the country on horseback, with a dozen books jangling on their saddlebags.
Cyrano uses the language of clocks and clockwork, the cutting-edge technology of his time, and an emerging metaphor for the nature of existence in a Newtonian universe, just as computer programmers today speak of our world as a software simulation. Cyrano, a bibliophile, couldn’t help but speculate on how the marvelous new machinery of pocket watches might be used for literature.
The Talking Machine
In 1769, engineer, scientist, and inventor Wolfgang von Kempelen (1734 - 1804) began development of his speaking machine. Instead of a brazen head, Kempelen used simple kitchen belows on a wooden frame, blowing air through a reed. By restricting the airflow with his hand, Kempelen was able to make his machine produce enough vowels and consonants to produce basic words and phrases in French and Italian, though Kempelen’s own German, with its thickets of hard consonants, was harder to reproduce.
The final version of Kempelen’s speaking machine spoke in a rasping, wheezy monotone somewhere between a duck and a crying baby. You can hear reproductions for yourself on YouTube, which follow the exact schematics published in Kempelen’s 1791 The Mechanism of Human Speech, with a Description of a Speaking Machine. Kempelen’s speaking machine, with some later improvements to the design fifty years later by the English inventor Charles Wheatstone, is the most accurate non-electronic speech synthesizer ever made.
Whatever his legitimate accomplishment, however, Kempelen remains forever associated with his other great project of 1669: a turban-clad automaton capable of playing chess at a grandmaster level. The Mechanical Turk, as it came to be called, received international acclaim, toured Europe and the United States for half a century, and defeated, among others, Napoleon Bonaparte, Benjamin Franklin, and Edgar Allen Poe. Only after perishing in a Philadelphia fire did the Mechanical Turk’s operators come forward and admit that the machine was a fraud, concealing a chess player beneath the table who controlled the Turk’s every movement.
The Talking Book
London, 1892. A group of gentlemen, top hats in hand, have just come to their favorite restaurant following a lecture by Sir William Thomson on the distant future of humanity. Still buzzing with excitement, they order drinks and make predictions about the year 2000. In the future, they claim, we will have artificial meat, take nutritional supplements as pills, abandon figurative art for abstraction, and receive our news and stories through motion pictures.
The gentlemen and their conversation are fictitious, appearing in an 1894 short story written by Octave Uzanne (1851 - 1931) though many of their predictions are not: artificial food fills our pantries, screens dominate our homes, and modern art hangs in our galleries. The bulk of the story, though, is given over to the narrator, a “worthy Bibliophile” asked to present his views on the future of the book.
The Bibliophile is blunt: "If by books,” he says, “you are to be understood as referring to our innumerable collections of paper, printed, sewed, and bound in a cover announcing the title of the work, I own to you frankly that I do not believe (and the progress of electricity and modern mechanism forbids me to believe) that Gutenberg's invention can do otherwise than sooner or later fall into desuetude.” The damage books cause to the eyes and spine, he says, cry out for a technological fix.
In the future, he says, we will listen to our books, carrying them everywhere we go. "There will be registering cylinders,” he says, “as light as celluloid penholders, capable of containing five or six hundred words and working upon very tenuous axles, and occupying not more than five square inches all the vibrations of the voice will be reproduced in them; we shall attain to perfection in this apparatus as surely as we have obtained precision in the smallest and most ornamental watches.” The reader of the future will listen to these “pocket phono-operagraphs” as he saunters through town or takes excursions into the wilderness.
Authors will go from writers to narrators: “Certain Narrators,” he says, “will be sought out for their fine address, their contagious sympathy, their thrilling warmth, and the perfect accuracy, the fine punctuation of their voice,” with all their productions protected by copyright law as the author’s sole property. Journalists will record their news and read it as daily dispatches printed on cheap, disposable cylinders or else delivered over telephone-powered loudspeakers–just like the theatrophone has done for drama, his friends observe.
The conversation finished, the gentlemen raise their glasses to the Bibliophile’s speech and toast: “The printed book is about to disappear. After us the last of books, gentlemen!"
The Book of the Future
For $100 per hour, you can now hire the acclaimed actor and presenter Edward Herrmann to narrate your writing. As the voice of countless History Channel programs and Nova science documentaries and narrator for books by Stephen King, Doris Kearns Goodwin, David McCullough, Ayn Rand, Ron Chernow, and more, Herrmann is one of America’s most beloved narrators. Audible alone lists 102 books narrated by Herrmann. And unlike other celebrity narrators, who charge up to $1,000 for every hour of performance and must be booked months in advance, Edward Herrmann is available at any time, for only a tenth of the price per PFH. This is, by any measure, an extraordinary deal. There’s just one catch: Edward Herrmann is dead. The Herrmann you can hire today is an AI-generated speech-to-text program.
I won’t pretend to understand exactly how DeepZen, the current licensee of Herrmann’s voice, works. From what I can tell, the process depends on feeding an AI program thousands of hours of Herrmann performances. DeepZen’s software scans it all for patterns of intonation, breath, rhythm, and pronunciation, then matches it all to their source texts. This builds up a model of Edward Herrmann’s voice, complete down to the last phoneme. By teaching the program enough examples of Herrmann reading ordinary words and phrases like “Good morning,” “Dog,” and “Bumblebee,” you can get a decent imitation of things Herrmann never said, like “Transalpine cumquat rodeo” or “the exquisite corpse shall drink the new wine.”
The result is far from perfect. Much of your $100 per PFH goes to human editors who smooth out inconsistencies, mispronunciations, and mistakes that inevitably come when a computer is guessing at sounds that it doesn’t really understand. The synthetic reader also isn’t especially good at fiction, with its multiple voices and varying levels of irony, pace, and style. You can hear AI-Herrmann trying to read A Christmas Carol on Spotify and judge for yourself. But even in nonfiction samples in history or biography, with the kind of steady, stately manner that the original Edward Herrmann excelled at reading, there’s still a slight uncanniness to the AI narrator.
DeepZen, the licensee of Herrmann’s voice–or rather, of certain phonemic and intonational patterns that constitute the publicly-recognized voice of the late Edward Herrmann–is betting big on synthetic voices as the future of the audiobook market. This is no small niche, either: while the last decade has seen print and digital book sales mostly stagnant, audiobooks have had double-digit growth rates every year. The biggest problem in audiobook recording is no longer cost or publication, but production: there simply aren’t enough professional audiobook studios to meet the rising demand. Automation promises an easy, cheap fix. DeepZen and their competitors are promising that fix, and their software is getting better all the time.
Whatever attachment listeners have to human narrators–and despite Audible’s current policy of banning all synthetic narrators–it’s going to be hard for the audiobook industry to fight the economic benefits of automation. By the end of the decade, we very well might have a majority of audiobooks read by synthetic narrators. This will surely change how we read and write.
As the technology becomes better and cheaper–and like all AI applications, it will improve exponentially with more listeners and more feedback–it might become trivial to write a book, run it through an app, and have a high-quality audiobook narrated by any one of a huge list of licensed narrators. Stephen Fry’s lawyers probably already have the paperwork drafted for auctioning off his voice like Herrmann, but we don’t have to stop with the living or even the recently dead. Just as a minor industry has sprang up around celebrity holograms, it’s possible we’ll soon have the means to reproduce any popular figure of the last century. (I would pay good money for an Orson Welles narrator app. What voice would you spring for?)
Corners of the book world with little audiobook market presence, like academic works and books in translation, would leap into the field and open up new revenue streams and use cases with extremely low costs. Imagine learning a language with an AI narrator who speeds up, slows down, and enunciates on command, or pauses after each sentence for you to catch up, or having your college textbook read itself to you.
Authors might use certain text markers for the software to read like a musical score, making suggestions about pacing and tone (“use sarcasm here”), dialogue (“interrupt this line with the next one”), and character (“Jed speaks slowly, Sarah speaks quickly”). Our culture’s turn towards secondary orality will be supercharged, with all prose writing written in order to be spoken and heard.
But books, as I’ve covered in this space, are becoming the least common and least significant kind of reading that we do. Most of our engagement with text now happens in articles, reports, posts, tweets, and captions, so that’s where the real money is with synthetic readers. If we do come to prefer listening to our books, and listening to synthetic readers, it’ll be our internet browsing that leads to this preference.
The companies that give us access to all this–Amazon, Microsoft, Apple, Google, and so on–all have massive investments in text-to-speech and speech-to-text software. These are, after all, how their digital assistants, like Siri and Alexa, work. If the tech giants don’t already have synthetic narrators as good as DeepZen, then they’re probably waiting to buy them out once the technology matures.
And at that point, it’s not just books, but our entire relationship to text that changes: consumers could have access to the software itself, not just records of DeepZen or whatever in action. Morgan Freeman could read your emails; Jeff Goldblum could run you through your schedule; Scarlett Johansson could read Wikipedia to you–or flirt with you, like the movie. It will also almost certainly be used to produce illegal copies of any public figure with enough recorded audio to imitate, leading to countless deepfakes, hoaxes, and frauds. This is heady, sci-fi stuff, and it is only beginning to unfold.
Like everything else with digital technology in the 2020s, we’ll have to take the good (increased convenience, lower costs, more accessibility, loads of creative potential) with the bad (increasing fraud, the erosion of reality as we know it). There is something magical, something strange, and something slightly wrong about the synthetic voice, but as long it holds out the promise of convenience, utility, and profit, the idea will always be pursued. And now, with artificial intelligence, we will finally have our brazen heads. Whether or not they must be struck down with a flash of lightning and a hammer remains to be seen.