Who do you write like?

I pasted a longish blog post into I Write Like, and it said:

George Orwell

While I appreciate the compliment, I wish it would be more specific as to how it got that assessment. I can make a few guesses.

It seems obvious that this uses some kind of nearest-neighbour search. Take a corpus of authors, break their works into good-sized chunks, and then find the closest match for whatever the user gives you.

But what constitutes a match? We could use n-grams (words, and strings of words), as we do in many computational language tasks, but just matching the words in a book doesn’t mean you write like the author. Sure, Steinbeck and Faulkner wrote different words in their books just because of the topics they treated, but that’s not what we mean by writing style.

My guess is that writing style is more about patterns of words, especially function words like prepositions and conjunctions. (You may have noticed I start a lot of sentences with conjunctions like ‘but’ and ‘and’.) I’d try running all the words through a part-of-speech tagger, and see what matches that data best. Just a guess though.

I wonder if Orwell writes like Orwell. Here are three adjacent passages from Orwell’s Down and Out in Paris and London, with the computer’s assessment.

Or there was Henri, who worked in the sewers. He was a tall, melancholy man with curly hair, rather romantic-looking in his long, sewer-man’s boots. Henri’s peculiarity was that he did not speak, except for the purposes of work, literally for days together. Only a year before he had been a chauffeur in good employ and saving money. One day he fell in love, and when the girl refused him he lost his temper and kicked her. On being kicked the girl fell desperately in love with Henri, and for a fortnight they lived together and spent a thousand francs of Henri’s money. Then the girl was unfaithful; Henri planted a knife in her upper arm and was sent to prison for six months. As soon as she had been stabbed the girl fell more in love with Henri than ever, and the two made up their quarrel and agreed that when Henri came out of jail he should buy a taxi and they would marry and settle down. But a fortnight later the girl was unfaithful again, and when Henri came out she was with child, Henri did not stab her again. He drew out all his savings and went on a drinking-bout that ended in another month’s imprisonment; after that he went to work in the sewers. Nothing would induce Henri to talk. If you asked him why he worked in the sewers he never answered, but simply crossed his wrists to signify handcuffs, and jerked his head southward, towards the prison. Bad luck seemed to have turned him half-witted in a single day.

H. P. Lovecraft

Or there was R., an Englishman, who lived six months of the year in Putney with his parents and six months in France. During his time in France he drank four litres of wine a day, and six litres on Saturdays; he had once travelled as far as the Azores, because the wine there is cheaper than anywhere in Europe. He was a gentle, domesticated creature, never rowdy or quarrelsome, and never sober. He would lie in bed till midday, and from then till midnight he was in his comer of the bistro, quietly and methodically soaking. While he soaked he talked, in a refined, womanish voice, about antique furniture. Except myself, R. was the only Englishman in the quarter.

Charles Dickens

There were plenty of other people who lived lives just as eccentric as these: Monsieur Jules, the Roumanian, who had a glass eye and would not admit it, Furex the Liniousin stonemason, Roucolle the miser — he died before my time, though — old Laurent the rag-merchant, who used to copy his signature from a slip of paper he carried in his pocket. It would be fun to write some of their biographies, if one had time. I am trying to describe the people in our quarter, not for the mere curiosity, but because they are all part of the story. Poverty is what I am writing about, and I had my first contact with poverty in this slum. The slum, with its dirt and its queer lives, was first an object-lesson in poverty, and then the background of my own experiences. It is for that reason that I try to give some idea of what life was like there.

No wonder Orwell had writer’s block: schizophrenia.

UPDATE: Thanks to Kuri for that link in comments. It seems the author used

vocabulary (use of words), number of words, commas, and semicolons in sentences, number of sentences with quotation marks and dashes (direct speech).

I’d say this could be smartened up considerably. Just including some simple features would help, like the ratio of singletons (words appearing once) to other words, appearance of conjunctions, or ranking all the words by frequency and comparing lists.

This kind of makes me want to try building a better system. I won’t (for lack of time), but I think I will keep in mind that if you can take interesting work in natural language processing and make a simple web implementation, people will think it is interesting. You can also have a lot of English major hotheads sniping at you because you snubbed Toni Morrison. Wouldn’t that be fun!


The writing is all right.

Every once in a while, I hear people complain about those rotten kids who are wantonly ruining English with their electronic gizmos and their internets. It’s a myth that’s been taken apart in various ways.

One fact that I don’t see mentioned as frequently in this discussion is that people in this generation are communicating in writing much more than previous generations. Blogs, Facebook, email, Twitter. It all adds up. So it’s nice to see this fact mentioned in this Wired article.

Lunsford is a professor of writing and rhetoric at Stanford University, where she has organized a mammoth project called the Stanford Study of Writing to scrutinize college students’ prose. From 2001 to 2006, she collected 14,672 student writing samples—everything from in-class assignments, formal essays, and journal entries to emails, blog posts, and chat sessions. Her conclusions are stirring.

“I think we’re in the midst of a literacy revolution the likes of which we haven’t seen since Greek civilization,” she says. For Lunsford, technology isn’t killing our ability to write. It’s reviving it—and pushing our literacy in bold new directions.
The first thing she found is that young people today write far more than any generation before them. That’s because so much socializing takes place online, and it almost always involves text. Of all the writing that the Stanford students did, a stunning 38 percent of it took place out of the classroom—life writing, as Lunsford calls it. Those Twitter updates and lists of 25 things about yourself add up.

It’s almost hard to remember how big a paradigm shift this is. Before the Internet came along, most Americans never wrote anything, ever, that wasn’t a school assignment. Unless they got a job that required producing text (like in law, advertising, or media), they’d leave school and virtually never construct a paragraph again.

People used to phone. Now they’re writing. And the writing isn’t half bad, possibly because the entire world is reading, ready to correct you if your logic or your spelling is faulty.

You can listen to me talking more about this on an RTRFM radio interview (about three-quarters through the stream).

Portuguese spelling changes

I’m late on this story, but I’m going with it anyway, just because as an American-Australian I think it’s nice to remember that anglophones aren’t the only ones with transnational spelling issues.

This time it’s lusophones. That is, speakers of Portuguese.

Brazilians start 2009 facing the task of learning new spelling rules that have just come into effect.

The spelling reforms have been agreed by Portuguese-speaking nations, but the language seems set to have different written forms for some time to come.

In Portugal, there has been fierce resistance in some quarters to the changes because many of the changes are to spell words the Brazilian way.

Isn’t that always the way? The European colonisers get alarmed by these American upstarts taking over the language.

Any language with an alphabetic writing system will eventually have this kind of trouble because every language undergoes sound change, making old spellings archaic. And when dialects of a language diverge, there are bound to be struggles over whose dialect gets represented in the writing system.

So what are the changes?

  • Silent consonants are getting dropped, like the silent ‘c’ in ‘actualmente’ (actually) or the ‘p’ in ‘optimo’ (great). A tip: if you’re a Portuguese consonant, don’t hang out before ‘t’. There’s no future in it.
  • Some accent marks are being discontinued — diphthongs ‘éi’ and ‘ói’ will lose their accents
  • Letters k, y, and w are being officially added, though they’ve been in use unofficially.

As a linguist, I usually think of language change as slow, like two tectonic plates sliding past each other. And usually it is. But, as with land masses, when the language gets locked into place in the form of writing, we can expect periodic earthquakes. This is one of those cases.

UPDATE: If you’re curious about this issue, here’s a nutty little article about it, written in ‘Simplified Spelling’ English. It’s cute. You can imagine yourself as a citizen of Parallel England, or you can imagine that you’re hanging out with Simplified English advocates like George Bernard Shaw and Mark Twain. And here’s a challenge — try reading it aloud without unconsciously affecting a dopey overbite accent, like Cousin Floyd from the country. It’s harder than you’d think.

The War on Writing Systems suffers a setback

How are we going to defend ourselves against terrorism if we’re not allowed to discriminate against different-looking people with weird writing on their shirts?

An air passenger forced to cover his T-shirt because it displayed Arabic script has been awarded a payout of $240,000 (£163,000), his lawyers say.

Two Transportation Security Authority officials and JetBlue Airways will be forced to make the payout.

Raed Jarrar, a US resident, had accused them of illegally discriminating against him based on his ethnicity and the Arabic writing on his T-shirt.

The payout is the largest of its kind since the 9/11 terror attacks.

Here’s the shirt.

Okay, I have to admit that this is not the least threatening t-shirt I have ever seen in an airport. Vaguely militant slogan plus Arabic script. I would probably think twice about wearing that for a flight.

And yet, isn’t that the lesson of this whole thing? These officials are in the business of creating a security state. It’s hard to monitor everyone all the time, so it’s useful to them if they can get individuals to do a lot of self-monitoring — to make lots of little decisions not to wear this, or not to say that, to censor themselves in a hundred ways just so they won’t fall afoul of some arbitrary and unwritten code of conduct.

And so Raed’s question that day was very appropriate:

I once again asked the three of them : “How come you are asking me to change my t-shirt? Isn’t this my constitutional right to wear it? I am ready to change it if you tell me why I should. Do you have an order against Arabic t-shirts? Is there such a law against Arabic script?”

No, there is not. The good guys won this time.

