Good Reason

It's okay to be wrong. It's not okay to stay wrong.

Category: computing (page 2 of 6)

Google’s contextual spell checker is cool

It was a Great Moment in Tech Support. The caller asked me how he could remove a word from his WordPerfect dictionary.

It was an unusual request, but we got a lot of those. “What word do you want to remove?” I asked.

He stumbled. “Um… ‘pubic’?”

I knew immediately what had happened. ‘Pubic’ is a real word, of course, but he hadn’t meant to use it in his document, and there he’d gone and given a presentation on ‘pubic works’ and how ‘pubic libraries’ operate for the ‘pubic good’. These things can happen when you speak in pubic.

His dumb spell checker had failed him. Spell checkers have been around so long that we’re used to their limitations, and one of them was that it was insensitive to context. Well, no more. Google’s DocsBlog (via Lifehacker) has announced that it’s rolled its “do you mean” spell checker into GoogleDocs.

1. Suggestions are contextual. For example, the spell checker is now smart enough to know what you mean if you type “Icland is an icland.”
2. Contextual suggestions are made even if the misspelled word is in the dictionary. If you write “Let’s meat tomorrow morning for coffee” you’ll see a suggestion to change “meat” to “meet.”
3. Suggestions are constantly evolving. As Google crawls the web, we see new words, and if those new words become popular enough they’ll automatically be included in our spell checker—even pop culture terms, like Skrillex.

How do ordinary spell checkers work?

Spell checkers work by taking words that don’t appear in the dictionary (sometimes known as ‘out-of-vocabulary’ words, or OOV), and comparing the string to a list of known words in the dictionary. To figure out the most likely suggestion, they calculate an ‘edit distance’, or how many changes it would take to go from the malformed word to a known word.

So how do you calculate the edit distance? One easy measure is the Levenshtein distance. It’s pretty intuitive. Ask yourself: How many changes would it take to go from ‘pubic’ to ‘public’? Just one: add an ‘l’. So the edit distance is 1. But the computer calculates this using a grid. This is the cool part.

Start by putting the two words in a grid like so; one word down and the other across. Also, fill the second row and column with numbers. (This will make sense in a minute.)

p u b i c
0 1 2 3 4 5
p 1
u 2
b 3
l 4
i 5
c 6

Now, fill each of the inner boxes with one of three numbers, whichever is lowest:

  1. The number above plus 1
  2. The number to the left plus 1, or
  3. The number to the upper left, plus 1 if the two letters don’t match (that’s called the “cost”), or plus 0 if the two letters do match.

For our example, ‘p’ matches ‘p’, so the smallest number would be the 0 to the upper left. No cost.

p u b i c
0 1 2 3 4 5
p 1 0
u 2
b 3
l 4
i 5
c 6

On we go, down the column. None of the other letters are a ‘p’, so the lowest number for each box would be the one just above it, plus one. Notice how the numbers keep stacking up.

p u b i c
0 1 2 3 4 5
p 1 0
u 2 1
b 3 2
l 4 3
i 5 4
c 6 5

We start again at the next column. The ‘p’ and the ‘u’ aren’t a match, so we give it a 0 + 1 from the left, but the ‘u’ and the ‘u’ are a match, so that box gets a cost-free ‘0’ from the upper-left.

p u b i c
0 1 2 3 4 5
p 1 0 1
u 2 1 0
b 3 2
l 4 3
i 5 4
c 6 5

You can work out the rest of the table if you’re keen, but here it is in full.

p u b i c
0 1 2 3 4 5
p 1 0 1 2 3 4
u 2 1 0 1 2 3
b 3 2 1 0 1 2
l 4 3 2 1 1 2
i 5 4 3 2 1 2
c 6 5 4 3 2 1

Notice how everything’s going smoothly until the number in green, where the first real mismatch is. But the number to watch out for is that last one in the lower right, in red. When the whole table is filled out, that’s where your answer is. So the words ‘pubic’ and ‘public’ have a Levenshtein distance of 1, which matches our intuition about the number of changes we’d have to make to go from one to the other.

You can try this with any two words, either on paper, or using this handy website here. Having a play with it is a good way of getting a grip on this algorithm.

There are lots of ways we can tweak this spell-checker. We can adjust the cost so that near keys (and therefore more plausible typing mistakes) cost less than farther-away keys. We could adjust for frequency so that more common words float to the top of our suggestion list. But what we can’t do is look at nearby words to see what’s likely. That means the classic ‘form/from’ problem is beyond the reach of our spell checker.

But not the one from GoogleDocs. It will flag words, even if they’re real words. Behold:

Note how throwing in a related word (‘pelvis’) in that last example is enough to calm the spell checker down.

How does it do it? It looks like it works by calculating the probability of other words appearing nearby. Articles like ‘the’ and ‘a’ are likely to appear before ‘island’, less likely before ‘Iceland’. The whole thing could be modelled with n-grams (nearby words) using a sufficiently large language corpus, which Google certainly has. And that huge corpus ensures that lots of words will be in the dictionary, including low-frequency or brand new terms.

It’s good to know that people are still adding to a technology that’s so seemingly mundane.

Milk: Lost in translation

If you use Google Translate to translate “Got milk?” into Spanish, and burrow into the ‘alternate translations’ it offers you, one of the choices is “bigote de leche”, which means “milk moustache”. I’m leaving it as an exercise for the reader to figure out how it arrived at that translation.

Of course, this is better than the tagline they went with in Spanish speaking countries: “¿Tiene leche?” which sounds plausible enough to a non-native speaker, but which carries maternal associations, something along the lines of “Are you lactating?

Steve Jobs

I’ve been a Mac guy ever since 1984, which is when my Dad bought a Mac Classic for his office. He thought it was the greatest, and he was right. Man, how many hours I spent at my Dad’s office on that cute little white box! (No hard drive, two disc drives.) I was drawing, writing, pointing, clicking. I was computing, and it was easy and fun. I was hooked.

In the 1990s, I had to confront the strange paradox that puzzled every Mac user: Despite the obvious superiority of the Mac (System 7 at that stage), and its ease of use, PCs still existed. How could this be? Mass delusion? I bored legions of friends with my Mac evangelism… and fretted about Apple’s predicted demise.

Then came the Return of Steve Jobs. He showed us how to turn a company around. And he did it by doing something unexpected — by me at least. While I thought the salvation of the Mac would happen when people saw what great software the OS was, Jobs went at it from the hardware end. He oversaw and (I think it’s fair to say) designed great-looking and great-working products that people couldn’t wait to get their hands on.

Steve Jobs brought us insanely great computers. Computers look and work the way they do today because of choices he made. They have mouses. They have trackpads. (Macs were the first to have them.) They have sophisticated font capabilities because Steve loved typography (like I do).

I love my Mac, and my iPod. I use them both every day. (No iPhone yet.) I love what these smart little things bring to my life. I hope Apple keeps making great things, now that Steve Jobs is gone. I feel like we all owe him a lot for what he brought to computing.

Gamers for science

This was exciting to see: Learning the structure of an AIDS-like virus stumped scientists for 15 years. FoldIt gamers cracked it in ten days.

“This is one small piece of the puzzle in being able to help with AIDS,” Firas Khatib, a biochemist at the University of Washington, told me. Khatib is the lead author of a research paper on the project, published today by Nature Structural & Molecular Biology.
The feat, which was accomplished using a collaborative online game called Foldit, is also one giant leap for citizen science — a burgeoning field that enlists Internet users to look for alien planets, decipher ancient texts and do other scientific tasks that sheer computer power can’t accomplish as easily.

“People have spatial reasoning skills, something computers are not yet good at,” Seth Cooper, a UW computer scientist who is Foldit’s lead designer and developer, explained in a news release. “Games provide a framework for bringing together the strengths of computers and humans.”

I’ve done work on crowdsourcing annotation in language tasks, so it’s good to see it working in this domain. I love the idea of people putting their heads together and solving problems. For all our computing might, nothing can match human brains on some tasks.

‘is’ v ‘has’

Back in the 1600s, people used auxiliary ‘be‘ + some verbs of motion, where today we’d use ‘have‘.

Shakespeare did it with ‘is fled‘.

LENNOX
‘Tis two or three, my lord, that bring you word
Macduff is fled to England.

And ‘is come‘.

LUCILIUS
He is at hand; and Pindarus is come
To do you salutation from his master.

I thought it would be fun to check it in Google Books Ngram Viewer, and see when ‘has X-en’ became more popular than ‘is X-en’. But you can’t do it with any old verb like ‘make’ — it has to be intransitive. Otherwise, you’re scooping up ‘is made’ constructions like ‘That’s how rubber is made.” Those are still okay now. I want the ones where ‘is X-en’ has been replaced by ‘has X-en’. And the pattern seems particularly common with verbs of motion.

Here’s ‘fled‘. Notice that the crossover happens around 1830ish.

And ‘come‘. They cross over at about the same time: 1840ish.

Arrive‘ arrives early — about 1810ish

Here’s ‘depart‘, right on the button — 1830 again.

And we see more or less the same pattern with other verbs like land, and become.

What was happening in English in 1800–1840?

When I raised this to the attention of fellow linguist Mark Ellison, he suggested twiddling the ‘Corpus’ menu between ‘British’ and ‘American’. This revealed that the stodgy conservative British books held onto the old usage longer. Perhaps the Americans were at the head of this ‘is – has’ innovation, and the rise we see in the corpus was partially due to more books being published in the Colonies.

I’ll have to do some looking around to see if anyone knows more about it. Luckily, I have two experts on present perfect in my very own department. Meanwhile, I think it’s cool that I can search centuries of language patterns in seconds.

Markov Everything!

Someone on Twitter has created Markov Bible.

We’ve had fun with Markov chains on the blog before. They’re really quite simple: just take a big file full of text, and pick any two adjacent words at random (let’s say it’s ‘in the’). Then, find every occurrence of the words ‘in the’, and make a list of every word that occurs right after them. Pick one of those at random, and that’s word number 3. Now repeat with your word number 2 and 3 to get a word 4, and so on for as long as you want.

It’s fun to mess around with the Bible, but my favourite thing is to do mashups. Here’s the Bible combined with George Orwell’s 1984.

They say unto you, Ye shall worship at his saying, and nipped off to Canada like cattle. They could do nothing against the children of the same: but the one end of three years old when he would have cast upon a pole, and it was too late–no such thought occurred to me, and on the north corner, he made windows of agates, and thy master’s son? And Ziba said unto Onan, Go in this book.

That last part is funnier if you know who Onan is.

And here’s some of Alice’s Adventures in Wonderland, mixed in with Franz Kafka’s The Penal Colony.

CHAPTER XI Who Stole the Tarts? The King laid his head and, feeling behind him with his face to the Bed. First, I’ll describe the apparatus to you.” The Traveller acted as if a dish or kettle had been asked to attend the execution of a bottle. They all came different!’ the Mock Turtle replied; ‘and then the execution is a very grave voice, ‘until all the players, except the King, who had been sleeping on duty. For his task is to give the prizes?’ quite a chorus of ‘There goes Bill!’ then the other, and making quite a long time together.’ ‘Which is just the case might be, if he had neglected to look down and appeared peaceful. The Soldier showed the Traveller and laid his head sadly. ‘Do I look like one, but it is.’

The longer the sentence, the less likely it is to be coherent, since Markov chaining doesn’t preserve the long-range structure of a sentence. But still, it’s surprising when it works.

Spam spotting

Here’s an interesting application for Natural Language Processing. How do you know if a review for a hotel or a business is real, or just a spammy one written by someone who’s never even seen the place?

Computers can do it. Just get people to write a ton of spammy reviews, then get some reviews that (you suspect) are real, and compare the patterns.

Can’t read the text? Fake reviews were more likely to use “I” and “me”, adverbs like “really”, and explanation points.

Here’s a PDF of the authors’ ACL presentation.

30-Day Blog September

I’m starting something new, and I’m calling it 30-Day Blog September. Every day in the month of September, I am going to blog something. It may be the most interesting news article I found that day, a thought I had, or a longer piece, but it will be something, and it will be every day.

You could try it too, if you have a blog. Maybe it will shake us both out of Blog Lethargy, and help us realise that not every post needs to be a Serious Thought Piece. Want to join me?

UPDATE: I has a graphic.

If you’re up for 30-Day Blog September, slap this graphic somewhere on your blog, and link to this post.

I give it one star

You’ve got to give the the LDS Church credit for working the Internet. One of their latest suggestions for members eager to share that gospel message is here (h/t Chino):

Google Reviews for LDS Chapels

This task involves submitting a review of your local meetinghouse to Google. Doing so will help make our local meetinghouses more visible in Google searches for people who are looking for a church to attend.

People can submit Google reviews for churches? Sounds like fun!

You may find a visit here to be pleasant enough. If you decide to investigate the church more in-depth, you will be presented with an escalating series of commitments. At first, it’s going to 3-hour church meetings and reading the Book of Mormon. Eventually, you’ll have promised to give the church 10% of your income and even more of your time. They offer no evidence for their many outlandish claims, including God living near a star named Kolob, or ancient Hebrews building boats and sailing to America. You’re meant to accept all this based on feelings, which are no subsitute for evidence. Mormons are generally nice people, but you probably have better things to do.

Try writing one for your local meetinghouse. It’s hard to be concise, but the real trick is to sound sensible and well-reasoned. If you start raving about underwear, then you sound like the crazy one. It’s so unfair.

Talk the Talk: Now on iTunes!

This is exciting: my podcast “Talk the Talk” is now on iTunes! Yes, you can now hear a fresh dose of linguistics news every week, in convenient podcast form.

Head over to the link by clicking on the nifty graphic below. Subscribing is free, of course.

While I’m thinking about iTunes — I’m still trying to figure out the profanity guidelines for their titles. I noticed that ‘shit’ turns into ‘sh*t’, which is fine. But ‘WTF’ comes out ‘W*F’. They starred the T? Shouldn’t they have starred the F?

WT*?

Older posts Newer posts

© 2024 Good Reason

Theme by Anders NorenUp ↑