Good Reason

It's okay to be wrong. It's not okay to stay wrong.

The Google web corpus

Google is releasing its lists of n-grams. What’s an n-gram, you ask? An n-gram is n words in a row. For example, ‘how are you’ would be a 3-gram.

Computational linguists use n-grams all the time because they’re a cheap, easy, and effective way of showing what’s in a sentence or a document.

So, um, how many n-gram types are in the Google trillion-word corpus?

We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times.

I was also interested in this tidbit:

There are 13,653,070 unique words, after discarding words that appear less than 200 times.

That seems to mean that in all the language in all the Web, there are about 13 million word types. I wonder how many of those are English.

2 Comments

  1. Dude,

    You should go work for google. Techy and linguisticky.

    Besides, the future isn’t discovering things or making things. It’s keeping track of all the things that have already been made or discovered.

    And categorizing things! Putting them in boxes with items of like category, marking the box appropriately, and moving it to a new house so you can forget again that you ever had it. Then you’ll find it again at 2am the morning before the truck comes, when you’re so tired that the item itself has no meaning. Only the category and appropriate marking are important.

    I need some sleep.

  2. Dobbins:
    When you are feeling a bit more rested you should stop by a restaurant/bar called “The Crush” You will be waited on by the most lovely drag queens and relax right into Portland. Ask for the owner, John “Woody” Clark, and tell him Jeff sent you. Should be good for at least one free drink :). I hope after the move blues wear off you start to Enjoy Portland, its a wonderful place.

Comments are closed.

© 2024 Good Reason

Theme by Anders NorenUp ↑