Home > Geekery, Strategy Shift, Unlearning > Google & Natural Language Processing

Google & Natural Language Processing

So, I was going to write about unemployment and how the job market has changed, but I got scooped by an amazing article by Drake Bennett called The end of the office…and the future of work.  It is a great look into the phenomenon of Structural Unemployment.  The analysis is very timely, but can go much deeper.  Drake, if you plan on writing a book here’s your calling.  There’s lots of good stories written on this subject out there by giants such as Jeremy Rifkin, John Seely Brown, Kevin Kelly, and Marshall Brain.

While reeling from the scoop, depressed and doing some preliminary market research, I happened upon a gem of a blog post by none other than our favorite search company, Google.  Before proceeding on in my post, I do recommend that you do read the blog post by Steve Baker, Software Engineer @ Google.  I think he does an excellent job describing the problems Google is currently having and why they need such a powerful search quality team.

Here’s what I got from the Blog post:  Google, though they really want to have them, cannot have fully automated quality algorithms.  They need human intervention…And A LOT OF IT.  The question is, why?  Why does a company with all of the resources and power and money that Google has still need to hire humans to watch over search quality?  Why have they not, in all of their intelligent genius, not created a program that can do this?

Because Google might be using methods which sterilize away meaning out of the gate.

Strangely enough, it may be that Google’s core engineer’s mind is holding them back…

We can write a computer program to beat the very best human chess players, but we can’t write a program to identify objects in a photo or understand a sentence with anywhere near the precision of even a child.

This is an engineer speaking, for sure.  But I ask you:  What child do we really program?  Are children precise?  My son falls over every time he turns around too quickly…

The goal of a search engine is to return the best results for your search, and understanding language is crucial to returning the best results. A key part of this is our system for understanding synonyms.

We use many techniques to extract synonyms, that we’ve blogged about before. Our systems analyze petabytes of web documents and historical search data to build an intricate understanding of what words can mean in different contexts.

Google does this using massive dictionary-like databases.  They can only achieve this because of the sheer size and processing power of their server farms of computing devices.  Not to take away from Google’s great achievements, but Syntience’s experimental systems have been running “synthetic synonyms” since our earliest versions.  We have no dictionaries and no distributed supercomputers.

As a nomenclatural [sic] note, even obvious term variants like “pictures” (plural) and “picture” (singular) would be treated as different search terms by a dumb computer, so we also include these types of relationships within our umbrella of synonyms.

Here’s the way this works, super-simplified:  There are separate “storage containers” for “picture”, “pictures”, “pic”, “pix”, “twitpix”, etc, all in their own neat little boxes.  This separation removes the very thing Google is seeking…Meaning in their data.  That’s why their approach doesn’t seem to make much sense to me for this particular application.

The activities of an engineer would be to write code that, in a sense, tells the computer to create a new little box and put the new word in a list of associated words.  Shouldn’t the computer be able to have some sort of continuous, flowing process which allows it to break out of the little boxes and allow for some sort of free association?  Well, the answer is “Not using Google’s methods.”.

You see, Google models the data to make it easily controllable…actually for that and for many, MANY other reasons.  But by doing so, they have put themselves in an intellectually mired position.  Monica Anderson does a great analysis of this in a talk on the Syntience Site called “Models vs. Patterns”.

So, simply and if you please, rhetorically:

How can computer scientists ever expect a computer to do anything novel with data when there is someone (or some rule/code) telling them precisely what to do all the time?

Kind of constraining…I guess that’s why they always start coding at the “command line”.

Advertisements
  1. abdulQahhar
    January 22, 2010 at 6:00 AM

    I’ve long maintained that data only becomes information when attention collapses its “meaning vector”.

    Can you turn data into information by removing entropy? In an ultra-low entropy Bose-Einstein Condensate, identity smears between(formerly)individual atoms.

    Will an AN be able to pay attention in this context?

  2. January 22, 2010 at 5:32 PM

    Wow…I hate to duck out on that one, Michael but maybe we should ask Monica.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: