Friday, February 15, 2008

One fish, two fish...

kw: musings, business, searching

For your translation needs, there's BabelFish*.

For everything else, there's the GoogleFish!

*BabelFish is a product of altavista™.

This editorial cartoon by Daryl Cagle showed up in the morning paper today, and I ran right down and scanned it. This is a de-halftoned, small version of the scan. You can also see an original color image (as long as the link lasts).

I think Mr. Cagle has captured the essence of just how important searching is. During the past decade, I've moved professionally from info tech to info science. Whatever IS was called in the distant past, it has been around since Mesopotamian herders began to use marks on clay take inventory ("Where'd I leave the tablet with Sam's receipt?"). We originated as hunters and gatherers: searchers. It is still what we do best and most obsessively. Ya gotta search to get work, food you can afford, a mate, a dwelling, transportation, and on and on...

Searching is big business. I work for an industrial company that does a lot of research. The library services division does mainly two things: catalog things so it will be easy to find them, and find things, particularly things that aren't well cataloged. Most of such "things" are books and smaller documents. I am now the main custodian of a thesaurus...not the book by Roget, but a large collection of keywords and key phrases ("terms" in the profession), arranged in a hierarchy, also called a "controlled vocabulary". The cataloging tasks aim to attach appropriate key terms to things like tech reports, so someone can later find documents that are useful to them. In this context, the cataloging is called "conceptual analysis," and employs a number of rather costly experts who can quickly read a document and extract concepts that they render as key terms from the thesaurus, anywhere from five to a couple dozen per document.

This kind of cataloging is quite a bit different from putting a Dewey Decimal number on a library book. Each book gets one key, and one only. The numbers keep the keys short. For example, DD numbers in the 800s are "Literature". 8n1 (n = a digit 1-8) means "Poetry", where 811 is "American poetry in English", 861 is "Spanish poetry" and so forth. Digits after a decimal further refine the subject. But in the poetry section, I have seldom seen digits after the 811 or 821 ("British poetry"); poets seldom stick to a narrow subject beyond a single poem. You'll see strings of three or four digits (after the decimal) for narrowly focused books. Longer strings are for multiple, discernible foci; but this is rare. If you see a humongous DD number, ask your librarian to explain it.

Of course, the people that search my company's document collection don't only use key term searches. They can also search for documents that contain any word(s) they like. Now that everything is automated and on disk, searching is getting simpler all the time. However, "free text" searches tend to return a lot of clutter. Thus, large companies find professional cataloging and indexing worth the cost.

What about the rest of us? We have Google. Google is so big in searching now, we use the new verb "to google" even when we are searching via altavista or Yahoo! or DogPile. By the way, for the opening page that Yahoo! used to display, they employed professional indexers to categorize the first set of links you'd get when you clicked on a category. That has become rather unwieldy, with millions of new web pages daily, and I don't think they do it any more. What does Google do, to retrieve pages we find useful? Two things.

Firstly, the thing they get all the press for: the "popularity ranking" method. When they track down web pages, they keep track of how many other pages link TO each page. A web page with lots of links to it gets a higher score than one with a few or only one. Of course, many pages (possibly most) have no inbound links at all. When you get a small number of hits, the screwy ones in the last half of the list are likely these poor orphans.

But secondly, they set up a smarter sorting method for searches against more than one word. Search for peach and only the first method can be used. Peach fruit, peach trees, and Peachtree software are ranked only by popularity (I just got 58 million hits on this word). Enter peach tree, and you'll get a much smaller number (297,000 just now). Now try tree peach (889,000). What happened?

In the past, with anyone but Google, peach tree and tree peach would get the same number of hits, perhaps not in the same order, and a much larger number than for peach alone. That's because you'd get every document that contained either peach or tree, in addition to the ones you wanted that contained both words, and if possible, the phrase peach tree. The Google method assumes you want every "significant" word in your search phrase to be in all your hits. The "significant" words are words besides "and", "the", "you", "that" and so forth; you can search for "dogs that are larger than forty pounds", and the search will really be done for "dogs+larger+forty+pounds".

Finally, the multiple words have a certain order, and phrases are given priority. If a lot of phrases show up, Google won't give you documents without the phrase. Otherwise, it gives you all the documents that contain all the significant words in any order.

Why go to so much trouble? People really care about what they find. Enough to buy lots of stuff based on their searches. Retailers and service companies pay Google a lot for help getting their web sites found more easily, and for priority placement (pay Google, and you go right to the top!).

That simple business model has made Google too big for Microsoft to eat whole, so it tried to eat Yahoo! far Yahoo! has slipped from their grasp. We'll see.

No comments: