From:                     Kendra Smith

Sent:                      Wednesday, January 12, 2000 1:14 AM

To:                         M?crosöft Research Tech Talk, Sem. Notice

Cc:                         Kendra Smith

Subject:                 UW-CSE Colloq / 1-18-2000 / Burrows / Compaq SRC / The AltaVista Indexing and Search Engine

UW-CSE Colloq / 1-18-2000 / Burrows / Compaq SRC / The AltaVista Indexing and Search Engine

 

*NOTE* This lecture will be broadcast live via the Internet. See

http://www.cs.washington.edu/news/colloq.info.html for more information.

 

UNIVERSITY OF WASHINGTON

Seattle, Washington 98195

 

Department of Computer Science and Engineering

Box 352350

(206) 543-1695

 

COLLOQUIUM

 

SPEAKER:      Mike Burrows, Compaq SRC

 

TITLE:          The AltaVista Indexing and Search Engine

 

DATE:           Tuesday, January 18, 2000

 

TIME:           3:30 pm

 

PLACE:                    134 Sieg Hall

 

HOST:           Hank Levy

 

ABSTRACT:

 

I'll motivate the talk with an overview of how a web search engine is

organized.   I'll then describe in more depth a key component of the

AltaVista search engine: its indexing library.  The library manages a set

of inverted files, and provides mechanisms to construct and optimize

complex queries on those inverted files.  It is a low-level library; it

does not perform high-level functions such as parsing queries, parsing

text to be indexed, or computing ranking scores. Instead it supplies the

interface to allow these operations to be implemented.  The design goals

were to enable efficient queries on bodies of text up to a few hundred

gigabytes in size (e.g. AltaVista) without sacrificing too much

generality, and without giving up on small applications (e.g. mail

directories).  The key design choices covered include:

        - the use of flat inverted files, and

          the techniques to allow their efficient update.

        - the byte-level format of the inverted files, and the sequence

          of instructions used to parse that format.

        - the internal abstractions used to construct complex queries.

 

At the end of the talk, I'll describe some security-failure  and

failure-related issues with the original AltaVista Web site.

 

Refreshments to follow.

 

Email: talk-info@cs.washington.edu

Info: http://www.cs.washington.edu