WNILS Working Group					Chris Weider
INTERNET-DRAFT						Merit Network, Inc.
							Jim Fullton
							UNC Chapel Hill
							Simon Spero
11/10/92						UNC Chapel Hill


	Architecture of the Whois++ Index Service

Status of this memo:

The authors describe an archtecture for indexing in distributed databases,
and apply this to the WHOIS++ protocol.


        This document is an Internet Draft.  Internet Drafts are working 
        documents of the Internet Engineering Task Force (IETF), its Areas, 
        and its Working Groups. Note that other groups may also distribute
        working documents as Internet Drafts. 

        Internet Drafts are draft documents valid for a maximum of six 
        months. Internet Drafts may be updated, replaced, or obsoleted
        by other documents at any time.  It is not appropriate to use
        Internet Drafts as reference material or to cite them other than
        as a "working draft" or "work in progress."

        Please check the I-D abstract listing contained in each Internet 
        Draft directory to learn the current status of this or any 
        other Internet Draft.

	This Internet Draft expires May 10, 1993.

1. Purpose:

The WHOIS++ directory service [GDS, 1992] is intended to provide
a simple, extensible directory service predicated on a template-based
information model and a flexible query language. This document describes
an architecture designed to link together many of these WHOIS++ servers
into a distributed, searchable wide area directory service.

2. Scope:

This document details a distributed, easily maintained architecture for
providing a unified index to a large number of distributed WHOIS++
servers. This architecture can be used with systems other than WHOIS++ to
provide a distributed directory service which is also searchable.

3. Motivation and Introduction:

It seems clear that with the vast amount of directory information potentially
available on the Internet, it is simply unfeasible to build a centralized
directory to serve all this information. Therefore, we should look at building
a distributed directory service. If we are to distribute the directory service,
the easiest (although not necessarily the best) way of building the directory
service is to build a hierarchy of directory information collection agents.
In this architecture, a directory query is delivered to a certain agent
in the tree, and then handed up or down, as appropriate, so that the query
is delivered to the agent which holds the information which fills the query.
This approach has been tried before, most notably in some implementations of
the X.500 standard. However, there are two major flaws with the approach 
as it has been taken. This new Index Service is designed to fix these flaws.

3.1 The search problem

Current implementations of this hierarchical architecture require that a search
query issued at a certain location in the directory agent tree be replicated 
to _all_ subtrees, because there is no way to tell which subtrees might 
contain the desired information. It is obvious that this has rather extreme
scaling problems, and in fact the search facility has been turned off in the
X.500 architecture because of this problem. Our new WHOIS++ architecture
solves this problem by having a set of 'forward information' at each level
of the tree. That is, each level of the tree has some idea of where to look
lower in the tree to find the requested information. Consequently, the
search tree can be pruned enormously, making search feasible at all levels
of the tree. We have chosen a certain set of information to hand up the
tree as forward information; this may or may not be exactly the set of 
information required to build a truly searchable directory. However, it seems
clear that without some sort of forward information, the search problem
becomes intractable.

3.2 The location problem

Current implementations of this hierarchical architecture also encode details
about the directory agent hierarchy in the location information for a specific
entry. With search turned off, this requires a user to know exactly how
the hierarchy of servers is laid out and how they are named, which leads to
acrimonious debate about the shape of the name space and really massive
headaches whenever it becomes apparant that the current namespace is unsuited
to the current usages and must be changed. The new Index Service gets around
this by a) not enforcing a true hierarchy on the directory agents, b) 
dissociating the directory service from the information served, and c)
allowing new hierarchies to be built whenever necessary, without destroying
the hierarchies already in place. Thus a user does not need to know in 
advance where in the hierarchy the information served is contained, and the
information a user enters to guide the search does not ever have to explicitly
show up in the hierarchy. Although there are provisions in the WHOIS++ 
query syntax to watch the directory service as it hand the query around, and
consequently to divine the structure of the directory service hierarchy,
it really is not relevant to the user, and does not ever have to be taken
into consideration.

3.3 The Yellow Pages problem

Current implementations of this hierarchical architecture have also been
unsuited to solving the Yellow Pages problem; that is, the problem of 
easily and flexibly building special-purpose directories (say of 
molecular biologists) and of automatically maintaining these directories
once they have been built. In particular with the current systems, one has
to build into the name space the attributes appropriate to the new directory. 
Since our new Index Service very easily allows directory servers to pick and
choose between information proffered by a given entry server, and because we
have an architecture which allows for automatic polling of data, Yellow 
Pages capabilities fall very naturally out of the design. Although the 
ability to search all levels of the tree(s) gets us a long way towards the
Yellow Pages, it is this capacity to locate, gather, and maintain information
in a distributed and selective way that really solves the problem.


4. Components of the Index Service:

4.1 WHOIS++ servers

The whois++ service is described in [GDS, 1992]. As that service specifies
only the query language, the information model, and the server responses,
whois++ services can be provided by a wide variety of databases and directory
services. However, to participate in the Index Service, that underlying
database must also be able to generate a 'centroid' for the data it serves.

4.2 Centroids as forward knowledge

The centroid of a server is comprised of a list of the templates and 
attributes used by that server, and a word list for each attribute.
The word list for a given attribute contains one occurrence of every 
word which appears at least once in that attribute in some record in that 
server's data, and nothing else.

For example, if a whois++ server contains exactly three records, as follows:

Record 1			Record 2
Template: User			Template: User
First Name: John 		First Name: Joe
Last Name: Smith		Last Name: Smith
Favourite Drink: Labatt Beer    Favourite Drink: Molson Beer

Record 3
Template: Domain
Domain Name: foo.edu
Contact Name: Mike Foobar

the centroid for this server would be

Template: 	  User
First Name: 	  Joe
		  John
Last Name: 	  Smith
Favourite Drink:  Beer
		  Labatt
		  Molson

Template:	  Domain
Domain Name:      foo.edu
Contact Name:     Mike
		  Foobar
		  
It is this information which is handed up the tree to provide forward knowledge.
As we mention above, this may not turn out to be the ideal solution for
forward knowledge, and we suspect that there may be a number of different
sets of forward knowledge used in the Index Service. However, the directory
architecture is in a very real sense independent of what types of forward
knowledge are handed around, and it is entirely possible to build a 
unified directory which uses many types of forward knowledge.
 		

4.3 Index servers and Index server Architecture

A whois++ index server collects and collates the centroids (or other forward 
knowledge) of either a number of whois++ servers or of a number of other index
servers. An index server must be able to generate a centroid for the
information it contains.

4.3.1 Queries to index servers

An index server will take a query in standard whois++ format, search its
collections of centroids, determine which servers hold records which may fill
that query, and then forward the query to the appropriate servers.

4.3.2 Index server distribution model and centroid propogation

The diagram below illustrates how a tree of index servers is created for
a set of whois++ servers.

  whois++		index			index
  servers		servers			servers
			for			for
  _______		whois++			lower-level
 |       |              servers			index servers
 |   A   |__
 |_______|  \            _______
	     \----------|       |
  _______               |   D   |__             ______
 |       |   /----------|_______|  \           |      |
 |   B   |__/                       \----------|      |
 |_______|                                     |  F   |
				    /----------|______|
				   /
  _______                _______  /
 |       |              |       |-
 |   C   |--------------|   E   |
 |_______|              |_______|


In the portion of the index tree shown above, whois++ servers A and B hand their
centroids up to index server D, whois++ server C hands its centroid up to
index server E, and index servers D and E hand their centroids up to index 
server F. 

The number of levels of index servers, and the number of index servers at each
level, will depend on the number of whois++ servers deployed, and the response
time of individual layers of the server tree. These numbers will have to 
be determined in the field.

4.3.4 Centroid propogation and changes to centroids

Centroid propogation is initiated by an authenticated POLL command (sec. 4.2).
The format of the POLL command allows the poller to request the centroid of
any or all templates and attributes held by the polled server. After the
polled server has authenticated the poller, it determines which of the 
requested centroids the poller is allowed to request, and then issues a
CENTROID-CHANGES report (sec. 4.3) to transmit the data. When the poller
receives the CENTROID-CHANGES report, it can authenticate the pollee to
determine whether to add the centroid changes to its data. Additionally, if
a given pollee knows what pollers hold centroids from the pollee, it can
signal to those pollers the fact that its centroid has changed by issuing
a DATA-CHANGED command. The poller can then determine if and when to 
issue a new POLL request to get the updated information. The DATA-CHANGED
command is included in this protocol to allow 'interactive' updating of
critical information.

4.3.5 Query handling and passing algorithm

When an index server receives a query, it searches its collection of centroids,
and determines which servers hold records which may fill that query. As
whois++ becomes widely deployed, it is expected that some index servers
may specialize in indexing certain whois++ templates or perhaps even
certain fields within those templates. If an index server obtains a match
with the query _for those template fields and attributes the server indexes_,
it is to be considered a match for the purpose of forwarding the query.
When the index server has completed its search to match the query to a 
server, it then forwards the request as shown in 5.4.

Each server in the chain can then use the authentication information
included in the FORWARDED-QUERY command to determine whether to continue
forwarding the query.

Also, a whois++ query can specify the 'trace' option, which sends to
the user a string containing the IANA handle and an identification
string for each index server the query is handed to.

5. Syntax for operations of the Index Service:

5.1 Data changed syntax

The data changed template look like this:

DATA-CHANGED:
   Version-number: // version number of index service software, used to insure
		   // compatibility
   Time-of-latest-centroid-change: // time stamp of latest centroid change, GMT
   Time-of-message-generation: // time when this message was generated, GMT
   Server-handle: // IANA unique identifier for this server
   Best-time-to-poll: // For heavily used servers, this will identify when
		      // the server is likely to be lightly loaded
		      // so that response to the poll will be speedy, GMT
   Authentication-type: // Type of authentication used by server, or NONE
   Authentication-data: // data for authentication 
END DATA-CHANGED // This line must be used to terminate the data changed 
		 // message

5.2 Polling syntax

POLL:
   Version-number: // version number of poller's index software, used to
		   // insure compatibility
   Start-time: // give me all the centroid changes starting at this time, GMT
   End-time: // ending at this time, GMT
   Template: // a standard whois++ template name, or the keyword ALL, for a
	     // full update.
   Field:    // used to limit centroid update information to specific fields,
	     // is either a specific field name, a list of field names, 
             // or the keyword ALL
   Server-handle: // IANA unique identifier for the polling server. 
		  // this handle may optionally be cached by the polled
		  // server to announce future changes
   Authentication-type: // Type of authentication used by poller, or NONE
   Authentication-data: // Data for authentication
END POLL     // This line must by used to terminate the poll message

5.3 Centroid change report

CENTROID-CHANGES:
   Version-number: // version number of pollee's index software, used to
		   // insure compatibility
   Start-time: // change list starting time, GMT
   End-time: // change list ending time, GMT
   Server-handle: // IANA unique identifier of the responding server
   Authentication-type: // Type of authentication used by pollee, or NONE
   Authentication-data: // Data for authentication
   Compression-type: // Type of compression used on the data, or NONE
   Size-of-compressed-data: // size of compressed data if compression is used
   Operation: // One of 3 keywords: ADD, DELETE, FULL
	      // ADD - add these entries to the centroid for this server
              // DELETE - delete these entries from the centroid of this
              // server
	      // FULL - the full centroid as of end-time follows
Multiple occurrences of the following block of fields:
    Template: // a standard whois++ template name
    Field: // a field name within that template
    Data: // the word list itself, one per line, cr/lf terminated
end of multiply repeated block
    END CENTROID-CHANGES // This line must be used to terminate the centroid
			 // change report

5.4 Forwarded query

FORWARDED-QUERY:
   Version-number: // version number of forwarder's index software, used to 
		   // insure compatibility
   Forwarded-From: // IANA unique identifier of the server forwarding query 
   Forwarded-time: // time this query forwarded, GMT (used for debugging)
   Trace-option: // YES if query has 'trace' option listed, NO if not.
		 // used at message reception time to generate trace information
   Query-origination-address: // address of origin of query
   Body-of-Query: // The original query goes here
   Authentication-type: // Type of authentication used by queryer
   Authentication-data: // Data for authentication
   END FORWARDED-QUERY // This line must be used to terminate the body of the
 		       // query

6 Author's Addresses

Chris Weider
clw@merit.edu
Industrial Technology Institute, Pod G
2901 Hubbard Rd, 
Ann Arbor, MI 48105
O: (313) 747-2730
F: (313) 747-3185

Jim Fullton
fullton@mdewey.ga.unc.edu
310 Wilson Library CB #3460
University of North Carolina
Chapel Hill, NC 27599-3460
O: (919) 962-9107
F: (919) 962-5604

Simon Spero
ses@sunsite.unc.edu
310 Wilson Library CB #3460
University of North Carolina
Chapel Hill, NC 27599-3460
O: (919) 962-9107
F: (919) 962-5604