A VIRTUAL MEMORY DISTRIBUTED FILE SYSTEM


                               Daniel L. Murphy


                                 9-March-1989


                   A VIRTUAL MEMORY DISTRIBUTED FILE SYSTEM


                                   ABSTRACT


    This paper describes the practical extension of an existing  operating

    system  into  a multiprocessor configuration using a high-speed serial

    interconnect.  A major part of  this  effort  was  to  distribute  the

    management  of  the  file  system  such that all file objects would be

    accessible from all processors.  The original file system used virtual

    memory  mapping  with extensive sharing capabilities as its basic file

    access abstraction, and the multiprocessor extensions were required to

    maintain and support this abstraction.


                                       CONTENTS

            1       INTRODUCTION - INTERCONNECT TECHNOLOGIES . . . . . . 1
            2       DISTRIBUTIBILITY OF SYSTEM COMPONENTS  . . . . . . . 2
            3       VIRTUAL MEMORY FILE ABSTRACTION  . . . . . . . . . . 4
            4       PAGE AND FILE LOCK MANAGEMENT  . . . . . . . . . . . 5
            5       OTHER SYSTEM FACILITIES AND SERVICES . . . . . . . . 8
            6       RELIABILITY, FAILSOFT  . . . . . . . . . . . . . . . 9
            7       PERFORMANCE ISSUES . . . . . . . . . . . . . . . .  11
            8       IMPLEMENTATION EXPERIENCE  . . . . . . . . . . . .  12
            9       SUMMARY  . . . . . . . . . . . . . . . . . . . . .  12
            10      ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . .  13
            11      REFERENCES . . . . . . . . . . . . . . . . . . . .  14
                                                                    Page 1


                   A VIRTUAL MEMORY DISTRIBUTED FILE SYSTEM


    1  INTRODUCTION - INTERCONNECT TECHNOLOGIES

    During the early  1980's,  innovations  in  the  area  of  high  speed
    interconnects  led  to  new  topologies for operating systems.  One of
    these, known as "Clusters", is implemented by two systems from Digital
    Equipment Corp.  -- VAX/VMS and DECSYSTEM-20.  These two systems share
    much of the cluster technology at the hardware level, and some  common
    techniques are also used in several areas of the operating system.  In
    one particular area however, the original architectures of the systems
    were  quite  different,  and this led to rather different solutions in
    the extension of the architectures to clusters.

    The design and implementation of Clusters on VAX/VMS systems has  been
    extensively  discussed elsewhere[3, 4].  This paper will discuss those
    aspects  of  operating  system  design  which  are   unique   to   the
    DECSYSTEM-20  implementation, i.e.  the file system and virtual memory
    architecture.

    Both systems are based on a "closely-coupled" structure using  a  high
    speed,  serial,  message-oriented  interconnect  known  as  the  "CI".
    Relative to the Ethernet, the CI has the following characteristics:

         1.  Faster (70 mbit/sec)

         2.  Shorter (100 meters)

         3.  Intended  for  homogeneous  configurations   of   cooperating
             operating  systems  and  servers;  more  tightly coupled than
             typical network objects.

         4.  Small number of nodes (originally 16).


    Unlike the multi-drop design  of  Ethernet,  the  CI  uses  a  central
    coupler with cables radiating out to each node.

    The CI presented the opportunity to configure system  which  would  be
    more  closely  connected  than  on  the  typical  wide-  or local-area
    network,  yet  not  so  closely  connected  as  with  a  shared-memory
    multiprocessor.  Hence, the term "closely-coupled" system is used.

    In addition to connecting CPUs, the CI also  serves  to  connect  mass
    storage  disk  and  tape  servers  to the CPUs.  The following shows a
    basic CI-linked configuration.
                                                                    Page 2


                      _____________      _________________          
                     |   |    |   |     |   |      |     |          
                     |     DISKS ...    |     DISKS ...               
                     |                  |                            
                     |                  |                            
                   ------------       --------------
                   |   HSC    |       |   HSC      |
                   |   DISK   |       |   DISK     |
                   |   SERVER |       |   SERVER   |
                   ------------       --------------
                        ||                  ||
                        ||                  ||
         CI ====================================================
               ||                 ||                     ||
               ||                 ||                     ||
          ------------        --------------         -------------
          |          |        |            |         |           |
          |  CPU&MEM |        |  CPU&MEM   |         | CPU&MEM   |
          ------------        --------------         -------------
                 ||                                        
                 || MASSBUS                                
             -------------    
             |           |                     
             | LOCAL     |                     
             |      DISKS|
             -------------             


    2  DISTRIBUTIBILITY OF SYSTEM COMPONENTS

    The TOPS-20 operating system includes a rich set of  system  services,
    including  a number of services which explicitly or implicitly provide
    interprocess communication,  and  such  services  are  the  focus  for
    extending the system as a whole into a clustered environment.  A basic
    entity of TOPS-20  is  the  "job",  where  each  job  is  owned  by  a
    particular user and has access capabilities dependent on that user.  A
    job has multiple processes organized in a hierarchy, and each  process
    has  a  unique  address  space and a unique state represented by a PC,
    processor registers, and other software state information.

    In a shared-memory multiprocessor systems, these  resources  would  be
    distributed among the processors almost automatically inasmuch as they
    are represented by structure in memory, and the memory is accessed  in
    common  by  the  processors.  Without shared memory, an active process
    context cannot exist simultaneously on two  or  more  processors.   It
    could  be moved by copying, but such a move would take a prohibitively
    long time  for  larger  processes,  even  at  CI  speeds.   Hence,  we
    concluded  early  in  our  design  effort  that  dynamic  movement  of
    processes among  processors  would  not  be  a  goal  of  the  cluster
    implementation.
                                                                    Page 3


    We also considered whether the multiple processes of one job could  be
    distributed  among  the processors of the cluster.  There is a sizable
    job-common database which the processes use, and, although it  appears
    possible  to implement a distributed form of this database and operate
    it with acceptable performance  characteristics,  that  was  a  larger
    effort  than  we  wanted to attempt at the time.  Hence, we decided to
    assume that any particular job would exist entirely on one  processor,
    and that assignment would be static for the life of the job.

    With each job  residing  on  a  particular  processor,  any  intra-job
    communication works as it does on single-processor systems.  Inter-job
    communication services (implicit or explicit) include:

         1.  IPCF - Message oriented interprocess communication facility.

         2.  ENQ/DEQ - Queueing and interlocking  oriented  to  files  and
             data bases.

         3.  DECNET - Networking services to processes on the local system
             as well as foreign systems.

         4.  FILE SYSTEM  -  The  general  disk-based  file  system  which
             provides  dynamic  sharing  of  file  data  as well as shared
             directories.

         5.  MISCELLANEOUS SYSTEM SERVICES - Job information tables,  etc.
             accessible to users.


    IO peripherals such as line printers and card readers  might  also  be
    considered  in  this  category, but they were already allocated to and
    controlled by  dedicated  jobs,  so  further  distributed  access  was
    unnecessary

    Of the above services, the file system is clearly the most  important.
    Information  is  communicated  and  shared  through  the  file  system
    continuously anytime the system is  in  use.   Users  access  a  large
    number  of  common  directories  for  system  programs, documentation,
    project data bases,  etc.   The  mail  system  is  merely  a  form  of
    communication  through  the  file  system,  with  senders writing into
    receiver's mailbox files through a system server.  Further,  the  file
    system  provides  simultaneous  access  to  databases  by  cooperating
    processes, possibly in different jobs.  A primary objective of cluster
    support was to make the file system as effective for jobs on different
    processors as it was for jobs on the same processor.

    DECNET networking was already an inherently distributed facility.  The
    other  facilities were not distributed in the existing TOPS-20 design,
    and have been considered separately.

    The capability of using a single logical  file  system  from  multiple
    CI-linked  processors  is  called  the  Common  File  System  (CFS) in
    TOPS-20, although it is also fair to think of  it  as  a  file  system
    distributed  among  several  closely  cooperating  nodes.   Design and
    implementation of CFS is discussed in the balance of this paper.
                                                                    Page 4


    3  VIRTUAL MEMORY FILE ABSTRACTION

    The architecture for  the  disk  file  system  in  TOPS-20  is  rather
    different from that of many other operating systems, including VMS.  A
    salient difference is that the most basic  form  of  file  IO  is  not
    read/write  transfer  primitives  but  rather  mapping  of  files into
    process address space.   TOPS-20  does  provide  system  services  for
    read/write  transfer IO, record IO, etc., but all of these are layered
    on top of the basic mapped file abstraction.

    The mapped file  abstraction  implemented  by  TOPS-20  provides  some
    characteristics  that  are  not implicit in transfer type file access.
    Foremost among  these  is  that  the  mapping  of  a  file  gives  the
    appearance  of  allowing  the program to directly reference the actual
    current file data, not a copy of the data.  That is, the semantics  of
    the  operation  of mapping a portion of a file into an area of process
    address space do not imply a copying of file data into  local  memory;
    rather  the  process  address space becomes a window through which the
    process "sees" the actual information currently in  the  file.   Given
    appropriate  access privileges, two or more processes may map the same
    file data.  Should one process modify the data via a write  reference,
    the modification is immediately visible to the others.

    Of course, the operating system implements this abstraction by copying
    (reading) data from the disk into memory.  However, a single copy of a
    page of file data in memory is sufficient for all processes that  have
    that  page  mapped,  and  the  structure  of  page  tables  allows the
    operating system to move the page between  memory,  swapping  storage,
    and  home  disk  location  transparently  to  all  processes using it.
    Simultaneous write access presents no difficulty in a single-processor
    system,  since all processes are referencing the same page of physical
    main memory.

    The shared-file capabilities are used by the operating system  itself,
    e.g.   to  access  file directories.  As is typical of many systems, a
    directory is an ordinary file containing  name  information  and  some
    kind  of  link to a physical file structure.  A TOPS-20 file directory
    consists of linked nodes organized  to  provide  efficient  lookup  of
    three  independent  file  name  fields  (name,  extension,  generation
    number) including recognition and  on-demand  completion  of  flexible
    abbreviations.   Storage  for  file  nodes is allocated from a heap in
    each  directory,  and  each  node  contains  several  dozen  items  of
    information  about the file.  Any directory of interest is mapped into
    the operating system's address  space  whenever  some  access  to  the
    directory is required.

    Disk allocation tables are also organized as files  on  the  disk  and
    mapped into system address space when needed.

    Other capabilities visible to user  programs  include  "copy-on-write"
    access    and   some   basic   file   access   interlock   mechanisms.
    "Copy-on-write" is a mapping mode which  is  initially  equivalent  to
    shared-read,  but  if  a write or modify reference is made to any such
    page, a  copy  is  automatically  created  in  place.   This  is  used
    extensively  for initialized variables in shared programs, and it also
                                                                    Page 5


    has some application in data  base  management.   The  file  interlock
    mechanisms  allow  common utility programs to obtain consistent access
    to files by detecting and preventing inadvertent simultaneous write.

    Also, TOPS-20 implements the concept of a "file structure" which is  a
    named  unit  consisting  of  multiple directories with their files.  A
    running  system  typically  has  several  structures  available,   and
    individual  structures  may  be  added  to or removed from the running
    system.  A structure may reside on a physically removable medium  such
    as  a  disk  pack,  but  is  more  often  just  a  logical entity on a
    fixed-media disk device.

    The goal of the Common File System design was to extend the underlying
    page  handling  mechanisms  so  that  these same abstractions would be
    maintained.  The major problem is the fact that all processes using  a
    particular  file  page  are  not necessarily on the same processor and
    therefore not referencing the same physical page.

    By contrast, a read/write transfer based file system begins  with  the
    assumption  that  the  user  program  sees  a  copy  of  the data, and
    therefore the primary requirement is just  to  get  the  bits  to  the
    desired  place.   In  VMS,  for  example, the IO primitives provide no
    guarantees regarding concurrent reads  and  writes;  rather  it  is  a
    layered locking mechanism that controls concurrent access.


    4  PAGE AND FILE LOCK MANAGEMENT

    As noted, a page can not actually be  written  simultaneously  by  two
    processors.   If  two  processors  are  sharing  a  page for read-only
    access, there may be a copy in the memory of each.  However, if one is
    modified,  the  other  is  immediately  rendered  stale.   Our initial
    thinking about this problem was based on a "write token"  which  would
    be  passed from processor to processor to indicate that writing to the
    page would be allowed.  If a processor did not hold the  write  token,
    no  process  on  that  processor  could write the page, and the paging
    mechanisms would have to trap  and  prevent  any  such  access.   Upon
    further  thought,  it  became evident that the "write token" model was
    insufficient to control access.  A  finite  state  machine  model  was
    developed, shown below in fig.  1, which uses several state variables.

    We had certain desires with regard to the distributed file system data
    base  which  this  design seemed to support.  We felt that a symmetric
    distributed data base, with no master or central  control  point,  was
    desirable   in   keeping   with   the   general  symmetry  of  the  CI
    configuration.   With  the  limited  number  of  hosts  in   CI-linked
    configurations  and  the  high  speed  and reliability of the CI, this
    seemed a realizable goal.  Such a structure would also have advantages
    in reliability and failsoft characteristics.

    The  512-word*  page,  being  the  unit  of  protection  and   mapping
    granularity, might be considered the obvious unit for interlocking and
    access control.  However, we quickly realized that this would  involve
    excessive  data  and  communication overhead.  Rather, the next larger
                                                                    Page 6


    unit, the 512-page "file  segment",  appeared  to  provide  the  right
    balance  between  granularity  and  efficiency.   In  the TOPS-20 file
    structure, the 512-page file segment is a basic entity handled by  the
    paging  mechanisms.   Files larger than 512 pages are supported by one
    additional level in the tree (providing  512  segments  of  512  pages
    each),  but  this is realized in the file system logic above the level
    of the paging mechanisms.  Thus,  the  512-page  file  segment  has  a
    resident  database which is accessible to low-level communication code
    and could serve  as  the  base  for  distributed  locking  and  access
    control.

    Some analysis of statistics  and  experience  further  indicated  that
    little  would  be  gained  by  interlocking at the page level.  Only a
    small fraction of file access in TOPS-20 involves active  simultaneous
    write,  and  of  these,  only  cases  where  different pages are being
    written within the segment would benefit from  per-page  interlocking.
    In   cases   where   all  processes  accessing  a  file  are  read  or
    copy-on-write, separate copies can exist on the  different  processors
    and no access interlocking is needed.

    The implications of locking at the segment level are  that  the  state
    and  possibly  the location of a number of pages may have to change as
    the result of a reference by this or one of the other processors.  The
    cases are:

         1.  If this processor wants to read one or more pages of  a  file
             segment,  other  processors may continue to read it, but none
             may write it.

         2.  If this processor wants to write a page of  a  file  segment,
             other processors may not continue reading any of the pages of
             it.

         3.  For other processors  to  resume  reading  pages  of  a  file
             segment after one has written it, three things must happen:

             a.  The reading processors must discard any pages of the file
                 segment  that  may be in memory or on swapping storage as
                 they may be stale.

             b.  The writing processor must write any modified pages  back
                 to their home location on the disk.

             c.  The reading processors may then read new copies of  pages
                 on demand from their home location on the disk.


    This incorporates the usual readers/writers interlock, but goes beyond

    - - - - - - - - - - - - -

    * - The DECSYSTEM-20 architecture is  based  on  36-bit  words.   Some
    might  consider  this  archaic.   We consider it merely out of current
    fashion.
                                                                    Page 7


    that  by  allowing  the  operating  system  to make dynamic changes of
    state.

    The model generated by this approach is that the  data  base  on  each
    processor  indicates whether a particular kind of access (read, write)
    can be made locally.  If a desired access cannot be made, a  processor
    will communicate with other processors to request the access.  Because
    the data base is distributed and there is no master,  this  implies  a
    "broadcast"  type  of  request.   The  CI  does  not support broadcast
    directly, but each host maintains a list of  all  hosts  currently  in
    operation, so a broadcast can be easily simulated.  A response of some
    sort must be received from all other hosts in order for a decision  to
    be made.

    Some hosts may have no local knowledge of the file segment in question
    because no process on that host is using it.  Those hosts will respond
    with approval of the request and will otherwise ignore it.  If a  host
    is  using  a  file  segment  and  the  request  does not conflict with
    continued use, the host will also respond with  approval.   Otherwise,
    the host will respond in a way that indicates that the request will be
    approved but some cleanup action must  be  completed  first,  such  as
    writing modified pages to disk or invalidating local copies.

    Processes  on  different  processors   may   be   making   conflicting
    simultaneous  requests.   In this case, timers are used to control the
    changes of state so that each processor gets the access it  needs  for
    some  fraction  of the time.  The time interval is set large enough to
    prevent thrashing and excess CI traffic.  Again,  the  processors  are
    assumed to be cooperating, and, in fact, are running the same code, so
    consistent decisions are made and no central arbitrating authority  is
    needed.

    Simultaneous access, either read or write, may  be  prevented  by  the
    mode  in  which  the  file  is  opened.  In this case, subsequent open
    requests on the same or other processors will be checked  and  refused
    if inconsistent.  Here, no timeouts are needed because the restriction
    is set by the user program logic, not the underlying operating  system
    mechanics.

    The foregoing covers most of the logic necessary for proper management
    of   pages   to  maintain  the  virtual  memory  abstraction  and  its
    assumptions.  To support this, the low level  paging  mechanisms  must
    provide:

         1.  Read-only access per page;

         2.  Read/write access per page;

         3.  Cancel write access for physical page;

         4.  Cancel all access for physical page;

         5.  Note if page modified;

                                                                    Page 8


    The existing paging mechanisms provided all of these except  #3  which
    was added by a microcode change to the processor during development.


    5  OTHER SYSTEM FACILITIES AND SERVICES

    Once virtual  memory  file  mapping  works  transparently  across  the
    cluster,  the  operating  system  facilities  that  use  it, including
    directory logic and disk allocation  table  management,  can  also  be
    distributed  without  major restructuring.  As one might expect, those
    facilities already have locking mechanisms separate from  the  virtual
    memory   logic   to  maintain  consistent  access  to  the  respective
    structures.  That is, each file directory has a lock (mutex)  that  is
    set while a process is performing a file lookup or create operation so
    that other processes will not see the  directory  in  an  inconsistent
    state.    Similarly,  the  allocation  tables  are  locked  while  the
    allocator code is selecting a page to allocate.

    To extend these locks across the cluster, much of the  same  logic  is
    used  as  that which supports the page locking and control.  A general
    interface was implemented for the management  of  locks  in  a  manner
    required  by the various higher level facilities.  The locks represent
    resources in the system, such as a directory, a  file,  an  allocation
    table,  etc., but the lock manager need not be aware of the mapping of
    locks to resources.  Again, the mechanism is based  on  a  polling  of
    other  nodes  whenever  a  lock  needs  to  be  upgraded, and one 'no'
    response means that another node has the resource locked.

    To summarize the procedures used to maintain the distributed file  and
    lock data base:

         1.  The paging mechanisms trap any reference  to  a  mapped  file
             page  which  does  not  already allow that type of reference.
             The kernel then checks the local data base  to  determine  if
             that  access  is  consistent  with  the  known  state  of the
             resource, and if so, the page  is  made  accessible  and  the
             program   continues.   System  services  needing  to  lock  a
             resource first  attempt  to  obtain  the  lock  locally.   If
             successful, they continue.

         2.  If the local  state  is  insufficient,  the  kernel  sends  a
             polling  request to all other known nodes on the CI.  Each is
             guaranteed to respond within a few milliseconds under  normal
             circumstances.   The response will be in effect, "yes", "no",
             or "conditional yes".  The last means that some  housekeeping
             must  be  completed  before  a "yes" can be acted on, and the
             "conditional yes" allows the  requesting  processor  to  know
             that  the  request  will  soon  be  granted.   If  a  "no" is
             received, then an  interlock  is  being  held  by  the  other
             processor  indefinitely (i.e.  subject to higher level code).
             When all responses have been received,  and  all  conditional
             "yes" responses cleared, the requesting processor may proceed
             as in #1.  The requesting processor  will  have  updated  its
             local data base so that a subsequent trap to a different page
                                                                    Page 9


             in the same file segment will not have to repeat the  polling
             procedure.

         3.  Processors may receive polling requests at  any  time.   They
             are  received  at  interrupt level and handled immediately so
             that latency does not accumulate.  If the request  is  for  a
             resource which is unknown to this processor, a "yes" response
             is generated, meaning "whatever you want, it's ok  with  me".
             This  means that processors do not have to maintain resources
             in their data base that they are not using.

         4.  If the resource is in use locally, it  will  be  known.   Its
             local  state  and  the  incoming request are compared, and an
             appropriate response is generated.  If the "conditional  yes"
             response is generated, the appropriate housekeeping action is
             also  queued  for  a  local  process  to  perform.   At   the
             completion   of   that,   the   local  process  will  send  a
             complementary message which removes the condition.

         5.  Resources are not necessarily removed from the data base when
             local  use  is completed, rather they are left in an inactive
             state.  This allows for a subsequent local  request  for  the
             resource  to  be  granted immediately rather than requiring a
             polling cycle.  This saves much time in the  rather  frequent
             case  that  a  particular  resource is used repeatedly on one
             host and is not being requested by other hosts.

         6.  Downgrading  or   releasing   a   lock   does   not   require
             communication  with  other  nodes.   The  local  data base is
             adjusted, and subsequent requests received from  other  nodes
             will generate a response based on the new state.


    6  RELIABILITY, FAILSOFT

    A primary reason for choosing a symmetric distributed data base is its
    failsoft  characteristics.   There  is no information unique to a node
    which is lost if that node fails.  Of course, failure of  a  node  may
    cause  loss  of  recently  modified pages, directories, etc., but this
    must be handled by  higher  levels  of  software  in  single-processor
    systems also.  With regard to the locking mechanisms:

         1.  A node may fail and not respond while a polling cycle  is  in
             progress.   The rules are that all nodes must respond, so the
             requesting node will not proceed.  The lower level CI service
             will  detect  a node going offline (through periodic polling)
             and notify all clients.  The lock manager considers  this  as
             one  of  several  events  that  require  the  polling  to  be
             restarted.  Note that the request/response  rules  result  in
             idempotent  requests  --  any request may be repeated without
             causing a problem at the responding nodes.
                                                                   Page 10


         2.  A node may fail while holding any number of resources.   When
             another  node  issues  a  polling  request  for  one of those
             resources, the failed node will not be polled, having already
             been  noted as down by the lower-level CI service.  Hence, no
             "no" votes will be received, and the polling node will assume
             the resource.


    The foregoing means that no special operation is necessary  to  update
    data  bases  when  a node goes on or off line, other than updating the
    list of known nodes.  If the  node  holding  a  lock  fails,  then  by
    definition  the  node  can  no  longer  be  using  the resource.  Upon
    failure, a node will not respond to requests or will  not  be  polled,
    and  other nodes will see no impediment to acquiring the resource.  In
    practice, there is usually no noticable impact on the remaining  nodes
    when  one  node  fails.   Particular  users may notice a pause in some
    response if the operation happens to require a resource that was  held
    by the failing system.  It may take a number of seconds for the timers
    to act and remove the failing node from the poll list.

    This design requires a strict discipline in bringing  nodes  on  line.
    In  particular,  once  a node has failed, it cannot be brought back on
    line except by being reinitialized.  Normal startup of a node  implies
    that  it  holds  no  locks  or  resources,  and  its local database so
    indicates.  This is consistent with any possible state  of  the  nodes
    already  in  operation,  so  the  new  node can come on line and begin
    polling for any resources it needs.  A failed node has implicitly lost
    some  or all of the resources it held, so it cannot be brought back on
    line without completely  resetting  its  local  data  base,  including
    aborting  any  processes which were using the resources.  Considerable
    interest was expressed in the possibility of supporting some  sort  of
    rejoin capability as a matter of principle.  In practice, failure of a
    node is usually fatal and there is no reason for it to rejoin  without
    reinitializing.

    Another  failure  mode  which  must  be  considered  is  loss  of   CI
    communication  not  involving the failure of any node.  Because of the
    redundant and highly reliable design of the CI, this is unlikely.   It
    must  be considered, however, if for no other reason than that service
    personnel might sometime disconnect the wrong cable.  To  handle  this
    possibility,  we  implemented  an  algorithm  whereby  each  node  can
    determine if it has been separated  from  the  rest  of  the  cluster.
    Specifically,  a  node  which  can  "see"  the  central star connector
    (through a hardware loopback message) will consider itself still  part
    of  the  cluster,  and  a node which cannot see the star will consider
    itself segregated.  A segregated node will immediately cease using any
    physical  resources  (e.g.  disks) that have been used by other nodes,
    and will refuse to rejoin the cluster even if the connection should be
    restored.   Its action upon seeing the connection restored is to crash
    and reinitialize.

    Several potential race conditions can arise where two  or  more  nodes
    are  coming  on line or requesting the same resource at the same time.
    The symmetric nature  of  all  algorithms  would  tend  to  result  in
    multiple  connections  being  opened,  a deadlock over resources, etc.
                                                                   Page 11


    These conditions can  all  be  detected  and  are  resolved  based  on
    processor  serial  number.   The  processor serial number is unique to
    each processor and, by favoring  the  higher-numbered  processor,  the
    symmetry  is  broken just enough to avoid the deadlock.  The frequency
    of such occurrances is too low to raise any fairness issues with  this
    solution.

    It is a useful design principle that infrequently  exercised  software
    logic  will  tend to develop bugs, and where such logic is involved in
    failure handling, cascading failures are likely to result.  We believe
    the  above approach minimizes the amount of logic that is invoked only
    on rare failures.  Although the CI segregation logic is in fact rarely
    exercised,  it  is very simple and makes conservative decisions.  That
    is, it will crash the local node if there is  any  doubt  rather  than
    risk  corruption of system file structures through lock mismanagement.
    Most importantly, it does not attempt complex recovery strategies; its
    purpose is to detect certain problems and prevent blundering ahead.


    7  PERFORMANCE ISSUES

    The performance of this system  depends  on  the  speed  of  the  lock
    communication  and on the frequency of lock collisions among different
    nodes.  Before implementation began,  we  considered  the  sources  of
    possible  lock collisions on file data.  Statistics gathered over many
    years indicated that simultaneous shared write was used a  very  small
    fraction of the time by user programs.  Certain application were known
    to use it, such as DBMS, but these applications  generally  had  other
    characteristics  and  interlocking mechanisms which would also have to
    be modified for effective  cluster  operation.   On  the  other  hand,
    simultaneous  write was used by the system itself for file directories
    and disk allocation tables, and we needed  to  understand  more  about
    that use.

    To develop this information, we  instrumented  the  system  to  gather
    statistics  and traces of directory locking.  For worst case analysis,
    we needed to know how often a directory  would  be  used  by  one  job
    shortly  before  or  after  being used by another.  While the jobs may
    often be on the same node in practice, a worst case  assumption  would
    be  that successive accesses would be from different nodes.  From this
    experiment, we  found  that  most  directories  had  an  insignificant
    frequency  of  conflict.   Certain directories showed a high collision
    rate -- those that held the commonly used system utility programs  and
    files.  Fortunately, most accesses to these directories are read only,
    and, as we have seen, read  accesses  do  not  incur  latency  or  bus
    traffic.   Hence,  we  were  confident  that system use of directories
    would not incur unacceptable performance penalities.

    We also gave special consideration to the handling of disk  allocation
    tables.   We determined that there would be a much higher frequency of
    conflict because there  is  only  one  such  resource  for  each  disk
    structure, and because most accesses do modify the data.  We could see
    no way  to  reduce  the  conflict  rate  in  this  case  without  some
    additional  logic.   We chose to add some caching of disk blocks so as
                                                                   Page 12


    to reduce the number of accesses  to  the  actual  allocation  tables.
    Rather  than  allocating one page at a time for a file, a set of pages
    is pre-allocated.  Subsequent growth of the file  is  taken  from  the
    pre-allocated pages.


    8  IMPLEMENTATION EXPERIENCE

    Of the set of interjob communication facilities listed  earlier  which
    were not already distributed, only the file system and part of ENQ/DEQ
    have been extended to support  the  distributed  cluster  environment.
    Hence,  some  utilities  and  services  at higher layers which use the
    unextended facilities also do not support the  environment.   In  many
    cases,  the layered services are being modified to use DECNET in place
    of some other service, reflecting what we believe is a more extensible
    facility.  Despite the lack of these other facilities, the Common File
    System facilities alone  have  proved  enormously  useful  within  our
    development environment.

    The processors of a cluster are separate nodes as far as the  external
    network  (DECNET,  Ethernet) is concerned since each processor has its
    own connection to the network.  However,  the  users  of  the  cluster
    prefer  to  view  it  as a single computing resource and usually don't
    care to make any distinction among the processors.  A typical step  we
    took  to  make  a cluster look more like a single system was to set up
    mail files on a common file structure and modify the mail programs  to
    use  the  same  file for any particular user regardless of the node on
    which the mail arrives.  Hence, users can read and send mail while  on
    any  node.   Over a relatively short time, most user files migrated to
    common structures and, as a result, users can log into  any  node  and
    find the same environment.

    As is typical of multiprocessor development, debugging the common file
    system  code presented some challenges.  Fortunately, TOPS-20 has long
    had fairly good symbolic debugging  facilities,  even  at  the  lowest
    levels  of  the kernel, including breakpoints, tracing, patching, etc.
    Of course, hitting breakpoints runs afoul of the various  timeouts  in
    the lock managers described above and tends to cause the breaking node
    to be summarily exorcised from the cluster.  To handle this,  a  piece
    of  protocol  is  used  to  declare debugging mode to the other nodes.
    This effectively extends the timeouts to infinity, so the other  nodes
    will wait as long as necessary for a poll response.


    9  SUMMARY

    The measure of an operating system is in the abstractions it provides.
    Do  they  enhance programming effectiveness?  Do they foster effective
    use of the hardware resources?  One test of an abstraction can only be
    made  over  time:   does  it  remain  viable  in  new environments not
    contemplated in the original design?   Does  it  provide  transparency
    and,  thereby,  portability  to  layered  software when the underlying
    configuration changes?
                                                                   Page 13


    Virtual memory mapping as the basic file  access  abstraction  is  not
    present in many operating systems, and some people believe it to be an
    obsolete   technique   not   compatible   with   distributed    system
    environments.  Our experience with TOPS-20 is that the file system has
    been  extended  into  a  distributed   configuration   with   complete
    transparency  and  with  entirely  acceptable  performance.   There is
    literally no assumption built into user programs  regarding  the  file
    system  which  is  not supported in CFS; therefore, user programs have
    not  had  to  change  to  take  significant  advantage  of   the   new
    configurations.   We  believe  that  the cluster implementation is not
    limited, relative to a single system, in any way except  with  respect
    to  simultaneous  shared  write.  And even here, the basic assumptions
    still work, although at a reduced bandwidth.

    We  understand  however,  that  the  cluster  is  based   on   certain
    assumptions not necessarily true in other distributed environments.  A
    very high  speed  interconnect  and  a  homogeneous  operating  system
    environment  are  key  elements  of  this  design.  The virtual memory
    abstraction is well suited to  a  high  bandwidth,  high  performance,
    multi-process  environment,  and  in  such environments, it provides a
    great deal of power to the clients.  It also "keeps secrets" about the
    gritty   details   of   particular   interconnects,   disk  controller
    characteristics, queueing requirements, memory  management,  etc.   An
    interconnect  of  typical  wide-area network speed, or a heterogeneous
    environment involving different file formats, data formats, and access
    semantics  would not be a good basis for a virtual memory abstraction.
    The power implicitly provided to the  client  would  not  in  fact  be
    available, and performance would fall short of expectations.

    On the other hand, it seems likely that the 10 Mbit ethernet would  be
    a  wholly satisfactory interconnect for this purpose, but only between
    compatible and cooperating operating system types.


    10  ACKNOWLEDGEMENTS

    TOPS-20 was originally based on  the  architecture  of  TENEX[1],  and
    started  with  a version of TENEX circa 1973.  The base virtual memory
    architecture discussed herein is much  as  it  was  in  TENEX  and  is
    further described in [2].

    If  the  locking  rules   for   CFS   seem   relatively   simple   and
    straightforward  in retrospect, the total effort to implement clusters
    certainly involved many other areas.   Design  and  implementation  of
    protocols  for  basic CI service, for mass storage data transfer, port
    hardware, etc.  occupied dozens of engineers in Digital.   Within  the
    DECSYSTEM-20 group, Peter Hurley proposed many of the original Loosely
    Coupled  Systems   concepts.    Arnold   Miller   did   the   original
    implementation  of  the  CFS  lock manager.  Many other members of the
    TOPS-20 group were involved in completing, testing, and polishing  the
    implementation,  including  Clair  Grant,  Judy Hall, Ron McLean, Dave
    Lomatire, Tom Moser, and others.
                                                                   Page 14


    11  REFERENCES


         1.  "TENEX, A Paged Timesharing System for the PDP-10" -  D.   G.
             Bobrow, J.  D.  Burchfiel, D.  L.  Murphy, R.  S.  Tomlinson.
             CACM Vol 15 No 3, March 1972.

         2.  "Storage Organization and management in TENEX"  -  Daniel  L.
             Murphy.  AFIPS Conference Proceedings Vol.  41, 1972.

         3.  "VAXclusters:  A Closely-Coupled Distributed System" -  Nancy
             P.  Kronenberg, et al.  ACM TOCS Vol 4 No 2, May 1986.

         4.  Digital Technical Journal, Number 5, September 1987.  Digital
             Equipment Corp.