PAPER		Linux-VServer Technology

      (C)2004 Herbert Pötzl
      
      All Rights Reserved.
      
      Permission is granted to copy, distribute and/or 
      modify this document under the terms of the
      GNU Free Documentation License, Version 1.2 or any 
      later version published by the Free Software Foundation. 


CHAPTER		Abstract

      A soft partitioning concept based on 'Security Contexts'
      which allows to create many independent Virtual Private
      Servers (VPS), similar to normal Linux Servers, which can
      be run simultaneously on one box at full speed, sharing
      the hardware resources.

      All services, such as ssh, mail, Web and databases, can
      be started on such a VPS, without (or in special cases
      with only minimal) modification, just like on any real server.

      Each virtual server has its own user account database
      and root password and doesn't interfere with other
      virtual servers, except for the fact that they share
      the same hardware resources.


CHAPTER		Introduction

      Over the years, computers have become sufficiently powerful
      to use virtualization for giving the illusion of many smaller
      virtual machines, each running a separate operating system
      instance.

      There are several kinds of Virtual Machines (VMs) which are 
      providing similar features and which only differ in the 
      degree of abstraction and the methods used for virtualization.

      Most of them accomplish what they do by "emulating" some real
      or fictional hardware, which in turn requires "real" resources
      from the Host (the machine running the VMs). This approach, used
      by most System Emulators (like QEMU[U1], Bochs[U2], ...), 
      allows to run an arbitrary Guest Operating System, even for 
      a different Architecture (CPU and Hardware) without any 
      modifications, because the Guest OS isn't aware of the fact 
      that it isn't running on real hardware.

      Some of them require small modifications or specialized
      drivers to be added to Host or Guest to improve performance 
      and minimize the overhead required for the hardware emulation.
      Although this improves a lot on efficiency, there are 
      still large amounts of resources being wasted in caches and 
      mediation between Guest and Host (examples for this approach 
      are UML[U3] and Xen[U4]).

      But suppose you do not want to run many different Operating
      Systems simultaneously on a single box? Most applications
      running on a server do not require hardware access or kernel
      level code, and could easily share a machine with others,
      if they could be separated and secured ...


CHAPTER		The Concept

      Basically a Linux Server consists of three building
      blocks: Hardware, Kernel and Applications. The Hardware
      usually depends on the provider or system maintainer and,
      while it has big influence on the overall performance,
      cannot be changed that easily, but for sure will differ
      from one setup to another.

      The main purpose of the Kernel is to build an abstraction
      layer on top of the hardware to allow processes (Applications) 
      to work with and operate on resources (Data) without
      knowing the details of the underlying hardware.
      Ideally those processes would be completely hardware agnostic, 
      by being written in an interpreted language and therefore not 
      requiring any hardware specific knowledge.

      Given that a system has enough resources to drive
      ten times the number of applications a single Linux
      server would usually require, why not put ten servers on that
      box, who will then share the resources efficiently? 

      Most applications will assume that they are the only one
      providing this service, and usually they will also assume
      a certain filesystem layout and environment. This requires
      that similar or identical services, for example only differing
      in their addresses, have to be coordinated. 
      This usually requires a great deal of administrative work 
      which usually reduces overall stability and security.

      So the basic idea is to separate the user-space environment
      into distinct units (sometimes called Virtual Private Server)
      in such way that each VPS looks and feels like a real 
      server to the processes contained within.

      Although different Linux Distributions use (sometimes
      heavily) patched kernels to provide special support for
      unusual hardware or extra functionality, most Linux
      Distributions are not tied to a special kernel, but they
      bring their own set of tools, and applications.

      Linux-VServer uses this fact to allow several distributions
      to be run simultaneously on a single, shared kernel,
      without direct access to the hardware, and share the
      resources in a very efficient way.


CHAPTER		Existing Infrastructure

      Recent Linux Kernels already provide many security features
      which are utilized by Linux-VServer to do its work. 
      Especially the Linux Capability System, the Resource Limits,
      File Attributes and the Change Root Environment are used
      by Linux-VServer. The following sections will give a short
      overview about each of these.

SECTION		Linux Capability System

      In computer science, a capability is a token used by a
      process to prove that it is allowed to do an operation
      on an object. But there is something different, called
      "POSIX Capabilities" which was designed to split up the
      all powerful root privilege into a set of distinct
      privileges.
    
SUBSECT		POSIX Capabilities

      A process has three sets of bitmaps called the inheritable(I),
      permitted(P), and effective(E) capabilities.  Each capability
      is implemented as a bit in each of these bitmaps which is
      either set or unset.
 
      When a process tries to do a privileged operation, the
      operating system will check the appropriate bit in the effective
      set of the process (instead of checking whether the effective
      uid of the process is 0 as is normally done).
 
      For example, when a process tries to set the clock, the Linux
      kernel will check that the process has the CAP_SYS_TIME bit
      (which is currently bit 25) set in its effective set.
 
      The permitted set of the process indicates the capabilities the
      process can use.  The process can have capabilities set in the
      permitted set that are not in the effective set.
 
      This indicates that the process has temporarily disabled this
      capability.  A process is allowed to set a bit in its effective
      set only if it is available in the permitted set.
      The distinction between effective and permitted exists so that
      processes can "bracket" operations that need privilege.

      The inheritable capabilities are the capabilities of the current
      process that should be inherited by a program executed by the
      current process.  The permitted set of a process is masked
      against the inheritable set during exec().
      Nothing special happens during fork() or clone().
      Child processes and threads are given an exact copy of
      the capabilities of the parent process.

      Although this concept is intriguing, the implementation in Linux
      stopped at this point, whereas POSIX Capabilities[U5] 
      would require to add capability sets to files too, 
      to replace the SUID flag (at least for executables)

SUBSECT		Capability Overview
	
      The list of POSIX Capabilities used with Linux is long, and
      the 32 available bits are almost used up. While the detailed
      list of all capabilities can be found in
      /usr/include/linux/capability.h on most Linux systems, an
      overview of "considered important" capabilities is given here.

LLIST		Linux Capabilities
	ITEM	[0] CAP_CHOWN
		change file ownership and group.
		
	ITEM	[5] CAP_KILL
		send a signal to a process with a different
		real or effective user ID
		
	ITEM	[6] CAP_SETGID
		permit setgid(2), setgroups(2), and forged gids
		on socket credentials passing
		
	ITEM	[7] CAP_SETUID
		permit set*uid(2), and forged uids
		on socket credentials passing
		
	ITEM	[8] CAP_SETPCAP
		transfer/remove any capability in permitted
		set to/from any pid
		
	ITEM	[9] CAP_LINUX_IMMUTABLE
		allow modification of S_IMMUTABLE and S_APPEND
		file attributes
		
	ITEM	[11] CAP_NET_BROADCAST
		permit broadcasting and listening to multicast
		
	ITEM	[12] CAP_NET_ADMIN
		permit interface configuration, IP firewall,
		masquerading, accounting, socket debugging,
		routing tables, bind to any address, enter
		promiscuous mode, multicasting, ...
		
	ITEM	[13] CAP_NET_RAW
		permit usage of RAW and PACKET sockets

	ITEM	[16] CAP_SYS_MODULE
		insert and remove kernel modules
		
	ITEM	[18] CAP_SYS_CHROOT
		permit chroot(2)
		
	ITEM	[19] CAP_SYS_PTRACE
		permit ptrace() of any process
		
	ITEM	[21] CAP_SYS_ADMIN
		this list would be too long, it basically
		allows to do everything else, not mentioned
		in another capability.

	ITEM	[22] CAP_SYS_BOOT
		permit reboot(2)
		
	ITEM	[23] CAP_SYS_NICE
		allow raising priority and setting priority
		on other processes, modify scheduling
		
	ITEM	[24] CAP_SYS_RESOURCE
		override resource limits, quota, reserved
		space on fs, ...
		
	ITEM	[27] CAP_MKNOD
		permit the privileged aspects of mknod(2)

EOLIST

      Also see Examples [E01],[E02], and [E03].

SECTION		Resource Limits
 
      Resources for each process can be limited by specifying
      a Resource Limit. Similar to the Linux Capabilities,
      there are two different limits, a Soft Limit and a
      Hard Limit.

      The soft limit is the value that the kernel enforces
      for the corresponding resource. The hard limit acts as
      a ceiling for the soft limit: an unprivileged process
      may only set its soft limit to a value in the range
      from zero up to the hard limit, and (irreversibly) lower
      its hard limit. A privileged process may make arbitrary
      changes to either limit value, as long as the soft limit
      stays below the hard limit.

SUBSECT		Limit-able Resource Overview
	
      The list of all defined resource limits can be found in
      /usr/include/asm/resource.h on most Linux systems, an
      overview of "relevant" resource limits is given here.

LLIST		Resource Limits
	ITEM	[0] RLIMIT_CPU
		CPU time in seconds.
		process is sent a SIGXCPU signal after
		reaching the soft limit, and SIGKILL on
		hard limit.
		
	ITEM	[4] RLIMIT_CORE
		maximum size of core files generated
		
	ITEM	[5] RLIMIT_RSS
		number of pages the process's resident set
		can consume (the number of virtual pages
		resident in RAM)
		
	ITEM	[6] RLIMIT_NPROC
		The maximum number of processes that can be
		created for the real user ID of the calling process.

	ITEM	[7] RLIMIT_NOFILE
		Specifies a value one greater than the maximum
		file descriptor number that can be opened by
		this process.
		
	ITEM	[8] RLIMIT_MEMLOCK
		The maximum number of virtual memory pages
		that may be locked into RAM using mlock()
		and mlockall().
		
	ITEM	[9] RLIMIT_AS
		The maximum number of virtual memory pages
		available to the process (address space limit).
EOLIST

      Also see Examples [E11], and [E12].


SECTION		File Attributes
	
      While this feature started as applicable for ext2 only, all major
      filesystem now implement a basic set of File Attributes
      which allow to change certain properties. Here again
      a short overview of the possible attributes, and what
      they mean.

LLIST		File Attributes
	ITEM	's' SECRM
		When a file with this attribute set is deleted,
		its blocks are zeroed and written back to the disk.
		
	ITEM	'u' UNRM
		When a file with this attribute set is deleted,
		its contents are saved.      

	ITEM	'c' COMPR
		files marked with this attribute are automatically
		compressed on write and uncompressed on read.
		(not implemented yet)      

	ITEM	'S' SYNC
		updates to the file contents are done synchronously

	ITEM	'i' IMMUTABLE
		A file with this attribute cannot be modified:
		it cannot be deleted or renamed, no link can be
		created to this file and no data can be written
		to the file.      

	ITEM	'a' APPEND
		files with this attribute set can only be opened
		in append mode for writing.      

	ITEM	'd' NODUMP
		if this flag is set, the file is not candidate
		for backup with the dump utility.
		      

	ITEM	'A' NOATIME
		prevents updating the atime record on files when
		they are accessed or modified.      

	ITEM	't' NOTAIL
		A file with the 't' attribute will not have a
		partial block fragment at the end of the file
		merged with other files.      

	ITEM	'D' DIRSYNC
		changes to a directory having this attribute set
		will be done synchronously.      

EOLIST

      Also see Examples [E13] and [E14].

SECTION		The chroot(1) Command

      chroot allows you to run a command with a different
      root directory. This means that all filesystem
      lookups are done with '/' referring to the new
      root directory and not to the original one.

      While the Linux chroot implementation isn't very
      secure, it increases the isolation of processes
      regarding the filesystem, and if used properly
      allows to create a filesystem jail for a single
      process or a restricted user, daemon or service.
      
      See Example [E15]


CHAPTER		Required Modifications
      
      This chapter will describe the essential 
      modifications to implement something like
      Linux-VServer.

SECTION		Context Separation
      
      The separation mentioned in the Concepts section
      requires some modifications to the kernel to
      allow for the notion of Contexts.

      The purpose of this "Context" is to hide all
      processes outside of its scope, and prohibit
      any unwanted interaction between a process
      inside the context and a process belonging
      to another context.
 
      This separation requires the extension of some 
      existing data structures in order for them to become 
      aware of the newly introduced context, and to 
      allow to have the same uid in different contexts, 
      and still be able to differentiate between them.
 
      It also requires to define a 'default' context
      which is used when the host system is booted,
      and to work around the issues resulting from
      some false assumptions made by some user-space
      tools (like pstree) that the 'init' process has
      to exist and be running under id '1'
 
      To simplify administration while simultaneously
      allowing for a process overview, the Host Context
      is basically treated like the other contexts,
      at least regarding the isolation, and a special
      'Spectator' context is used for looking at all
      processes at once.
	
      See Examples [E21],[E22] and [E23]

SECTION		Network Separation
      
      While the Context Separation is sufficient to
      isolate groups of processes, a different kind 
      of separation or better limitation is required 
      to confine processes to a subset of available
      network addresses.
      
      Several issues have to be considered when doing
      so, for example the fact that bindings to
      special addresses like IPADDR_ANY or the local
      host address have to be handled in a very 
      special way.
      
      Note, that Linux-VServer doesn't make use of 
      virtual network devices (yet) to avoid the
      resulting overhead, so socket binding and
      packet transmission has to be adapted.
      
      See Example [E24]

SECTION		The Chroot Barrier
      
      One major problem of the chroot() system used
      in Linux, which modifies the 'current' root
      directory for the process, lies within the fact 
      that this information is volatile, and will be 
      changed on the 'next' chroot() Syscall.
      
      One simple method to escape from a chroot-ed
      environment, is to create or open a file and 
      keep the file-descriptor, then chroot into a
      subdirectory at equal or lower level with regards to the file. 
      This causes the 'root' to be moved 'down' in the 
      filesystem, then use fchdir() on the file 
      descriptor to escape from that 'new' root, and
      consequently from the 'old' one as well, as this
      was lost in the last chroot() Syscall.
      
      While early Linux-VServer versions tried to
      fix this by funny methods, the recent version
      use a special marking, known as the Chroot
      Barrier, on each 'root' directory, which 
      prevents unauthorized modification and escape
      from the confinement.

SECTION		Upper Bound for Caps
      
      Because the current Linux Capability system
      does not provide the filesystem part, which
      is required to make set uid and set gid
      executables secure, and because it is much
      safer to have a secure upper bound for
      all processes within a context, an additional
      per context capability mask has been added
      to limit all processes belonging to that
      context to this mask.
      
      The meaning of the capability bound mask
      is exactly the same as the permitted capability
      set, but for all processes, and all other
      capability masks.

SECTION		Resource Isolation
      
      Most resources are somewhat shared among the
      different Contexts, but some of them require
      additional Isolation, either to avoid security
      issues or to allow for improved accounting.
      
      Those resources are:
      
      * shared memory, IPC
      * user and process IDs
      * file xid tagging
      * Unix ptys
      * sockets

SECTION		Filesystem XID Tagging
      
      Although it can be disabled completely, this
      modification is required for filesystem level
      security and context isolation. It also is 
      mandatory for Context Disk Limits and Per
      Context Quota Support on a shared partition.
      
      While the idea, to add the context id (xid)
      to each file, to make the context ownership
      persistent, sounds simple, the actual 
      implementation is non trivial - mainly because
      adding this information either requires to 
      change the on disk representation of the 
      filesystem or to use some tricks.
      
      One non intrusive approach to avoid modification 
      of the underlying filesystem is to use the
      upper, mostly unused bits of existing fields, 
      like those for UID and GID to store the 
      additional XID.

      Having context information available with 
      each inode, it seems logical to extend the 
      access controls to check against context too. 
      
      Currently all inode access restrictions are 
      extended to check for the context, with 
      special exceptions for the Host Context and
      the Spectator Context.
      
      Untagged files belong to the Host Context
      and are silently treated as if they belong 
      to the current context, which is required
      for Unification. If such a file is modified
      from inside a Context, it silently migrates 
      to the new one, changing it's xid. 

      The following Tagging Methods are implemented:

LLIST
	ITEM	UID32/GID32 or EXTERNAL
		This format uses, up to now
		unused space within the disk inode to store the
		context information, this is currently only
		defined for ext2/ext3 but will be also defined
		for xfs, reiserfs, and jfs as soon as possible.
		Advantage: you'll have full 32bit uid/gid values.

	ITEM	UID32/GID16 
		This format uses the upper half of the group id
		to store the context information. This is done
		transparently, so you'll never notice, except if
		you change the format without prior file conversion.
		Advantage: works on all 32bit U/GID FSs.
		Drawback: GID is reduced to 16 bit.

	ITEM	UID24/GID24
		This format uses the upper quarter of user and
		group id to store the context information, again
		transparently. you'll end up with 16 million
		user and group ids, which should suffice for the
		majority of all applications.
		Advantage: works on all 32bit U/GID FSs.
		Drawback: UID and GID is reduced to 24 bit.
EOLIST

      See Examples [E31] and [E32]


CHAPTER		Additional Modifications

      In addition to the bare minimum, there are 
      a number of modifications, not really required
      but extremely useful, and very nice to have, 
      so they were added over time.

SECTION		Context Flags

      It was very soon discovered that some features
      require a flag, a kind of switch to turn them
      on and off separately for each Linux-VServer, so a
      simple flag-word was added. 
      
      Nowadays this flag word supports quite a number 
      of flags, a flag word mask, which allows to
      tell what flags are available, and a special 
      trigger mechanism, providing one-time flags, 
      set on startup, that can only be cleared once,
      usually causing a special action or event.

      Here a list of planned and mostly implemented
      Context Flags, available in the development
      branch, together with a short description.

LLIST
	ITEM	[0] VXF_INFO_LOCK (legacy, obsoleted)
		
	ITEM	[1] VXF_INFO_SCHED (legacy, obsoleted)
		schedule all processes in a context 
		as if they where one.
		
	ITEM	[2] VXF_INFO_NPROC (legacy, obsoleted)
		limit the number of processes in a 
		context to the initial NPROC value.
		
	ITEM	[3] VXF_INFO_PRIVATE (legacy)
		do not allow to join this context
		from outside.

	ITEM	[4] VXF_INFO_INIT (legacy)
		show the 'init' process with pid '1'
	
	ITEM	[5] VXF_INFO_HIDE (legacy, obsoleted)
	
	ITEM	[6] VXF_INFO_ULIMIT (legacy, obsoleted)
	
	ITEM	[7] VXF_INFO_NSPACE (legacy, obsoleted)

	ITEM	[8] VXF_SCHED_HARD
		activate the Hard CPU scheduling
		
	ITEM	[9] VXF_SCHED_PRIO
		use the context token bucket for 
		calculating the process priorities
		
	ITEM	[10] VXF_SCHED_PAUSE
		put all processes in this context
		on the hold queue, not scheduling
		them any longer

	ITEM	[16] VXF_VIRT_MEM
		virtualize the memory information
		so that the VM and RSS limits are
		used for meminfo and friends
	
	ITEM	[17] VXF_VIRT_UPTIME
		virtualize the uptime, beginning
		with the time of context creation
		
	ITEM	[18] VXF_VIRT_CPU

	ITEM	[24] VXF_HIDE_MOUNT
		show empty proc/{pid}/mounts
		
	ITEM	[25] VXF_HIDE_NETIF
		hide network interfaces and addresses
		not permitted by the network context
EOLIST
      
      See Example [E4]

SECTION		Context Capabilities

      As the Linux Capabilities have almost reached 
      the maximum number that is possible
      without heavy modifications to the kernel, 
      it came naturally that
      adding a simplified version of context
      capabilities to each context would ease a lot
      of things.
      
      This capability set doesn't need to be visible 
      to the processes within a context, as they 
      would not know how to modify or verify it.
      Instead they act like some kind of fine tuning
      to existing capabilities.
      
      In general there are two ways to use those
      capabilities: 
      
NLIST
       ITEM
	    require one or a number of context capabilities
	    to be set in addition to a given Linux
	    capability, each one controlling a distinct
	    part of the functionality

	    for example the CAP_NET_ADMIN could be
	    split into RAW and PACKET sockets, so
	    you could take away each of them separately
	    by not providing the required context
	    capability

       ITEM
	    consider the context capability sufficient
	    for a specified functionality, even if the
	    Linux Capability says something different

	    for example mount() requires CAP_SYS_ADMIN
	    which adds a dozen other things, we do not
	    want, so we define a CCAP_MOUNT to allow
	    mounts for certain contexts.
EOLIST

      The difference between the Context Flags and the
      Context Caps is more an abstract logical 
      separation than a functional one, because they
      are handled very similar.

      Again a list of the Context Capabilities and
      their purpose.

ULIST
	ITEM	[0] VXC_SET_UTSNAME
		allow the Context to change the host
		and domain name with the appropriate
		kernel Syscall
		
	ITEM	[1] VXC_SET_RLIMIT
		allow the Context to modify the resource
		limits (within the vserver limits).
		
	ITEM	[16] VXC_SECURE_MOUNT
		permit 'secure' mounts, which at the
		moment means that the 'nodev' mount
		option is added. 
EOLIST


SECTION		Context Accounting

      Some properties of a Context are useful to
      the admin, either for keeping an overview
      of the resources, to get a feeling for the
      capacity of the host, or for billing them in
      some way to the customer.
      
      There are two different kinds of accountable 
      properties, those having a current value 
      which represents the 'state' of the system
      (for example the speed of a vehicle), and 
      those which monotonic increase over time
      (like the mileage).
      
      Most of the 'state' type of properties, also
      qualify for applying some limits, so they
      are handled special, which is described in
      the next section. 

      Good candidates for Context accounting are:

      * Amount of CPU Time spent
      * Number of Forks done
      * Socket Messages by Type
      * Network Packets Transmitted and Received

      See Example [E41]

SECTION		Context Limits

      Most properties related to system resources,
      might it be the memory consumption, the number 
      of processes or file-handles, or the current
      network bandwidth, qualify for imposing limits
      on them.
      
      To provide a general framework for all kinds
      of limits, the Context Limits allow to set
      three different values for each limit-able 
      resource: the minimum, a soft limit and a
      hard limit (maximum).
      
      At the time this is written, only the hard
      limits are supported and not all of them
      are actually enforced, but here is a list
      of current and planned Context Limits:
      
      * process limits
      * scheduler limits
      * memory limits
      * per context disk limits
      * per context user/group quota

     See Example [E42]

SECTION		Virtualization
      
      One major difference between the Linux-VServer
      approach and Virtual Machines is that you
      do not have the 'virtualization' part as a
      side-effect, so you have to do that 'by hand'
      where it makes sense.
      
      For example, a Virtual Machine does not need 
      to think about uptime, because naturally the 
      running OS was started somewhere in the past
      and will not have any problem to tell the time
      it 'thinks' it is running.
      
      A Context can also store the time when it was
      created, but that will be different from the
      systems uptime, so in additions, there has to
      be some function, which 'adjusts' the values
      passed from kernel to user-space depending on
      the context the process belongs to.
      
      This is what for Linux-VServer is known as
      Virtualization (actually it's more faking some
      values passed to and from the kernel, to make
      the processes 'think' that they are on a 
      different machine).
      
      Currently modified for the purpose of 
      Virtualization are:

      * System Uptime
      * Host and Domain Name
      * Machine Type and Kernel Version
      * Context Memory Availability
      * Context Disk Space

      See Example [E43]

SECTION		Improved Security

      Proc-FS Security provides a mechanism to protect 
      dynamic entries in the proc filesystem from being 
      seen in every context.

      The system consists of three flags for each
      Proc-FS entry: Admin, Watch and Hide. 
      
      The Hide flag enables or disables the entire 
      feature, so any combination with the Hide flag
      cleared will mean total visibility.
      
      The Admin and Watch flags determine where the
      'hidden' entry remains visible, so for example 
      if Admin and Hidden are set, the Host Context will
      be the only one able to see this specific entry.

      See Example [E44] and [E45]

SECTION		Kernel Helper
      
      For some purposes, it makes sense to have
      an user-space tool to act on behalf of the
      kernel, when a process inside a context 
      requests something usually available on a
      real server, but naturally not available 
      inside a context.
      
      The best, and currently only example for this
      is the Reboot Helper, which allows to handle
      the reboot() system call, invoked from inside
      a Context, from Host side user-space, and to
      take appropriate actions - either reboot or
      just shutdown (halt) the specified Context.
      
      While the helper is designed to be flexible
      and handle different things in a similar way
      there are no other users of this helper at
      the moment, and it might be replaced by an
      event interface in near future.
	
      See Example [XX]


CHAPTER		Features and Bonus Material

SECTION		Unification
      
      Because one of the central objectives for
      Linux-VServer is to reduce the overall resource
      usage wherever possible, a truly great idea
      was born how to 'share' files between different
      Contexts without interfering with the usual
      administrative tasks or reducing the level
      of security created by the isolation.
      
      Files common to more than one Context, which
      are not very likely going to change, like
      libraries or binaries, can be hard linked on
      a shared filesystem, thus reducing the amount
      of disk space, inode caches, and even memory
      mappings for shared libraries.
      
      The only drawback is, that without additional
      measures, a malicious Contexts would be able
      to deliberately or accidentally destroy or 
      modify such shared files, which in turn would
      harm the other Contexts.
      
      One step is to make the shared files immutable 
      (by using the Immutable File Attribute and 
      removing the Linux Capability required to modify 
      this attribute). However an additional attribute
      is required to allow removal of such immutable 
      shared files, to allow for updates of libraries 
      or executables from inside a Context.
      
      Such hard linked, immutable but unlink-able files
      belonging to more than one Context are called
      'unified' and the process of finding common
      files and preparing them in this way is called
      Unification.
      
      The reason for doing this is reduced resource
      consumption, not simplified administration.
      While a typical Linux Server install will 
      consume about 500MB of disk space, 10 unified
      servers will only need about 700MB and as a 
      bonus use less memory for caching.
       
      See Example [XX]

SECTION		Private Namespaces
      
      A recent addition to the Linux-VServer branch 
      was the introduction of Private Namespaces.
      This uses the already existing Virtual Filesystem
      Layer of the Linux kernel to create a separate
      'view' of the filesystem for the processes
      belonging to a Context.
      
      The major advantage over the shared namespace
      used by default is, that any modifications to
      the namespace layout (like mounts) do not
      affect other Contexts, not even the Host Context. 
      
      Obviously the drawback of that approach is, that
      entering such a Private Namespace isn't that
      trivial as changing the root directory, but 
      with proper kernel support this will completely
      replace the chroot() in the future.

SECTION		The Linux-VServer Proc-FS 
      
      A structured, dynamically generated subtree of the 
      well known Proc-FS - actually two of them - has been 
      created to allow for inspecting the different
      values of Security and Network Contexts.
      
      ;   /proc/virtual
      ;     .../info
      ;
      ;   /proc/virtual/<pid>
      ;     .../info
      ;     .../status
      ;     .../sched
      ;     .../cvirt
      ;     .../cacct
      ;     .../limit

SECTION		Token Bucket Extensions
      
      While the basic idea of Linux-VServer is a
      peaceful coexistence of all Contexts, sharing 
      the common resources in a respectful way, it
      is sometimes useful to control the resource
      distribution for resource hungry processes.
      
      The basic principle of a Token Bucket is not
      very new, but it is given here as example for
      the Hard CPU Limit, but it also applies to
      Scheduler Priorities, Network Bandwidth 
      limitation and resource control in general.
      
      The Hard CPU Limit uses this mechanism in the
      following way: you have a bucket of a certain 
      size S which is filled with a specified amount 
      of tokens R each interval T, until a maximum 
      M is reached - excess tokens are spilled. 
      At each timer tick, a running process consumes
      exactly one token from the bucket, unless the 
      bucket is empty - in which case the process is 
      put on a hold queue until the bucket has been 
      refilled with a minimum N of tokens. 
      The process is then rescheduled.
      
      A major advantage of a Token Bucket is, that
      a certain amount of tokens can be accumulated
      in times of quiescence, which later can be
      used to burst, when resources are required.
      
      Where a per process Token Bucket would allow
      for a CPU resource limitation of a single
      process, a Context Token Bucket allows to
      control the CPU usage of all confined processes.
      
      Another approach, which is also done, is to
      use the current fill level of the bucket to
      adjust the process priority, thus reducing the
      priority of processes belonging to excessive
      Contexts.

      See Example [XX]

SECTION		Context Disk Limits
      
      This Feature requires the use of XID Tagged
      Files, and allows for independent Disk Limits
      for different Contexts on a shared partition.
      
      The number of inodes and blocks for each 
      filesystem is accounted, if an XID-Hash was
      added for the Context-Filesystem combo.
      
      Those values, including current usage, maximum
      and reserved space, will be shown for filesystem
      queries, creating the illusion that the shared
      filesystem has a different usage and size,
      for each Context.
	
SECTION		Per Context Quota

      Similar to the Context Disk Limits, Per
      Context Quota allows to have separate quota
      hashes for different Contexts on a shared
      filesystem. This is not required to allow
      for Linux-VServer quota on separate partitions.

SECTION		The VRoot Proxy Device

      Quota operations (ioctls) require some access to
      the block device, which usually is, for
      security reasons, not available inside a VPS


SECTION		Stealth
      
      For some applications, for example the 
      preparation of a honey-pot or an especially 
      realistic imitation of a real server for 
      educational purposes, it can make sense to
      make the Context indistinguishable from
      a real server. 
      
      However, since other freely available alternatives like 
      QEMU or UML are much better at this, and require 
      much less effort, this is not a central 
      issue in Linux-VServer development.

CHAPTER		Linux-VServer Security
      
      Now that we know what the Linux-VServer framework
      provides and how some features work, a word on
      security, because you should not rely on the
      framework to be secure per definition, instead 
      you should exactly know what you are doing.

SECTION		Secure Capabilities

      Currently the following Linux Capabilities are
      considered secure, if you add others to them, you
      will probably open some security hole.
      
      * CAP_CHOWN
      * CAP_DAC_OVERRIDE
      * CAP_DAC_READ_SEARCH
      * CAP_FOWNER
      * CAP_FSETID
      * CAP_KILL
      * CAP_SETGID
      * CAP_SETUID
      * CAP_NET_BIND_SERVICE
      * CAP_SYS_CHROOT
      * CAP_SYS_PTRACE
      * CAP_SYS_BOOT
      * CAP_SYS_TTY_CONFIG
      * CAP_LEASE
      
      CAP_NET_RAW for example is not considered secure
      although it is often used to allow the 'broken'
      ping command to work, although there are better
      alternatives like the hping[U6] command or 
      poink[U7].
      

SECTION		The Chroot Barrier
      
      Ensuring that the Barrier flag is set on the
      root directory of each VPS is vital if you
      do not want VPS root to escape from the
      confinement and walk your Host's root filesystem.
      
SECTION		Secure Device Nodes

      The /dev directory of a VPS should not contain
      more than the following devices and the one 
      directory for the unix pts tree. 

      * c 1 7  full
      * c 1 3  null
      * c 5 2  ptmx
      * c 1 8  random
      * c 5 0  tty
      * c 1 9  urandom
      * c 1 5  zero
      * d      pts

      Of course you may add other device nodes like
      console, mem and kmem, even block and character
      devices, but you should know exactly what you 
      are doing.

SECTION		Secure Proc-FS Entries

      There has been no detailed evaluation of secure
      and unsecure entries in the proc filesystem,
      but there have been some incidents where
      unprotected (not protected via Linux Capabilites)
      writable proc entries caused mayhem.
      
      For example /proc/sysrq-trigger is something
      which should not be accessible inside a VPS
      without a very good reason.


CHAPTER		Field of Application

      The primary goal of this project is to create
      virtual servers sharing the same machine.
      A virtual server operate like a normal Linux
      server. It runs normal services such as telnet,
      mail servers, web servers, SQL servers.

SECTION		Administrative Separation

      Of course this allows a clever provider to sell
      something called Virtual Private Server, which
      uses less resources than other virtualization
      techniques, which in turn allows to put more
      units on a single machine.
 
      The list of providers doing so is relatively
      long, and so it is rightfully considered the
      main area of application.

SECTION		Service Separation
       
      Separating different or similar services
      which otherwise would interfere with each
      other, either because they are poorly designed
      or because they are simply not capable of
      peaceful coexistence, for whatever reason,
      can be easily done with Linux-VServer.
 
      But even on the old fashioned real server
      machines, putting some extremely exposed
      or untrusted, because unknown or proprietary,
      services into some kind of jail can improve
      maintainability and security a lot.

SECTION		Enhancing Security
       
      While it can be interesting to run several
      virtual servers in one box, there is one concept
      potentially more generally useful.
      Imagine a physical server running a single
      virtual server.
      The goal is isolate the main environment from
      any service, any network.
      You boot in the main environment, start very
      few services and then continue in the
      virtual server.
 
      The service in the main environment would be

ULIST
	ITEM	
		unreachable from the network.
	ITEM
    		able to log messages from the virtual server 
		in a secure way. the virtual server would be 
		unable to change/erase the logs. 
		
		So even a cracked virtual server would not be 
		able the edit the log.

	ITEM 
		able to run intrusion detection facilities, 
		potentially spying the state of the virtual 
		server without being accessible or noticed. 
		
		For example tripwire could run there and it 
		would be impossible to circumvent its operation 
		or trick it.
EOLIST

      Another option is to put the firewall in a virtual
      server, and pull in the DMZ, containing each service
      in a separate VPS. On proper configuration, this
      setup can reduce the number of required machines
      drastically, without impacting performance.
      

SECTION		Easy Maintenance

      One key feature of a virtual server is the independence
      from the actual hardware. Most hardware issues are
      irrelevant for the virtual server installation.

      The main server acts as a host and takes care of 
      all the details. The virtual server is just a client 
      and ignores all the details. As such, the client can 
      be moved to another physical server with very few 
      manipulations. 
      
      For example, to move the virtual server from one 
      physical computer to another, it sufficient to do
      the following:
      
      * shutdown the running server
      * copy it over to the other machine
      * copy the configuration
      * start the virtual server on the new machine

      No adjustments to user setup, password database
      or hardware configuration are required, as long as
      both machines are binary compatible.

SECTION		Fail-over Scenarios

      Pushing the limit a little further, using some
      replication technology to keep two versions of
      the same Virtual Server, one running the other
      dormant, but up to the minute, would allow to
      do a very fast fail-over if the running server
      goes offline for whatever reason.
      
      All the known methods to accomplish this, starting
      with network replication via rsync or drbd, via
      network devices or shared disk arrays, to 
      distributed filesystems, can be utilized to
      reduce the down-time and improve overall efficiency.

SECTION		For Testing

      Consider a software tool or package which should
      be built for several versions of a specific
      distribution (Mandrake 8.2, 9.0, 9.1, 9.2, 10.0)
      or even for different distributions.
      
      This is easily solved with Linux-VServer, and
      given there is plenty of disk space, the different
      distributions can be installed and running side
      by side, which simplifies switching from one to
      the other.
      
      Of course this can be accomplished by chroot()
      alone, but with Linux-VServer it's much more 
      realistic.


CHAPTER		Non Intel i386 Hardware
      
      Linux-VServer was designed to be mostly 
      architecture agnostic, therefore only a small part, the 
      syscall definition itself, is architecture
      specific. Nevertheless some architectures have 
      private copies of basically architecture 
      independent code for whatever reason, and
      therefore often small modifications are 
      required.
      
      Basically the following architectures are
      supported and some of them are even tested.
      
      * alpha
      * ia32 / ia64
      * mips / mips64
      * hppa / hppa64
      * ppc / ppc64
      * sparc / sparc64
      * s390
      * x86_64 (AMD64)
      * uml

      Adding a new architecture is relatively simple
      although extensive testing is required to
      make sure that every feature is working as
      expected (and of course, the hardware ;)


CHAPTER		Linux Kernel Intro

      While almost all of the described features reside
      in the Linux Kernel, nifty Userspace Tools are
      required to activate and control the new functionality.
      
      Those Userspace Tools in general communicate with 
      the Linux Kernel via so called System Calls 
      (or short Syscall).

      This chapter will give a short overview how Linux
      Kernel and User Space is organized and how Syscalls,
      a simple method of communication between processes
      and kernel, work.

SECTION		Kernel and User Space
 
      In Linux and similar Operating Systems, User and
      Kernel Space is separated, and address space is
      divided into two parts. Kernel space is where the
      kernel code resides, and user space is where the user
      programs live. Of course, a given user program can't
      write to kernel memory or to another program's
      memory area.

      Unfortunately, this is also the case for kernel code.
      Kernel code can't write to user space either. What
      does this mean? Well, when a given hardware driver
      wants to write data bytes to a program in user memory,
      it can't do it directly, but rather it must use specific
      kernel functions instead. Also, when parameters are
      passed by address to a kernel function, the kernel
      function can not read the parameters directly.
      It must use other kernel functions to read each byte
      of the parameters.

      Of course there are some helpers which do the
      transfer to and from user space.

      copy_to_user(void *to, const void *from, long n);
      copy_from_user(void *to, const void *from, long n);

      get_user() and put_user()
      Get or put the given byte, word, or long from or to
      user memory. This is a macro, and it relies on the
      type of the argument to determine the number of bytes
      to transfer.

SECTION		Linux Syscalls
 
      Most libc calls rely on system calls, which are the
      simplest kernel functions a user program can call.

      These system calls are implemented in the kernel itself
      or in loadable kernel modules, which are little chunks
      of dynamically link-able kernel code.

      Linux system calls are implemented through a multiplexor
      called with a given maskable interrupt.
      In Linux, this interrupt is int 0x80.
      When the 'int 0x80' instruction is executed, control is
      given to the kernel (or, more accurately, to the
      _system_call() function), and the actual demultiplexing
      process occurs.

      How does _system_call() work ?

      First, all registers are saved and the content of the
      %eax register is checked against the global system
      calls table, which enumerates all system calls and
      their addresses.

      This table can be accessed with the
      extern void *sys_call_table[] variable.
      A given number and memory address in this table
      corresponds to each system call.
 
      System call numbers can be found in
      /usr/include/sys/syscall.h.
 
      They are of the form SYS_systemcallname.
      If the system call is not implemented, the
      corresponding cell in the sys_call_table is 0,
      and an error is returned.
 
      Otherwise, the system call exists and the corresponding
      entry in the table is the memory address of the
      system call code.


CHAPTER		Kernel Side Implementation

      While this chapter is mainly of interest to kernel
      developers it might be funny to take a small peek 
      behind the curtain to get a glimpse how everything 
      really works.
      
SECTION		The Syscall Command Switch
 
      For a long time Linux-VServer used a few different
      Syscalls to accomplish different aspects of the
      work, but very soon the number of required commands
      grew large, and the Syscalls started to have 
      'magic' values, selecting the desired behavior.
      
      Not too long ago, a single syscall was reserved 
      for Linux-VServer, and while the opinion on that 
      might differ from developer to developer, it was
      generally considered a good decision not to have 
      more than one syscall.
      
      The advantage of different Syscalls would be simpler
      handling of the Syscalls on different architectures
      but this hasn't been any problem so far, as the
      data passed to and from the kernel has strong typed
      fields conforming to the C99 types.
      
      Anyway the availability of one system call required
      the creation of a multiplexor, which decides, based
      on some selector, what specific command is to be
      executed, and then passes on the remaining arguments
      to that command, which does the actual work.
      
      ; extern asmlinkage long
      ; sys_vserver(uint32_t cmd, uint32_t id, void __user *data)

      The Linux-VServer syscall is passed three arguments
      regardless of what actual command is specified:
      a command (cmd), a number (id), and a user-space
      data-structure of yet unknown size.
      
      To allow some structuring for debugging purposes 
      and some kind of command versioning, the 'cmd'
      is split into three parts: the lower 12 bit 
      contain a version number, then 4 bit are reserved
      the upper 16 bit are divided into 8 bit command
      and 6 bit category, again reserving 2 bit for
      the future.
      
      So there are 64 Categories with up to 256 commands
      in each category, allowing for 4096 revisions of
      each command, which to be honest is more than will
      ever be required.
      
      Here is an overview of the categories already
      defined, and their numerical value:
      
      ; Syscall Matrix V2.6
      ;
      ;        |VERSION|CREATE |MODIFY |MIGRATE|CONTROL|EXPERIM| |SPECIAL|SPECIAL|
      ;        |STATS  |DESTROY|ALTER  |CHANGE |LIMIT  |TEST   | |	 |	 |
      ;        |INFO   |SETUP  |       |MOVE   |       |       | |	 |	 |
      ; -------+-------+-------+-------+-------+-------+-------+ +-------+-------+
      ; SYSTEM |VERSION|VSETUP |VHOST  |       |       |       | |DEVICES|	 |
      ; HOST   |     00|     01|     02|     03|     04|     05| |     06|     07|
      ; -------+-------+-------+-------+-------+-------+-------+ +-------+-------+
      ; CPU    |       |VPROC  |PROCALT|PROCMIG|PROCTRL|       | |SCHED. |	 |
      ; PROCESS|     08|     09|     10|     11|     12|     13| |     14|     15|
      ; -------+-------+-------+-------+-------+-------+-------+ +-------+-------+
      ; MEMORY |       |       |       |       |       |       | |SWAP   |	 |
      ;        |     16|     17|     18|     19|     20|     21| |     22|     23|
      ; -------+-------+-------+-------+-------+-------+-------+ +-------+-------+
      ; NETWORK|       |VNET   |NETALT |NETMIG |NETCTL |       | |SERIAL |	 |
      ;        |     24|     25|     26|     27|     28|     29| |     30|     31|
      ; -------+-------+-------+-------+-------+-------+-------+ +-------+-------+
      ; DISK   |       |       |       |       |       |       | |INODE  |	 |
      ; VFS    |     32|     33|     34|     35|     36|     37| |     38|     39|
      ; -------+-------+-------+-------+-------+-------+-------+ +-------+-------+
      ; OTHER  |       |       |       |       |       |       | |VINFO  |	 |
      ;        |     40|     41|     42|     43|     44|     45| |     46|     47|
      ; =======+=======+=======+=======+=======+=======+=======+ +=======+=======+
      ; SPECIAL|       |       |       |       |FLAGS  |       | |	 |	 |
      ;        |     48|     49|     50|     51|     52|     53| |     54|     55|
      ; -------+-------+-------+-------+-------+-------+-------+ +-------+-------+
      ; SPECIAL|       |       |       |       |RLIMIT |SYSCALL| |	 |COMPAT |
      ;        |     56|     57|     58|     59|     60|TEST 61| |     62|     63|
      ; -------+-------+-------+-------+-------+-------+-------+ +-------+-------+

      The definition of those Commands is simplified
      by some macros, so for example the commands to 
      get and set the Context Flags are defined like 
      this: 

      ; #define VCMD_get_cflags         VC_CMD(FLAGS, 1, 0)
      ; #define VCMD_set_cflags         VC_CMD(FLAGS, 2, 0)
      ; 
      ; extern int vc_get_cflags(uint32_t, void __user *);
      ; extern int vc_set_cflags(uint32_t, void __user *);

      Note that the command itself is not passed to
      the actual command implementation, only the id
      and the pointer to user-space data.

SECTION		Utilized Data Structures

      There are many different data structures used
      by different parts of the implementation, and
      only a few examples are given here, but of course 
      all utilized structures can be found in the source. 

SUBSECT		The Context Data Structure

      The Context Data Structure consists of a few
      fields required to manage the contexts, and 
      handle context destruction, as well as future
      hierarchical contexts.
      
      Logically separated sections of that structure
      like for the scheduler or the context limits
      are defined in separate structures, and 
      incorporated into the main one.

      ; struct vx_info {
      ;         struct list_head vx_list;         /* linked list of contexts */
      ;         xid_t vx_id;                      /* context id */
      ;         atomic_t vx_refcount;             /* refcount */
      ;         struct vx_info *vx_parent;        /* parent context */
      ; 
      ;         struct namespace *vx_namespace;   /* private namespace */
      ;         struct fs_struct *vx_fs;          /* private namespace fs */
      ;         uint64_t vx_flags;                /* context flags */
      ;         uint64_t vx_bcaps;                /* bounding caps (system) */
      ;         uint64_t vx_ccaps;                /* context caps (vserver) */
      ; 
      ;         pid_t vx_initpid;                 /* PID of fake init process */
      ; 
      ;         struct _vx_limit limit;           /* vserver limits */
      ;         struct _vx_sched sched;           /* vserver scheduler */
      ;         struct _vx_cvirt cvirt;           /* virtual/bias stuff */
      ;         struct _vx_cacct cacct;           /* context accounting */
      ; 
      ;         char vx_name[65];                 /* vserver name */
      ; };

      Here as example the Scheduler Substructure:

      ; struct _vx_sched {
      ;         spinlock_t tokens_lock; /* lock for this structure */
      ; 
      ;         int fill_rate;          /* Fill rate:      add X tokens ... */
      ;         int interval;           /* Divisor:      ... each Y jiffies */
      ;         atomic_t tokens;        /*         current number of tokens */
      ;         int tokens_min;         /* Limit:        minimum for unhold */
      ;         int tokens_max;         /* Limit:     no more than N tokens */
      ;         uint32_t jiffies;       /* bias:     integral multiple of Y */
      ; 
      ;         uint64_t ticks;         /* token tick events */
      ;         cpumask_t cpus_allowed; /* cpu mask for context */
      ; };

      The main idea behind this separation is that
      each substructure belongs to a logically
      distinct part of the implementation which
      provides an init and cleanup function for this
      structure, thus simplifying maintainability 
      and readability of those structures.

SUBSECT		The Scheduler Command Data

      As an example for the data structure used to
      control a specific part of the context from
      user-space, here a scheduler command and the
      utilized data structure to set the properties: 

      ; #define VCMD_set_sched          VC_CMD(SCHED, 1, 2)
      ;
      ; struct  vcmd_set_sched_v2 {
      ;         int32_t fill_rate;      /* Fill rate:      add X tokens ... */
      ;         int32_t interval;       /* Divisor:      ... each Y jiffies */
      ;         int32_t tokens;         /*         current number of tokens */
      ;         int32_t tokens_min;     /* Limit:        minimum for unhold */
      ;         int32_t tokens_max;     /* Limit:     no more than N tokens */
      ;         uint64_t cpu_mask;      /* Mask:               allowed cpus */
      ; };


SUBSECT		Example Accounting: Sockets

      Basically all the accounting and limit stuff
      is defined as macros or inline functions capable
      of handling the different resources, and hiding
      the underlying implementation wherever possible.

      ; #define vx_acc_sock(v,f,p,s) \
      ;         __vx_acc_sock((v), (f), (p), (s), __FILE__, __LINE__)
      ;
      ; static inline void __vx_acc_sock(struct vx_info *vxi,
      ;         int family, int pos, int size, char *file, int line)
      ; {
      ;         if (vxi) {
      ;                 int type = vx_sock_type(family);
      ;
      ;                 atomic_inc(&vxi->cacct.sock[type][pos].count);
      ;                 atomic_add(size, &vxi->cacct.sock[type][pos].total);
      ;         }
      ; }
      ;
      ; #define vx_sock_recv(sk,s) \
      ;         vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 0, (s))
      ; #define vx_sock_send(sk,s) \
      ;         vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 1, (s))
      ; #define vx_sock_fail(sk,s) \
      ;         vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 2, (s))

      And this general definition is then used 
      where appropriate, for example in the
      __sock_sendmsg() function like this:

      ;         len = sock->ops->sendmsg(iocb, sock, msg, size);
      ;         if (sock->sk) {
      ;                 if (len == size)
      ;                         vx_sock_send(sock->sk, size);
      ;                 else
      ;                         vx_sock_fail(sock->sk, size);
      ;         }

SUBSECT		Example Limits: Virtual Memory

      ; #define vx_pages_avail(m, p, r) \
      ;         __vx_pages_avail((m)->mm_vx_info, (r), (p), __FILE__, __LINE__)
      ;
      ; static inline int __vx_pages_avail(struct vx_info *vxi,
      ;                 int res, int pages, char *file, int line)
      ; {
      ;         if (!vxi)
      ;                 return 1;
      ;         if (vxi->limit.rlim[res] == RLIM_INFINITY)
      ;                 return 1;
      ;         if (atomic_read(&vxi->limit.res[res]) + 
      ;                 pages < vxi->limit.rlim[res])
      ;                 return 1;
      ;         return 0;
      ; }
      ;
      ; #define vx_vmpages_avail(m,p)  vx_pages_avail(m, p, RLIMIT_AS)
      ; #define vx_vmlocked_avail(m,p) vx_pages_avail(m, p, RLIMIT_MEMLOCK)
      ; #define vx_rsspages_avail(m,p) vx_pages_avail(m, p, RLIMIT_RSS)

      And again the test against those limits
      at certain places, for example here in 
      copy_process()

      ;         /* check vserver memory */
      ;         if (p->mm && !(clone_flags & CLONE_VM)) {
      ;                 if (vx_vmpages_avail(p->mm, p->mm->total_vm))
      ;                         vx_pages_add(p->mm->mm_vx_info, 
      ;                                 RLIMIT_AS, p->mm->total_vm);
      ;                 else
      ;                         goto bad_fork_free;
      ;         }

SUBSECT		Example Virtualization: Uptime

      ; void vx_vsi_uptime(struct timespec *uptime)
      ; {
      ;         struct vx_info *vxi = current->vx_info;
      ;
      ;         set_normalized_timespec(uptime,
      ;                 uptime->tv_sec - vxi->cvirt.bias_tp.tv_sec,
      ;                 uptime->tv_nsec - vxi->cvirt.bias_tp.tv_nsec);
      ;         return;
      ; }

      ;         if (vx_flags(VXF_VIRT_UPTIME, 0))
      ;                 vx_vsi_uptime(&uptime, &idle);


CHAPTER		Future Directions

SECTION		Hierarchical Contexts

SECTION		Security Branch

SECTION		Stealth Branch


CHAPTER		Step by Step Examples

      The following examples can be reproduced on 
      recent Linux systems without any modifications.
      

SECTION		[E01] Linux Caps Example

; # grep Cap /proc/self/status
;
;   CapInh: 0000000000000000
;   CapPrm: 00000000fffffeff
;   CapEff: 00000000fffffeff

SECTION		[E02] Manipulating Linux Caps

; # lcap -z
; # grep Cap /proc/self/status
;
;   CapInh: 0000000000000000
;   CapPrm: 0000000000000000
;   CapEff: 0000000000000000

SECTION		[E03] CAP_MKNOD Example

; # lcap -z CAP_MKNOD
; # mknod /tmp/null b 0 0
;
;   mknod: /tmp/null: Operation not permitted


SECTION		[E11] Resource Limits via BASH

; bash-2.05# ulimit -H -a
; 
;   core file size (blocks)     unlimited
;   data seg size (kbytes)      unlimited
;   file size (blocks)          unlimited
;   max locked memory (kbytes)  unlimited
;   max memory size (kbytes)    unlimited
;   open files                  1024
;   pipe size (512 bytes)       8
;   stack size (kbytes)         unlimited
;   cpu time (seconds)          unlimited
;   max user processes          256
;   virtual memory (kbytes)     unlimited

SECTION		[E12] Example CPU Time Limit

; bash-2.05# ulimit -t 10
; bash-2.05# /tmp/cpuhog
;
;   Killed

SECTION		[E13] Normal File without Attributes

; # touch /tmp/file
; # lsattr /tmp/file
;
;   ------------- /tmp/file
;
; # rm -f /tmp/file

SECTION		[E14] Immutable File Attribute

; # touch /tmp/file
; # chattr +i /tmp/file 
; # lsattr /tmp/file 
;
;   ----i-------- /tmp/file
;
; # rm -f /tmp/file
;
;   rm: unable to remove `/tmp/file': Operation not permitted
;
; # chattr -i /tmp/file 
; # rm -f /tmp/file 

SECTION		[E15] Chroot Example

; # mkdir /tmp/root
; # mkdir /tmp/root/bin /tmp/root/lib
; # cp -a /lib/* /tmp/root/lib/ 
; # cp -a /bin/bash /tmp/root/bin/
; # cp -a /bin/ls /tmp/root/bin/  
;
; # chroot /tmp/root /bin/bash
;
; bash-2.05# ls / 
;
; bin  lib


CLOSE

      The now following examples require a Linux-VServer
      kernel and recent util-vserver tools.

SECTION		[E21] Process Isolation

; # chcontext ps auxw
;
;   New security context is 49152
;   USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
;   root        37 11.5  1.7  2408  500 tts/0    R    02:36   0:00 ps auxw

SECTION		[E22] Dual Context Isolation 

; # chcontext --ctx 100 sleep 111 &
; # chcontext --ctx 200 sleep 222 &
; # chcontext --ctx 100 ps auxw
;
;   New security context is 100
;   USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
;   root        23  0.3  1.5  2748  436 tts/0    S    18:47   0:00 sleep 111
;   root        26 28.0  2.2  2476  664 tts/0    R    18:48   0:00 ps auxw

SECTION		[E23] Fake Init Process

; # chcontext --flag fakeinit ps auxw
;
;   New security context is 49153
;   USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
;   root         1  0.0  0.4  2756  552 ?        S    19:09   0:04 init

SECTION		[E24] Network Isolation

; # ifconfig lo 127.0.0.1
; # ifconfig eth0 10.0.0.2
; # ping -c 1 10.0.0.1
;
;   PING 10.0.0.1 (10.0.0.1) from 10.0.0.2 : 56(84) bytes of data.
;   64 bytes from 10.0.0.1: icmp_seq=0 ttl=64 time=4.814 msec
;   
; # chbind --ip 127.0.0.1 ping -c 1 10.0.0.1
;
;   ipv4root is now 127.0.0.1
;   connect: Invalid argument


SECTION		[E31] Context File Tagging

; # mount -o tagxid /dev/hd1/part1 /mnt/
; # mkdir /mnt/test
; # touch /mnt/test/file
; # lsctx /mnt/test/file
;   
;   #0   /mnt/test/file
;
; # chcontext --ctx 100 touch /mnt/test/file_100
; # lsctx -R /mnt/test
;
;   #0   /mnt/test/file
;   #100 /mnt/test/file_100

SECTION		[E32] XID File Permissions

; # mount -o tagxid /dev/hd1/part1 /mnt/
; # chcontext --ctx 100 mkdir /mnt/100
;
;   New security context is 100
; # chcontext --ctx 200 touch /mnt/100/file
;
;   New security context is 200
;   touch: /mnt/100/file: Permission denied


SECTION		[E33] Context Flags

; # chcontext --flag ^24 cat /proc/mounts 
;   New security context is 49154

SECTION		[E34] Context Capabilities

; # chcontext --flag ^24 cat /proc/mounts 
;   New security context is 49154

SECTION		[E41] Context Accounting

; # cat /proc/virtual/1001/cacct 
;
;   UNSPEC:        0/0                 0/0                  0/0
;   UNIX:         25/1626             25/1626               0/0
;   INET:         12/531              48/1653              55/1625
;   INET6:         0/0                 0/0                  0/0
;   OTHER:         0/0                 0/0                  0/0

SECTION		[E42] Context Limits

; # cat /proc/virtual/1001/limit 
;
;   PROC:         16/-1
;   VM:        14313/-1
;   VML:           0/-1
;   RSS:        5519/-1
;   FILES:       140/-1

SECTION		[E43] Uptime Virtualization

; # uptime
;
;   20:31:11 up 21 min, load average: 0.20, 0.05, 0.01
;
; # chcontext --ctx 100 --flag ^17 bash -c "sleep 100; uptime"
;
;   New security context is 100
;   20:33:13 up 1 min, load average: 0.02, 0.03, 0.00

SECTION		[E44] ProcFS Guest Security

; # setattr --hide /proc/*
; # chcontext --ctx 100 ls /proc
;
;   New security context is 100
;   527  self


SECTION		[E45] ProcFS Host Security

; # setattr --hide --~admin /proc/*
; # ls /proc
;
;   1  10  12  13  16  2  3  4  5  532  6  7  8  9   self


CHAPTER		Help and References

      Linux-VServer is a Community Project.
      This means that all development and documentation
      is done with and by the community, and let me say
      it's a good and friendly community.
      
      Most Documentation for Linux-VServer is available
      on the web site http://linux-vserver.org, which is
      a Wiki, so you can add your stuff there and improve
      the site this way.
      
      There is also an IRC channel on irc.oftc.net called
      #vserver, where you always will receive a warm
      welcome ...
      
LLIST
	ITEM	[U1] QEMU CPU Emulator
		http://fabrice.bellard.free.fr/qemu/
	
	ITEM	[U2] Bochs IA-32 Emulator 
		http://bochs.sourceforge.net/

	ITEM	[U3] User-mode Linux Kernel 
		http://user-mode-linux.sourceforge.net/

	ITEM	[U4] The Xen virtual machine monitor
		http://www.cl.cam.ac.uk/Research/SRG/netos/xen/

	ITEM	[U5] POSIX.1e Reference 
		http://wt.xpilot.org/publications/posix.1e/

	ITEM	[U6] hping tool
		http://www.hping.org/

	ITEM	[U7] poink tool 
		http://www.gnu.org/directory/security/system/poink.html
EOLIST