MIT6828-Lab4-抢占式多线程

剪不断,理还乱

介绍

In this lab you will implement preemptive multitasking among multiple simultaneously active user-mode environments.

In part A you will add multiprocessor support to JOS, implement round-robin scheduling, and add basic environment management system calls (calls that create and destroy environments, and allocate/map memory).

In part B, you will implement a Unix-like fork(), which allows a user-mode environment to create copies of itself.

Finally, in part C you will add support for inter-process communication (IPC), allowing different user-mode environments to communicate and synchronize with each other explicitly. You will also add support for hardware clock interrupts and preemption.

开始

1
2
3
4
5
6
7
8
9
10
11
athena% cd ~/6.828/lab
athena% add git
athena% git pull
Already up-to-date.
athena% git checkout -b lab4 origin/lab4
Branch lab4 set up to track remote branch refs/remotes/origin/lab4.
Switched to a new branch "lab4"
athena% git merge lab3
Merge made by recursive.
...
athena%

首先,合并lab3,然后切换至lab4,在lab4中,你需要参考及阅读的源码如下:

点击显/隐内容
目录 文件 功能 进度
inc/ env.h Public definitions for user-mode environments
trap.h Public definitions for trap handling
syscall.h Public definitions for system calls from user environments to the kernel
lib.h Public definitions for the user-mode support library
kern/ env.h Kernel-private definitions for user-mode environments
env.c Kernel code implementing user-mode environments
trap.h Kernel-private trap handling definitions
trap.c Trap handling code
trapentry.S Assembly-language trap handler entry-points
syscall.h Kernel-private definitions for system call handling
syscall.c System call implementation code
lib/ Makefrag Makefile fragment to build user-mode library, obj/lib/libjos.a
entry.S Assembly-language entry-point for user environments
libmain.c User-mode library setup code called from entry.S
syscall.c User-mode system call stub functions
console.c User-mode implementations of putchar and getchar, providing console I/O
exit.c User-mode implementation of exit
panic.c User-mode implementation of panic
user/ * Various test programs to check kernel lab 3 code

实验需求

本练习有三个部分,每个部分耗时一周。你需要完成所有的任务和至少一个挑战。

第一部分:多处理器支持及协作多任务

在本实验第一章,你需要在多处理器系统上扩展JOS,并实现一些新的系统调用使用户程序创建新的用户环境。你需要实现合作的round-robin调度算法,从而使内核实现任务切换。在Part C,你需要实现抢占式调度,从而使内核重新获得CPU控制权。

多处理器支持

We are going to make JOS support “symmetric multiprocessing” (SMP), a multiprocessor model in which all CPUs have equivalent access to system resources such as memory and I/O buses. While all CPUs are functionally identical in SMP, during the boot process they can be classified into two types: the bootstrap processor (BSP) is responsible for initializing the system and for booting the operating system; and the application processors (APs) are activated by the BSP only after the operating system is up and running. Which processor is the BSP is determined by the hardware and the BIOS. Up to this point, all your existing JOS code has been running on the BSP. (BSP负责启动系统)

In an SMP system, each CPU has an accompanying local APIC (局部可编程终端控制) unit. The LAPIC units are responsible for delivering interrupts throughout the system. The LAPIC also provides its connected CPU with a unique identifier. In this lab, we make use of the following basic functionality of the LAPIC unit (in kern/lapic.c):

  • Reading the LAPIC identifier (APIC ID) to tell which CPU our code is currently running on (see cpunum()).
  • Sending the STARTUP interprocessor interrupt (IPI) from the BSP to the APs to bring up other CPUs (see lapic_startap()).
  • In part C, we program LAPIC’s built-in timer to trigger clock interrupts to support preemptive multitasking (see apic_init()).

一个处理器通过内存映射IO(MMIO)来访问其LAPIC。在MMIO中,a portion of physical memory is hardwired to the registers of some I/O devices, so the same load/store instructions typically used to access memory can be used to access device registers. You’ve already seen one IO hole at physical address 0xA0000 (we use this to write to the VGA display buffer). The LAPIC lives in a hole starting at physical address 0xFE000000 (32MB short of 4GB), so it’s too high for us to access using our usual direct map at KERNBASE. The JOS virtual memory map leaves a 4MB gap at MMIOBASE so we have a place to map devices like this. Since later labs introduce more MMIO regions, you’ll write a simple function to allocate space from this region and map device memory to it.

练习1:在kern/pmap.c中编写mmio_map_region。为了搞明白这个是如何工作的,看一下kern/lapic.clapic_init的一开始部分。你需要完成下一个练习,才能令mmio_map_region正常工作。

函数原型

1
2
3
4
5
6
void *
mmio_map_region(physaddr_t pa, size_t size)
{
static uintptr_t base = MMIOBASE;
return (void*)0;
}

函数功能

这个函数保留了一段空间,用于MMIO区域,这段区域位于MMIOBASE + size,被映射至[pa, pa+size),程序不难,照着注释就能写完。

函数实现

点击显/隐内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
//
// Reserve size bytes in the MMIO region and map [pa,pa+size) at this
// location. Return the base of the reserved region. size does *not*
// have to be multiple of PGSIZE.
//
void *
mmio_map_region(physaddr_t pa, size_t size)
{
// Where to start the next region. Initially, this is the
// beginning of the MMIO region. Because this is static, its
// value will be preserved between calls to mmio_map_region
// (just like nextfree in boot_alloc).
static uintptr_t base = MMIOBASE;


size = (size_t) ROUNDUP((uint32_t)(size), PGSIZE); // Round size up to a multiple of PGSIZE
// Handle if this reservation would overflow MMIOLIM
if(base + size >= MMIOBASE) panic("The reservation of MMIO region is overflowed");

// Reserve size bytes of virtual memory starting at base and
// map physical pages [pa,pa+size) to virtual addresses
// [base,base+size). Since this is device memory and not
// regular DRAM, you'll have to tell the CPU that it isn't
// safe to cache access to this memory.
//
// PTE_PCD: Cache disable
// PTE_PWT: write through
//
// Should I write PTE_P?
boot_map_region(kern_pgdir, base, size, pa, (PTE_P | PTE_W | PTE_PCD | PTE_PWT));
return (void*)base;
}

调试

上面的函数无法通过check_kern_pgdir,注意这里的base用的是static,根据其中的注释,我们每次调用mmio_map_region,都需要更新base,因此我们在函数中使用一个old_base记录更新前的值,然后令base = base + size。代码如下:

1
2
3
uintptr_t old_base = base;
base = base + size;
return (void*)old_base

应用处理器启动

在启动APs之前,BSP将会收集多处理器系统的信息,例如CPU总数,APIC号以及LAPIC的多路IO地址。 kern/mpconfig.c中的 mp_init()函数通过读取位于内存BIOS区域的多处理器配置表获取这些信息。

boot_aps() 函数 (位于kern/init.c) 驱动了应用处理器的启动过程。APs于实模式下启动,和bootloader由boot/boot.S的启动过程类似,因此boot_aps()(位于kern/mpentry.S)将AP的入口(entry)代码复制到了可以在实模式下寻址的一段内存空间中。不同于bootloader,我们对于AP执行代码的位置有了一定控制;我们将入口代码复制到了 0x7000 (MPENTRY_PADDR)处。

在此之后,boot_aps() 将依次激活应用处理器,通过向对应处理器的LAPIC单元发送STARTUPIPIs,以及一个初始化的 CS:IP 地址,应用处理器将会在这个地址执行入口代码 (MPENTRY_PADDR). kern/mpentry.S 中的入口代码和boot/boot.S中的很类似。经过一些简单的启动过程后,这个代码将会令AP进入保护模式,从而使能页映射,并调用 mp_main()进行配置(also in kern/init.c)。 在唤醒下一个AP前,boot_aps() 等待AP发送一个CPU_STARTED标志。这个标志位于结构体struct CpuInfocpu_status字段。

练习 2. Read boot_aps() and mp_main() in kern/init.c, and the assembly code in kern/mpentry.S. Make sure you understand the control flow transfer during the bootstrap of APs. Then modify your implementation of page_init() in kern/pmap.c to avoid adding the page at MPENTRY_PADDR to the free list, so that we can safely copy and run AP bootstrap code at that physical address. Your code should pass the updated check_page_free_list() test (but might fail the updated check_kern_pgdir() test, which we will fix soon).

阅读boot_apsmp_main以及mpentry.S

boot_aps的源代码如下:

点击显/隐内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
static void
boot_aps(void)
{
extern unsigned char mpentry_start[], mpentry_end[]; // 这两个地址位于`mpentry.S`中
void *code;
struct CpuInfo *c;

// Write entry code to unused memory at MPENTRY_PADDR
code = KADDR(MPENTRY_PADDR);
memmove(code, mpentry_start, mpentry_end - mpentry_start);

// Boot each AP one at a time
for (c = cpus; c < cpus + ncpu; c++) {
// 下面这句不是太理解,是说当前CPU已经启动么?
if (c == cpus + cpunum()) // We've started already.
continue;

// Tell mpentry.S what stack to use
mpentry_kstack = percpu_kstacks[c - cpus] + KSTKSIZE;
// Start the CPU at mpentry_start
lapic_startap(c->cpu_id, PADDR(code));
// Wait for the CPU to finish some basic setup in mp_main()
while(c->cpu_status != CPU_STARTED)
;
}
}

boot_aps的基本过程为首先拷贝入口代码,然后遍历每个CPU,如果CPU已经启动了(cpunum返回当前CPU),那么跳过,否则分配栈(大小为32768),然后启动AP并等待配置完成。

mp_main的源码如下:

点击显/隐内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Setup code for APs
void
mp_main(void)
{
// We are in high EIP now, safe to switch to kern_pgdir
lcr3(PADDR(kern_pgdir));
cprintf("SMP: CPU %d starting\n", cpunum());

lapic_init(); // 初始化local apic
env_init_percpu(); // 初始化每个CPU的进程
trap_init_percpu(); // 初始化每个CPU的trap
xchg(&thiscpu->cpu_status, CPU_STARTED); // tell boot_aps() we're up

// Now that we have finished some basic setup, call sched_yield()
// to start running processes on this CPU. But make sure that
// only one CPU can enter the scheduler at a time!
//
// Your code here:

// Remove this after you finish Exercise 6
for (;;);
}

mp_main主要做了一些初始化工作,具体内容我们会在后面进行讲解。至于mpentry.S,其基本功能就是初始化一些寄存器,然后调用mp_main,所以AP的初始化过程可以总结为:boot_apsmpentry.S拷贝到给定内存空间中,然后mpentry.S调用mp_main进行CPU初始化工作。

MPENTRY_PADDR(0x7000)对应的物理页从内存freelist中剔除

我们之间以链表的形式将内存中空闲页面进行了串联,现在需要将0x7000对应的物理页剔除,只要找到0x7000对应的页在pages的数组中所在下标,并调整链表节点连接关系即可,比较简单,代码如下:

1
2
3
4
5
6
7
// LAB 4:
// Change your code to mark the physical page at MPENTRY_PADDR
// as in use
i = MPENTRY_PADDR/PGSIZE;
pages[i].pp_ref = 1;
pages[i].pp_link = NULL;
pages[i+1].pp_link = &pages[i-1];

单个CPU的状态及初始化

编写多处理器OS时,最重要的一个就是分辨每个CPU的状态(私有)以及系统共享全局状态。 kern/cpu.h定义了最常用的CPU状态,包括 struct CpuInfo,包含了每个CPU中的状态变量. cpunum()总是返回调用它的CPU的序号,因此可以被用来作为CPU数组cpus的索引。同样的, 宏thiscpu 也记录了当前CPU的 struct CpuInfo.

这里列举了你应当了解的每个CPU的状态:

  • Per-CPU kernel stack(每个CPU的内核栈)
    每个CPU可能同时陷入内核态,我们需要为每个处理器设置单独的内核栈,确保他们之间不会相互打扰。The array percpu_kstacks[NCPU][KSTKSIZE] reserves space for NCPU’s worth of kernel stacks.(独立的内核栈确保互不打扰)

    In Lab 2, you mapped the physical memory that bootstack refers to as the BSP’s kernel stack just below KSTACKTOP. Similarly, in this lab, you will map each CPU’s kernel stack into this region with guard pages acting as a buffer between them. CPU 0’s stack will still grow down from KSTACKTOP; CPU 1’s stack will start KSTKGAP bytes below the bottom of CPU 0’s stack, and so on. inc/memlayout.h shows the mapping layout.

  • Per-CPU TSS and TSS descriptor(每个CPU的TSS和TSSD)
    A per-CPU task state segment (TSS) is also needed in order to specify where each CPU’s kernel stack lives. The TSS for CPU i is stored in cpus[i].cpu_ts, and the corresponding TSS descriptor is defined in the GDT entry gdt[(GD_TSS0 >> 3) + i]. The global ts variable defined in kern/trap.c will no longer be useful.

  • Per-CPU current environment pointer(每个CPU的当前环境指针)
    Since each CPU can run different user process simultaneously, we redefined the symbol curenv to refer to cpus[cpunum()].cpu_env (or thiscpu->cpu_env), which points to the environment currently executing on the current CPU (the CPU on which the code is running).

  • Per-CPU system registers(每个CPU的系统寄存器)
    All registers, including system registers, are private to a CPU. Therefore, instructions that initialize these registers, such as lcr3(), ltr(), lgdt(), lidt(), etc., must be executed once on each CPU. Functions env_init_percpu() and trap_init_percpu() are defined for this purpose.

  • In addition to this, if you have added any extra per-CPU state or performed any additional CPU-specific initialization (by say, setting new bits in the CPU registers) in your solutions to challenge problems in earlier labs, be sure to replicate them on each CPU here!

Exercise 3. Modify mem_init_mp() (in kern/pmap.c) to map per-CPU stacks starting at KSTACKTOP, as shown in inc/memlayout.h. The size of each stack is KSTKSIZE bytes plus KSTKGAP bytes of unmapped guard pages. Your code should pass the new check in check_kern_pgdir().

函数功能

为每个CPU分配栈空间,栈空间分为两部分,一部分为栈区,大小为KSTKSIZE,另一部分为缓冲区,大小为KSTKGAP,缓冲区的作用是判断栈溢出,如果进入缓冲区,说明溢出,要触发异常,栈示意图如下:

图片名称

函数实现

根据函数功能,我们需要将栈地址与对应的物理地址进行映射,每个栈的栈顶为kstacktop_i = KSTACKTOP - i * (KSTKSIZE + KSTKGAP),栈底为kstacktop_i-KSTKSIZE,大小为KSTKSIZE。不难写出映射代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Modify mappings in kern_pgdir to support SMP
// - Map the per-CPU stacks in the region [KSTACKTOP-PTSIZE, KSTACKTOP)
//
static void
mem_init_mp(void)
{
// For CPU i, use the physical memory that 'percpu_kstacks[i]' refers
// to as its kernel stack. CPU i's kernel stack grows down from virtual
// address kstacktop_i = KSTACKTOP - i * (KSTKSIZE + KSTKGAP), and is
// divided into two pieces, just like the single stack you set up in
// mem_init:
// * [kstacktop_i - KSTKSIZE, kstacktop_i)
// -- backed by physical memory
// * [kstacktop_i - (KSTKSIZE + KSTKGAP), kstacktop_i - KSTKSIZE)
// -- not backed; so if the kernel overflows its stack,
// it will fault rather than overwrite another CPU's stack.
// Known as a "guard page".
// Permissions: kernel RW, user NONE
for(int i = 0; i < NCPU; ++i){
uintptr_t kstacktop_i = KSTACKTOP - i * (KSTKSIZE + KSTKGAP);
boot_map_region(kern_pgdir, kstacktop_i - KSTKSIZE, KSTKSIZE, PADDR(percpu_kstacks[i]), (PTE_W | PTE_P));
}
}

完成后,代码应当能够通过check_kern_pgdir

Exercise 4. The code in trap_init_percpu() (kern/trap.c) initializes the TSS and TSS descriptor for the BSP. It worked in Lab 3, but is incorrect when running on other CPUs. Change the code so that it can work on all CPUs. (Note: your new code should not use the global ts variable any more.)

TSS和TSS descriptor

为了解决这个问题,我们首先要搞清楚Task Segment Selector(TSS)和TSS descriptor,关于这些内容请参考关于进程的讲解。练习四要求我们修改trap_init_percpu,从而初始化TSS和TSSD。

原始版本的trap_init_percpu

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void
trap_init_percpu(void)
{
ts.ts_esp0 = KSTACKTOP;
ts.ts_ss0 = GD_KD;
ts.ts_iomb = sizeof(struct Taskstate);

gdt[GD_TSS0 >> 3] = SEG16(STS_T32A, (uint32_t) (&ts),
sizeof(struct Taskstate) - 1, 0);
gdt[GD_TSS0 >> 3].sd_s = 0;

ltr(GD_TSS0);
lidt(&idt_pd);
}

这段代码配置了CPU0(Boot CPU)的TSS和TSSD,但是并不通用,我们要将其修改,令其可以自适应地配置每个CPU的TSS和TSSD。

修改内容

根据注释,我们需要修改的内容如下:

  • 首先使用thiscpu->cpu_ts即当前CPU的TSS段代替ts
  • 将TSS段中的esp0指向当前CPU内核栈栈顶(栈顶计算过程参考练习3)
  • 修改当前CPU的TSSD,即gdt[(GD_TSS0 >> 3) + cpunum()]
  • 修改当前CPU的TSS selector

前三个任务根据注释很好完成,主要注意下当前CPU的TSS selector。根据注释,selector的低三位都是0,实际上第四位表示具体的CPU,因此每个CPU的TSS selector可以表示为:GD_TSS0 + (cpunum() << 3)

改进后的trap_init_percpu

根据上面的分析及注释,可以写出改进的trap_init_percpu如下:

点击显/隐内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Initialize and load TSS, TSS descriptor and IDT for all CPUs
void
trap_init_percpu(void)
{
// - Calculate current CPU's kern stack top (esp0), thiscpu points to the current CPU's
// struct CpuInfo, and cpunum() returns current CPU's id
// - Use "thiscpu->cpu_ts" as the TSS for the current CPU,
// rather than the global "ts" variable;
uintptr_t cur_cpu_kstacktop = KSTACKTOP - cpunum() * (KSTKSIZE + KSTKGAP);
thiscpu->cpu_ts.ts_esp0 = cur_cpu_kstacktop;
thiscpu->cpu_ts.ts_ss0 = GD_KD;

// - Initialize cpu_ts.ts_iomb to prevent unauthorized environments
// from doing IO (0 is not the correct value!)
thiscpu->cpu_ts.ts_iomb = sizeof(struct Taskstate);

// Setup a TSS so that we get the right stack when we trap to the kernel.
// Initialize the TSS slot of the gdt. USE gdt[(GD_TSS0 >> 3) + i] for CPU i's TSS descriptor;
gdt[(GD_TSS0 >> 3) + cpunum()] = SEG16(STS_T32A, (uint32_t) (&(thiscpu->cpu_ts)),
sizeof(struct Taskstate) - 1, 0);
gdt[(GD_TSS0 >> 3) + cpunum()].sd_s = 0;

// ltr sets a 'busy' flag in the TSS selector, so if you
// accidentally load the same TSS on more than one CPU, you'll
// get a triple fault. If you set up an individual CPU's TSS
// wrong, you may not get a fault until you try to return from
// user space on that CPU.
// Load the TSS selector (like other segment selectors, the
// bottom three bits are special; we leave them 0)
ltr(GD_TSS0 + (cpunum() << 3));

// Load the IDT
lidt(&idt_pd);
}

完成上述内容后,以4核CPU在QEMU中启动JOS(make qemu CPUS=4),可以得到如下输出结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
...
Physical memory: 66556K available, base = 640K, extended = 65532K
check_page_alloc() succeeded!
check_page() succeeded!
check_kern_pgdir() succeeded!
check_page_installed_pgdir() succeeded!
SMP: CPU 0 found 4 CPU(s)
enabled interrupts: 1 2
SMP: CPU 1 starting
SMP: CPU 2 starting
SMP: CPU 3 starting
[00000000] new env 00001000
EAX=00000000 EBX=00000000 ECX=ef804000 EDX=f022a094
... #中间省略一堆寄存器的信息
EFER=0000000000000000
Triple fault. Halting for inspection via QEMU monitor.

可以看到CPU都启动了,但新启动的进程遇到了一个Triple fault,原因见上一个实验。

我们现在的代码将在mp_main()初始化AP后陷入死循环。为了让AP能够继续运行,我们首先要解决多CPU同时运行内核代码时的竞态条件。最简单的解决方式是使用一个大的全局内核锁,当有进程进入内核态后,需要对内核进行上锁;从内核态返回用户态时,需要解锁。在这个模型中,用户态进程可以在任意CPU上并发执行,但是只有一个进程能够运行于内核态,任何一个其他的想要进入内核态的进程必须等待。

kern/spinlock.h declares the big kernel lock, namely kernel_lock. It also provides lock_kernel() and unlock_kernel(), shortcuts to acquire and release the lock. You should apply the big kernel lock at four locations:

  • In i386_init(), acquire the lock before the BSP wakes up the other CPUs. 避免其他CPU执行内核代码。
  • In mp_main(), acquire the lock after initializing the AP, and then call sched_yield() to start running environments on this AP. 在该CPU上执行用户态进程
  • In trap(), acquire the lock when trapped from user mode. To determine whether a trap happened in user mode or in kernel mode, check the low bits of the tf_cs. 从用户态进入内核态,占有内核锁。
  • In env_run(), release the lock right before switching to user mode. Do not do that too early or too late, otherwise you will experience races or deadlocks. 从内核态进入用户态,释放内核锁

从上面的过程中可以看出,我们有三处位置进行上锁,一处位置进行解锁。实际上,我们在所有可能发生内核态到用户态以及用户态到内核态转变的过程中,都对内核锁进行了操作。

Exercise 5. Apply the big kernel lock as described above, by calling lock_kernel() and unlock_kernel() at the proper locations.

在指定位置为内核上锁/解锁

根据上面的提示,我们需要在特定位置上对内核锁进行加锁与解锁的操作。代码并不难,就不赘述了。目前我们尚无法验证上锁是否正确,但是随着后面练习的推进,我们会验证锁的正确性

问题:使用一个大的内核锁保证了在一个时刻只能有一个CPU运行内核代码。为何我们还需要对每个CPU单独设置内核栈?Describe a scenario in which using a shared kernel stack will go wrong, even with the protection of the big kernel lock.

在由用户态执行系统调用进入内核态,到trap()对内核进行加锁保护这一段时间内,内核实际上是未经保护的,此时_alltrap会将一些关键寄存器的值压栈并保存,因此会对内核栈进行修改。可能会发生错误,所以必须要多个内核栈。

Challenge!

粗颗粒锁简单易用,但是限制了内核态的所有并发,效率较低。现代操作系统使用细颗粒锁来保护共享态中的不同的部分。细颗粒锁提高了并发性能,但是更困难,同时也更容易出错。请尝试设计合理的细颗粒锁,实现对于内核更加精细的控制。

提示:你可以考虑针对如下的共享部分设计细颗粒锁,并实现内核的访问控制:

  • The page allocator.
  • The console driver.
  • The scheduler.
  • The inter-process communication (IPC) state that you will implement in the part C.

定义细颗粒锁以及加锁/解锁函数

根据上面的提示,我们首先要针对共享部分,定义新的细颗粒锁。

Round-Robin 调度(轮转调度)

下一个任务是实施轮转调度,令内核轮转调度多个进程,调度过程如下:

  • 函数sched_yield() (在新的kern/sched.c中)选择一个新的要运行的进程。该函数在进程队列envs[]中环形顺序搜索, 搜索位置开始于前一个运行的进程之后(如果没有前一时刻的运行进程,那么就从进程队列起始位置搜索),并选择第一个状态为 ENV_RUNNABLE (see inc/env.h)的进程,并调用env_run()执行该进程
  • sched_yield() 永远不应于同一时刻在两个CPU上运行同一个进程。根据进程状态ENV_RUNNING,我们可以判定当前进程是否在某个CPU(或当前CPU)上运行.
  • 我们定义了一个新系统调用sys_yield(),用户可以通过该调用来调用内核sched_yield()函数,从而主动将CPU让给另一个进程。

Exercise 6. Implement round-robin scheduling in sched_yield() as described above. Don’t forget to modify syscall() to dispatch sys_yield().

mp_main中调用sched_yield(),修改kern/init.c 并创建三个以上的进程,这些进程都运行程序 user/yield.c。现在,运行make qemu CPUS=2,你应当能看到多个进程交互运行,每个执行了五次,然后退出

1
2
3
4
5
6
7
8
9
10
11
...
Hello, I am environment 00001000.
Hello, I am environment 00001001.
Hello, I am environment 00001002.
Back in environment 00001000, iteration 0.
Back in environment 00001001, iteration 0.
Back in environment 00001002, iteration 0.
Back in environment 00001000, iteration 1.
Back in environment 00001001, iteration 1.
Back in environment 00001002, iteration 1.
...

After the yield programs exit, there will be no runnable environment in the system, the scheduler should invoke the JOS kernel monitor. If any of this does not happen, then fix your code before proceeding.

框架编写

任务6主要是实现轮询调度,然后在合适的地方调用该函数,找到下一个调度的进程。我们先完成框架的编写,在mp_main中调用sched_yield(),改进后的mp_main为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Setup code for APs
void
mp_main(void)
{
// We are in high EIP now, safe to switch to kern_pgdir
lcr3(PADDR(kern_pgdir));
cprintf("SMP: CPU %d starting\n", cpunum());

lapic_init(); // 初始化local apic
env_init_percpu(); // 初始化每个CPU的进程
trap_init_percpu(); // 初始化每个CPU的trap
xchg(&thiscpu->cpu_status, CPU_STARTED); // tell boot_aps() we're up

// Now that we have finished some basic setup, call sched_yield()
// to start running processes on this CPU. But make sure that
// only one CPU can enter the scheduler at a time!
lock_kernel();
sched_yield();
}

然后修改kern/syscall.c,在syscall()中添加对sys_yield的响应。

轮询算法实现

为了实现轮询算法,我们需要解决如下问题:

  • 数组遍历:算法要求我们从当前进程之后开始,对可调度的进程进行搜索,因此我们需要从envs中间某个位置开始,对其进行遍历。简介的写法可以写为:envs[(next_envid + i) % NENV]
  • 找不到合适的envs
    • 如果当前CPU的进程仍然在运行,可以选择当前进程继续执行
    • 如果当前CPU上的进程也结束了,也找不到合适的进程,那么就挂起CPU

我们能够写出轮询算法如下:

点击显/隐内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
void
sched_yield(void)
{
struct Env *idle;

// Search through 'envs' for an ENV_RUNNABLE environment in
// circular fashion starting just after the env this CPU was
// last running. Switch to the first such environment found.
int next_envid = curenv ? ENVX(curenv->env_id) + 1 : 0;

for (int i = 0; i < NENV; i++) {
idle = &envs[(next_envid + i) % NENV];
if (idle->env_status == ENV_RUNNABLE){

// Do we need to modify the id of the env?
env_run(idle);
}
}

// If no envs are runnable, but the environment previously
// running on this CPU is still ENV_RUNNING, it's okay to
// choose that environment.
if (curenv && curenv->env_status == ENV_RUNNING){
env_run(curenv);
}

// Never choose an environment that's currently running on
// another CPU (env_status == ENV_RUNNING). If there are
// no runnable environments, simply drop through to the code
// below to halt the cpu.
sched_halt();
}

Q uestion

  1. env_run()中,你调用了lcr3()实现页目录的切换,而在这条语句之后,代码都使用了变量e。在将进程页目录载入cr3寄存器后,地址映射发生了变化,为何还能通过e的虚拟地址找到e所在的物理地址呢
  2. Whenever the kernel switches from one environment to another, it must ensure the old environment’s registers are saved so they can be restored properly later. Why? Where does this happen?
  • 第一个问题:如果JOS能够正确运行,那说明e的虚拟地址实际上是没有变化的,也就是在内核页目录和用户页目录中,e的映射是相同的。实际也确实如此,因为用户页目录本身就从内核页目录中复制过来,只有一部分发生了变动,而e所在的区域envs并未发生改变,在每个进程页目录中都相同,所以映射依然有效。
  • 第二个问题:旧的进程的寄存器需要保存,因为要保存现场,确保下一次调度时,能够从上一次暂停的地方继续执行。这个发生的地方在_alltrap中,即从用户态转向内核态后。

Challenge! 为内核添加一个不那么普通的(trivial)调度策略,例如固定优先级策略, such as a fixed-priority scheduler that allows each environment to be assigned a priority and ensures that higher-priority environments are always chosen in preference to lower-priority environments. If you’re feeling really adventurous, try implementing a Unix-style adjustable-priority scheduler or even a lottery or stride scheduler. (Look up “lottery scheduling” and “stride scheduling” in Google.)

Write a test program or two that verifies that your scheduling algorithm is working correctly (i.e., the right environments get run in the right order). It may be easier to write these test programs once you have implemented fork() and IPC in parts B and C of this lab.

暂时未解决,先了解一下静态优先级调度(花半天时间看看)

Challenge! The JOS kernel currently does not allow applications to use the x86 processor’s x87 floating-point unit (FPU), MMX instructions, or Streaming SIMD Extensions (SSE). Extend the Env structure to provide a save area for the processor’s floating point state, and extend the context switching code to save and restore this state properly when switching from one environment to another. The FXSAVE and FXRSTOR instructions may be useful, but note that these are not in the old i386 user’s manual because they were introduced in more recent processors. Write a user-level test program that does something cool with floating-point.

未解决

用于进程创建的系统调用

目前JOS只能运行内核初始化时设置的进程,你需要编写对应的系统调用,使得用户进程能够创建并开始其他的用户进程。

Unix provides the fork() system call as its process creation primitive. Unix fork() copies the entire address space of calling process (the parent) to create a new process (the child). The only differences between the two observable from user space are their process IDs and parent process IDs (as returned by getpid and getppid). 在父进程中,fork()返回子进程ID,而在子进程中 fork()返回0。默认地,每个进程拥有其私有的地址空间,并对对方不可见。

你需要提供一组用于创建新用户进程的系统调用。利用这些系统调用,你能够在用户态执行fork,新的系统调用如下:

  • sys_exofork:

    这个系统调用创建了一个几乎空白状态的进程:用户地址空间尚未被映射,同时这个进程也不能运行。新的进程和父进程拥有相同的寄存器状态。在父进程中,sys_exofork返回新创建进程的envid_t(如果进程分配错误,返回负错误码)。在子进程中,返回0(由于子进程被标注为不能运行态,因此sys_exofork不会真的返回子进程,直到父进程使用sys_env_set_status显式修改子进程状态为runnable)

  • sys_env_set_status:

    将特定进程的状态设置为ENV_RUNNABLEENV_NOT_RUNNABLE。当一个新进程地址空间和寄存器状态被初始化后,这个系统调用用来将新进程标记为可运行的。

  • sys_page_alloc:

    分配一个页的物理内存,并将其映射到给定进程地址空间的虚拟地址上。

  • sys_page_map:

    将一个进程的地址空间映射关系拷贝到另一个进程,新旧进程的虚拟地址将指向相同的物理内存(共享内存)。

  • sys_page_unmap:

    解除给定进程的一个给定虚拟地址的映射关系。

对于上面所有的接收一个进程ID的系统调用,JOS使用0表示当前进程。这个转换由函数envid2env()(in kern/env.c)实现。

We have provided a very primitive implementation of a Unix-like fork() in the test program user/dumbfork.c. This test program uses the above system calls to create and run a child environment with a copy of its own address space. The two environments then switch back and forth using sys_yield as in the previous exercise. The parent exits after 10 iterations, whereas the child exits after 20.

Exercise 7. 实现上述系统调用,并确保syscall调用他们。你需要使用kern/pmap.ckern/env.c中的几个函数,特别是envid2env(). For now, whenever you call envid2env(), pass 1 in the checkperm parameter. Be sure you check for any invalid system call arguments, returning -E_INVAL in that case. Test your JOS kernel with user/dumbfork and make sure it works before proceeding.

代码中已经给出了系统调用函数的框架,我们只需要在kern/syscall.c中实现对应的系统调用即可。

sys_exofork实现

函数设计

在该函数中,我们主要需要完成下面两个任务:

  • 使用env_alloc创建一个新的进程
  • 对新的进程进行初始化,主要是对tf进行拷贝,同时将tf_regsreg_eax字段设置为0,保证系统调用返回值为0

调试

在一开始对于函数的功能没有理解,创建了进程后,直接调用env_run运行了进程,这样做的后果是进程A调用sys_exofork后,创建了新的进程B,随即立刻开始运行进程B,即由A进入系统调用,但是返回却是返回了进程B,这样显然是有问题的。正确的做法是将新进程的状态设为ENV_NOT_RUNNABLE,并在后面的某个位置中将其修改为ENV_RUNNABLE,令其可以被sched_yield调度。

函数实现

点击显/隐内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Allocate a new environment.
// Returns envid of new environment,
static envid_t
sys_exofork(void)
{
// Create the new environment with env_alloc(), from kern/env.c.
int r;
struct Env* e;
r = env_alloc(&e, curenv->env_id);

// Return < 0 on error. Errors are:
// -E_NO_FREE_ENV if no free environment is available.
// -E_NO_MEM on memory exhaustion.
if(r < 0){
return r;
}

e->env_tf = curenv->env_tf; // 将父进程的寄存器值拷贝给子进程
e->env_tf.tf_regs.reg_eax = 0; // 保存子进程系统调用后的返回值。
e->env_status = ENV_NOT_RUNNABLE;
int pid = e->env_id;
// It should be left as env_alloc created it, except that
// status is set to ENV_NOT_RUNNABLE, and the register set is copied
// from the current environment -- but tweaked so sys_exofork
// will appear to return 0.

return e->env_id;
}

函数解释

这个函数里比较难的一点是,子进程和父进程会返回不同的参数,这一点是依靠修改子进程栈帧寄存器eax的值实现的。父进程会返回子进程的id

剩余几个系统调用的实现

剩下的几个系统调用都不难,主要是对输入参数有效性进行了一些判断,这里直接给出源码。

函数实现

sys_env_set_status的实现如下:

点击显/隐内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Set envid's env_status to status, which must be ENV_RUNNABLE
// or ENV_NOT_RUNNABLE.
// Return 0 for success and < 0 on error. Errors are:
// -E_BAD_ENV if environment envid doesn't currently exist,
// or the caller doesn't have permission to change envid.
// -E_INVAL if status is not a valid status for an environment.

static int
sys_env_set_status(envid_t envid, int status)
{
// Hint: Use the 'envid2env' function from kern/env.c to translate an
// envid to a struct Env.
// You should set envid2env's third argument to 1, which will
// check whether the current environment has permission to set
// envid's status.
if(status != ENV_RUNNABLE && status != ENV_NOT_RUNNABLE){
return -E_INVAL;
}

struct Env* e;
int r;
r = envid2env(envid, &e, 1); // get env from envid

if(r < 0){
return r;
}

e->env_status = status;

// Returns 0 on success.
return 0;
}

sys_page_alloc实现如下:

点击显/隐内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Allocate a page of memory and map it at 'va' with permission
// 'perm' in the address space of 'envid'.
// The page's contents are set to 0.
// If a page is already mapped at 'va', that page is unmapped as a
// side effect.
//
// perm -- PTE_U | PTE_P must be set, PTE_AVAIL | PTE_W may or may not be set,
// but no other bits may be set. See PTE_SYSCALL in inc/mmu.h.
//
// Return 0 on success, < 0 on error. Errors are:
// -E_BAD_ENV if environment envid doesn't currently exist,
// or the caller doesn't have permission to change envid.
// -E_INVAL if va >= UTOP, or va is not page-aligned.
// -E_INVAL if perm is inappropriate (see above).
// -E_NO_MEM if there's no memory to allocate the new page,
// or to allocate any necessary page tables.
static int
sys_page_alloc(envid_t envid, void *va, int perm)
{
// Hint: This function is a wrapper around page_alloc() and
// page_insert() from kern/pmap.c.
// Most of the new code you write should be to check the
// parameters for correctness.
// If page_insert() fails, remember to free the page you
// allocated!
struct Env* e;
int r;
r = envid2env(envid, &e, 1);

if(r < 0){
cprintf("sys_page_alloc: wrong envid %d!\n", envid);
return r;
}

if((uint32_t)va >= UTOP || (uint32_t)va != ROUNDUP((uint32_t)va, PGSIZE)){
cprintf("sys_page_alloc: wrong virtual address!\n");
return -E_INVAL;
}

if( !(perm & (PTE_U | PTE_P )) ){
cprintf("sys_page_alloc: wrong perm!\n");
return -E_INVAL;
}

struct PageInfo* pp = page_alloc(1);
if(pp == NULL){
return -E_NO_MEM;
}

r = page_insert(e->env_pgdir, pp, va, perm);
if(r < 0){
page_free(pp);
return r;
}

return 0;
}

sys_page_map实现如下:

点击显/隐内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Map the page of memory at 'srcva' in srcenvid's address space
// at 'dstva' in dstenvid's address space with permission 'perm'.
// Perm has the same restrictions as in sys_page_alloc, except
// that it also must not grant write access to a read-only
// page.
//
// Return 0 on success, < 0 on error. Errors are:
// -E_BAD_ENV if srcenvid and/or dstenvid doesn't currently exist,
// or the caller doesn't have permission to change one of them.
// -E_INVAL if srcva >= UTOP or srcva is not page-aligned,
// or dstva >= UTOP or dstva is not page-aligned.
// -E_INVAL is srcva is not mapped in srcenvid's address space.
// -E_INVAL if perm is inappropriate (see sys_page_alloc).
// -E_INVAL if (perm & PTE_W), but srcva is read-only in srcenvid's
// address space.
// -E_NO_MEM if there's no memory to allocate any necessary page tables.
static int
sys_page_map(envid_t srcenvid, void *srcva,
envid_t dstenvid, void *dstva, int perm)
{
// Hint: This function is a wrapper around page_lookup() and
// page_insert() from kern/pmap.c.
// Again, most of the new code you write should be to check the
// parameters for correctness.
// Use the third argument to page_lookup() to
// check the current permissions on the page.
struct Env *srce, *dste;
envid2env(srcenvid, &srce, 1);
envid2env(dstenvid, &dste, 1);
if(!srce || !dste){
return -E_BAD_ENV;
}

if((uint32_t)srcva >= UTOP || (uint32_t)srcva != ROUNDUP((uint32_t)srcva, PGSIZE) ||\
(uint32_t)dstva >= UTOP || (uint32_t)dstva != ROUNDUP((uint32_t)dstva, PGSIZE)){
cprintf("sys_page_map: wrong virtual address!\n");
return -E_INVAL;
}

pte_t *pte_store;
struct PageInfo* pp = page_lookup(srce->env_pgdir, srcva, &pte_store);
if(pp == NULL){
cprintf("sys_page_map: srcva not mapped!\n");
return -E_INVAL;
}

if( (perm & PTE_W) && !(*pte_store & PTE_W )){
cprintf("sys_page_map: srcva is read only, but perm want write!\n");
return -E_INVAL;
}

if( !(perm & (PTE_U | PTE_P )) ){
cprintf("sys_page_map: wrong perm!\n");
return -E_INVAL;
}

if(page_insert(dste->env_pgdir, pp, dstva, perm) < 0){
cprintf("sys_page_map: no memory!\n");
return -E_NO_MEM;
}

return 0;
}

sys_page_unmap实现如下:

点击显/隐内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Unmap the page of memory at 'va' in the address space of 'envid'.
// If no page is mapped, the function silently succeeds.
//
// Return 0 on success, < 0 on error. Errors are:
// -E_BAD_ENV if environment envid doesn't currently exist,
// or the caller doesn't have permission to change envid.
// -E_INVAL if va >= UTOP, or va is not page-aligned.
static int
sys_page_unmap(envid_t envid, void *va)
{
struct Env *e;
if(envid2env(envid, &e, 1) != 0){
return -E_BAD_ENV;
}

if((uint32_t)va >= UTOP || (uint32_t)va != ROUNDUP((uint32_t)va, PGSIZE)){
cprintf("sys_page_map: wrong virtual address!\n");
return -E_INVAL;
}

// Hint: This function is a wrapper around page_remove().
page_remove(e->env_pgdir, va);
return 0;
}

Challenge! Add the additional system calls necessary to read all of the vital state of an existing environment as well as set it up. Then implement a user mode program that forks off a child environment, runs it for a while (e.g., a few iterations of sys_yield()), then takes a complete snapshot or checkpoint of the child environment, runs the child for a while longer, and finally restores the child environment to the state it was in at the checkpoint and continues it from there. Thus, you are effectively “replaying” the execution of the child environment from an intermediate state. Make the child environment perform some interaction with the user using sys_cgetc() or readline() so that the user can view and mutate its internal state, and verify that with your checkpoint/restart you can give the child environment a case of selective amnesia, making it “forget” everything that happened beyond a certain point.

现在你完成了Part A,运行make grade并确保通过Part A的所有测试用例。If you are trying to figure out why a particular test case is failing, run ./grade-lab4 -v, which will show you the output of the kernel builds and QEMU runs for each test, until a test fails. When a test fails, the script will stop, and then you can inspect jos.out to see what the kernel actually printed.

第二部分:写复制

Unix提供了fork()系统调用作为主要的进程创建手段。fork()系统调用将调用进程的地址空间拷贝至新的子进程中。xv6 Unix的fork()将父进程的页全部拷贝到了子进程中,这个做法和dumbfork()所采用的基本一致。The copying of the parent’s address space into the child is the most expensive part of the fork() operation (拷贝是开销最大的部分).

However, a call to fork() is frequently followed almost immediately by a call to exec() in the child process, which replaces the child’s memory with a new program. This is what the the shell typically does, for example. In this case, the time spent copying the parent’s address space is largely wasted, because the child process will use very little of its memory before calling exec().

更新版本的Unix允许父子进程共享内存映射,直到某个进程修改了进程地址空间。这项技术叫做写时复制。为了实现该功能,在fork中,内核会拷贝地址空间映射关系而非内存中的具体内容,同时将共享的内存页标注为只读。当某个进程试图在共享页写入内存,那么将会触发一个页错误。此时内核将会为触发页错误的进程分配一个新的、私有的、可写的副本。因此,直到内存页中的内容被实际写入时,该页才会被拷贝。This optimization makes a fork() followed by an exec() in the child much cheaper: the child will probably only need to copy one page (the current page of its stack) before it calls exec().

在下面的实验部分,你将实施一个像Unix的写时复制fork()作为用户态的一个库函数。Implementing fork() and copy-on-write support in user space has the benefit that the kernel remains much simpler and thus more likely to be correct. It also lets individual user-mode programs define their own semantics for fork(). A program that wants a slightly different implementation (for example, the expensive always-copy version like dumbfork(), or one in which the parent and child actually share memory afterward) can easily provide its own.

用户态页错误处理

一个用户态的写时拷贝函数 fork()需要知道写保护页的页错误。写时拷贝是处理用户态页错误的一种手段。

It’s common to set up an address space so that page faults indicate when some action needs to take place. For example, most Unix kernels initially map only a single page in a new process’s stack region, and allocate and map additional stack pages later “on demand” as the process’s stack consumption increases and causes page faults on stack addresses that are not yet mapped. A typical Unix kernel must keep track of what action to take when a page fault occurs in each region of a process’s space. For example, a fault in the stack region will typically allocate and map new page of physical memory. A fault in the program’s BSS region will typically allocate a new page, fill it with zeroes, and map it. In systems with demand-paged executables, a fault in the text region will read the corresponding page of the binary off of disk and then map it.

This is a lot of information for the kernel to keep track of. Instead of taking the traditional Unix approach, you will decide what to do about each page fault in user space, where bugs are less damaging. This design has the added benefit of allowing programs great flexibility in defining their memory regions; you’ll use user-level page fault handling later for mapping and accessing files on a disk-based file system.

设置Page Fault Handler

为了处理页错误,一个用户进程需要在JOS内核中注册一个page_fault_handler。用户进程通过sys_env_set_pgfault_upcall设置页错误处理入口。我们需要在Env中添加一个新的成员env_pgfault_upcall来记录该信息。

Exercise 8. Implement the sys_env_set_pgfault_upcall system call. Be sure to enable permission checking when looking up the environment ID of the target environment, since this is a “dangerous” system call.

函数实现

这个函数也不难,就是将进程的page_fault_handler设置为给定的函数入口:

点击显/隐内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Set the page fault upcall for 'envid' by modifying the corresponding struct
// Env's 'env_pgfault_upcall' field. When 'envid' causes a page fault, the
// kernel will push a fault record onto the exception stack, then branch to
// 'func'.
//
// Returns 0 on success, < 0 on error. Errors are:
// -E_BAD_ENV if environment envid doesn't currently exist,
// or the caller doesn't have permission to change envid.
static int
sys_env_set_pgfault_upcall(envid_t envid, void *func)
{
struct Env* e;

if(envid2env(envid, &e, 1) < 0){
return -E_BAD_ENV;
}

e->env_pgfault_upcall = func; // Set page fault entrypoint for e
return 0;
}

用户进程的普通与异常栈

在正常的执行过程中,一个用户进程运行在常规的用户栈上:其ESP寄存器指向USTACKTOP,栈的范围是[USTACKTOP-PGSIZE, USTACKTOP-1]。当用户态发生了页错误后,内核将会令用户进程在一个不同的栈上运行对应的用户态page fault handler,这个栈叫用户异常栈(user exception stack)。我们将令JOS内核代替(on behalf of)用户进程执行自动栈切换,这个过程与发生用户态到内核态的切换时,x86代替JOS实现用户栈到内核栈的切换非常类似。

用户异常栈的大小也是一个页,栈顶地址为UXSTACKTOP,范围是[UXSTACKTOP-PGSIZE, UXSTACKTOP-1]。当运行在这个栈时,用户态的page fault handler能够使用JOS的常规系统调用,映射新的栈或调整映射,从而修复触发页错误的问题。随后用户态的page fault handler通过汇编指令stub返回到常规栈中继续运行。

每个想要支持用户态页错误处理的进程,都需要利用sys_page_alloc()为异常栈分配空间。

触发用户态Page Fault Handler

现在,你需要修改kern/trap.c的代码,处理来自用户态的页错误。我们将处在页错误的用户态的状态成为陷入时间 (trap-time)态。

如果没有page fault handler被注册,那么内核将销毁用户进程。否则内核将在异常栈上设置一个栈帧,其格式如struct UTrapframe所示(在 inc/trap.h中)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
                    <-- UXSTACKTOP
trap-time esp
trap-time eflags
trap-time eip
trap-time eax start of struct PushRegs
trap-time ecx
trap-time edx
trap-time ebx
trap-time esp
trap-time ebp
trap-time esi
trap-time edi end of struct PushRegs
tf_err (error code)
fault_va <-- %esp when handler is run

随后,内核将安排用户进程在异常栈上以该栈帧恢复执行,运行的代码为page fault handler;你需要弄清楚这个是如何实现的,fault_va即造成页错误的虚拟地址。

如果在异常发生时,用户已经位于异常栈上了,此时page fault handler本身发生了错误(嵌套异常)。在这种情况下,你需要在异常栈栈顶 tf->tf_esp 下方开启新的异常栈,而不是在UXSTACKTOP下,你需要先压入一个空的32-bit的word,然后再压入一个struct UTrapframe。为了验证tf->tf_esp是否已经在用户异常栈上,你需要检测该指针是否位于UXSTACKTOP-PGSIZEUXSTACKTOP-1之间。

Exercise 9. Implement the code in page_fault_handler in kern/trap.c required to dispatch page faults to the user-mode handler. Be sure to take appropriate precautions when writing into the exception stack. (What happens if the user environment runs out of space on the exception stack?)

函数功能

page_fault_handler的主要功能是在进入内核后,由内核调整trapframe,从而令当前进程在异常栈上执行page_fault_handler函数。

函数设计

首先,我们要判断page_fault_handler是否注册,如果没注册,直接销毁进程。然后,根据当前栈指针调整其位置,为utf分配空间,并对分配的空间进行检查。随后,我们设置utf的内容,主要就是对tf的关键内容进行保存。最后,我们调整tf中的espeip,并运行进程,使其在异常栈上运行给定的函数。

函数实现

点击显/隐内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
/* 
* page_fault (interrupt 14) handler, Handles the page fault
*
* This handler function is invoked by the interrupt 14 to
* deal with page fault in both the kernel mode and user mode
*
* INPUT: - tf trapframe of current env
*
* OUTPUT: void
*
* WARNINGS: none
*
* HISTORY:
* - 2021-7-2 created by Shiqi Duan (sqduan@mail.ustc.edu.cn)
*/
void
page_fault_handler(struct Trapframe *tf)
{
uint32_t fault_va;

// Read processor's CR2 register to find the faulting address
fault_va = rcr2();

// Handle kernel-mode page faults.
if(tf->tf_cs == GD_KT){
panic("Kernel mode page faults\n");
}
// We've already handled kernel-mode exceptions, so if we get here,
// the page fault happened in user mode.

// If the env_pgfault_upcall function ptr isn't registered, simply
// destory the environment that caused the fault.
if(!curenv->env_pgfault_upcall){
cprintf("[%08x] user fault va %08x ip %08x\n",
curenv->env_id, fault_va, tf->tf_eip);
print_trapframe(tf);
env_destroy(curenv);
}

struct UTrapframe *utf;
uint32_t utf_size = sizeof(struct UTrapframe);
// Set up a page fault stack frame on the user exception stack
// which range is [UXSTACKTOP-PGSIZE, UXSTACKTOP). Remember to
// check the validation of the memory.
if( tf->tf_esp >= UXSTACKTOP - PGSIZE && tf->tf_esp < UXSTACKTOP ){
// Current stack is already the exception stack, create a UTrapframe
// under current esp. Remember to first push a empty 32-bit word.

// It is convenient for our code which returns from a page fault
// (lib/pfentry.S) to have one word of scratch space at the top of the
// trap-time stack; it allows us to more easily restore the eip/esp. In
// the non-recursive case, we don't have to worry about this because
// the top of the regular user stack is free. In the recursive case,
// this means we have to leave an extra word between the current top of
// the exception stack and the new stack frame because the exception
// stack _is_ the trap-time stack.
utf = (struct UTrapframe*)( tf->tf_esp - sizeof(uint32_t) - utf_size);
}
else{
// Current stack is in user normal stack, create a UTrapframe
// under UXSTACKTOP.
utf = (struct UTrapframe*)(UXSTACKTOP - utf_size);
}
user_mem_check(curenv, (void*)utf, utf_size, (PTE_P | PTE_U | PTE_W) );

// Now we set the content of the utf, the utf here is actually
// saving the tf, since we need to switch the context.
// When return from the upcall, we will restore the context to the original tf
utf->utf_fault_va = fault_va;
utf->utf_err = tf->tf_err;
utf->utf_regs = tf->tf_regs;
utf->utf_eip = tf->tf_eip;
utf->utf_eflags = tf->tf_eflags;
utf->utf_esp = tf->tf_esp;

// replace tf's eip with the upcall, and the stack is now the ustack
tf->tf_esp = (uintptr_t)utf;
tf->tf_eip = (uintptr_t)curenv->env_pgfault_upcall;

// release the env, when return from kernel to user, it will use the utf as
// its tf, and run the env_pgfault_upcall.
env_run(curenv);
}

根据这段代码,我们能够画出处理页错误过程中,内核栈、用户常规栈与用户异常栈的结构及关系如下所示:

图片名称

从该图中可以看出,嵌套处理utf实际上就是函数的递归调用,这一点要特别明确,因为后面的练习10要用到。

用户态页错误入口

下面,你需要实现一段汇编程序,这段程序将调用C编写的page fault handler,并在原先产生页错误的语句上继续执行。这个汇编代码将由内核使用sys_env_set_pgfault_upcall()进行注册。

Exercise 10. Implement the _pgfault_upcall routine in lib/pfentry.S. The interesting part is returning to the original point in the user code that caused the page fault. You’ll return directly there, without going back through the kernel. The hard part is simultaneously switching stacks and re-loading the EIP.

思路分析

练习十可能是本实验乃至整个课程实验中最难的一个,如果要做好练习十,你需要掌握的知识包括

  • 汇编代码中callret的原理,主要是这两个语句在调用时栈的变化
  • 内核栈、用户栈及用户异常栈的布局
  • 对于内存的精细操作

我反正是没搞定,看了别人的代码琢磨了半天琢磨明白了,还是太菜了。

设计

根据程序的注释,我们可以总结出这段程序的流程图如下:首先这个函数要调用_pgfault_handler。处理完页错误后,再将异常栈帧弹出即可。

实现
点击显/隐内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
// Page fault upcall entrypoint.

// This is where we ask the kernel to redirect us to whenever we cause
// a page fault in user space (see the call to sys_set_pgfault_handler
// in pgfault.c).
//
// When a page fault actually occurs, the kernel switches our ESP to
// point to the user exception stack if we're not already on the user
// exception stack, and then it pushes a UTrapframe onto our user
// exception stack:
//
// trap-time esp
// trap-time eflags
// trap-time eip
// utf_regs.reg_eax
// ...
// utf_regs.reg_esi
// utf_regs.reg_edi
// utf_err (error code)
// utf_fault_va <-- %esp
//
// If this is a recursive fault, the kernel will reserve for us a
// blank word above the trap-time esp for scratch work when we unwind
// the recursive call.
//
// We then have call up to the appropriate page fault handler in C
// code, pointed to by the global variable '_pgfault_handler'.

.text
.globl _pgfault_upcall
_pgfault_upcall:
// Call the C page fault handler.
pushl %esp // function argument: pointer to UTF
movl _pgfault_handler, %eax
call *%eax
addl $4, %esp // pop function argument
// After this instruction, %esp will
// points to the last utf

// Now the C page fault handler has returned and you must return
// to the trap time state.
// Push trap-time %eip onto the trap-time stack.
// Restore tf from utf

//
// Explanation:
// We must prepare the trap-time stack for our eventual return to
// re-execute the instruction that faulted.
// Unfortunately, we can't return directly from the exception stack:
// We can't call 'jmp', since that requires that we load the address
// into a register, and all registers must have their trap-time
// values after the return.
// We can't call 'ret' from the exception stack either, since if we
// did, %esp would have the wrong value.
// So instead, we push the trap-time %eip onto the *trap-time* stack!
// Below we'll switch to that stack and call 'ret', which will
// restore %eip to its pre-fault value.
//
// In the case of a recursive fault on the exception stack,
// note that the word we're pushing now will fit in the
// blank word that the kernel reserved for us.
//
// Throughout the remaining code, think carefully about what
// registers are available for intermediate calculations. You
// may find that you have to rearrange your code in non-obvious
// ways as registers become unavailable as scratch space.
//
// LAB 4: Your code here.

movl 40(%esp), %ebx
movl %esp, %eax
movl 48(%esp), %esp
pushl %ebx

movl %eax, %esp
subl $4, 48(%esp)

// Restore the trap-time registers. After you do this, you
// can no longer modify any general-purpose registers.
// LAB 4: Your code here.
addl $8, %esp
popal

// Restore eflags from the stack. After you do this, you can
// no longer use arithmetic operations or anything else that
// modifies eflags.
// LAB 4: Your code here.
addl $4, %esp
popfl

// Switch back to the adjusted trap-time stack.
// LAB 4: Your code here.
popl %esp

// Return to re-execute the instruction that faulted.
// LAB 4: Your code here.
ret

最后,你需要完成C用户库的用户层页错误处理机制,即练习11。

Exercise 11. Finish set_pgfault_handler() in lib/pgfault.c.

函数功能

这个函数的功能是设置pgfault_handler,这个函数会注册用户指定的页错误处理函数,并通知内核在发生页错误时,调用汇编编写的页错误回调函数_pgfault_upcall

函数设计

:warning:注意,用户态与内核进行交互需要使用系统调用!

函数实现

根据函数功能和函数相关的注释提示,我们能够写出函数的实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
//
// Set the page fault handler function.
// If there isn't one yet, _pgfault_handler will be 0.
// The first time we register a handler, we need to
// allocate an exception stack (one page of memory with its top
// at UXSTACKTOP), and tell the kernel to call the assembly-language
// _pgfault_upcall routine when a page fault occurs.
//
void
set_pgfault_handler(void (*handler)(struct UTrapframe *utf))
{
int r;

if (_pgfault_handler == 0) {
// First time through!
// LAB 4: Your code here.
// Allocate an exception stack
sys_page_alloc(thisenv->env_id, (void*)UXSTACKTOP-PGSIZE, PTE_U | PTE_P | PTE_W);
}

// Save handler pointer for assembly to call.
_pgfault_handler = handler;
sys_env_set_pgfault_upcall(thisenv->env_id, _pgfault_upcall);
}

我们使用了两个系统调用,一个用于分配页空间作为异常栈,一个用于设置pgfault_upcall,注意完成之后,需要注册系统调用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
int32_t
syscall(uint32_t syscallno, uint32_t a1, uint32_t a2, uint32_t a3, uint32_t a4, uint32_t a5)
{
// LAB 3: Your code here.
int r = 0;
switch (syscallno) {
case SYS_cgetc:
r = sys_cgetc();
break;
...
case SYS_env_set_pgfault_upcall:
r = sys_env_set_pgfault_upcall((envid_t)a1, (void*)a2);
break;
default:
return -E_INVAL;
}
return r;
}

Testing

Run user/faultread (make run-faultread). You should see:

1
2
3
4
5
...
[00000000] new env 00001000
[00001000] user fault va 00000000 ip 0080003a
TRAP frame ...
[00001000] free env 00001000

上面这段代码试图读取0x0这个地址的内容,所以出错了,由于没有注册页错误处理函数,所以打印user fault va 00000000 ip 0080003a。Run user/faultdie. You should see:

1
2
3
4
5
...
[00000000] new env 00001000
i faulted at va deadbeef, err 6
[00001000] exiting gracefully
[00001000] free env 00001000

上面这段代码试图写0xdeadbeef这个地址,所以出错

Run user/faultalloc. You should see:

1
2
3
4
5
6
7
8
9
...
[00000000] new env 00001000
fault deadbeef
this string was faulted in at deadbeef
fault cafebffe
fault cafec000 // 这个是为什么呢?
this string was faulted in at cafebffe
[00001000] exiting gracefully
[00001000] free env 00001000

If you see only the first “this string” line, it means you are not handling recursive page faults properly.

Run user/faultallocbad. You should see:

1
2
3
4
...
[00000000] new env 00001000
[00001000] user_mem_check assertion failure for va deadbeef
[00001000] free env 00001000

Make sure you understand why user/faultalloc and user/faultallocbad behave differently. faultallocbad试图向deadbeef写入,而faultalloc只是试图读取。

Challenge! 扩展你的内核,令其不仅能处理页错误,任何在用户空间产生的处理器异常,都能重定向至一个用户态的异常处理函数。编写用户态测试程序来测试用户态异常处理函数对于不同异常的处理,例如除零错误、通用保护错误以及非法操作(illegal opcode)等。.

实施写入时复制的Fork

现在,你有了在用户空间实施写入时复制fork的全部内核特性。我们在lib/fork.c为你提供了fork的骨架。和dumbfork()类似, fork()应当创建一个新进程,然后扫描父进程的整个地址空间,并在子进程中建立对应的页映射。关键不同在于,dumbfork() 拷贝了所有的,而 fork() 将只复制页映射关系。当有一个进程试图写入一个页时,fork()才会拷贝这个页。

fork()的基本控制流如下:

  • 父进程利用set_pgfault_handler()pgfault()设置为C语言层面的fault handler。

  • 父进程调用sys_exofork()创建一个子进程。

  • 对于每一个在地址空间中低于UTOP的可写的或者copy-on-write的页,父进程调用duppage,这个函数将把每一个可写的页以copy-on-write的形式映射到子进程的地址空间中,同时,将自己的可写页重新映射为copy-on-write。[ 注意此处的顺序,先将子进程设置为COW,再设置父进程,这个非常重要。为什么呢?请思考一种顺序改变后可能造成的错误]duppage设置了父进程与子进程的PTE,使得页不可写,同时在”avail”字段包含了PTE_COW,来区分传统的只读页和copy-on-write页。

    与普通用户空间不同,异常栈不是这样映射的,你需要为子进程分配一个新页作为异常栈。由于页错误处理函数将执行实际拷贝操作,同时页错误处理函数运行在异常栈上,因此异常栈不能被设置为copy-on-write。况且也没有进程会复制异常栈。

    fork()同样需要处理存在(PTE_P)但是不是可写的或者COW的页。

  • 父进程为子进程设置用户页错误处理入口。

  • 子进程现在准备运行,父进程将其设置为可运行态。

当有进程写一个它尚未写入的copy-on-write页时,它将会产生一个页错误。这里是用户页错误处理函数的控制流程:

  1. 内核产生一个页错误The kernel propagates the page fault to _pgfault_upcall,这个函数调用fork()pgfault()handler。
  2. pgfault() 检查错误是否为一个写操作(check for FEC_WR in the error code),同时,写入页的PTE标记为PTE_COW。如果不是,触发panic。
  3. pgfault() allocates a new page mapped at a temporary location and copies the contents of the faulting page into it. Then the fault handler maps the new page at the appropriate address with read/write permissions, in place of the old read-only mapping.

The user-level lib/fork.c code must consult the environment’s page tables for several of the operations above (e.g., that the PTE for a page is marked PTE_COW). The kernel maps the environment’s page tables at UVPT exactly for this purpose. It uses a clever mapping trick to make it to make it easy to lookup PTEs for user code. lib/entry.S sets up uvpt and uvpd so that you can easily lookup page-table information in lib/fork.c.

Exercise 12. Implement fork, duppage and pgfault in lib/fork.c.

设计

fork的完整流程如下:

伪代码

fork的伪代码如下:

1
2
3
4
Function Fork
Set pgfault handler
Exo Fork
EndFunction

利用forktree测试你的程序 。这个程序应当产生如下信息, It should produce the following messages, with interspersed ‘new env’, ‘free env’, and ‘exiting gracefully’ messages. The messages may not appear in this order, and the environment IDs may be different.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1000: I am ''
1001: I am '0'
2000: I am '00'
2001: I am '000'
1002: I am '1'
3000: I am '11'
3001: I am '10'
4000: I am '100'
1003: I am '01'
5000: I am '010'
4001: I am '011'
2002: I am '110'
1004: I am '001'
1005: I am '111'
1006: I am '101'

Challenge! Implement a shared-memory fork() called sfork(). This version should have the parent and child share all their memory pages (so writes in one environment appear in the other) except for pages in the stack area, which should be treated in the usual copy-on-write manner. Modify user/forktree.c to use sfork() instead of regular fork(). Also, once you have finished implementing IPC in part C, use your sfork() to run user/pingpongs. You will have to find a new way to provide the functionality of the global thisenv pointer.

Challenge! Your implementation of fork makes a huge number of system calls. On the x86, switching into the kernel using interrupts has non-trivial cost. Augment the system call interface so that it is possible to send a batch of system calls at once. Then change fork to use this interface.

你的新fork比原先的有多快?

You can answer this (roughly) by using analytical arguments to estimate how much of an improvement batching system calls will make to the performance of your fork: How expensive is an int 0x30 instruction? How many times do you execute int 0x30 in your fork? Is accessing the TSS stack switch also expensive? And so on…

Alternatively, you can boot your kernel on real hardware and really benchmark your code. See the RDTSC (read time-stamp counter) instruction, defined in the IA32 manual, which counts the number of clock cycles that have elapsed since the last processor reset. QEMU doesn’t emulate this instruction faithfully (it can either count the number of virtual instructions executed or use the host TSC, neither of which reflects the number of cycles a real CPU would require).

第二部分到此结束,确保你通过了第二部分所有测试。As usual, you can hand in your submission with make handin.

第三部分:抢占式多任务与进程间通信

在第三部分,你将修改内核,使其支持抢占式进程,并允许进程间相互通信。

定时中断与抢占

运行user/spin测试程序。这个测试程序将fork一个子进程,这个进程一旦获取CPU控制权,将会陷入无限制的循环当中。父进程和内核都无法再获取CPU。这样显然不是一个用户进程的理想情况,因为任意一个用户进程都可以通过陷入死循环劫持CPU导致整个系统僵死。为了允许内核抢占一个正在运行的进程,强制性重新占据CPU,我们需要允许JOS内核支持来自时钟的外部硬件中断。

中断规律

外部中断(设备中断)被称为IRQs。此处一共有16个可能的IRQs,序号为0-15。IRQ序号与中断向量表入口的映射不是固定的。 picirq.c中的pic_init将IRQs 0-15映射至IDT入口IRQ_OFFSETIRQ_OFFSET+15(32-47):

1
2
outb(IO_PIC1+1, IRQ_OFFSET);          // 0-7
outb(IO_PIC2+1, IRQ_OFFSET + 8); // 8-15

inc/trap.h中, IRQ_OFFSET被定义为十进制32。因此IDT入口32-47与IRQs 0-15相关联。例如,时钟中断为IRQ 0。因此, IDT[IRQ_OFFSET+0] (即IDT[32]) 包含了内核对时钟中断处理程序的地址。IRQ_OFFSET的选择需要注意硬件中断不能覆盖处理器提供的异常入口。

相比xv6,我们在JOS中做了关键简化。在内核态,外部设备中断总是被关闭的(在用户态则打开)。外部中断由%eflags寄存器的FL_IF标志位进行控制。FL_IF = 1时开中断。由于这个标志位可以通过若干种方式修改,由于我们的简化,我们只需要在进入和离开用户态时对其进行保存和恢复即可。

你需要保证FL_IF标志位在用户态进程运行时被设置,当一个中断来临时,这个中断将会被传递至处理器,并被你的中断处理程序处理。否则,中断被掩盖(masked)或者忽略直到中断被重新使能。我们在bootloader的第一条语句中关闭了中断,同时没有再重新使能它。

Exercise 13. 修改 kern/trapentry.Skern/trap.c在IDT中初始化证确的入口,同时为中断0-15提供正确的入口。然后修改kern/env.cenv_alloc()的代码,确保用户态进程总是运行在中断使能的状态下。

同样,取消sched_halt()sti指令的注释,使得空闲的CPU不再掩盖中断。由于我在lab3中使用了自动生成中断入口函数的脚本vectors.py,因此此处我需要修改vectors.py

修改vectors.py

首先修改vectors.py,32号到47号向量不需要压入错误码:

1
2
3
4
5
6
.globl vector32
.type vector32,@function
.align 2
vector32:
pushl $32
jmp _alltraps

然后在trap_init.c中为中断设置入口函数

1
2
3
4
5
// Set gate for irq
for(i = IRQ_OFFSET; i <= IRQ_OFFSET + 15; i++){
// SETGATE(idt[i], is_trap, selector, function, dpl)
SETGATE(idt[i], 0, GD_KT, vectors[i], 0)
}

修改env_alloc代码

1
e->env_tf.tf_eflags |= FL_IF;

当激活一个硬件中断时,处理器将不会压入一个错误码。You might want to re-read section 9.2 of the 80386 Reference Manual, or section 5.8 of the IA-32 Intel Architecture Software Developer’s Manual, Volume 3, at this time.

当做完这个练习后,如果你运行内核if you run your kernel with any test program that runs for a non-trivial length of time (e.g., spin), you should see the kernel print trap frames for hardware interrupts. While interrupts are now enabled in the processor, JOS isn’t yet handling them, so you should see it misattribute each interrupt to the currently running user environment and destroy it. Eventually it should run out of environments to destroy and drop into the monitor.

Handling Clock Interrupts

In the user/spin program, after the child environment was first run, it just spun in a loop, and the kernel never got control back. We need to program the hardware to generate clock interrupts periodically, which will force control back to the kernel where we can switch control to a different user environment.

The calls to lapic_init and pic_init (from i386_init in init.c), which we have written for you, set up the clock and the interrupt controller to generate interrupts. You now need to write the code to handle these interrupts.

Exercise 14. Modify the kernel’s trap_dispatch() function so that it calls sched_yield() to find and run a different environment whenever a clock interrupt takes place.

You should now be able to get the user/spin test to work: the parent environment should fork off the child, sys_yield() to it a couple times but in each case regain control of the CPU after one time slice, and finally kill the child environment and terminate gracefully.

This is a great time to do some regression testing. Make sure that you haven’t broken any earlier part of that lab that used to work (e.g. forktree) by enabling interrupts. Also, try running with multiple CPUs using make CPUS=2 target. You should also be able to pass stresssched now. Run make grade to see for sure. You should now get a total score of 65/80 points on this lab.

Inter-Process communication (IPC)

(Technically in JOS this is “inter-environment communication” or “IEC”, but everyone else calls it IPC, so we’ll use the standard term.)

We’ve been focusing on the isolation aspects of the operating system, the ways it provides the illusion that each program has a machine all to itself. Another important service of an operating system is to allow programs to communicate with each other when they want to. It can be quite powerful to let programs interact with other programs. The Unix pipe model is the canonical example.

There are many models for interprocess communication. Even today there are still debates about which models are best. We won’t get into that debate. Instead, we’ll implement a simple IPC mechanism and then try it out.

IPC in JOS

You will implement a few additional JOS kernel system calls that collectively provide a simple interprocess communication mechanism. You will implement two system calls, sys_ipc_recv and sys_ipc_try_send. Then you will implement two library wrappers ipc_recv and ipc_send.

The “messages” that user environments can send to each other using JOS’s IPC mechanism consist of two components: a single 32-bit value, and optionally a single page mapping. Allowing environments to pass page mappings in messages provides an efficient way to transfer more data than will fit into a single 32-bit integer, and also allows environments to set up shared memory arrangements easily.

Sending and Receiving Messages

To receive a message, an environment calls sys_ipc_recv. This system call de-schedules the current environment and does not run it again until a message has been received. When an environment is waiting to receive a message, any other environment can send it a message - not just a particular environment, and not just environments that have a parent/child arrangement with the receiving environment. In other words, the permission checking that you implemented in Part A will not apply to IPC, because the IPC system calls are carefully designed so as to be “safe”: an environment cannot cause another environment to malfunction simply by sending it messages (unless the target environment is also buggy).

To try to send a value, an environment calls sys_ipc_try_send with both the receiver’s environment id and the value to be sent. If the named environment is actually receiving (it has called sys_ipc_recv and not gotten a value yet), then the send delivers the message and returns 0. Otherwise the send returns -E_IPC_NOT_RECV to indicate that the target environment is not currently expecting to receive a value.

A library function ipc_recv in user space will take care of calling sys_ipc_recv and then looking up the information about the received values in the current environment’s struct Env.

Similarly, a library function ipc_send will take care of repeatedly calling sys_ipc_try_send until the send succeeds.

Transferring Pages

When an environment calls sys_ipc_recv with a valid dstva parameter (below UTOP), the environment is stating that it is willing to receive a page mapping. If the sender sends a page, then that page should be mapped at dstva in the receiver’s address space. If the receiver already had a page mapped at dstva, then that previous page is unmapped.

When an environment calls sys_ipc_try_send with a valid srcva (below UTOP), it means the sender wants to send the page currently mapped at srcva to the receiver, with permissions perm. After a successful IPC, the sender keeps its original mapping for the page at srcva in its address space, but the receiver also obtains a mapping for this same physical page at the dstva originally specified by the receiver, in the receiver’s address space. As a result this page becomes shared between the sender and receiver.

If either the sender or the receiver does not indicate that a page should be transferred, then no page is transferred. After any IPC the kernel sets the new field env_ipc_perm in the receiver’s Env structure to the permissions of the page received, or zero if no page was received.

Implementing IPC

Exercise 15. Implement sys_ipc_recv and sys_ipc_try_send in kern/syscall.c. Read the comments on both before implementing them, since they have to work together. When you call envid2env in these routines, you should set the checkperm flag to 0, meaning that any environment is allowed to send IPC messages to any other environment, and the kernel does no special permission checking other than verifying that the target envid is valid.

Then implement the ipc_recv and ipc_send functions in lib/ipc.c.

Use the user/pingpong and user/primes functions to test your IPC mechanism. user/primes will generate for each prime number a new environment until JOS runs out of environments. You might find it interesting to read user/primes.c to see all the forking and IPC going on behind the scenes.

Challenge! Why does ipc_send have to loop? Change the system call interface so it doesn’t have to. Make sure you can handle multiple environments trying to send to one environment at the same time.

Challenge! The prime sieve is only one neat use of message passing between a large number of concurrent programs. Read C. A. R. Hoare, ``Communicating Sequential Processes,’’ Communications of the ACM 21(8) (August 1978), 666-667, and implement the matrix multiplication example.

Challenge! One of the most impressive examples of the power of message passing is Doug McIlroy’s power series calculator, described in M. Douglas McIlroy, ``Squinting at Power Series,’’ Software—Practice and Experience, 20(7) (July 1990), 661-683. Implement his power series calculator and compute the power series for sin(x+x^3).

Challenge! Make JOS’s IPC mechanism more efficient by applying some of the techniques from Liedtke’s paper, Improving IPC by Kernel Design, or any other tricks you may think of. Feel free to modify the kernel’s system call API for this purpose, as long as your code is backwards compatible with what our grading scripts expect.

This ends part C. Make sure you pass all of the make grade tests and don’t forget to write up your answers to the questions and a description of your challenge exercise solution in answers-lab4.txt.

Before handing in, use git status and git diff to examine your changes and don’t forget to git add answers-lab4.txt. When you’re ready, commit your changes with git commit -am ‘my solutions to lab 4’, then make handin and follow the directions.

参考文献

0%