剪不断,理还乱
介绍
In this lab you will implement preemptive multitasking among multiple simultaneously active user-mode environments.
In part A you will add multiprocessor support to JOS, implement round-robin scheduling, and add basic environment management system calls (calls that create and destroy environments, and allocate/map memory).
In part B, you will implement a Unix-like fork()
, which allows a user-mode environment to create copies of itself.
Finally, in part C you will add support for inter-process communication (IPC), allowing different user-mode environments to communicate and synchronize with each other explicitly. You will also add support for hardware clock interrupts and preemption.
开始
1 | athena% cd ~/6.828/lab |
首先,合并lab3,然后切换至lab4,在lab4中,你需要参考及阅读的源码如下:
目录 | 文件 | 功能 | 进度 |
---|---|---|---|
inc/ |
env.h |
Public definitions for user-mode environments | |
trap.h |
Public definitions for trap handling | ||
syscall.h |
Public definitions for system calls from user environments to the kernel | ||
lib.h |
Public definitions for the user-mode support library | ||
kern/ |
env.h |
Kernel-private definitions for user-mode environments | |
env.c |
Kernel code implementing user-mode environments | ||
trap.h |
Kernel-private trap handling definitions | ||
trap.c |
Trap handling code | ||
trapentry.S |
Assembly-language trap handler entry-points | ||
syscall.h |
Kernel-private definitions for system call handling | ||
syscall.c |
System call implementation code | ||
lib/ |
Makefrag |
Makefile fragment to build user-mode library, obj/lib/libjos.a |
|
entry.S |
Assembly-language entry-point for user environments | ||
libmain.c |
User-mode library setup code called from entry.S |
||
syscall.c |
User-mode system call stub functions | ||
console.c |
User-mode implementations of putchar and getchar , providing console I/O |
||
exit.c |
User-mode implementation of exit |
||
panic.c |
User-mode implementation of panic |
||
user/ |
* |
Various test programs to check kernel lab 3 code |
实验需求
本练习有三个部分,每个部分耗时一周。你需要完成所有的任务和至少一个挑战。
第一部分:多处理器支持及协作多任务
在本实验第一章,你需要在多处理器系统上扩展JOS,并实现一些新的系统调用使用户程序创建新的用户环境。你需要实现合作的round-robin调度算法,从而使内核实现任务切换。在Part C,你需要实现抢占式调度,从而使内核重新获得CPU控制权。
多处理器支持
We are going to make JOS support “symmetric multiprocessing” (SMP), a multiprocessor model in which all CPUs have equivalent access to system resources such as memory and I/O buses. While all CPUs are functionally identical in SMP, during the boot process they can be classified into two types: the bootstrap processor (BSP) is responsible for initializing the system and for booting the operating system; and the application processors (APs) are activated by the BSP only after the operating system is up and running. Which processor is the BSP is determined by the hardware and the BIOS. Up to this point, all your existing JOS code has been running on the BSP. (BSP负责启动系统)
In an SMP system, each CPU has an accompanying local APIC (局部可编程终端控制) unit. The LAPIC units are responsible for delivering interrupts throughout the system. The LAPIC also provides its connected CPU with a unique identifier. In this lab, we make use of the following basic functionality of the LAPIC unit (in kern/lapic.c
):
- Reading the LAPIC identifier (APIC ID) to tell which CPU our code is currently running on (see
cpunum()
). - Sending the
STARTUP
interprocessor interrupt (IPI) from the BSP to the APs to bring up other CPUs (seelapic_startap()
). - In part C, we program LAPIC’s built-in timer to trigger clock interrupts to support preemptive multitasking (see
apic_init()
).
一个处理器通过内存映射IO(MMIO)来访问其LAPIC。在MMIO中,a portion of physical memory is hardwired to the registers of some I/O devices, so the same load/store instructions typically used to access memory can be used to access device registers. You’ve already seen one IO hole at physical address 0xA0000
(we use this to write to the VGA display buffer). The LAPIC lives in a hole starting at physical address 0xFE000000
(32MB short of 4GB), so it’s too high for us to access using our usual direct map at KERNBASE. The JOS virtual memory map leaves a 4MB gap at MMIOBASE
so we have a place to map devices like this. Since later labs introduce more MMIO regions, you’ll write a simple function to allocate space from this region and map device memory to it.
练习1:在kern/pmap.c
中编写mmio_map_region
。为了搞明白这个是如何工作的,看一下kern/lapic.c
中lapic_init
的一开始部分。你需要完成下一个练习,才能令mmio_map_region
正常工作。
函数原型
1 | void * |
函数功能
这个函数保留了一段空间,用于MMIO区域,这段区域位于MMIOBASE + size
,被映射至[pa, pa+size)
,程序不难,照着注释就能写完。
函数实现
1 | // |
调试
上面的函数无法通过check_kern_pgdir
,注意这里的base
用的是static
,根据其中的注释,我们每次调用mmio_map_region
,都需要更新base
,因此我们在函数中使用一个old_base
记录更新前的值,然后令base = base + size
。代码如下:
1 | uintptr_t old_base = base; |
应用处理器启动
在启动APs之前,BSP将会收集多处理器系统的信息,例如CPU总数,APIC号以及LAPIC的多路IO地址。 kern/mpconfig.c
中的 mp_init()
函数通过读取位于内存BIOS区域的多处理器配置表获取这些信息。
boot_aps()
函数 (位于kern/init.c
) 驱动了应用处理器的启动过程。APs于实模式下启动,和bootloader由boot/boot.S
的启动过程类似,因此boot_aps()
(位于kern/mpentry.S
)将AP的入口(entry)代码复制到了可以在实模式下寻址的一段内存空间中。不同于bootloader,我们对于AP执行代码的位置有了一定控制;我们将入口代码复制到了 0x7000
(MPENTRY_PADDR
)处。
在此之后,boot_aps()
将依次激活应用处理器,通过向对应处理器的LAPIC单元发送STARTUP
IPIs,以及一个初始化的 CS:IP
地址,应用处理器将会在这个地址执行入口代码 (MPENTRY_PADDR
). kern/mpentry.S
中的入口代码和boot/boot.S
中的很类似。经过一些简单的启动过程后,这个代码将会令AP进入保护模式,从而使能页映射,并调用 mp_main()
进行配置(also in kern/init.c
)。 在唤醒下一个AP前,boot_aps()
等待AP发送一个CPU_STARTED
标志。这个标志位于结构体struct CpuInfo
的cpu_status
字段。
练习 2. Read boot_aps()
and mp_main()
in kern/init.c
, and the assembly code in kern/mpentry.S
. Make sure you understand the control flow transfer during the bootstrap of APs. Then modify your implementation of page_init()
in kern/pmap.c
to avoid adding the page at MPENTRY_PADDR
to the free list, so that we can safely copy and run AP bootstrap code at that physical address. Your code should pass the updated check_page_free_list()
test (but might fail the updated check_kern_pgdir()
test, which we will fix soon).
阅读boot_aps
、mp_main
以及mpentry.S
boot_aps
的源代码如下:
1 | static void |
boot_aps
的基本过程为首先拷贝入口代码,然后遍历每个CPU,如果CPU已经启动了(cpunum返回当前CPU),那么跳过,否则分配栈(大小为32768),然后启动AP并等待配置完成。
mp_main
的源码如下:
1 | // Setup code for APs |
mp_main
主要做了一些初始化工作,具体内容我们会在后面进行讲解。至于mpentry.S
,其基本功能就是初始化一些寄存器,然后调用mp_main
,所以AP的初始化过程可以总结为:boot_aps
将mpentry.S
拷贝到给定内存空间中,然后mpentry.S
调用mp_main
进行CPU初始化工作。
将MPENTRY_PADDR
(0x7000)对应的物理页从内存freelist中剔除
我们之间以链表的形式将内存中空闲页面进行了串联,现在需要将0x7000对应的物理页剔除,只要找到0x7000对应的页在pages的数组中所在下标,并调整链表节点连接关系即可,比较简单,代码如下:
1 | // LAB 4: |
单个CPU的状态及初始化
编写多处理器OS时,最重要的一个就是分辨每个CPU的状态(私有)以及系统共享全局状态。 kern/cpu.h
定义了最常用的CPU状态,包括 struct CpuInfo
,包含了每个CPU中的状态变量. cpunum()
总是返回调用它的CPU的序号,因此可以被用来作为CPU数组cpus
的索引。同样的, 宏thiscpu
也记录了当前CPU的 struct CpuInfo
.
这里列举了你应当了解的每个CPU的状态:
Per-CPU kernel stack(每个CPU的内核栈)
每个CPU可能同时陷入内核态,我们需要为每个处理器设置单独的内核栈,确保他们之间不会相互打扰。The arraypercpu_kstacks[NCPU][KSTKSIZE]
reserves space for NCPU’s worth of kernel stacks.(独立的内核栈确保互不打扰)In Lab 2, you mapped the physical memory that
bootstack
refers to as the BSP’s kernel stack just belowKSTACKTOP
. Similarly, in this lab, you will map each CPU’s kernel stack into this region with guard pages acting as a buffer between them. CPU 0’s stack will still grow down fromKSTACKTOP
; CPU 1’s stack will startKSTKGAP
bytes below the bottom of CPU 0’s stack, and so on.inc/memlayout.h
shows the mapping layout.Per-CPU TSS and TSS descriptor(每个CPU的TSS和TSSD)
A per-CPU task state segment (TSS) is also needed in order to specify where each CPU’s kernel stack lives. The TSS for CPU i is stored incpus[i].cpu_ts
, and the corresponding TSS descriptor is defined in the GDT entrygdt[(GD_TSS0 >> 3) + i]
. The globalts
variable defined inkern/trap.c
will no longer be useful.Per-CPU current environment pointer(每个CPU的当前环境指针)
Since each CPU can run different user process simultaneously, we redefined the symbolcurenv
to refer tocpus[cpunum()].cpu_env
(orthiscpu->cpu_env
), which points to the environment currently executing on the current CPU (the CPU on which the code is running).Per-CPU system registers(每个CPU的系统寄存器)
All registers, including system registers, are private to a CPU. Therefore, instructions that initialize these registers, such aslcr3()
,ltr()
,lgdt()
,lidt()
, etc., must be executed once on each CPU. Functionsenv_init_percpu()
andtrap_init_percpu()
are defined for this purpose.In addition to this, if you have added any extra per-CPU state or performed any additional CPU-specific initialization (by say, setting new bits in the CPU registers) in your solutions to challenge problems in earlier labs, be sure to replicate them on each CPU here!
Exercise 3. Modify mem_init_mp()
(in kern/pmap.c
) to map per-CPU stacks starting at KSTACKTOP
, as shown in inc/memlayout.h
. The size of each stack is KSTKSIZE
bytes plus KSTKGAP
bytes of unmapped guard pages. Your code should pass the new check in check_kern_pgdir()
.
函数功能
为每个CPU分配栈空间,栈空间分为两部分,一部分为栈区,大小为KSTKSIZE
,另一部分为缓冲区,大小为KSTKGAP
,缓冲区的作用是判断栈溢出,如果进入缓冲区,说明溢出,要触发异常,栈示意图如下:
函数实现
根据函数功能,我们需要将栈地址与对应的物理地址进行映射,每个栈的栈顶为kstacktop_i = KSTACKTOP - i * (KSTKSIZE + KSTKGAP)
,栈底为kstacktop_i-KSTKSIZE
,大小为KSTKSIZE
。不难写出映射代码如下:
1 | // Modify mappings in kern_pgdir to support SMP |
完成后,代码应当能够通过check_kern_pgdir
Exercise 4. The code in trap_init_percpu()
(kern/trap.c
) initializes the TSS and TSS descriptor for the BSP. It worked in Lab 3, but is incorrect when running on other CPUs. Change the code so that it can work on all CPUs. (Note: your new code should not use the global ts
variable any more.)
TSS和TSS descriptor
为了解决这个问题,我们首先要搞清楚Task Segment Selector(TSS)和TSS descriptor,关于这些内容请参考关于进程的讲解。练习四要求我们修改trap_init_percpu
,从而初始化TSS和TSSD。
原始版本的trap_init_percpu
1 | void |
这段代码配置了CPU0(Boot CPU)的TSS和TSSD,但是并不通用,我们要将其修改,令其可以自适应地配置每个CPU的TSS和TSSD。
修改内容
根据注释,我们需要修改的内容如下:
- 首先使用
thiscpu->cpu_ts
即当前CPU的TSS段代替ts
- 将TSS段中的esp0指向当前CPU内核栈栈顶(栈顶计算过程参考练习3)
- 修改当前CPU的TSSD,即
gdt[(GD_TSS0 >> 3) + cpunum()]
- 修改当前CPU的TSS selector
前三个任务根据注释很好完成,主要注意下当前CPU的TSS selector。根据注释,selector的低三位都是0,实际上第四位表示具体的CPU,因此每个CPU的TSS selector可以表示为:GD_TSS0 + (cpunum() << 3)
。
改进后的trap_init_percpu
根据上面的分析及注释,可以写出改进的trap_init_percpu
如下:
1 | // Initialize and load TSS, TSS descriptor and IDT for all CPUs |
完成上述内容后,以4核CPU在QEMU中启动JOS(make qemu CPUS=4
),可以得到如下输出结果:
1 | ... |
可以看到CPU都启动了,但新启动的进程遇到了一个Triple fault,原因见上一个实验。
锁
我们现在的代码将在mp_main()
初始化AP后陷入死循环。为了让AP能够继续运行,我们首先要解决多CPU同时运行内核代码时的竞态条件。最简单的解决方式是使用一个大的全局内核锁,当有进程进入内核态后,需要对内核进行上锁;从内核态返回用户态时,需要解锁。在这个模型中,用户态进程可以在任意CPU上并发执行,但是只有一个进程能够运行于内核态,任何一个其他的想要进入内核态的进程必须等待。
kern/spinlock.h
declares the big kernel lock, namely kernel_lock
. It also provides lock_kernel()
and unlock_kernel()
, shortcuts to acquire and release the lock. You should apply the big kernel lock at four locations:
- In
i386_init()
, acquire the lock before the BSP wakes up the other CPUs. 避免其他CPU执行内核代码。 - In
mp_main()
, acquire the lock after initializing the AP, and then callsched_yield()
to start running environments on this AP. 在该CPU上执行用户态进程 - In
trap()
, acquire the lock when trapped from user mode. To determine whether a trap happened in user mode or in kernel mode, check the low bits of thetf_cs
. 从用户态进入内核态,占有内核锁。 - In
env_run()
, release the lock right before switching to user mode. Do not do that too early or too late, otherwise you will experience races or deadlocks. 从内核态进入用户态,释放内核锁
从上面的过程中可以看出,我们有三处位置进行上锁,一处位置进行解锁。实际上,我们在所有可能发生内核态到用户态以及用户态到内核态转变的过程中,都对内核锁进行了操作。
Exercise 5. Apply the big kernel lock as described above, by calling lock_kernel()
and unlock_kernel()
at the proper locations.
在指定位置为内核上锁/解锁
根据上面的提示,我们需要在特定位置上对内核锁进行加锁与解锁的操作。代码并不难,就不赘述了。目前我们尚无法验证上锁是否正确,但是随着后面练习的推进,我们会验证锁的正确性
问题:使用一个大的内核锁保证了在一个时刻只能有一个CPU运行内核代码。为何我们还需要对每个CPU单独设置内核栈?Describe a scenario in which using a shared kernel stack will go wrong, even with the protection of the big kernel lock.
在由用户态执行系统调用进入内核态,到trap()
对内核进行加锁保护这一段时间内,内核实际上是未经保护的,此时_alltrap
会将一些关键寄存器的值压栈并保存,因此会对内核栈进行修改。可能会发生错误,所以必须要多个内核栈。
Challenge!
粗颗粒锁简单易用,但是限制了内核态的所有并发,效率较低。现代操作系统使用细颗粒锁来保护共享态中的不同的部分。细颗粒锁提高了并发性能,但是更困难,同时也更容易出错。请尝试设计合理的细颗粒锁,实现对于内核更加精细的控制。
提示:你可以考虑针对如下的共享部分设计细颗粒锁,并实现内核的访问控制:
- The page allocator.
- The console driver.
- The scheduler.
- The inter-process communication (IPC) state that you will implement in the part C.
定义细颗粒锁以及加锁/解锁函数
根据上面的提示,我们首先要针对共享部分,定义新的细颗粒锁。
Round-Robin 调度(轮转调度)
下一个任务是实施轮转调度,令内核轮转调度多个进程,调度过程如下:
- 函数
sched_yield()
(在新的kern/sched.c
中)选择一个新的要运行的进程。该函数在进程队列envs[]
中环形顺序搜索, 搜索位置开始于前一个运行的进程之后(如果没有前一时刻的运行进程,那么就从进程队列起始位置搜索),并选择第一个状态为ENV_RUNNABLE
(seeinc/env.h
)的进程,并调用env_run()
执行该进程 sched_yield()
永远不应于同一时刻在两个CPU上运行同一个进程。根据进程状态ENV_RUNNING
,我们可以判定当前进程是否在某个CPU(或当前CPU)上运行.- 我们定义了一个新系统调用
sys_yield()
,用户可以通过该调用来调用内核sched_yield()
函数,从而主动将CPU让给另一个进程。
Exercise 6. Implement round-robin scheduling in sched_yield()
as described above. Don’t forget to modify syscall()
to dispatch sys_yield()
.
在mp_main
中调用sched_yield()
,修改kern/init.c
并创建三个以上的进程,这些进程都运行程序 user/yield.c
。现在,运行make qemu CPUS=2
,你应当能看到多个进程交互运行,每个执行了五次,然后退出
1 | ... |
After the yield
programs exit, there will be no runnable environment in the system, the scheduler should invoke the JOS kernel monitor. If any of this does not happen, then fix your code before proceeding.
框架编写
任务6主要是实现轮询调度,然后在合适的地方调用该函数,找到下一个调度的进程。我们先完成框架的编写,在mp_main
中调用sched_yield()
,改进后的mp_main
为:
1 | // Setup code for APs |
然后修改kern/syscall.c
,在syscall()
中添加对sys_yield
的响应。
轮询算法实现
为了实现轮询算法,我们需要解决如下问题:
- 数组遍历:算法要求我们从当前进程之后开始,对可调度的进程进行搜索,因此我们需要从
envs
中间某个位置开始,对其进行遍历。简介的写法可以写为:envs[(next_envid + i) % NENV]
- 找不到合适的
envs
:- 如果当前CPU的进程仍然在运行,可以选择当前进程继续执行
- 如果当前CPU上的进程也结束了,也找不到合适的进程,那么就挂起CPU
我们能够写出轮询算法如下:
1 | void |
Q uestion
- 在
env_run()
中,你调用了lcr3()
实现页目录的切换,而在这条语句之后,代码都使用了变量e
。在将进程页目录载入cr3
寄存器后,地址映射发生了变化,为何还能通过e
的虚拟地址找到e
所在的物理地址呢 - Whenever the kernel switches from one environment to another, it must ensure the old environment’s registers are saved so they can be restored properly later. Why? Where does this happen?
- 第一个问题:如果JOS能够正确运行,那说明
e
的虚拟地址实际上是没有变化的,也就是在内核页目录和用户页目录中,e
的映射是相同的。实际也确实如此,因为用户页目录本身就从内核页目录中复制过来,只有一部分发生了变动,而e
所在的区域envs
并未发生改变,在每个进程页目录中都相同,所以映射依然有效。 - 第二个问题:旧的进程的寄存器需要保存,因为要保存现场,确保下一次调度时,能够从上一次暂停的地方继续执行。这个发生的地方在
_alltrap
中,即从用户态转向内核态后。
Challenge! 为内核添加一个不那么普通的(trivial)调度策略,例如固定优先级策略, such as a fixed-priority scheduler that allows each environment to be assigned a priority and ensures that higher-priority environments are always chosen in preference to lower-priority environments. If you’re feeling really adventurous, try implementing a Unix-style adjustable-priority scheduler or even a lottery or stride scheduler. (Look up “lottery scheduling” and “stride scheduling” in Google.)
Write a test program or two that verifies that your scheduling algorithm is working correctly (i.e., the right environments get run in the right order). It may be easier to write these test programs once you have implemented fork()
and IPC in parts B and C of this lab.
暂时未解决,先了解一下静态优先级调度(花半天时间看看)
Challenge! The JOS kernel currently does not allow applications to use the x86 processor’s x87 floating-point unit (FPU), MMX instructions, or Streaming SIMD Extensions (SSE). Extend the Env
structure to provide a save area for the processor’s floating point state, and extend the context switching code to save and restore this state properly when switching from one environment to another. The FXSAVE
and FXRSTOR
instructions may be useful, but note that these are not in the old i386 user’s manual because they were introduced in more recent processors. Write a user-level test program that does something cool with floating-point.
未解决
用于进程创建的系统调用
目前JOS只能运行内核初始化时设置的进程,你需要编写对应的系统调用,使得用户进程能够创建并开始其他的用户进程。
Unix provides the fork()
system call as its process creation primitive. Unix fork()
copies the entire address space of calling process (the parent) to create a new process (the child). The only differences between the two observable from user space are their process IDs and parent process IDs (as returned by getpid
and getppid
). 在父进程中,fork()
返回子进程ID,而在子进程中 fork()
返回0
。默认地,每个进程拥有其私有的地址空间,并对对方不可见。
你需要提供一组用于创建新用户进程的系统调用。利用这些系统调用,你能够在用户态执行fork
,新的系统调用如下:
sys_exofork
:这个系统调用创建了一个几乎空白状态的进程:用户地址空间尚未被映射,同时这个进程也不能运行。新的进程和父进程拥有相同的寄存器状态。在父进程中,
sys_exofork
返回新创建进程的envid_t
(如果进程分配错误,返回负错误码)。在子进程中,返回0(由于子进程被标注为不能运行态,因此sys_exofork
不会真的返回子进程,直到父进程使用sys_env_set_status
显式修改子进程状态为runnable)sys_env_set_status
:将特定进程的状态设置为
ENV_RUNNABLE
或ENV_NOT_RUNNABLE
。当一个新进程地址空间和寄存器状态被初始化后,这个系统调用用来将新进程标记为可运行的。sys_page_alloc
:分配一个页的物理内存,并将其映射到给定进程地址空间的虚拟地址上。
sys_page_map
:将一个进程的地址空间映射关系拷贝到另一个进程,新旧进程的虚拟地址将指向相同的物理内存(共享内存)。
sys_page_unmap
:解除给定进程的一个给定虚拟地址的映射关系。
对于上面所有的接收一个进程ID的系统调用,JOS使用0表示当前进程。这个转换由函数envid2env()
(in kern/env.c
)实现。
We have provided a very primitive implementation of a Unix-like fork()
in the test program user/dumbfork.c
. This test program uses the above system calls to create and run a child environment with a copy of its own address space. The two environments then switch back and forth using sys_yield
as in the previous exercise. The parent exits after 10 iterations, whereas the child exits after 20.
Exercise 7. 实现上述系统调用,并确保syscall
调用他们。你需要使用kern/pmap.c
和 kern/env.c
中的几个函数,特别是envid2env()
. For now, whenever you call envid2env()
, pass 1 in the checkperm
parameter. Be sure you check for any invalid system call arguments, returning -E_INVAL
in that case. Test your JOS kernel with user/dumbfork
and make sure it works before proceeding.
代码中已经给出了系统调用函数的框架,我们只需要在kern/syscall.c
中实现对应的系统调用即可。
sys_exofork
实现
函数设计
在该函数中,我们主要需要完成下面两个任务:
- 使用
env_alloc
创建一个新的进程 - 对新的进程进行初始化,主要是对
tf
进行拷贝,同时将tf_regs
的reg_eax
字段设置为0,保证系统调用返回值为0
调试
在一开始对于函数的功能没有理解,创建了进程后,直接调用env_run
运行了进程,这样做的后果是进程A调用sys_exofork
后,创建了新的进程B,随即立刻开始运行进程B,即由A进入系统调用,但是返回却是返回了进程B,这样显然是有问题的。正确的做法是将新进程的状态设为ENV_NOT_RUNNABLE
,并在后面的某个位置中将其修改为ENV_RUNNABLE
,令其可以被sched_yield
调度。
函数实现
1 | // Allocate a new environment. |
函数解释
这个函数里比较难的一点是,子进程和父进程会返回不同的参数,这一点是依靠修改子进程栈帧寄存器eax
的值实现的。父进程会返回子进程的id
剩余几个系统调用的实现
剩下的几个系统调用都不难,主要是对输入参数有效性进行了一些判断,这里直接给出源码。
函数实现
sys_env_set_status
的实现如下:
1 | // Set envid's env_status to status, which must be ENV_RUNNABLE |
sys_page_alloc
实现如下:
1 | // Allocate a page of memory and map it at 'va' with permission |
sys_page_map
实现如下:
1 | // Map the page of memory at 'srcva' in srcenvid's address space |
sys_page_unmap
实现如下:
1 | // Unmap the page of memory at 'va' in the address space of 'envid'. |
Challenge! Add the additional system calls necessary to read all of the vital state of an existing environment as well as set it up. Then implement a user mode program that forks off a child environment, runs it for a while (e.g., a few iterations of sys_yield()
), then takes a complete snapshot or checkpoint of the child environment, runs the child for a while longer, and finally restores the child environment to the state it was in at the checkpoint and continues it from there. Thus, you are effectively “replaying” the execution of the child environment from an intermediate state. Make the child environment perform some interaction with the user using sys_cgetc()
or readline()
so that the user can view and mutate its internal state, and verify that with your checkpoint/restart you can give the child environment a case of selective amnesia, making it “forget” everything that happened beyond a certain point.
现在你完成了Part A,运行make grade
并确保通过Part A的所有测试用例。If you are trying to figure out why a particular test case is failing, run ./grade-lab4 -v, which will show you the output of the kernel builds and QEMU runs for each test, until a test fails. When a test fails, the script will stop, and then you can inspect jos.out
to see what the kernel actually printed.
第二部分:写复制
Unix提供了fork()
系统调用作为主要的进程创建手段。fork()
系统调用将调用进程的地址空间拷贝至新的子进程中。xv6 Unix的fork()
将父进程的页全部拷贝到了子进程中,这个做法和dumbfork()
所采用的基本一致。The copying of the parent’s address space into the child is the most expensive part of the fork()
operation (拷贝是开销最大的部分).
However, a call to fork()
is frequently followed almost immediately by a call to exec()
in the child process, which replaces the child’s memory with a new program. This is what the the shell typically does, for example. In this case, the time spent copying the parent’s address space is largely wasted, because the child process will use very little of its memory before calling exec()
.
更新版本的Unix允许父子进程共享内存映射,直到某个进程修改了进程地址空间。这项技术叫做写时复制。为了实现该功能,在fork
中,内核会拷贝地址空间映射关系而非内存中的具体内容,同时将共享的内存页标注为只读。当某个进程试图在共享页写入内存,那么将会触发一个页错误。此时内核将会为触发页错误的进程分配一个新的、私有的、可写的副本。因此,直到内存页中的内容被实际写入时,该页才会被拷贝。This optimization makes a fork()
followed by an exec()
in the child much cheaper: the child will probably only need to copy one page (the current page of its stack) before it calls exec()
.
在下面的实验部分,你将实施一个像Unix的写时复制fork()
作为用户态的一个库函数。Implementing fork()
and copy-on-write support in user space has the benefit that the kernel remains much simpler and thus more likely to be correct. It also lets individual user-mode programs define their own semantics for fork()
. A program that wants a slightly different implementation (for example, the expensive always-copy version like dumbfork()
, or one in which the parent and child actually share memory afterward) can easily provide its own.
用户态页错误处理
一个用户态的写时拷贝函数 fork()
需要知道写保护页的页错误。写时拷贝是处理用户态页错误的一种手段。
It’s common to set up an address space so that page faults indicate when some action needs to take place. For example, most Unix kernels initially map only a single page in a new process’s stack region, and allocate and map additional stack pages later “on demand” as the process’s stack consumption increases and causes page faults on stack addresses that are not yet mapped. A typical Unix kernel must keep track of what action to take when a page fault occurs in each region of a process’s space. For example, a fault in the stack region will typically allocate and map new page of physical memory. A fault in the program’s BSS region will typically allocate a new page, fill it with zeroes, and map it. In systems with demand-paged executables, a fault in the text region will read the corresponding page of the binary off of disk and then map it.
This is a lot of information for the kernel to keep track of. Instead of taking the traditional Unix approach, you will decide what to do about each page fault in user space, where bugs are less damaging. This design has the added benefit of allowing programs great flexibility in defining their memory regions; you’ll use user-level page fault handling later for mapping and accessing files on a disk-based file system.
设置Page Fault Handler
为了处理页错误,一个用户进程需要在JOS内核中注册一个page_fault_handler
。用户进程通过sys_env_set_pgfault_upcall
设置页错误处理入口。我们需要在Env
中添加一个新的成员env_pgfault_upcall
来记录该信息。
Exercise 8. Implement the sys_env_set_pgfault_upcall
system call. Be sure to enable permission checking when looking up the environment ID of the target environment, since this is a “dangerous” system call.
函数实现
这个函数也不难,就是将进程的page_fault_handler
设置为给定的函数入口:
1 | // Set the page fault upcall for 'envid' by modifying the corresponding struct |
用户进程的普通与异常栈
在正常的执行过程中,一个用户进程运行在常规的用户栈上:其ESP
寄存器指向USTACKTOP
,栈的范围是[USTACKTOP-PGSIZE, USTACKTOP-1]
。当用户态发生了页错误后,内核将会令用户进程在一个不同的栈上运行对应的用户态page fault handler,这个栈叫用户异常栈(user exception stack)。我们将令JOS内核代替(on behalf of)用户进程执行自动栈切换,这个过程与发生用户态到内核态的切换时,x86代替JOS实现用户栈到内核栈的切换非常类似。
用户异常栈的大小也是一个页,栈顶地址为UXSTACKTOP
,范围是[UXSTACKTOP-PGSIZE, UXSTACKTOP-1]
。当运行在这个栈时,用户态的page fault handler能够使用JOS的常规系统调用,映射新的栈或调整映射,从而修复触发页错误的问题。随后用户态的page fault handler通过汇编指令stub
返回到常规栈中继续运行。
每个想要支持用户态页错误处理的进程,都需要利用sys_page_alloc()
为异常栈分配空间。
触发用户态Page Fault Handler
现在,你需要修改kern/trap.c
的代码,处理来自用户态的页错误。我们将处在页错误的用户态的状态成为陷入时间 (trap-time)态。
如果没有page fault handler被注册,那么内核将销毁用户进程。否则内核将在异常栈上设置一个栈帧,其格式如struct UTrapframe
所示(在 inc/trap.h
中)。
1 | <-- UXSTACKTOP |
随后,内核将安排用户进程在异常栈上以该栈帧恢复执行,运行的代码为page fault handler;你需要弄清楚这个是如何实现的,fault_va
即造成页错误的虚拟地址。
如果在异常发生时,用户已经位于异常栈上了,此时page fault handler本身发生了错误(嵌套异常)。在这种情况下,你需要在异常栈栈顶 tf->tf_esp
下方开启新的异常栈,而不是在UXSTACKTOP
下,你需要先压入一个空的32-bit的word,然后再压入一个struct UTrapframe
。为了验证tf->tf_esp
是否已经在用户异常栈上,你需要检测该指针是否位于UXSTACKTOP-PGSIZE
和UXSTACKTOP-1
之间。
Exercise 9. Implement the code in page_fault_handler
in kern/trap.c
required to dispatch page faults to the user-mode handler. Be sure to take appropriate precautions when writing into the exception stack. (What happens if the user environment runs out of space on the exception stack?)
函数功能
page_fault_handler
的主要功能是在进入内核后,由内核调整trapframe
,从而令当前进程在异常栈上执行page_fault_handler
函数。
函数设计
首先,我们要判断page_fault_handler
是否注册,如果没注册,直接销毁进程。然后,根据当前栈指针调整其位置,为utf分配空间,并对分配的空间进行检查。随后,我们设置utf
的内容,主要就是对tf
的关键内容进行保存。最后,我们调整tf
中的esp
与eip
,并运行进程,使其在异常栈上运行给定的函数。
函数实现
1 | /* |
根据这段代码,我们能够画出处理页错误过程中,内核栈、用户常规栈与用户异常栈的结构及关系如下所示:
从该图中可以看出,嵌套处理utf实际上就是函数的递归调用,这一点要特别明确,因为后面的练习10要用到。
用户态页错误入口
下面,你需要实现一段汇编程序,这段程序将调用C编写的page fault handler,并在原先产生页错误的语句上继续执行。这个汇编代码将由内核使用sys_env_set_pgfault_upcall()
进行注册。
Exercise 10. Implement the _pgfault_upcall
routine in lib/pfentry.S
. The interesting part is returning to the original point in the user code that caused the page fault. You’ll return directly there, without going back through the kernel. The hard part is simultaneously switching stacks and re-loading the EIP.
思路分析
练习十可能是本实验乃至整个课程实验中最难的一个,如果要做好练习十,你需要掌握的知识包括
- 汇编代码中
call
与ret
的原理,主要是这两个语句在调用时栈的变化 - 内核栈、用户栈及用户异常栈的布局
- 对于内存的精细操作
我反正是没搞定,看了别人的代码琢磨了半天琢磨明白了,还是太菜了。
设计
根据程序的注释,我们可以总结出这段程序的流程图如下:首先这个函数要调用_pgfault_handler
。处理完页错误后,再将异常栈帧弹出即可。
实现
1 | // Page fault upcall entrypoint. |
最后,你需要完成C用户库的用户层页错误处理机制,即练习11。
Exercise 11. Finish set_pgfault_handler()
in lib/pgfault.c
.
函数功能
这个函数的功能是设置pgfault_handler
,这个函数会注册用户指定的页错误处理函数,并通知内核在发生页错误时,调用汇编编写的页错误回调函数_pgfault_upcall
。
函数设计
:warning:注意,用户态与内核进行交互需要使用系统调用!
函数实现
根据函数功能和函数相关的注释提示,我们能够写出函数的实现如下:
1 | // |
我们使用了两个系统调用,一个用于分配页空间作为异常栈,一个用于设置pgfault_upcall
,注意完成之后,需要注册系统调用:
1 | int32_t |
Testing
Run user/faultread
(make run-faultread). You should see:
1 | ... |
上面这段代码试图读取0x0这个地址的内容,所以出错了,由于没有注册页错误处理函数,所以打印user fault va 00000000 ip 0080003a。Run user/faultdie
. You should see:
1 | ... |
上面这段代码试图写0xdeadbeef
这个地址,所以出错
Run user/faultalloc
. You should see:
1 | ... |
If you see only the first “this string” line, it means you are not handling recursive page faults properly.
Run user/faultallocbad
. You should see:
1 | ... |
Make sure you understand why user/faultalloc
and user/faultallocbad
behave differently. faultallocbad
试图向deadbeef
写入,而faultalloc
只是试图读取。
Challenge! 扩展你的内核,令其不仅能处理页错误,任何在用户空间产生的处理器异常,都能重定向至一个用户态的异常处理函数。编写用户态测试程序来测试用户态异常处理函数对于不同异常的处理,例如除零错误、通用保护错误以及非法操作(illegal opcode)等。.
实施写入时复制的Fork
现在,你有了在用户空间实施写入时复制fork
的全部内核特性。我们在lib/fork.c
为你提供了fork
的骨架。和dumbfork()
类似, fork()
应当创建一个新进程,然后扫描父进程的整个地址空间,并在子进程中建立对应的页映射。关键不同在于,dumbfork()
拷贝了所有的页,而 fork()
将只复制页映射关系。当有一个进程试图写入一个页时,fork()
才会拷贝这个页。
fork()
的基本控制流如下:
父进程利用
set_pgfault_handler()
将pgfault()
设置为C语言层面的fault handler。父进程调用
sys_exofork()
创建一个子进程。对于每一个在地址空间中低于UTOP的可写的或者copy-on-write的页,父进程调用
duppage
,这个函数将把每一个可写的页以copy-on-write的形式映射到子进程的地址空间中,同时,将自己的可写页重新映射为copy-on-write。[ 注意此处的顺序,先将子进程设置为COW,再设置父进程,这个非常重要。为什么呢?请思考一种顺序改变后可能造成的错误]duppage
设置了父进程与子进程的PTE,使得页不可写,同时在”avail”字段包含了PTE_COW
,来区分传统的只读页和copy-on-write页。与普通用户空间不同,异常栈不是这样映射的,你需要为子进程分配一个新页作为异常栈。由于页错误处理函数将执行实际拷贝操作,同时页错误处理函数运行在异常栈上,因此异常栈不能被设置为copy-on-write。况且也没有进程会复制异常栈。
fork()
同样需要处理存在(PTE_P)但是不是可写的或者COW的页。父进程为子进程设置用户页错误处理入口。
- 子进程现在准备运行,父进程将其设置为可运行态。
当有进程写一个它尚未写入的copy-on-write页时,它将会产生一个页错误。这里是用户页错误处理函数的控制流程:
- 内核产生一个页错误The kernel propagates the page fault to
_pgfault_upcall
,这个函数调用fork()
的pgfault()
handler。 pgfault()
检查错误是否为一个写操作(check forFEC_WR
in the error code),同时,写入页的PTE标记为PTE_COW
。如果不是,触发panic。pgfault()
allocates a new page mapped at a temporary location and copies the contents of the faulting page into it. Then the fault handler maps the new page at the appropriate address with read/write permissions, in place of the old read-only mapping.
The user-level lib/fork.c
code must consult the environment’s page tables for several of the operations above (e.g., that the PTE for a page is marked PTE_COW
). The kernel maps the environment’s page tables at UVPT
exactly for this purpose. It uses a clever mapping trick to make it to make it easy to lookup PTEs for user code. lib/entry.S
sets up uvpt
and uvpd
so that you can easily lookup page-table information in lib/fork.c
.
Exercise 12. Implement fork
, duppage
and pgfault
in lib/fork.c
.
设计
fork的完整流程如下:
伪代码
fork的伪代码如下:
1 | Function Fork |
利用forktree
测试你的程序 。这个程序应当产生如下信息, It should produce the following messages, with interspersed ‘new env’, ‘free env’, and ‘exiting gracefully’ messages. The messages may not appear in this order, and the environment IDs may be different.
1 | 1000: I am '' |
Challenge! Implement a shared-memory fork()
called sfork()
. This version should have the parent and child share all their memory pages (so writes in one environment appear in the other) except for pages in the stack area, which should be treated in the usual copy-on-write manner. Modify user/forktree.c
to use sfork()
instead of regular fork()
. Also, once you have finished implementing IPC in part C, use your sfork()
to run user/pingpongs
. You will have to find a new way to provide the functionality of the global thisenv
pointer.
Challenge! Your implementation of fork
makes a huge number of system calls. On the x86, switching into the kernel using interrupts has non-trivial cost. Augment the system call interface so that it is possible to send a batch of system calls at once. Then change fork
to use this interface.
你的新fork比原先的有多快?
You can answer this (roughly) by using analytical arguments to estimate how much of an improvement batching system calls will make to the performance of your fork
: How expensive is an int 0x30
instruction? How many times do you execute int 0x30
in your fork
? Is accessing the TSS
stack switch also expensive? And so on…
Alternatively, you can boot your kernel on real hardware and really benchmark your code. See the RDTSC
(read time-stamp counter) instruction, defined in the IA32 manual, which counts the number of clock cycles that have elapsed since the last processor reset. QEMU doesn’t emulate this instruction faithfully (it can either count the number of virtual instructions executed or use the host TSC, neither of which reflects the number of cycles a real CPU would require).
第二部分到此结束,确保你通过了第二部分所有测试。As usual, you can hand in your submission with make handin.
第三部分:抢占式多任务与进程间通信
在第三部分,你将修改内核,使其支持抢占式进程,并允许进程间相互通信。
定时中断与抢占
运行user/spin
测试程序。这个测试程序将fork一个子进程,这个进程一旦获取CPU控制权,将会陷入无限制的循环当中。父进程和内核都无法再获取CPU。这样显然不是一个用户进程的理想情况,因为任意一个用户进程都可以通过陷入死循环劫持CPU导致整个系统僵死。为了允许内核抢占一个正在运行的进程,强制性重新占据CPU,我们需要允许JOS内核支持来自时钟的外部硬件中断。
中断规律
外部中断(设备中断)被称为IRQs。此处一共有16个可能的IRQs,序号为0-15。IRQ序号与中断向量表入口的映射不是固定的。 picirq.c
中的pic_init
将IRQs 0-15映射至IDT入口IRQ_OFFSET
至IRQ_OFFSET+15
(32-47):
1 | outb(IO_PIC1+1, IRQ_OFFSET); // 0-7 |
在inc/trap.h
中, IRQ_OFFSET
被定义为十进制32。因此IDT入口32-47与IRQs 0-15相关联。例如,时钟中断为IRQ 0。因此, IDT[IRQ_OFFSET+0] (即IDT[32]) 包含了内核对时钟中断处理程序的地址。IRQ_OFFSET
的选择需要注意硬件中断不能覆盖处理器提供的异常入口。
相比xv6,我们在JOS中做了关键简化。在内核态,外部设备中断总是被关闭的(在用户态则打开)。外部中断由%eflags
寄存器的FL_IF
标志位进行控制。FL_IF = 1
时开中断。由于这个标志位可以通过若干种方式修改,由于我们的简化,我们只需要在进入和离开用户态时对其进行保存和恢复即可。
你需要保证FL_IF
标志位在用户态进程运行时被设置,当一个中断来临时,这个中断将会被传递至处理器,并被你的中断处理程序处理。否则,中断被掩盖(masked)或者忽略直到中断被重新使能。我们在bootloader的第一条语句中关闭了中断,同时没有再重新使能它。
Exercise 13. 修改 kern/trapentry.S
和kern/trap.c
在IDT中初始化证确的入口,同时为中断0-15提供正确的入口。然后修改kern/env.c
中env_alloc()
的代码,确保用户态进程总是运行在中断使能的状态下。
同样,取消sched_halt()
中sti
指令的注释,使得空闲的CPU不再掩盖中断。由于我在lab3中使用了自动生成中断入口函数的脚本vectors.py,因此此处我需要修改vectors.py
修改vectors.py
首先修改vectors.py,32号到47号向量不需要压入错误码:
1 | .globl vector32 |
然后在trap_init.c中为中断设置入口函数
1 | // Set gate for irq |
修改env_alloc代码
1 | e->env_tf.tf_eflags |= FL_IF; |
当激活一个硬件中断时,处理器将不会压入一个错误码。You might want to re-read section 9.2 of the 80386 Reference Manual, or section 5.8 of the IA-32 Intel Architecture Software Developer’s Manual, Volume 3, at this time.
当做完这个练习后,如果你运行内核if you run your kernel with any test program that runs for a non-trivial length of time (e.g., spin
), you should see the kernel print trap frames for hardware interrupts. While interrupts are now enabled in the processor, JOS isn’t yet handling them, so you should see it misattribute each interrupt to the currently running user environment and destroy it. Eventually it should run out of environments to destroy and drop into the monitor.
Handling Clock Interrupts
In the user/spin
program, after the child environment was first run, it just spun in a loop, and the kernel never got control back. We need to program the hardware to generate clock interrupts periodically, which will force control back to the kernel where we can switch control to a different user environment.
The calls to lapic_init
and pic_init
(from i386_init
in init.c
), which we have written for you, set up the clock and the interrupt controller to generate interrupts. You now need to write the code to handle these interrupts.
Exercise 14. Modify the kernel’s trap_dispatch()
function so that it calls sched_yield()
to find and run a different environment whenever a clock interrupt takes place.
You should now be able to get the user/spin
test to work: the parent environment should fork off the child, sys_yield()
to it a couple times but in each case regain control of the CPU after one time slice, and finally kill the child environment and terminate gracefully.
This is a great time to do some regression testing. Make sure that you haven’t broken any earlier part of that lab that used to work (e.g. forktree
) by enabling interrupts. Also, try running with multiple CPUs using make CPUS=2 target. You should also be able to pass stresssched
now. Run make grade to see for sure. You should now get a total score of 65/80 points on this lab.
Inter-Process communication (IPC)
(Technically in JOS this is “inter-environment communication” or “IEC”, but everyone else calls it IPC, so we’ll use the standard term.)
We’ve been focusing on the isolation aspects of the operating system, the ways it provides the illusion that each program has a machine all to itself. Another important service of an operating system is to allow programs to communicate with each other when they want to. It can be quite powerful to let programs interact with other programs. The Unix pipe model is the canonical example.
There are many models for interprocess communication. Even today there are still debates about which models are best. We won’t get into that debate. Instead, we’ll implement a simple IPC mechanism and then try it out.
IPC in JOS
You will implement a few additional JOS kernel system calls that collectively provide a simple interprocess communication mechanism. You will implement two system calls, sys_ipc_recv
and sys_ipc_try_send
. Then you will implement two library wrappers ipc_recv
and ipc_send
.
The “messages” that user environments can send to each other using JOS’s IPC mechanism consist of two components: a single 32-bit value, and optionally a single page mapping. Allowing environments to pass page mappings in messages provides an efficient way to transfer more data than will fit into a single 32-bit integer, and also allows environments to set up shared memory arrangements easily.
Sending and Receiving Messages
To receive a message, an environment calls sys_ipc_recv
. This system call de-schedules the current environment and does not run it again until a message has been received. When an environment is waiting to receive a message, any other environment can send it a message - not just a particular environment, and not just environments that have a parent/child arrangement with the receiving environment. In other words, the permission checking that you implemented in Part A will not apply to IPC, because the IPC system calls are carefully designed so as to be “safe”: an environment cannot cause another environment to malfunction simply by sending it messages (unless the target environment is also buggy).
To try to send a value, an environment calls sys_ipc_try_send
with both the receiver’s environment id and the value to be sent. If the named environment is actually receiving (it has called sys_ipc_recv
and not gotten a value yet), then the send delivers the message and returns 0. Otherwise the send returns -E_IPC_NOT_RECV
to indicate that the target environment is not currently expecting to receive a value.
A library function ipc_recv
in user space will take care of calling sys_ipc_recv
and then looking up the information about the received values in the current environment’s struct Env
.
Similarly, a library function ipc_send
will take care of repeatedly calling sys_ipc_try_send
until the send succeeds.
Transferring Pages
When an environment calls sys_ipc_recv
with a valid dstva
parameter (below UTOP
), the environment is stating that it is willing to receive a page mapping. If the sender sends a page, then that page should be mapped at dstva
in the receiver’s address space. If the receiver already had a page mapped at dstva
, then that previous page is unmapped.
When an environment calls sys_ipc_try_send
with a valid srcva
(below UTOP
), it means the sender wants to send the page currently mapped at srcva
to the receiver, with permissions perm
. After a successful IPC, the sender keeps its original mapping for the page at srcva
in its address space, but the receiver also obtains a mapping for this same physical page at the dstva
originally specified by the receiver, in the receiver’s address space. As a result this page becomes shared between the sender and receiver.
If either the sender or the receiver does not indicate that a page should be transferred, then no page is transferred. After any IPC the kernel sets the new field env_ipc_perm
in the receiver’s Env
structure to the permissions of the page received, or zero if no page was received.
Implementing IPC
Exercise 15. Implement sys_ipc_recv
and sys_ipc_try_send
in kern/syscall.c
. Read the comments on both before implementing them, since they have to work together. When you call envid2env
in these routines, you should set the checkperm
flag to 0, meaning that any environment is allowed to send IPC messages to any other environment, and the kernel does no special permission checking other than verifying that the target envid is valid.
Then implement the ipc_recv
and ipc_send
functions in lib/ipc.c
.
Use the user/pingpong
and user/primes
functions to test your IPC mechanism. user/primes
will generate for each prime number a new environment until JOS runs out of environments. You might find it interesting to read user/primes.c
to see all the forking and IPC going on behind the scenes.
Challenge! Why does ipc_send
have to loop? Change the system call interface so it doesn’t have to. Make sure you can handle multiple environments trying to send to one environment at the same time.
Challenge! The prime sieve is only one neat use of message passing between a large number of concurrent programs. Read C. A. R. Hoare, ``Communicating Sequential Processes,’’ Communications of the ACM 21(8) (August 1978), 666-667, and implement the matrix multiplication example.
Challenge! One of the most impressive examples of the power of message passing is Doug McIlroy’s power series calculator, described in M. Douglas McIlroy, ``Squinting at Power Series,’’ Software—Practice and Experience, 20(7) (July 1990), 661-683. Implement his power series calculator and compute the power series for sin(x+x^3).
Challenge! Make JOS’s IPC mechanism more efficient by applying some of the techniques from Liedtke’s paper, Improving IPC by Kernel Design, or any other tricks you may think of. Feel free to modify the kernel’s system call API for this purpose, as long as your code is backwards compatible with what our grading scripts expect.
This ends part C. Make sure you pass all of the make grade tests and don’t forget to write up your answers to the questions and a description of your challenge exercise solution in answers-lab4.txt
.
Before handing in, use git status and git diff to examine your changes and don’t forget to git add answers-lab4.txt. When you’re ready, commit your changes with git commit -am ‘my solutions to lab 4’, then make handin and follow the directions.