<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          一文看懂|Linux內(nèi)核反向映射機(jī)制原理

          共 10267字,需瀏覽 21分鐘

           ·

          2022-11-01 19:47

          當(dāng)內(nèi)核需要對(duì)申請(qǐng)的page進(jìn)行回收時(shí),在回收頁(yè)表前需要解除該page的映射關(guān)系,即內(nèi)核需要知道這個(gè)物理頁(yè)被映射到了哪些進(jìn)程虛擬地址空間,因此就有了反向映射機(jī)制。反向映射一般分為匿名頁(yè)映射和文件頁(yè)映射,本文先介紹匿名頁(yè)反向映射。

          基本數(shù)據(jù)結(jié)構(gòu)

          page結(jié)構(gòu)體中涉及反向映射的相關(guān)成員

          struct page {
          。。。
          struct address_space *mapping; /* If low bit clear, points to
          * inode address_space, or NULL.
          * If page mapped as anonymous
          * memory, low bit is set, and
          * it points to anon_vma object:
          * see PAGE_MAPPING_ANON below.
          */
          。。。
          /* page_deferred_list().next -- second tail page */
          };

          /* Second double word */
          union {
          pgoff_t index; /* Our offset within mapping. */
          。。。
          union {
          atomic_t _mapcount;
          • mapping
            因?yàn)橹羔樧兞渴?個(gè)字節(jié),因此可以用最后兩位來(lái)區(qū)分不同的映射。對(duì)于匿名映射,最低位為PAGE_MAPPING_ANON,指向anon_vma結(jié)構(gòu)體,每個(gè)匿名頁(yè)對(duì)應(yīng)唯一的anon_vma;對(duì)于文件映射而言,指向address_space結(jié)構(gòu)體。

          • index
            表示頁(yè)偏移,對(duì)于匿名映射,index表示page在vm_areat_struct指定的虛擬內(nèi)存區(qū)域中的頁(yè)偏移;對(duì)于匿名映射,index表示物理頁(yè)中的數(shù)據(jù)在文件中的頁(yè)偏移。

          • _mapcount
            記錄該page被映射到了多少個(gè)vm_struct虛擬內(nèi)存區(qū)域。注意和mm_struct結(jié)構(gòu)體中的map_count做區(qū)分,map_count表示mm_strcut中有多少個(gè)vm_struct區(qū)域。

          一般struct anon_vma稱為AV,struct anon_vma_chain稱為AVC,struct vm_area_struct稱為VMA,page找到VMA的路徑一般如下:page->AV->AVC->VMA,其中AVC起到橋梁作用,至于為何需要AVC,主要考慮當(dāng)父進(jìn)程和多個(gè)子進(jìn)程同時(shí)擁有共同的page時(shí)的查詢效率,具體對(duì)比2.6版本時(shí)的實(shí)現(xiàn)方式。

          struct anon_vma

          struct anon_vma {
          struct anon_vma *root; /* Root of this anon_vma tree */
          struct rw_semaphore rwsem; /* W: modification, R: walking the list */
          /*
          * The refcount is taken on an anon_vma when there is no
          * guarantee that the vma of page tables will exist for
          * the duration of the operation. A caller that takes
          * the reference is responsible for clearing up the
          * anon_vma if they are the last user on release
          */
          atomic_t refcount;

          /*
          * Count of child anon_vmas and VMAs which points to this anon_vma.
          *
          * This counter is used for making decision about reusing anon_vma
          * instead of forking new one. See comments in function anon_vma_clone.
          */
          unsigned degree;

          struct anon_vma *parent; /* Parent of this anon_vma */

          /*
          * NOTE: the LSB of the rb_root.rb_node is set by
          * mm_take_all_locks() _after_ taking the above lock. So the
          * rb_root must only be read/written after taking the above lock
          * to be sure to see a valid next pointer. The LSB bit itself
          * is serialized by a system wide lock only visible to
          * mm_take_all_locks() (mm_all_locks_mutex).
          */
          struct rb_root rb_root; /* Interval tree of private "related" vmas */
          };

          struct anon_vma_chain

          struct anon_vma_chain {
          struct vm_area_struct *vma;
          struct anon_vma *anon_vma;
          struct list_head same_vma; /* locked by mmap_sem & page_table_lock */
          struct rb_node rb; /* locked by anon_vma->rwsem */
          unsigned long rb_subtree_last;
          #ifdef CONFIG_DEBUG_VM_RB
          unsigned long cached_vma_start, cached_vma_last;
          #endif
          };

          struct vm_struct中相關(guān)成員

          struct vm_area_struct {
          。。。
          struct list_head anon_vma_chain; /* Serialized by mmap_sem &
          * page_table_lock */
          struct anon_vma *anon_vma; /* Serialized by page_table_lock */
          。。。
          }

          上面幾個(gè)結(jié)構(gòu)體的關(guān)系大致如下:

          page通過(guò)mapping找到VMA,VMA 遍歷自己管理的紅黑樹(shù)rb_root,找到樹(shù)上的每個(gè)節(jié)點(diǎn)AVC,AVC通過(guò)成員指針anon_vma找到對(duì)應(yīng)的VMA,這個(gè)過(guò)程就完成了頁(yè)表映射查找。需要注意的幾點(diǎn):

          1.VMA中也有鏈表anon_vma_chain管理各個(gè)AVC,這里主要用在父子進(jìn)程之間的管理,下文會(huì)詳細(xì)介紹。

          2.VMA中有成員指針成員anon_vma,同時(shí)AVC中也有成員指針anon_vma,VAC起到橋梁作用所以可以指向VMA和AVC,那VMA中為何又需要指向AV呢?進(jìn)程創(chuàng)建的流程中一般都是新建AV,然后創(chuàng)建AVC及AMV,然后調(diào)用anon_vma_chain_link建立三者之間的關(guān)系,但是當(dāng)一個(gè)VMA沒(méi)有對(duì)應(yīng)頁(yè)的時(shí)候,此時(shí)觸發(fā)pagefault,這里可以快速判斷VMA有沒(méi)有對(duì)應(yīng)的page。

          常用接口

          • anon_vma_chain_link
            1.將VAC中的vma和anon_vma分別指向VMA和AV;
            2.將AVC加入到VMA的anon_vma_chain鏈表上;
            3.將AVC加入到AV的rb_root紅黑樹(shù)上,通常都是通過(guò)遍歷這個(gè)紅黑樹(shù)找到所有的AVC;

          static void anon_vma_chain_link(struct vm_area_struct *vma,
          struct anon_vma_chain *avc,
          struct anon_vma *anon_vma)
          {
          avc->vma = vma;
          avc->anon_vma = anon_vma;
          list_add(&avc->same_vma, &vma->anon_vma_chain);
          anon_vma_interval_tree_insert(avc, &anon_vma->rb_root);
          }

          代碼實(shí)現(xiàn)

          反向映射跟父子進(jìn)程的寫時(shí)拷貝有關(guān)系,所以先從父子進(jìn)程創(chuàng)建時(shí)對(duì)AV,AVC,VMA的創(chuàng)建開(kāi)始講。

          1.父進(jìn)程創(chuàng)建匿名頁(yè)面

          當(dāng)觸發(fā)pagefault的時(shí)候走到handle_pte_fault中,anon_vma_prepare中負(fù)責(zé)創(chuàng)建AVC和AV并建立彼此的關(guān)系;真正將創(chuàng)建的page與av關(guān)聯(lián)在__page_set_anon_map中完成。這樣的話父進(jìn)程新建的page在自己的反向映射中的關(guān)系就算完成了。

          int anon_vma_prepare(struct vm_area_struct *vma)
          {
          struct anon_vma *anon_vma = vma->anon_vma;
          struct anon_vma_chain *avc;
          。。。
          if (unlikely(!anon_vma)) {
          struct mm_struct *mm = vma->vm_mm;
          struct anon_vma *allocated;
          avc = anon_vma_chain_alloc(GFP_KERNEL);
          。。。
          anon_vma = find_mergeable_anon_vma(vma);
          allocated = NULL;
          if (!anon_vma) {
          anon_vma = anon_vma_alloc();
          。。。
          allocated = anon_vma;
          }
          anon_vma_lock_write(anon_vma);
          /* page_table_lock to protect against threads */
          spin_lock(&mm->page_table_lock);
          if (likely(!vma->anon_vma)) {
          vma->anon_vma = anon_vma;
          anon_vma_chain_link(vma, avc, anon_vma);
          /* vma reference or self-parent link for new root */
          anon_vma->degree++;
          allocated = NULL;
          avc = NULL;
          }
          spin_unlock(&mm->page_table_lock);
          anon_vma_unlock_write(anon_vma);
          。。。
          }
          return 0;
          。。。
          }
          static void __page_set_anon_rmap(struct page *page,
          struct vm_area_struct *vma, unsigned long address, int exclusive)
          {
          struct anon_vma *anon_vma = vma->anon_vma;
          。。。
          anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
          page->mapping = (struct address_space *) anon_vma;
          page->index = linear_page_index(vma, address);
          }

          至于index的含義看linear_page_index的實(shí)現(xiàn)應(yīng)該就明白了。

          static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
          unsigned long address)
          {
          pgoff_t pgoff;
          if (unlikely(is_vm_hugetlb_page(vma)))
          return linear_hugepage_index(vma, address);
          pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
          pgoff += vma->vm_pgoff;
          return pgoff;
          }

          2.父進(jìn)程創(chuàng)建子進(jìn)程

          當(dāng)父進(jìn)程創(chuàng)建子進(jìn)程的時(shí)候,子進(jìn)程會(huì)復(fù)制父進(jìn)程的VMA作為自己的進(jìn)程地址空間,并且父子進(jìn)程共享相同的page,知道子進(jìn)程往自己的地址空間寫數(shù)據(jù),這就是所謂的COW。這種情況需要完成兩件事情:1.子進(jìn)程需要繼承父進(jìn)程的AVC,AV,VMA及三者之間的關(guān)系;2.創(chuàng)建自己的AV,AVC,VMA。

          以上實(shí)現(xiàn)流程在dup_mm->dup_mmap->anon_vma_fork中完成。

          dup_mmap中就是組個(gè)創(chuàng)建子進(jìn)程的vma,并復(fù)制父進(jìn)程對(duì)應(yīng)vma的信息

          anon_vma_clone中新建了AVC,將子進(jìn)程的VMA關(guān)聯(lián)到父進(jìn)程的AV中,所以父進(jìn)程AV的rb樹(shù)上就有了子進(jìn)程的AVC,通過(guò)遍歷父進(jìn)程AV的rb樹(shù)就能找到子進(jìn)程的VMA。一個(gè)VMA可以包含多個(gè)page,但是該區(qū)域內(nèi)的所有page只需要一個(gè)AV來(lái)反向映射即可。

          具體anon_vma_clone代碼如下

          int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
          {
          struct anon_vma_chain *avc, *pavc;
          struct anon_vma *root = NULL;

          list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
          struct anon_vma *anon_vma;

          avc = anon_vma_chain_alloc(GFP_NOWAIT | __GFP_NOWARN);
          if (unlikely(!avc)) {
          unlock_anon_vma_root(root);
          root = NULL;
          avc = anon_vma_chain_alloc(GFP_KERNEL);
          if (!avc)
          goto enomem_failure;
          }
          anon_vma = pavc->anon_vma;
          root = lock_anon_vma_root(root, anon_vma);
          anon_vma_chain_link(dst, avc, anon_vma);

          /*
          * Reuse existing anon_vma if its degree lower than two,
          * that means it has no vma and only one anon_vma child.
          *
          * Do not chose parent anon_vma, otherwise first child
          * will always reuse it. Root anon_vma is never reused:
          * it has self-parent reference and at least one child.
          */
          if (!dst->anon_vma && anon_vma != src->anon_vma &&
          anon_vma->degree < 2)
          dst->anon_vma = anon_vma;
          }
          if (dst->anon_vma)
          dst->anon_vma->degree++;
          unlock_anon_vma_root(root);
          return 0;

          3.子進(jìn)程發(fā)生cow,創(chuàng)建自己的匿名頁(yè)面

          當(dāng)新創(chuàng)建的子進(jìn)程寫數(shù)據(jù)時(shí)觸發(fā)pagefault,在wp_page_copy中會(huì)創(chuàng)建新的page,此時(shí)創(chuàng)建的AV和AVC管理子進(jìn)程自己的VMA

          if (fe->flags & FAULT_FLAG_WRITE) {
          if (!pte_write(entry))
          return do_wp_page(fe, entry);
          entry = pte_mkdirty(entry);
          }

          4.頁(yè)面回收,解除映射

          物理頁(yè)回收時(shí)通過(guò)調(diào)用try_to_unmap解除一個(gè)page的頁(yè)表映射。對(duì)于匿名頁(yè)面解除映射而言,走

          try_to_unmap->rmap_walk->rmap_walk_anon流程。

          int rmap_walk(struct page *page, struct rmap_walk_control *rwc)
          {
          if (unlikely(PageKsm(page)))
          return rmap_walk_ksm(page, rwc);
          else if (PageAnon(page))
          return rmap_walk_anon(page, rwc, false);
          else
          return rmap_walk_file(page, rwc, false);
          }
          static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
          bool locked)
          {
          struct anon_vma *anon_vma;
          pgoff_t pgoff;
          struct anon_vma_chain *avc;
          int ret = SWAP_AGAIN;

          if (locked) {
          anon_vma = page_anon_vma(page);
          /* anon_vma disappear under us? */
          VM_BUG_ON_PAGE(!anon_vma, page);
          } else {
          anon_vma = rmap_walk_anon_lock(page, rwc);
          }
          。。。
          pgoff = page_to_pgoff(page);
          // 遍歷AV的紅黑樹(shù),找到所有的AVC
          anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
          // 通過(guò)AVC找到VMA
          struct vm_area_struct *vma = avc->vma;
          // address為該page對(duì)應(yīng)的起始地址
          unsigned long address = vma_address(page, vma);
          。。。
          ret = rwc->rmap_one(page, vma, address, rwc->arg);
          。。。
          }

          rmap_one指向try_to_umap_one,該函數(shù)內(nèi)容比較復(fù)雜,這里只截取了頁(yè)表項(xiàng)解除的操作。

          static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
          unsigned long address, void *arg)
          {
          struct mm_struct *mm = vma->vm_mm;
          pte_t *pte;
          pte_t pteval;
          spinlock_t *ptl;
          int ret = SWAP_AGAIN;
          struct rmap_private *rp = arg;
          enum ttu_flags flags = rp->flags;

          pte = page_check_address(page, mm, address, &ptl,
          PageTransCompound(page));
          。。。
          /* Nuke the page table entry. */
          flush_cache_page(vma, address, page_to_pfn(page));
          if (should_defer_flush(mm, flags)) {
          /*
          * We clear the PTE but do not flush so potentially a remote
          * CPU could still be writing to the page. If the entry was
          * previously clean then the architecture must guarantee that
          * a clear->dirty transition on a cached TLB entry is written
          * through and traps if the PTE is unmapped.
          */
          pteval = ptep_get_and_clear(mm, address, pte);

          set_tlb_ubc_flush_pending(mm, page, pte_dirty(pteval));
          } else {
          pteval = ptep_clear_flush(vma, address, pte);
          }
          。。。
          }

          參考

          《深入理解linux內(nèi)核》

          linux kernel4.9

          原文:https://zhuanlan.zhihu.com/p/361173109

          瀏覽 68
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  中国一区二区操B视频 | 婷婷五月深爱激情 | 亚洲成人福利视频 | 伊人大香蕉综合视频 | 狼友视频官网 |