偷偷摘套内射激情视频,久久精品99国产国产精,中文字幕无线乱码人妻,中文在线中文a,性爽19p

Kmp 內(nèi)存分配和 GC 優(yōu)化分析和實(shí)踐

開發(fā)
K/N 的內(nèi)存管理器和 GC,和主流虛擬機(jī)基本一樣,主要功能如:K/N 使用自己的 custom 內(nèi)存分配器,每個(gè)線程有自己的 tlab;默認(rèn)垃圾回收器通過(guò) Stop-the-world 標(biāo)記和并發(fā)清除收集器,并且不會(huì)將堆分代;當(dāng)前只支持弱引用,當(dāng)標(biāo)記階段完成后,GC 會(huì)處理弱引用,并使指向未標(biāo)記對(duì)象的引用無(wú)效。

引言

K/N 的內(nèi)存管理器和 GC,和主流虛擬機(jī)基本一樣,主要功能如下:

  • K/N 使用自己的 custom 內(nèi)存分配器,每個(gè)線程有自己的 tlab
  • 默認(rèn)垃圾回收器通過(guò) Stop-the-world 標(biāo)記和并發(fā)清除收集器,并且不會(huì)將堆分代
  • 當(dāng)前只支持弱引用,當(dāng)標(biāo)記階段完成后,GC 會(huì)處理弱引用,并使指向未標(biāo)記對(duì)象的引用無(wú)效

要監(jiān)控 GC 性能,需要在 Gradle 構(gòu)建腳本中設(shè)置以下編譯器選項(xiàng)。

代碼塊:

-Xruntime-logs=gc=info

為了提高 GC 性能,可以在 Gradle 構(gòu)建腳本啟用 cms 垃圾回收器,將存活對(duì)象標(biāo)記與應(yīng)用程序線程并行運(yùn)行,減少 GC 暫停時(shí)間。

代碼塊:

kotlin.native.binary.gc=cms

從文檔看,內(nèi)存分配器已經(jīng)比較完善了,但是 GC 性能比較差,默認(rèn)垃圾回收器是 STW,cms 還需要手動(dòng)配置。我們從代碼層面看一下。

Runtime

通過(guò)抓取過(guò) kmp trace,可以看到 runtime 入口。

  • 鴻蒙 linker 是 ld-musl-aarch64.so,加載 libbenchmark.so,這是 kmp 的編譯產(chǎn)物
  • 之后執(zhí)行 workRoutine 方法,這是 Runtime 的入口方法

抖音倉(cāng)庫(kù)用的是 kotlin2.0.20, workerRoutine 代碼在 kotlin-native 項(xiàng)目 Worker.cpp 文件。

  • 先調(diào)用 Kotlin_initRuntimeIfNeeded 初始化 Runtime
  • 然后通過(guò) do/while 循環(huán)調(diào)用 processQueueElement 處理任務(wù),類似消息循環(huán)

代碼塊:

void* workerRoutine(void* argument){
  Worker* worker = reinterpret_cast<Worker*>(argument);


// Kotlin_initRuntimeIfNeeded calls WorkerInit that needs
// to see there's already a worker created for this thread.
  ::g_worker = worker;
  Kotlin_initRuntimeIfNeeded();


// Only run this routine in the runnable state. The moment between this routine exiting and thread
// destructors running will be spent in the native state. `Kotlin_deinitRuntimeCallback` ensures
// that runtime deinitialization switches back to the runnable state.
kotlin::ThreadStateGuard guard(worker->memoryState(), ThreadState::kRunnable);


do {
if (worker->processQueueElement(true) == JOB_TERMINATE) break;
  } while (true);


returnnullptr;
}

而 Kotlin_initRuntimeIfNeeded 會(huì)調(diào)用 initRuntime,每個(gè)線程有獨(dú)立的 runtimeState 變量,通過(guò)判斷 runtimeState 變量狀態(tài)避免多次調(diào)用 initRuntime。

代碼塊:

RUNTIME_NOTHROW voidKotlin_initRuntimeIfNeeded(){
  if (!isValidRuntime()) {
    initRuntime();
    // Register runtime deinit function at thread cleanup.
    konan::onThreadExit(Kotlin_deinitRuntimeCallback, runtimeState);
  }
}


THREAD_LOCAL_VARIABLE RuntimeState* runtimeState = kInvalidRuntime;
inlineboolisValidRuntime(){
  return ::runtimeState != kInvalidRuntime;
}

initRuntime 具體功能如下:

  • SetKonanTerminateHandler 為線程設(shè)置異常處理 Handler,這樣可以捕獲 kotlin excepiton
  • 設(shè)置 runtimeState
  • initializeGlobalRuntimeIfNeeded 初始化全局變量
  • InitMemory 初始化線程內(nèi)存分配器
  • WorkInit 初始化

代碼塊:

RuntimeState* initRuntime(){
  SetKonanTerminateHandler();


  RuntimeState* result = new RuntimeState();
  if (!result) return kInvalidRuntime;
  ::runtimeState = result;


  bool firstRuntime = initializeGlobalRuntimeIfNeeded();
  result->memoryState = InitMemory();
  // Switch thread state because worker and globals inits require the runnable state.
  // This call may block if GC requested suspending threads.
  ThreadStateGuard stateGuard(result->memoryState, kotlin::ThreadState::kRunnable);
  result->worker = WorkerInit(result->memoryState);
  result->status = RuntimeStatus::kRunning;


  return result;
}

initRuntime 過(guò)程如圖,我們接下來(lái)分別分析。

ExceptionHandler

SetKonanTerminateHandler 通過(guò) TerminateHandler 調(diào)用 std::set_terminate 設(shè)置 kotlinHandler 來(lái)處理異常。

代碼塊:

// Use one public function to limit access to the class declaration
voidSetKonanTerminateHandler(){
  TerminateHandler::install();
}


/// Use machinery like Meyers singleton to provide thread safety
TerminateHandler()
  : queuedHandler_((QH)std::set_terminate(kotlinHandler)) {}

GlobalData

initializeGlobalRuntimeIfNeeded 調(diào)用 initGlobalMemory 初始化 GlobalData,GlobalData 包括 allocator_內(nèi)存分配器,gc_垃圾回收器,threadRegistry_線程列表等。GlobalData 是全局變量,所有線程共用,還有 ThreadData 是線程私有的,后續(xù)分析。

代碼塊:

voidkotlin::initGlobalMemory()noexcept{
    mm::GlobalData::init();
}


// Global (de)initialization is undefined in C++. Use single global singleton to define it for simplicity.
classGlobalData :private Pinned {
public:
    ThreadRegistry& threadRegistry()noexcept{ return threadRegistry_; }
    GlobalsRegistry& globalsRegistry()noexcept{ return globalsRegistry_; }
    SpecialRefRegistry& specialRefRegistry()noexcept{ return specialRefRegistry_; }
    gcScheduler::GCScheduler& gcScheduler()noexcept{ return gcScheduler_; }
    alloc::Allocator& allocator()noexcept{ return allocator_; }
    gc::GC& gc()noexcept{ return gc_; }

ThreadData

InitMemory 通過(guò)上面分析的 ThreadRegistry 全局變量的 RegisterCurrentThread 方法,生成 ThreadData,并注冊(cè)到 list_列表里,這樣 gc 時(shí)可以訪問(wèn)到 ThreadData 中的 gc root。currentThreadDataNode 是 thread local 變量,每個(gè)線程有獨(dú)立的變量。

代碼塊:

extern"C"MemoryState* InitMemory(){
    mm::GlobalData::waitInitialized();
    return mm::ToMemoryState(mm::ThreadRegistry::Instance().RegisterCurrentThread());
}


mm::ThreadRegistry::Node* mm::ThreadRegistry::RegisterCurrentThread() noexcept {
    auto lock = list_.LockForIter();
    auto* threadDataNode = list_.Emplace(konan::currentThreadId());
    Node*& currentDataNode = currentThreadDataNode_;
    currentDataNode = threadDataNode;
    threadDataNode->Get()->gc().onThreadRegistration();
    return threadDataNode;
}
// static
THREAD_LOCAL_VARIABLE mm::ThreadRegistry::Node* mm::ThreadRegistry::currentThreadDataNode_ = nullptr;

ThreadData 包括 threadId_,allocator_, gc_等,每個(gè)線程一個(gè)對(duì)象,這樣 allocator_每個(gè)線程私有就實(shí)現(xiàn)了 tlab。

代碼塊:

// `ThreadData` is supposed to be thread local singleton.
// Pin it in memory to prevent accidental copying.
classThreadDatafinal : privatePinned{
public:
    explicit ThreadData(int threadId) noexcept :
        threadId_(threadId),
        globalsThreadQueue_(GlobalsRegistry::Instance()),
        specialRefRegistry_(SpecialRefRegistry::instance()),
        gcScheduler_(GlobalData::Instance().gcScheduler(), *this),
        allocator_(GlobalData::Instance().allocator()),
        gc_(GlobalData::Instance().gc(), *this),
        suspensionData_(ThreadState::kNative, *this){}

總結(jié)一下,ThreadData 在每個(gè)線程內(nèi)部定義了內(nèi)存分配器和 GC,關(guān)于內(nèi)存分配器我們后續(xù)分析。

WorkInit

WorkInit 將 Work 的 thread_變量設(shè)置為線程自己,workRoutine 通過(guò) pthread_create 創(chuàng)建新線程 thread_來(lái)執(zhí)行。線程通過(guò) kotlin 代碼/c++代碼創(chuàng)建,創(chuàng)建好線程之后調(diào)用 initRuntime 來(lái)初始化。

代碼塊:

Worker* WorkerInit(MemoryState* memoryState){
  Worker* worker;
  if (::g_worker != nullptr) {
      worker = ::g_worker;
  } else {
      worker = theState()->addWorkerUnlocked(workerExceptionHandling(), nullptr, WorkerKind::kOther);
      ::g_worker = worker;
  }
  worker->setThread(pthread_self());
  worker->setMemoryState(memoryState);
  return worker;
}


voidWorker::startEventLoop(){
  kotlin::ThreadStateGuard guard(ThreadState::kNative);
  pthread_create(&thread_, nullptr, workerRoutine, this);
}

這里有個(gè)問(wèn)題,既然 workerRoutine 通過(guò) runtime 初始化調(diào)用,哪里真正調(diào)用 Runtime 呢?

CodeGenerator 會(huì)將每個(gè)方法中的 kotlin ir 轉(zhuǎn)換為 llvm ir,在這個(gè)過(guò)程中會(huì)插入 initRuntimeIfNeeded 調(diào)用。所以每個(gè)方法執(zhí)行時(shí)都會(huì)先調(diào)用 initRuntimeIfNeeded。

代碼塊:

if (needsRuntimeInit || switchToRunnable) {
    check(!forbidRuntime) { "Attempt to init runtime where runtime usage is forbidden" }
    call(llvm.initRuntimeIfNeeded, emptyList())
}

Runtime 這里分析完了,我們繼續(xù)看一下 allocator_內(nèi)存分配器。

內(nèi)存分配

K/N 有 3 種內(nèi)存分配器:

  • Custom:K/N 自己開發(fā)的內(nèi)存分配器,也是默認(rèn)的內(nèi)存分配器
  • Std:標(biāo)準(zhǔn)庫(kù)內(nèi)存分配器,在鴻蒙上是 jemalloc
  • Mimalloc:mimalloc 是微軟開源的 native 分配器

每個(gè)內(nèi)存分配器都會(huì)實(shí)現(xiàn)一個(gè) Allocator::ThreadData::Impl 類,比如 CustomAllocator 就對(duì)應(yīng) Custom 內(nèi)存分配器,這樣 allocator_可以和特定的內(nèi)存分配器關(guān)聯(lián)。

代碼塊:

classAllocator::ThreadData::Impl : private Pinned {
public:
    explicitImpl(Allocator::Impl& allocator)noexcept : alloc_(allocator.heap()){}


    alloc::CustomAllocator& alloc()noexcept{ return alloc_; }


private:
    CustomAllocator alloc_;
};




ALWAYS_INLINE ObjHeader* alloc::Allocator::ThreadData::allocateObject(const TypeInfo* typeInfo) noexcept {
    return impl_->alloc().CreateObject(typeInfo);
}

我們主要看一下 Custom 內(nèi)存分配器,每個(gè)線程有獨(dú)立的 threadata,通過(guò) threaddata 創(chuàng)建獨(dú)立的 allocator_。allocator_每次從 heap 申請(qǐng)一個(gè) page(比如中小對(duì)象是 256k),之后 page 在線程內(nèi)部分配內(nèi)存,我們具體看一下代碼。

內(nèi)存創(chuàng)建

在 GCApi.cpp 的 SafeAlloc 方法調(diào)用 mmap 創(chuàng)建虛擬內(nèi)存。

  • 通過(guò) allocatedBytesCounter 保存分配內(nèi)存總量
  • onMemoryAllocation 檢查是否需要觸發(fā) alloc gc

代碼塊:

void* SafeAlloc(uint64_t size)noexcept{
    void* memory;
    bool error;
    if (compiler::disableMmap()) {
        memory = calloc(size, 1);
        error = memory == nullptr;
    } else {
#if KONAN_WINDOWS
        RuntimeFail("mmap is not available on mingw");
#elif KONAN_LINUX || KONAN_OHOS
        memory = mmap(nullptr, size, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_NORESERVE | MAP_POPULATE, -1, 0);
        error = memory == MAP_FAILED;
#else
        memory = mmap(nullptr, size, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_NORESERVE, -1, 0);
        error = memory == MAP_FAILED;
#endif
    }
    if (error) {
        konan::consoleErrorf("Out of memory trying to allocate %" PRIu64 "bytes: %s. Aborting.\n", size, strerror(errno));
        std::abort();
    }
    auto previousSize = allocatedBytesCounter.fetch_add(static_cast<size_t>(size), std::memory_order_relaxed);
    OnMemoryAllocation(previousSize + static_cast<size_t>(size));
    return memory;
}

onMemoryAllocation 通過(guò) HeapGrowthController 的 boundaryForHeapSize 方法來(lái)檢查 totalAllocatedBytes 是否觸發(fā) gc 閾值,我們后續(xù)分析。

代碼塊:

voidkotlin::OnMemoryAllocation(size_t totalAllocatedBytes)noexcept{
    mm::GlobalData::Instance().gcScheduler().setAllocatedBytes(totalAllocatedBytes);
}


voidsetAllocatedBytes(size_t bytes)noexcept{
    // Still checking allocations: with a long running loop all safepoints
    // might be "met", so that's the only trigger to not run out of memory.
    auto boundary = heapGrowthController_.boundaryForHeapSize(bytes);
    switch (boundary) {
        case HeapGrowthController::MemoryBoundary::kNone:
            safePoint();
            return;
        case HeapGrowthController::MemoryBoundary::kTrigger:
            RuntimeLogDebug({kTagGC}, "Scheduling GC by allocation");
            scheduleGC_.scheduleNextEpochIfNotInProgress();
            return;
        case HeapGrowthController::MemoryBoundary::kTarget:
            RuntimeLogDebug({kTagGC}, "Scheduling GC by allocation");
            auto epoch = scheduleGC_.scheduleNextEpochIfNotInProgress();
            RuntimeLogWarning({kTagGC}, "Pausing the mutators");
            mutatorAssists_.requestAssists(epoch);
            return;
    }
}

Custom 內(nèi)存分配器通過(guò) CreateObject 和 CreateArray 分配內(nèi)存。

  • CreateObject 分配對(duì)象,如果類(typeInfo)加了 TF_HAS_FINALIZER 標(biāo)記,會(huì)通過(guò) extraObject 增加對(duì)象弱引用,gc 后調(diào)用 finialize 方法,后續(xù)分析
  • CreateArray 分配 array

代碼塊:

ObjHeader* CustomAllocator::CreateObject(const TypeInfo* typeInfo)noexcept{
    RuntimeAssert(!typeInfo->IsArray(), "Must not be an array");
    auto descriptor = HeapObject::make_descriptor(typeInfo);
    auto& heapObject = *descriptor.construct(Allocate(descriptor.size()));
    ObjHeader* object = heapObject.header(descriptor).object();
    if (typeInfo->flags_ & TF_HAS_FINALIZER) {
        auto* extraObject = CreateExtraObject();
        object->typeInfoOrMeta_ = reinterpret_cast<TypeInfo*>(new (extraObject) mm::ExtraObjectData(object, typeInfo));
    } else {
        object->typeInfoOrMeta_ = const_cast<TypeInfo*>(typeInfo);
    }
    return object;
}


ArrayHeader* CustomAllocator::CreateArray(const TypeInfo* typeInfo, uint32_t count)noexcept{
    RuntimeAssert(typeInfo->IsArray(), "Must be an array");
    auto descriptor = HeapArray::make_descriptor(typeInfo, count);
    CustomAllocDebug("CustomAllocator@%p::CreateArray(%d), total size:%ld", this ,count, (long)descriptor.size());
    auto& heapArray = *descriptor.construct(Allocate(descriptor.size()));
    ArrayHeader* array = heapArray.header(descriptor).array();
    array->typeInfoOrMeta_ = const_cast<TypeInfo*>(typeInfo);
    array->count_ = count;
    returnarray;
}

對(duì)象大小通過(guò) HeapObject 計(jì)算,包括 ObjectData/ObjHeader/ObjectBody 三部分。

代碼塊:

structHeapObjHeader {
    using descriptor = type_layout::Composite<HeapObjHeader, gc::GC::ObjectData, ObjHeader>;
structHeapObject {
    using descriptor = type_layout::Composite<HeapObject, HeapObjHeader, ObjectBody>;

Array 通過(guò) HeapArray 計(jì)算,包括 ObjectData, ArrayHeader, arrayBody。

代碼塊:

structHeapArrayHeader {
    using descriptor = type_layout::Composite<HeapArrayHeader, gc::GC::ObjectData, ArrayHeader>;
    // Header of value type array objects. Keep layout in sync with that of object header.
structArrayHeader {
  TypeInfo* typeInfoOrMeta_;


  // Elements count. Element size is stored in instanceSize_ field of TypeInfo, negated.
  uint32_t count_;
};
structHeapArray {
    using descriptor = type_layout::Composite<HeapArray, HeapArrayHeader, ArrayBody>;

具體如下:

最后,通過(guò) Allocater 方法決定選用哪個(gè) page,我們后續(xù)分析下:

代碼塊:

uint8_t* CustomAllocator::Allocate(uint64_t size)noexcept{
    RuntimeAssert(size, "CustomAllocator::Allocate cannot allocate 0 bytes");
    //CustomAllocDebug("CustomAllocator::Allocate(%" PRIu64 ")", size);
    uint64_t cellCount = (size + sizeof(Cell) - 1) / sizeof(Cell);
    if (cellCount <= FIXED_BLOCK_PAGE_MAX_BLOCK_SIZE) {
        return AllocateInFixedBlockPage(cellCount);
    } elseif (cellCount > NEXT_FIT_PAGE_MAX_BLOCK_SIZE) {
        return AllocateInSingleObjectPage(cellCount);
    } else {
        return AllocateInNextFitPage(cellCount);
    }
}

小對(duì)象分配

分配 8~1k 字節(jié)對(duì)象,MAX_BLOCK_SIZE = 128, 每次分配 cell 數(shù)量(一個(gè) cell 8 個(gè)字節(jié)) < 128 時(shí)會(huì)使用 FixedBlockPage 進(jìn)行內(nèi)存分配,每個(gè) page 默認(rèn) 256k。

代碼塊:

FixedBlockPage* FixedBlockPage::Create(uint32_t blockSize)noexcept{
    CustomAllocInfo("FixedBlockPage::Create(%u)", blockSize);
    RuntimeAssert(blockSize <= FIXED_BLOCK_PAGE_MAX_BLOCK_SIZE, "blockSize too large for FixedBlockPage");
    returnnew (SafeAlloc(FIXED_BLOCK_PAGE_SIZE)) FixedBlockPage(blockSize);
}
inlineconstexprconstsize_t FIXED_BLOCK_PAGE_SIZE = (256 * KiB);

blockSize 是每個(gè) block 的大小,大小在 1~128 個(gè) cell。

代碼塊:

FixedBlockPage::FixedBlockPage(uint32_t blockSize) noexcept : blockSize_(blockSize) {
    CustomAllocInfo("FixedBlockPage(%p)::FixedBlockPage(%u)", this, blockSize);
    nextFree_.first = 0;
    nextFree_.last = FIXED_BLOCK_PAGE_CELL_COUNT / blockSize * blockSize;
    end_ = FIXED_BLOCK_PAGE_CELL_COUNT / blockSize * blockSize;
}

TryAllocate 每次返回固定大小 cell,cell 數(shù)量取值 1~128。

代碼塊:

uint8_t* FixedBlockPage::TryAllocate() noexcept {
    uint32_t next = nextFree_.first;
    if (next < nextFree_.last) {
        nextFree_.first += blockSize_;
        return cells_[next].data;
    }
    if (next >= end_) return nullptr;
    nextFree_ = cells_[next].nextFree;
    memset(&cells_[next], 0, sizeof(cells_[next]));
    return cells_[next].data;
}

中對(duì)象分配

分配 1k~256k 對(duì)象,NextFitPage 和 FixedBlockPage 不同,同樣創(chuàng)建 256K 大小的內(nèi)存,每個(gè) page 可以分配不同 cell 數(shù)量的對(duì)象,而 FixedBlockPage 只能分配固定 cell 對(duì)象。

代碼塊:

NextFitPage* NextFitPage::Create(uint32_t cellCount) noexcept {
    CustomAllocInfo("NextFitPage::Create(%u)", cellCount);
    RuntimeAssert(cellCount < NEXT_FIT_PAGE_CELL_COUNT, "cellCount is too large for NextFitPage");
    return new (SafeAlloc(NEXT_FIT_PAGE_SIZE)) NextFitPage(cellCount);
}
inline constexpr const size_t NEXT_FIT_PAGE_SIZE = (256 * KiB);

cells 存放的是每個(gè) cell 編號(hào),從 0~cellCount - 1。

代碼塊:

NextFitPage::NextFitPage(uint32_t cellCount) noexcept : curBlock_(cells_) {
    cells_[0] = Cell(0); // Size 0 ensures any actual use would break
    cells_[1] = Cell(NEXT_FIT_PAGE_CELL_COUNT - 1);
}

每次從 curBlock(cell)分配 blockSize, 如果不夠按照 blockSize 重新分配 cell。

代碼塊:

uint8_t* NextFitPage::TryAllocate(uint32_t blockSize)noexcept{
    CustomAllocDebug("NextFitPage@%p::TryAllocate(%u)", this, blockSize);
    // +1 accounts for header, since cell->size also includes header cell
    uint32_t cellsNeeded = blockSize + 1;
    uint8_t* block = curBlock_->TryAllocate(cellsNeeded);
    if (block) return block;
    UpdateCurBlock(cellsNeeded);
    return curBlock_->TryAllocate(cellsNeeded);
}

大對(duì)象分配

SingleObjectPage 每次只創(chuàng)建一個(gè)對(duì)象,大小為 objectSize,主要申請(qǐng)超過(guò) 256k 的大對(duì)象。

代碼塊:

SingleObjectPage* SingleObjectPage::Create(uint64_t cellCount)noexcept{
    CustomAllocInfo("SingleObjectPage::Create(%" PRIu64 ")", cellCount);
    RuntimeAssert(cellCount > NEXT_FIT_PAGE_MAX_BLOCK_SIZE, "blockSize too small for SingleObjectPage");
    uint64_t size = sizeof(SingleObjectPage) + cellCount * sizeof(uint64_t);
    returnnew (SafeAlloc(size)) SingleObjectPage(size);
}

Finalize 對(duì)象

不管哪種類型對(duì)象,如果需要 finalize,在 createObject 時(shí),通過(guò) ExtraObject 分配 24 字節(jié) ExtraObjectData 內(nèi)存。

ExtraObjectPage 分配 64k 內(nèi)存。

代碼塊:

ExtraObjectPage* ExtraObjectPage::Create(uint32_t ignored)noexcept{
    CustomAllocInfo("ExtraObjectPage::Create()");
    returnnew (SafeAlloc(EXTRA_OBJECT_PAGE_SIZE)) ExtraObjectPage();
}


// Optional data that's lazily allocated only for objects that need it.
classExtraObjectData :private Pinned {
private:
    // Must be first to match `TypeInfo` layout.
    const TypeInfo* typeInfo_;
    std::atomic<uint32_t> flags_ = 0;
    std::atomic<ObjHeader*> weakReferenceOrBaseObject_;

nextFree 存放 cells 地址,創(chuàng)建 extraObjectCount 個(gè) cell。

代碼塊:

ExtraObjectPage::ExtraObjectPage() noexcept {
    nextFree_.store(cells_, std::memory_order_relaxed);
    ExtraObjectCell* end = cells_ + EXTRA_OBJECT_COUNT;
    for (ExtraObjectCell* cell = cells_; cell < end; cell = cell->next_.load(std::memory_order_relaxed)) {
        cell->next_.store(cell + 1, std::memory_order_relaxed);
    }
}

TryAllocate 每次分配一個(gè) cell。

代碼塊:

mm::ExtraObjectData* ExtraObjectPage::TryAllocate()noexcept{
    auto* next = nextFree_.load(std::memory_order_relaxed);
    if (next >= cells_ + EXTRA_OBJECT_COUNT) {
        returnnullptr;
    }
    ExtraObjectCell* freeBlock = next;
    nextFree_.store(freeBlock->next_.load(std::memory_order_relaxed), std::memory_order_relaxed);
    CustomAllocDebug("ExtraObjectPage(%p)::TryAllocate() = %p", this, freeBlock->Data());
    return freeBlock->Data();
}

FinalizerQueue 用于存放 finialze 對(duì)象,gc 后會(huì)遍歷 FinalizerQueue,調(diào)用對(duì)象 finialize 方法。

代碼塊:

classCustomAllocator {
private:
    uint8_t* Allocate(uint64_t cellCount)noexcept;
    uint8_t* AllocateInSingleObjectPage(uint64_t cellCount)noexcept;
    uint8_t* AllocateInNextFitPage(uint32_t cellCount)noexcept;
    uint8_t* AllocateInFixedBlockPage(uint32_t cellCount)noexcept;


    Heap& heap_;
    NextFitPage* nextFitPage_;
    FixedBlockPage* fixedBlockPages_[FIXED_BLOCK_PAGE_MAX_BLOCK_SIZE + 1];
    ExtraObjectPage* extraObjectPage_;
    FinalizerQueue finalizerQueue_;

總結(jié)一下,custom 內(nèi)存分配器一共有四種內(nèi)存分配方式,F(xiàn)ixedBlockPage/NextFitPage 適用于中小對(duì)象,SingleObjecPage 適用于大對(duì)象,ExtraObjectPage 適用于需要 finalize 對(duì)象的額外數(shù)據(jù)。

如下是簡(jiǎn)單總結(jié):

上面分析的 FixedBlockPage/SingleObjectPage/NextFitPage 都定義了 Sweep 方法,用于 GC 時(shí)回收內(nèi)存,不同的 GC 算法都會(huì)調(diào)用同樣的 sweep 方法,我們繼續(xù)看一下 GC。

GC

GC 有三種類型,默認(rèn) pcms,cms 需要手動(dòng)配置。

  • cms 是并發(fā)標(biāo)記的,只在遍歷 gc root 時(shí)暫停線程,性能最好
  • stms,需要 stop world 暫停線程,性能很差
  • 默認(rèn) pcms 可以支持多線程 gc,也會(huì) stop the world 暫停線程

stms 是早期的垃圾回收器,cms 是最新的,我們從代碼層面分別看下。

stms

GCImpl.cpp 是 GC 實(shí)現(xiàn)的接口類,每個(gè) GC 垃圾回收器都需要實(shí)現(xiàn)一下,包括幾個(gè)部分:

  • SameThreadMarkAndSweep gc_,GC 整體都是由 SameThreadMarkAndSweep 完成的
  • gcScheduler 調(diào)度策略,gcScheduler 后續(xù)會(huì)分析

代碼塊:

classGC::Impl : private Pinned {
public:
    explicitImpl(alloc::Allocator& allocator, gcScheduler::GCScheduler& gcScheduler)noexcept : gc_(allocator, gcScheduler){}


    SameThreadMarkAndSweep& gc()noexcept{ return gc_; }


private:
    SameThreadMarkAndSweep gc_;
};

SameThreadMarkAndSweep 在構(gòu)造函數(shù)中創(chuàng)建 GC thread 線程,并通過(guò) state_。waitScheduled 判斷是否調(diào)用 PerformFullGC,這里用了 do/while 循環(huán),state_是 GCStateHolder 變量。

代碼塊:

gc::SameThreadMarkAndSweep::SameThreadMarkAndSweep(alloc::Allocator& allocator, gcScheduler::GCScheduler& gcScheduler) noexcept :


    allocator_(allocator), gcScheduler_(gcScheduler), finalizerProcessor_([this](int64_t epoch) noexcept {
        GCHandle::getByEpoch(epoch).finalizersDone();
        state_.finalized(epoch);
    }) {
    gcThread_ = ScopedThread(ScopedThread::attributes().name("GC thread"), [this] {
        while (true) {
            auto epoch = state_.waitScheduled();
            if (epoch.has_value()) {
                PerformFullGC(*epoch);
            } else {
                break;
            }
        }
    });
}

PerformFullGC 主要做幾個(gè)事情:

  • StopTheWord 所有線程將線程暫停執(zhí)行
  • collectRootSet 收集 gc root
  • Mark 會(huì)根據(jù) gc root 標(biāo)記存活對(duì)象
  • processWeaks 處理 weakReference
  • prepareForGC 通知每個(gè)線程 customallocator 去掉 page 引用,為存活對(duì)象 sweep 提前做準(zhǔn)備
  • heap.Sweep 釋放非存活對(duì)象
  • resumeTheWorld 喚醒線程
  • finalizerProcessor 調(diào)用對(duì)象 finialize 方法,之前會(huì)收集所有線程的 finalize 對(duì)象

代碼塊:

void gc::SameThreadMarkAndSweep::PerformFullGC(int64_t epoch) noexcept {
    stopTheWorld(gcHandle, "GC stop the world");
    
    gc::collectRootSet<internal::MarkTraits>(gcHandle, markQueue_, [](mm::ThreadData&) { returntrue; });
    gc::Mark<internal::MarkTraits>(gcHandle, markQueue_);
    gc::processWeaks<DefaultProcessWeaksTraits>(gcHandle, mm::SpecialRefRegistry::instance());


    // This should really be done by each individual thread while waiting
    int threadCount = 0;
    for (auto& thread : kotlin::mm::ThreadRegistry::Instance().LockForIter()) {
        thread.allocator().prepareForGC();
        ++threadCount;
    }
    allocator_.prepareForGC();


    // also sweeps extraObjects
    auto finalizerQueue = allocator_.impl().heap().Sweep(gcHandle);
    for (auto& thread : kotlin::mm::ThreadRegistry::Instance().LockForIter()) {
        finalizerQueue.mergeFrom(thread.allocator().impl().alloc().ExtractFinalizerQueue());
    }
    finalizerQueue.mergeFrom(allocator_.impl().heap().ExtractFinalizerQueue());


    resumeTheWorld(gcHandle);
    
    finalizerProcessor_.ScheduleTasks(std::move(finalizerQueue.regular), epoch);
    mainThreadFinalizerProcessor_.schedule(std::move(finalizerQueue.mainThread), epoch);
}

具體流程如圖:

collectRootSet 通過(guò) collectRootSetForThread 從線程 stack/tls gc root, collectRootSetGlobals 讀取 static 和 jni 調(diào)用的 gc root,最終放到 markQueue。

代碼塊:

// TODO: This needs some tests now.
template <typename Traits, typename F>
voidcollectRootSet(GCHandle handle, typename Traits::MarkQueue& markQueue, F&& filter)noexcept{
    Traits::clear(markQueue);
    for (auto& thread : mm::GlobalData::Instance().threadRegistry().LockForIter()) {
        if (!filter(thread))
            continue;
        thread.Publish();
        collectRootSetForThread<Traits>(handle, markQueue, thread);
    }
    collectRootSetGlobals<Traits>(handle, markQueue);
}

Mark 方法會(huì)從 markQueue 中取出存活對(duì)象,然后調(diào)用 processInMark 處理成員變量。

代碼塊:

template <typename Traits>
voidMark(GCHandle::GCMarkScope& markHandle, typename Traits::MarkQueue& markQueue)noexcept{
    while (ObjHeader* top = Traits::tryDequeue(markQueue)) {
        markHandle.addObject();


        Traits::processInMark(markQueue, top);


        // TODO: Consider moving it before processInMark to make the latter something of a tail call.
        if (auto* extraObjectData = mm::ExtraObjectData::Get(top)) {
            internal::processExtraObjectData<Traits>(markHandle, markQueue, *extraObjectData, top);
        }
    }
}

和 android 不同,kmp 會(huì)通過(guò)靜態(tài)代碼分析判斷對(duì)象在棧上還是堆上分配。

棧上分配的對(duì)象在方法調(diào)用結(jié)束后可以返回,通過(guò) field->heap 判斷變量在堆上還是棧上,棧上的對(duì)象不需要放到 markQueue。

代碼塊:

template <typename Traits>
voidprocessFieldInMark(void* state, ObjHeader* object, ObjHeader* field)noexcept{
    auto& markQueue = *static_cast<typename Traits::MarkQueue*>(state);
    if (field->heap()) {
        Traits::tryEnqueue(markQueue, field);
    }
    ifconstexpr(!Traits::kAllowHeapToStackRefs){
        if (object->heap()) {
            RuntimeAssert(!field->local(), "Heap object %p references stack object %p[typeInfo=%p]", object, field, field->type_info());
        }
    }
}

tryEnqueue 將對(duì)象的 ObjectData(上面分析過(guò),在每個(gè)對(duì)象開頭 8 個(gè)字節(jié)),通過(guò) tryPush 放到 queue 里面。

代碼塊:

static ALWAYS_INLINE booltryEnqueue(AnyQueue& queue, ObjHeader* object)noexcept{
    auto& objectData = alloc::objectDataForObject(object);
    bool pushed = queue.tryPush(objectData);
    return pushed;
}

這里 queue 實(shí)現(xiàn)上是一個(gè)鏈表,每個(gè)元素是 ObjectData 中的 next_變量,如果對(duì)象 next_有值,說(shuō)明已經(jīng) mark 過(guò),直接返回。sweep 時(shí)判斷 next_有值就不會(huì)釋放對(duì)象。

代碼塊:

std::optional<iterator> try_insert_after(iterator pos, reference value) noexcept {
    RuntimeAssert(pos != end(), "Attempted to try_insert_after end()");
    RuntimeAssert(pos != iterator(), "Attempted to try_insert_after empty iterator");
    if (!trySetNext(&value, next(pos.node_))) {
        return std::nullopt;
    }
    setNext(pos.node_, &value);
    return iterator(&value);
}


void setNext(ObjectData* next) noexcept {
    RuntimeAssert(next, "next cannot be nullptr");
    next_.store(next, std::memory_order_relaxed);
}
bool trySetNext(ObjectData* next) noexcept {
    RuntimeAssert(next, "next cannot be nullptr");
    ObjectData* expected = nullptr;
    return next_.compare_exchange_strong(expected, next, std::memory_order_relaxed);
}

具體邏輯如下:

從代碼看,stms 代碼邏輯非常完整,但是 stw 會(huì)造成線程暫停,影響性能,pmcs 和 stms 實(shí)現(xiàn)差不多。

我們繼續(xù)看下 cms 如何去掉 stop the world。

cms

從代碼看,cms 在遍歷 gc root 時(shí)才會(huì) stop the world,主要實(shí)現(xiàn)在 markDispatcher_。runMainInSTW。

代碼塊:

void gc::ConcurrentMarkAndSweep::PerformFullGC(int64_t epoch) noexcept {
    std::unique_lock mainGCLock(gcMutex);
    auto gcHandle = GCHandle::create(epoch);


    stopTheWorld(gcHandle, "GC stop the world #1: collect root set");


    auto& scheduler = gcScheduler_;
    scheduler.onGCStart();


    state_.start(epoch);


    markDispatcher_.runMainInSTW();

在 completeMutatorSRootSet 獲取到 gc root 后,通過(guò) resumeTheWorld 喚醒線程,這樣后續(xù) Mark 階段就不會(huì)暫停線程了。在 Mark 階段新產(chǎn)生的對(duì)象都是存活對(duì)象。

代碼塊:

void gc::mark::ConcurrentMark::runMainInSTW() {
    ParallelProcessor::Worker mainWorker(*parallelProcessor_);


    // create mutator mark queues
    for (auto& thread : *lockedMutatorsList_) {
        thread.gc().impl().gc().mark().markQueue().construct(*parallelProcessor_);
    }
    completeMutatorsRootSet(mainWorker);


    // global root set must be collected after all the mutator's global data have been published
    collectRootSetGlobals<MarkTraits>(gcHandle(), mainWorker);
    
    barriers::enableBarriers(gcHandle().getEpoch());
    resumeTheWorld(gcHandle());

具體流程圖:

GCScheduler

默認(rèn)是 adaptive 模式,通過(guò) GC timer thread 線程在應(yīng)用處于前臺(tái)時(shí)定時(shí)觸發(fā) GC, config_。regularGcInterval 指定,默認(rèn) 10s。

代碼塊:

classGCSchedulerDataAdaptive{
public:
    GCSchedulerDataAdaptive(GCSchedulerConfig& config, std::function<int64_t()> scheduleGC) noexcept :
        config_(config),
        scheduleGC_(std::move(scheduleGC)),
        appStateTracking_(mm::GlobalData::Instance().appStateTracking()),
        heapGrowthController_(config),
        regularIntervalPacer_(config),
        timer_("GC Timer thread", config_.regularGcInterval(), [this] {
            if (appStateTracking_.state() == mm::AppStateTracking::State::kBackground) {
                return;
            }
            if (regularIntervalPacer_.NeedsGC()) {
                RuntimeLogDebug({kTagGC}, "Scheduling GC by timer");
                scheduleGC_.scheduleNextEpochIfNotInProgress();
            }
        }) {
    }

也可以在 alloc 對(duì)象時(shí)觸發(fā),boundaryForHeapSize 返回 kTrigger 觸發(fā) gc,內(nèi)存分配的時(shí)候 safealloc 通過(guò) mmap 分配內(nèi)存后會(huì)調(diào)用 setAllocatedBytes 判斷是否需要 gc。

代碼塊:

voidsetAllocatedBytes(size_t bytes)noexcept{
    auto boundary = heapGrowthController_.boundaryForHeapSize(bytes);
    switch (boundary) {
        case HeapGrowthController::MemoryBoundary::kNone:
            return;
        case HeapGrowthController::MemoryBoundary::kTrigger:
            scheduleGC_.scheduleNextEpochIfNotInProgress();
            return;
        case HeapGrowthController::MemoryBoundary::kTarget:
            mutatorAssists_.requestAssists(epoch);
            return;
    }
}

判斷條件是已分配內(nèi)存 totalAllocatedBytes >= targetHeapBytes(默認(rèn) 10M)。

代碼塊:

// Can be called by any thread.
MemoryBoundary boundaryForHeapSize(size_t totalAllocatedBytes)noexcept{
    if (totalAllocatedBytes >= targetHeapBytes_) {
        return config_.mutatorAssists() ? MemoryBoundary::kTarget : MemoryBoundary::kTrigger;
    } elseif (totalAllocatedBytes >= triggerHeapBytes_) {
        return MemoryBoundary::kTrigger;
    } else {
        return MemoryBoundary::kNone;
    }
}

每次 gc 后,通過(guò) updateboundaries 重新計(jì)算 targetHeapBytes,涉及 heapTriggerCoefficient(默認(rèn) 0.9), targetheapUtilization(默認(rèn) 0.1),都可以調(diào)整優(yōu)化。

代碼塊:

// Called by the GC thread.
voidupdateBoundaries(size_t aliveBytes)noexcept{
    if (config_.autoTune.load()) {
        double targetHeapBytes = static_cast<double>(aliveBytes) / config_.targetHeapUtilization;
        if (!std::isfinite(targetHeapBytes)) {
            // This shouldn't happen in practice: targetHeapUtilization is in (0, 1]. But in case it does, don't touch anything.
            return;
        }
        double minHeapBytes = static_cast<double>(config_.minHeapBytes.load(std::memory_order_relaxed));
        double maxHeapBytes = static_cast<double>(config_.maxHeapBytes.load(std::memory_order_relaxed));
        targetHeapBytes = std::min(std::max(targetHeapBytes, minHeapBytes), maxHeapBytes);
        triggerHeapBytes_ = static_cast<size_t>(targetHeapBytes * config_.heapTriggerCoefficient.load(std::memory_order_relaxed));
        config_.targetHeapBytes.store(static_cast<int64_t>(targetHeapBytes), std::memory_order_relaxed);
        targetHeapBytes_ = static_cast<size_t>(targetHeapBytes);
    } else {
        targetHeapBytes_ = config_.targetHeapBytes.load(std::memory_order_relaxed);
    }
}

aggressive 模式只會(huì)觸發(fā) alloc gc,不會(huì)定時(shí)觸發(fā)。

目前問(wèn)題總結(jié)

  1. std 內(nèi)存分配器占用內(nèi)存很少,但是實(shí)踐發(fā)現(xiàn)切換后會(huì)頻繁的 alloc gc,性能比 custom 差很多
  2. cms 在 mark 階段不會(huì)暫停線程,性能更好,但是默認(rèn)是 pmcs
  3. GcScheduler 默認(rèn) adaptive 模式,會(huì)有定時(shí)觸發(fā) GC(默認(rèn) 10s)以及默認(rèn) heap(10M)導(dǎo)致頻繁 gc
  4. gc 不支持分代,每次遍歷所有對(duì)象比較耗時(shí)
  5. custom 內(nèi)存分配器每個(gè)線程內(nèi)存分配是獨(dú)立的,相當(dāng)于 android 的 tlab。不過(guò)實(shí)踐發(fā)現(xiàn)物理內(nèi)存很容易 200M+,原因是沒(méi)有做內(nèi)存碎片整理,需要我們自己實(shí)現(xiàn)

針對(duì)這幾個(gè)問(wèn)題,我們做了優(yōu)化并在抖音落地。

優(yōu)化落地

heap 配置優(yōu)化

從 updateBoundaries 分析看,影響下次 gc 主要是 targeHeapBytes,而 targeHeapBytes 默認(rèn) 10M,heapTriggerCoefficient * 10 = 9M 時(shí)就會(huì)觸發(fā) GC,GC 后 targeHeapBytes = 存活對(duì)象大小 /  targetHeapUtilization(0.5)。

代碼塊:

std::atomic<int64_t> regularGcIntervalMicroseconds = 10 * 1000 * 1000;
// GC will try to keep object bytes under this amount. If object bytes have
// become bigger than this value, and `mutatorAssists` are enabled the GC will
// stop the world and wait until current epoch finishes.
// Adapts after each GC epoch when `autoTune = true`.
std::atomic<int64_t> targetHeapBytes = 10 * 1024 * 1024;
// The rate at which `targetHeapBytes` changes when `autoTune = true`. Concretely: if after the collection
// `N` object bytes remain in the heap, the next `targetHeapBytes` will be `N / targetHeapUtilization` capped
// between `minHeapBytes` and `maxHeapBytes`.
std::atomic<double> targetHeapUtilization = 0.5;
// GC will be triggered when object bytes reach `heapTriggerCoefficient * targetHeapBytes`.
std::atomic<double> heapTriggerCoefficient = 0.9;

從實(shí)際看,alloc gc 觸發(fā)次數(shù)比較多,可以設(shè)置這幾個(gè)變量,另外滑動(dòng)時(shí) regularGcIntervalMicrosecnotallow=10s 定時(shí) gc 也會(huì)占用 cpu,可以先在滑動(dòng)時(shí)增大,后續(xù)根據(jù) heap 大小來(lái)觸發(fā)。

以頭條關(guān)注頁(yè)為例,默認(rèn)內(nèi)存參數(shù)在滑動(dòng)的時(shí)候會(huì)頻繁觸發(fā) gc,導(dǎo)致幀率降低。

默認(rèn)參數(shù)滑動(dòng)時(shí) gc 間隔只有 200ms 左右

在業(yè)務(wù)層可以通過(guò) kotlin.native.runtime.GC 屬性來(lái)直接調(diào)整調(diào)整參數(shù)。

調(diào)大 gc 閾值內(nèi)存

調(diào)整之后滑動(dòng)間隔為默認(rèn)的 10s

滑動(dòng) gc 抑制

目前 kotlin-native 的 gc 機(jī)制會(huì)定時(shí) gc,如果恰好是在滑動(dòng)的時(shí)候觸發(fā) gc,就可能會(huì)導(dǎo)致卡頓,因此需要在滑動(dòng)的時(shí)候讓 runtime 不進(jìn)行 gc。方法是滑動(dòng)時(shí)候通過(guò)GC.regularGCInterval來(lái)調(diào)整 gc 間隔到一個(gè)相對(duì)長(zhǎng)的值,比如 1 分鐘,等到滑動(dòng)結(jié)束的時(shí)候再還原回去。

gc 配置優(yōu)化

默認(rèn)是 pmcs,可以改成 cms,減少線程暫停時(shí)間,在大多數(shù)情況下 gmcs 線程暫停(STW)時(shí)間 5ms 左右,如果想要不掉幀,一幀的渲染時(shí)間為 8.33ms(120fps),留給處理業(yè)務(wù)的時(shí)間只有 3ms,實(shí)測(cè)下來(lái)滑動(dòng)帶圖場(chǎng)景基本穩(wěn)定掉幀。cms 的線程暫停(STW)時(shí)間為 0.2ms 左右。直接降低了一個(gè)數(shù)量級(jí)。

默認(rèn) gmcs gc 時(shí)的暫停時(shí)間

改為 cms 時(shí),gc 的暫停時(shí)間

經(jīng)過(guò)測(cè)試,上述三項(xiàng)優(yōu)化上了之后,頭條個(gè)人頁(yè)滑動(dòng)場(chǎng)景的幀率可從 110fps 提升到 117fps

內(nèi)存碎片優(yōu)化

  1. 調(diào)整 FixedBlockPage 數(shù)量,cell size,每個(gè)線程都有獨(dú)立的 fixedBlockPages 數(shù)組,大小為 256k * 128 = 32M,gc 后由于沒(méi)有內(nèi)存碎片整理,內(nèi)存空洞較大。目前將 FIXED_BLOCK_PAGE_SIZE 設(shè)置為 64k,FIXED_BLOCK_PAGE_MAX_BLOCK_SIZE 設(shè)置為 16,一個(gè)線程占用 1M。

代碼塊:

classCustomAllocator {
private:
    Heap& heap_;
    NextFitPage* nextFitPage_;
    FixedBlockPage* fixedBlockPages_[FIXED_BLOCK_PAGE_MAX_BLOCK_SIZE + 1];
    ExtraObjectPage* extraObjectPage_;
    FinalizerQueue finalizerQueue_;
    
inlineconstexprconstsize_t FIXED_BLOCK_PAGE_SIZE = (256 * KiB);
inlineconstexprconstint FIXED_BLOCK_PAGE_MAX_BLOCK_SIZE = 128;
  1. 按頁(yè)釋放空洞內(nèi)存

Sweep 時(shí)如果內(nèi)存需要釋放,只是 memset 將內(nèi)存設(shè)置為 0,并不會(huì)釋放內(nèi)存。

代碼塊:

boolFixedBlockPage::Sweep(GCSweepScope& sweepHandle, FinalizerQueue& finalizerQueue)noexcept{
    for (uint32_t cell = 0 ; cell < end_ ; cell += blockSize_) {
        // Go through the occupied cells.
        for (; cell < nextFree.first ; cell += blockSize_) {
            if (!SweepObject(cells_[cell].data, finalizerQueue, sweepHandle)) {
                // We should null this cell out, but we will do so in batch later.
                continue;
            }
            if (prevLive + blockSize_ < cell) {
                // We found an alive cell that ended a run of swept cells or a known unoccupied range.
                uint32_t prevCell = cell - blockSize_;
                // Nulling in batch.
                memset(&cells_[prevLive + blockSize_], 0, (prevCell - prevLive) * sizeof(FixedBlockCell));
             }
        }

將 memset 改成 madvise 按頁(yè)釋放內(nèi)存。

代碼塊:

#ifndef KONAN_WINDOWS
staticsize_t kPageSize = sysconf(_SC_PAGESIZE);
#endif
voidZeroAndReleasePages(void* address, size_t length)noexcept{
#ifdef KONAN_WINDOWS
#else
    if (length <= 0) {
        return;
    }
    uint8_t* const mem_begin = reinterpret_cast<uint8_t*>(address);
    uint8_t* const mem_end = mem_begin + length;
    uint8_t* const page_begin = reinterpret_cast<uint8_t*>(RoundUp(reinterpret_cast<uintptr_t>(mem_begin), kPageSize));
    uint8_t* const page_end = reinterpret_cast<uint8_t*>(RoundDown(reinterpret_cast<uintptr_t>(mem_end), kPageSize));
    if (page_begin >= page_end) {
        // No possible area to madvise.
    } else {
        madvise(page_begin, page_end - page_begin, MADV_DONTNEED);
    }
#endif
}
//#endif

經(jīng)測(cè)試,在頭條關(guān)注頁(yè)長(zhǎng)時(shí)間滑動(dòng)情況下,內(nèi)存碎片優(yōu)化 -200M 內(nèi)存。

  1. mmap 去掉 MAP_POPULATE 標(biāo)記

Runtime 使用 mmap 進(jìn)行 Page 分配,如下:

代碼塊:

void* SafeAlloc(uint64_t size)noexcept{
//...... 
#if KONAN_WINDOWS
        RuntimeFail("mmap is not available on mingw");
#elif KONAN_LINUX
        memory = mmap(nullptr, size, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_NORESERVE | MAP_POPULATE, -1, 0);
        error = memory == MAP_FAILED;
        //......
}

調(diào)用的參數(shù)有一個(gè)  MAP_POPULATE 標(biāo)記,它的主要作用是預(yù)先填充(prefault)映射區(qū)域的頁(yè)表。

在標(biāo)準(zhǔn)的 mmap 調(diào)用中,系統(tǒng)僅會(huì)在進(jìn)程的虛擬內(nèi)存空間中分配一段虛擬內(nèi)存區(qū)域,并建立虛擬地址與文件(或匿名內(nèi)存)之間的映射關(guān)系,但并不會(huì)立即分配物理內(nèi)存。物理內(nèi)存的實(shí)際分配會(huì)延遲到 CPU 首次訪問(wèn)這段虛擬內(nèi)存時(shí),通過(guò)缺頁(yè)中斷(page fault)機(jī)制觸發(fā)。

而當(dāng)使用 MAP_POPULATE 標(biāo)志時(shí),系統(tǒng)會(huì)在 mmap 調(diào)用期間就預(yù)先填充頁(yè)表,對(duì)于文件映射,還會(huì)觸發(fā)對(duì)文件的預(yù)讀(read-ahead)操作,去掉該標(biāo)記能減少物理內(nèi)存占用。

vma 重用優(yōu)化

CMS GC 在 sweep  時(shí)會(huì)將 empty page 收集起來(lái):

代碼塊:

T* SweepSingle(GCSweepScope& sweepHandle, T* page, AtomicStack<T>& from, AtomicStack<T>& to, FinalizerQueue& finalizerQueue)noexcept{
        if (!page) {
            returnnullptr;
        }
        do {
            if (page->Sweep(sweepHandle, finalizerQueue)) {
                to.Push(page);
                return page;
            }
            empty_.Push(page);
         } while ((page = from.Pop()));
         returnnullptr;
}

在下次 GC 的第二次 STW 時(shí),將 empty page 通過(guò) munmap 釋放物理內(nèi)存:

代碼塊:

void PrepareForGC() noexcept {
    unswept_.TransferAllFrom(std::move(ready_));
    unswept_.TransferAllFrom(std::move(used_));
    T* page;
    // Destory 使用 munmap 釋放 vma
    while ((page = empty_.Pop())) page->Destroy();
}

但在 empty 比較多的場(chǎng)景下,這樣會(huì)導(dǎo)致 STW 的時(shí)間顯著變長(zhǎng),影響程序性能。

因此,我們做了 vma 重用的優(yōu)化,在收集 empty  page 時(shí),對(duì)其使用 madvise (MADV_DONTNEED) 來(lái)釋放物理內(nèi)存 ,極大降低了第二次 STW 的時(shí)間。

gc 分代

在 sweep 調(diào)用 ObjectData tryResetMark 時(shí),如果是 sticky(young),就標(biāo)記成 kStickMark,這樣下次 gc 時(shí)發(fā)現(xiàn)對(duì)象還是 mark 狀態(tài),就不會(huì)釋放,也不會(huì)添加到 markqueue。

代碼塊:

booltryResetMark()noexcept{
    if (!isSticky) {
        unMarkSticky();
     }


     if (next() == nullptr) returnfalse;
     markUncontendedSticky();
     markSticky();
     returntrue;
}


voidmarkSticky()noexcept{
    auto nextVal = reinterpret_cast<ObjectData*>(kStickyMark);
    next_.store(nextVal, std::memory_order_relaxed);
}


boolunMarkSticky(){
    auto expected = reinterpret_cast<ObjectData*>(kStickyMark);
    return next_.compare_exchange_strong(expected, nullptr, std::memory_order_relaxed);
}

在不是 sticky 模式下,tryEnqueue 時(shí),unMarkSticky 取消重新標(biāo)記。

代碼塊:

static ALWAYS_INLINE booltryEnqueue(AnyQueue& queue, ObjHeader* object)noexcept{
            auto& objectData = alloc::objectDataForObject(object);
            if (!GC::ObjectData::isSticky) {
                objectData.unMarkSticky();
            }


            bool pushed = queue.tryPush(objectData);
            return pushed;
        }

gc 分代不會(huì)減少 gc 暫停線程時(shí)間,可以減少 gc 線程整體耗時(shí) 10m~30ms,但是由于內(nèi)存釋放不及時(shí)也會(huì)造成內(nèi)存占用過(guò)大。

對(duì)象逃逸分析

通過(guò)靜態(tài)代碼分析變量在堆上還是棧上分配,在棧上分配對(duì)象在函數(shù)調(diào)用結(jié)束后可以立即釋放。測(cè)試發(fā)現(xiàn),棧上對(duì)象數(shù)量/堆上對(duì)象數(shù)量 = 1/8,業(yè)務(wù)盡量增加棧上對(duì)象數(shù)量。

  • 盡量少用類成員變量,在方法內(nèi)部分配變量
  • 少用多態(tài),增加識(shí)別成棧上對(duì)象概率

內(nèi)存碎片整理

由于棧上變量不會(huì)調(diào)用一次 loadslot 更新為新對(duì)象地址,還有兩個(gè)問(wèn)題需要解決。

  • 內(nèi)存碎片整理是 stw
  • 不會(huì)整理?xiàng)I弦米兞?/span>

如下是部分實(shí)現(xiàn),判斷 copied,從老對(duì)象 object 中取出新對(duì)象地址,否則就用 memcpy 進(jìn)行 copy。

代碼塊:

if (gc::isCopied(object)) {
        UpdateStackRef(newObjAddr, gc::copyObj(object));
        return;
    }


    //cas多線程狀態(tài)設(shè)置開始狀態(tài)
    if (!gc::isCopying(object))  {
        gc::trySetCopyObj(object, reinterpret_cast<ObjHeader*>(gc::kObjectCopy));
    } else {
        //否則等待copy完成
        while (gc::isCopying(object)) {};
        if (gc::isCopied(object)) {
            UpdateStackRef(newObjAddr, gc::copyObj(object));
        }
        return;
    }


    newObj = threadData->allocator().allocateObject(typeInfo);
    // Prevents unsafe class publication (see KT-58995).
    // Also important in case of the concurrent GC mark phase.
    std::atomic_thread_fence(std::memory_order_release);
    size = computeObjectSize(typeInfo);
    std::memcpy(reinterpret_cast<int8_t *>(newObj) + sizeof(ObjHeader), reinterpret_cast<int8_t *>(object) + sizeof(ObjHeader),
                     size - sizeof(ObjHeader));
    gc::trySetCopyObj(object, newObj);
    UpdateStackRef(newObjAddr, newObj);

抖音線上實(shí)驗(yàn)有 10%內(nèi)存優(yōu)化。

未來(lái)規(guī)劃

  1. 內(nèi)存碎片整理使用 llvm stackmap,gc 時(shí)線程從 stw 改成 concurrent
  2. 指針壓縮,將對(duì)象中的成員變量以及數(shù)組元素指針從 64 位改為 32 位,可以優(yōu)化 10%+內(nèi)存
  3. 大對(duì)象和小對(duì)象在同一個(gè) heap,可以放到不同的 heap,減少 gc 次數(shù)。
責(zé)任編輯:龐桂玉 來(lái)源: 字節(jié)跳動(dòng)技術(shù)團(tuán)隊(duì)
相關(guān)推薦

2023-01-28 08:32:04

Go內(nèi)存分配

2025-10-30 07:15:00

Go內(nèi)存管理編程

2023-01-10 09:18:37

Go內(nèi)存分配逃逸

2023-08-24 07:46:21

服務(wù)器JVM

2021-07-14 10:00:32

Python內(nèi)存測(cè)量

2022-05-06 12:01:01

優(yōu)化小程序

2022-05-27 08:01:36

JVM內(nèi)存收集器

2023-10-18 13:31:00

Linux內(nèi)存

2013-10-12 11:15:09

Linux運(yùn)維內(nèi)存管理

2022-01-07 15:10:53

C++動(dòng)態(tài)內(nèi)存

2010-08-10 10:00:57

Flex內(nèi)存

2018-06-26 15:58:06

數(shù)據(jù)庫(kù)MySQL索引優(yōu)化

2023-05-11 07:30:10

KV存儲(chǔ)GC優(yōu)化

2023-02-26 01:00:12

索引優(yōu)化慢查詢

2015-11-16 11:22:05

Java對(duì)象內(nèi)存分配

2021-01-24 11:46:26

自動(dòng)化Web 優(yōu)化

2009-05-19 14:34:52

Oraclehash優(yōu)化

2023-10-08 07:33:24

Presto數(shù)據(jù)分析

2025-03-03 09:05:56

2021-07-04 07:07:06

Ceph分布式存儲(chǔ)架構(gòu)
點(diǎn)贊
收藏

51CTO技術(shù)棧公眾號(hào)