Why you still need SmartHeap in 32-bit Windows
Note This document is taken from the SmartHeap Technical Specification. Click here to download a .zip file containing an html version of this doc and the associated benchmark graphics. Pkunzip users should specify the -d option -- no options necessary if using Winzip. To view, open the file techspec.htm in your browser.
Table of contents
SmartHeap and HeapAgent are trademarks of MicroQuill Software Publishing, Inc.
Bounds-Checker is a trademark of Nu-Mega Technologies, Inc.
Microsoft and Windows are registered trademarks and Visual C++, Win95, Win32s, and WindowsNT are trademarks of Microsoft Corporation.
Borland C++ is a trademark of Borland International, Inc.
Watcom C/C++ is a trademark of Watcom International Corp.
All other trademarks are the property of their respective holders.
© 1993-1997 MicroQuill Software Publishing, Inc.
The problem with memory allocators
The fundamental problem for your memory library is how to randomly allocate, free, and reference thousands of objects in a fragmented heap whose size exceeds physical memory, and do so with a minimum of swapping.
Applications must manage large numbers of objects
Today's applications, especially those written in C++, tend to be more memory intensive than ever before, often allocating, freeing, and referencing hundreds of thousands, or even millions, of small blocks. Their allocation patterns are random, with calls to new interspersed with calls to delete. As a result, the heap quickly evolves into a fragmented, chaotic jumble.
This fragmentation causes many commercial allocators to "hit the wall" when trying to manage even modestly large heaps; the performance of the allocator degrades geometrically as the heap size grows. A simple test case bears this out. Operating on a 4 MB, 33 MHz 386, running Enhanced Mode Windows 3.1, a single call to malloc in an empty heap takes 110-116 microseconds each for the Microsoft and Borland allocators - both very fast. However, if you then make 49,999 random calls to malloc, realloc, and free using blocks varying randomly between 8 and 100 bytes, and then measure the time for a single call to malloc, the call is much slower, taking 6.1 seconds (or 36,000X slower) for Microsoft and 9.6 seconds (or 87,000X slower) for Borland.
Large-footprint operating systems compete with your application for precious RAM
32-bit virtual memory operating systems provide the advantage of a huge address space. However, these operating systems themselves are huge, at least relative to the typical RAM configuration. As a result, they compete with your application for precious RAM and force your application's heap to swap far more frequently.
Windows for Workgroups 3.11, for example, has a footprint of about 3 MB. On a typical 8 MB configuration, the operating system itself will typically reside entirely in RAM, leaving a full 5 MB of RAM for an application and its heap. The typical 16bit application is reasonably snappy on this configuration, since the entire application fits within physical memory (provided other applications are not running at the same time). The application doesn't thrash up to this point because all of its calls to allocate or free memory involve RAM-resident pages. Only when the heap size exceeds available physical memory do some of these calls invoke performance-killing disk hits, which are 20,000 times more expensive than a call to a RAM-resident page.
Windows 95, however, has a footprint of some 14 MB - several times the size of Microsoft's suggested 4 MB minimum memory configuration. (Windows NT and most UNIX implementations are even larger.) On that same 8 MB platform, the sheer size of Win95 guarantees that your application's heap will always be at least partially non-resident, so each call to allocate, free, or even reference memory is far more likely to invoke agonizingly slow disk hits.
Pre-emptive multi-tasking and multi-threading operating systems increase swapping frequency
Pre-emptive multi-tasking in Win95, NT, OS/2, and UNIX further degrades performance. For example, your application (or one of its threads) may be in the middle of traversing a data structure when the operating system turns the processor over to another application or another thread. When your application (or that thread) gets its next slice of processor time, its data will often have been swapped to disk.
Multiple threads exacerbate the problem still further. In multi-threaded environments, allocations are normally serialized so that only one thread can be active in the heap at a time. This makes the heap a real bottleneck for multi-threaded applications and affects performance:
- Serialization (semaphore locking) itself is slow.
- Blocking on semaphores results in additional context switching because a thread that blocks waiting for the heap gives up its processor timeslice prematurely. This additional context switching, expensive in its own right, causes still more swapping.
On symmetric multi-processing (SMP) systems, heap serialization can also lead to one or more processors being idle if several threads are trying to access the heap at the same time.
Return to the Table of contents.
The solution
SmartHeap was designed to perform well both in physical memory and in situations where the total size of all objects exceeds the amount of physical memory currently available. (See the benchmark graphs later in this section.) Several innovative design characteristics drive SmartHeap's enhanced performance.
SmartHeap's unique page table algorithm maintains better locality of reference and reduces swapping
SmartHeap is smarter than other malloc and new implementations. Conventional memory management algorithms effectively treat the heap as one large region, maintaining a single free list that spans the entire heap. Over time, as blocks are continually allocated and freed, the free list ultimately degenerates into a random path of pages in the heap. This causes the allocator to jump from page to page as it traverses the free list. The heap in these conventional implementations exhibits poor page locality: data that is referenced consecutively (in this case, the heap free list) isn't stored in the same page of the heap.
The free list's lack of data locality wouldn't be a big problem if each free block were always large enough to satisfy each subsequent allocation request and if the heap were always entirely resident in physical memory. However, the same cycle of allocating and freeing that randomizes the free list also fragments the heap. This causes an ever-shrinking average block size, which, in turn, lessens the likelihood that "the next" free block in the list will be large enough to fulfill the current request. Moreover, as we discussed above, most 32-bit applications don't run purely in physical memory. As a result, a call to malloc or new often touches multiple pages while looking for a free block large enough, and some of these touches invoke performance killing disk hits. When a heap is fragmented and in tight memory conditions, a single allocation call can take a second or more as the allocator thrashes while traversing the free list.
The malloc and new implementations in all the compiler-supplied runtime libraries in DOS, Win16, OS/2, and the Mac use the conventional algorithm described above. Most UNIX allocators and Microsoft's NT allocator improve on this by storing the free list and associated header information in a memory area separate from the blocks of data. Because the heap headers are smaller (usually eight bytes each) than the actual data, the free list can be stored more compactly, so data locality improves and swapping is reduced.
However, separating the free list from the data still doesn't eliminate swapping. For large heaps, the free list continues to span a large number of pages, so traversing it can still touch multiple pages. In addition, for heaps with a small median block size (common in C++), the free list isn't that much smaller, and swapping still occurs.
SmartHeap uses a much smarter algorithm that virtually eliminates swapping. Like most other allocators, SmartHeap goes directly to the operating system for large blocks of memory and then sub-allocates. But while other allocators treat the heap effectively as one large region, SmartHeap divides the heap into discrete pages that correspond with (and are perfectly aligned with) the pages of the underlying operating system virtual memory manager. And, also like the operating system, SmartHeap maintains a compact page table that keeps track of all of the pages in the heap.
For each page in the heap, SmartHeap's page table stores the size of the largest free block in that page. This page table is much smaller than the compiler allocator's free list because the page table has just one entry per page, rather than one entry per free block in the heap. Rather than searching one long list of free blocks (and touching many pages in the process), SmartHeap quickly scans its much smaller page table for a page that it knows has space for the current allocation request. SmartHeap's actual free list is contained inside each page - since the free list doesn't reference any other pages, only a single heap page is referenced during a SmartHeap allocation. With this technique, SmartHeap virtually eliminates swapping during allocation calls.
Allocation speed is one clear benefit of SmartHeap's page-based allocation algorithm, but there is a more subtle benefit that can have an even greater impact on your application's overall performance.
We mentioned earlier how the free list in traditional allocators follows a random path through the heap. A consequence of this is that each block allocated by your application will lie on a random page within the heap. SmartHeap, on the other hand, with its page-centric free list, always tries to allocate consecutive objects from the same page. The result is that the data referenced by your application has better locality. Applications often reference (and free) memory in the same pattern in which they allocate it. For example, elements successively inserted into a list will be allocated and referenced in the order of the list links. Therefore, referencing this data will involve accessing fewer pages, which further minimizes swapping.
SmartHeap coalesces free blocks on every free, without the overhead of a doubly-linked list or a full heap traversal
De-allocation (free/delete) is also extremely fast because SmartHeap maintains, in the block headers, special bits that indicate whether adjacent blocks are free. With each free/delete, SmartHeap checks this local header information and immediately coalesces any adjacent free blocks. This technique is constant time; it is not affected by the size of the heap. As a result, it is faster and more memory efficient than the compiler-library alternatives of maintaining a double-linked list of free blocks, traversing the free list, or worst of all, leaving free blocks uncoalesced, so they fragment the heap.
SmartHeap minimizes per-allocation overhead to reduce swapping
With SmartHeap, per-allocation overhead is quite small. Depending on the platform, overhead for fixed-size allocations is zero to two bytes, and overhead for variable-size allocations is two to eight bytes. (For Win16, Win32, and OS/2, SmartHeap's overhead per malloc/new is two bytes, while compiler malloc/new overhead per allocation is as high as sixteen bytes.) The extra overhead of the compiler allocator increases the size of the heap. This wastes memory and causes extra swapping by increasing the size of the working set required for your application.
SmartHeap's fixed-size allocator is faster still
In addition to malloc/new, SmartHeap provides a fixed-size allocator. Fixed-size allocators reduce memory management to simple free-list management, which is extremely fast. Rather than searching the heap for the best fit, the fixed-size allocator can simply pick the head of the free list. If you're implementing linked lists, trees, or other dynamic data structures in C or C++, SmartHeap's fixed-size allocator will allocate memory for these purposes faster than any other memory manager available.
Get the advantages of a fixed-size allocator without changing your code
Using SmartHeap's malloc/new, you can allocate both fixed- and variable-size blocks, without changing your code! Just specify a fixed-size threshold in SmartHeap. Thereafter, when you call the regular new or malloc, allocations smaller than the threshold are routed to the fixed-size allocator, and allocations larger than the threshold go to a variable-size allocator. You get the benefits of an extremely fast fixed-size allocator for small allocations and the efficiency of a variable-size allocator for large requests, without any code changes.
Use SmartHeap's multiple pools to achieve even greater performance gains, especially in referencing data
SmartHeap provides seamless support for multiple pools, so you can further improve the performance of your application with minimal coding effort. (To allocate from a particular pool, you simply override the definition of new for a given class.) Multiple pools let you partition your data into discrete "mini-heaps," which produces a number of benefits:
- All objects allocated by a given class are contained in the fewest possible pages of the heap (for improved locality and, therefore, less swapping).
- Allocating each additional object is faster because the heap space that the allocator must search (one pool) is a subset of the entire heap.
- Deleting objects is faster because you can delete all of the objects in one pool with a single call - this is much faster than traversing your entire data structure to free every individual element.
- Referencing data is dramatically faster because all of the objects you need to reference are on a small number of pages and not spread throughout a much larger heap space (again, because better locality means less swapping).
- You can assign each thread in your application its own SmartHeap pool. This eliminates heap serialization overhead and allows multiple threads to concurrently access their own heaps. Moreover, on symmetric multi-processing (SMP) systems, multiple pools allow true concurrency of heap operations, which can dramatically improve overall performance by maximizing processor utilization.
SmartHeap's handle-based allocator gives you speed and space
If you need a handle-based allocator, SmartHeap implements double-indirection handles. The memory is moveable, so fragmentation is eliminated. However you can access the memory very efficiently by de-referencing the handle as a pointer, which eliminates a performance-degrading function call. This is the best of both worlds (speed and space) in a feature that was previously available only on the Macintosh.
Return to the Table of contents.
Benchmarks: blazingly fast performance
How important is malloc/free speed?
Consider a typical application, which spends 40% of its total execution time on managing memory and takes 10 minutes to run. The table below shows how a faster memory management library affects this application.
then malloc/new the app and the
If malloc/new is takes this takes this entire app is
this much faster... much time... much time... this much faster
-------------------------------------------------------------------------
no change (1X) 4.00 minutes 10.00 minutes 0%
1.5X 3.60 minutes 9.60 minutes 4%
2X 2.00 minutes 8.00 minutes 20%
4X 1.00 minutes 7.00 minutes 30%
10X 0.40 minutes 6.40 minutes 36%
100X 0.04 minutes 6.04 minutes 39.6%
Note that even a 4X improvement in malloc can result in a 30% overall application performance improvement - and remember that SmartHeap is generally a minimum of 4X faster than other commercial allocators and requires just a relink to implement.
Benchmarks
These benchmark graphs compare SmartHeap to Microsoft Visual C++ version 5.0 malloc/new libraries for various versions of Windows:
Note We also have benchmarks for Borland C++ on Win NT, Win 95, and WFW 3.11. In addition, we have benchmarks for the Macintosh, OS/2, SunOS, and HP 7xx. For more information, please call us at 425-827-7200 or email us at info@microquill.com.
Our benchmark test program randomly calls operators new and delete (in a ratio of 3:1) to create objects that randomly vary in size from 8 to 128 bytes until the heap reaches the specified size. The program then deletes all of the objects.
Note Applications that initially allocate all of their memory and do little or no subsequent allocation will not see substantial performance improvements because traditional new implementations are fast when allocating into a totally empty heap.
|