home current projects technical training more about us contact us
 

There are three new papers - Comments welcome!

Data Stream Processing: Parallelism and Performance

PLINQ, C#/.NET, and Native Code

Download Version 1.8, February 14, 2011. The file contains the source code, Visual Studio 2010 projects, and a spreadsheet with the raw data used to produce the figures in the article. The full article is in MSDN Magazine (January, 2011).

Introduction

File and data stream processing is a fundamental computing task, and numerous practical problems are amenable to parallel processing to improve performance and throughput on multi-CPU systems. This article will compare several distinct methods supported by Windows to solve problems with a high degree of data parallelism.

Leveraging and extending results in two previous articles [the two others on this page], I'll compare alternative solutions to a file search problem from [T. Magennis, LINQ to Objects Using C# 4.0, Addison-Wesley 2010, Chapter 9] using:

  • PLINQ  ("Parallel Language Integrated Query") and C#, with and without enhancements to Magennis's Chapter 9 code.
  • Windows native code using C, the Windows API, threads, and memory mapped files
  • Windows C# and .NET with threads and memory mapped files

Results Summary

  • The C#/.NET and native code solutions are fastest, about twice as fast as the best PLINQ solution. This is consistent with results in the other two articles on this page involving large file input and output.
  • The original PLINQ solution [Magennis] is slower by a factor of nearly 10, and it does not scale well  beyond two CPUs, whereas the other solutions scale well up to four CPUs (the maximum tested). However, code enhancements improve the original solution significantly.
  • The PLINQ code is the simplest and most elegant in all respects because LINQ provides a declarative query capability for memory-resident as well as external data. The native code is ungainly, and the C#/.NET code is considerably better.
  • PLINQ performance can be competitive with native code performance, but only if you preprocess the data into an indexed, in-memory structure. The preprocessing is time-consuming and is only worthwhile if the cost can be amortized over multiple queries.
  • All methods scale well with file size up to the limits of the test system physical memory.
  • Performance scaling, for this series of benchmarks, was best on systems without hyper-threading.

Sequential File Processing: A Fast Path from Native Code to .NET

with Visits to Parallelism and Memory Mapped Files

Download Version 1.07, December 10, 2010 (This was temporarily under repair but is now available). The file contains the full article (.doc format), source code, Visual Studio projects, and spreadsheets with the raw data used to produce the figures in the article.

Background: Windows System Programming (all editions) compares the performance of different implementation strategies for the same task, and Appendix C summarizes the results. For example, the "cci" ("Caesar Cipher") examples ("atou" in Editions 1-3) all deal with sequential file processing. After working with C# and .NET, I became curious about performance and programming convenience relative to Windows System Programming's native code implementations. I was also curious about .NET 4.0's new MemoryMappedFile class, parallel programming and multithreaded support, and the "Asynchronous Programming Model" (APM). This paper is the result, and the abstract below summarizes the principal findings.

Abstract: How can we make sequential file processing of large files fast? This may seem like a boring question, but it is still essential in any number of application areas, including pattern searching, compression, encryption, hashing, and deduplication,. The summary results (backed up with code, graphs, and spreadsheets of test results) are:

  • Many sequential file processing problems are instances of simple patterns that are amenable to parallelism.

  • Memory mapped files and threading are not sufficient by themselves for significant performance enhancement.

  • The combination of MMFs and threading yield very good performance (as much as x3 compared to other implementations) in some common circumstances (described) and are as good as file stream performance in all tested circumstances.

  • 64-bit muti-CPU systems are essential for top performance, especially with large files.

  • The .NET 4.0 MemoryMappedFile class is not useful; mixed managed and native code is necessary for best results

  • The .NET asynchronous programming model (APM) was not useful for the applications used in this paper. APM may be very useful in other situations.

  • The best implementation can be further enhanced with some simple manual code optimization and tuning.

  • These results apply to Windows Vista and Window 7 (both NT Kernel 6.x based). Windows XP and Windows Server 2003 give somewhat different results.

Windows Parallelism, Fast File Searching, and Speculative Processing

Windows Parallelism, Fast File Searching, and Speculative Processing is posted on the InformIt Web site.

The benchmark problem solved in this paper involves speculative processing in which: 

         We need to search data streams for a pattern or specific condition.

         It's required to identify the first occurrence(s) of the condition.

         Once an occurrence has been detected, other search tasks terminate as soon as possible in order to maximize performance.

More generally, Toub (Patterns of Parallel Programming - Understanding and Applying Parallel Patterns with the .NET Framework 4 and Visual C# (p. 94) describes speculative processing this way:

"Speculation is the pattern of doing something that may not be needed in case it actually is needed. This is increasingly relevant to parallel computing, where we can take advantage of multiple cores to do more things in advance of their actually being needed. Speculation trades off throughput for reduced latency, by utilizing resources to do more work in case that extra work could pay dividends."

This article's specific benchmark problem is to implement a fast version of the Windows COMP command (called compMP), which compares two files, byte by byte, and displays the mismatches in order of occurrence. compMP differs from the COMP command in several ways which are useful but challenging to implement: 

The search terminates after 8 mismatches (an arbitrary limit) have been identified.

         Optional command line parameters specify a block size (default: 4,096) and the number of concurrent threads (default: the number of processors).

         If the number of threads is negative, the program uses conventional file processing with ReadFile Windows calls, reading block size blocks for each read operation. This allows for comparison with both the fast memory mapped/parallel implementation and with the real Windows COMP command.

         After a file compare, there is no prompt for the next pair of files.

The implementation challenges occur because mismatches can be identified at any time in different tasks (threads) and may not be found in the order in which they occur in the two files. Therefore, the solution needs to keep track of the earliest mismatches and, when the maximum mismatch count (8) is reached, other search tasks should terminate if they are searching past the last mismatch location.

The performance results are much faster than Windows for large files on commodity test systems (as much as 10x or more in some cases). A strategy for very large files is also described.

The Visual Studio project and source code are included in the current WSP4 examples code, and the COMP for large files on commodity test systems. The source code has benefited greatly from suggestions by Michael Bruestle.

 

footer