There are three new papers - Comments welcome!
Data Stream Processing: Parallelism and Performance
PLINQ, C#/.NET, and Native Code
Download Version 1.8, February 14, 2011.
The file contains the source code, Visual Studio
2010 projects, and a spreadsheet with the raw data used to produce the figures
in the article. The full article is in
MSDN Magazine
(January, 2011).
Introduction
File and data stream processing is a fundamental computing task, and numerous
practical problems are amenable to parallel processing to improve performance
and throughput on multi-CPU systems. This article will compare several distinct
methods supported by Windows to solve problems with a high degree of data
parallelism.
Leveraging and extending results in two previous articles [the two others on
this page], I'll compare alternative solutions to a file search problem from [T.
Magennis, LINQ to Objects Using C# 4.0,
Addison-Wesley 2010, Chapter 9] using:
- PLINQ ("Parallel
Language Integrated Query") and C#, with and without enhancements to Magennis's
Chapter 9 code.
- Windows native code using C,
the Windows API, threads, and memory mapped files
- Windows C# and .NET with
threads and memory mapped files
Results Summary
- The C#/.NET and native code solutions are fastest, about twice
as fast as the best PLINQ solution. This is
consistent with results in the other two articles on this page involving large
file input and output.
- The original PLINQ solution [Magennis] is slower by a factor
of nearly 10, and it does not scale well beyond
two CPUs, whereas the other solutions scale well up to four CPUs (the maximum
tested). However, code enhancements improve the original solution significantly.
- The PLINQ code is the simplest and most elegant in all
respects because LINQ provides a declarative query capability for
memory-resident as well as external data. The native code is ungainly, and the
C#/.NET code is considerably better.
- PLINQ performance can be competitive with native code
performance, but only if you preprocess the data into an indexed, in-memory
structure. The preprocessing is time-consuming and is only worthwhile if the
cost can be amortized over multiple queries.
- All methods scale well with file size up to the limits of the
test system physical memory.
- Performance scaling, for this series of benchmarks, was best
on systems without hyper-threading.
Sequential File Processing: A Fast Path from Native Code to .NET
with Visits to Parallelism and Memory Mapped Files
Download Version 1.07, December 10, 2010
(This was temporarily under repair but is now available).
The file contains the full article (.doc format), source code, Visual Studio projects, and spreadsheets with the
raw data used to produce the figures in the article.
Background: Windows System Programming (all editions) compares the
performance of different implementation strategies for the same task, and
Appendix C summarizes the results. For example, the "cci" ("Caesar Cipher")
examples ("atou" in Editions 1-3) all deal with sequential file processing.
After working with C# and .NET, I became curious about performance and
programming convenience relative to Windows System Programming's native code
implementations. I was also curious about .NET 4.0's new
MemoryMappedFile class,
parallel programming and multithreaded support, and the "Asynchronous
Programming Model" (APM). This paper is the result, and the abstract below
summarizes the principal findings.
Abstract: How can we make sequential file processing of
large files fast? This may seem like a boring question, but it is still
essential in any number of application areas, including
pattern searching, compression, encryption, hashing, and deduplication,. The
summary results (backed up with code, graphs, and spreadsheets of test results)
are:
-
Many sequential file processing problems are instances of simple patterns that
are amenable to parallelism.
-
Memory mapped files and threading are not sufficient by themselves for
significant performance enhancement.
-
The combination of MMFs and threading yield very good performance (as much as x3
compared to other
implementations) in some common circumstances (described) and are as good as
file stream performance in all tested circumstances.
-
64-bit muti-CPU systems are essential for top performance, especially with large
files.
-
The .NET 4.0
MemoryMappedFile
class is not useful; mixed managed and native
code is necessary for best results
-
The .NET asynchronous programming model (APM) was not useful for the applications used
in this paper. APM may be very useful in other situations.
-
The best implementation can be further enhanced with some simple manual code
optimization and tuning.
-
These results apply to Windows Vista and Window 7 (both NT Kernel 6.x based).
Windows XP and Windows Server 2003 give somewhat different results.
Windows Parallelism, Fast File Searching, and Speculative Processing
Windows Parallelism, Fast File Searching, and Speculative Processing
is posted on the InformIt Web site.
The benchmark problem solved in this paper involves
speculative processing in which:
·
We need to search data streams for a pattern or
specific condition.
·
It's required to identify the first occurrence(s)
of the condition.
·
Once an occurrence has been detected, other
search tasks terminate as soon as possible in order to maximize performance.
More generally, Toub (Patterns of Parallel Programming -
Understanding and Applying Parallel Patterns with the .NET Framework 4 and
Visual C# (p. 94) describes speculative processing this way:
-
"Speculation is the pattern of doing something that may not be needed in case it
actually is needed. This is increasingly relevant to parallel computing, where
we can take advantage of multiple cores to do more things in advance of their
actually being needed. Speculation trades off throughput for reduced latency, by
utilizing resources to do more work in case that extra work could pay
dividends."
This article's specific benchmark problem is to implement a fast
version of the Windows
COMP
command (called compMP), which compares two files, byte by byte, and displays the mismatches in order of occurrence. compMP
differs from the COMP command in several ways which are useful but challenging to implement:
The search terminates after 8 mismatches (an
arbitrary limit) have been identified.
·
Optional command line parameters specify a
block size (default: 4,096) and the
number of concurrent threads (default: the number of processors).
·
If the number of threads is negative, the program
uses conventional file processing with
ReadFile Windows calls,
reading block size blocks for each
read operation. This allows for comparison with both the fast memory
mapped/parallel implementation and with the real Windows
COMP command.
·
After a file compare, there is no prompt for the
next pair of files.
The implementation challenges occur because mismatches can be
identified at any time in different tasks (threads) and may not be found in the
order in which they occur in the two files. Therefore, the solution needs to
keep track of the earliest mismatches and, when the maximum mismatch count (8)
is reached, other search tasks should terminate if they are searching past the
last mismatch location.
The performance results are much faster than Windows
for large files on commodity test systems (as
much as 10x or more in some cases). A strategy for very large files is also
described.
The Visual Studio project and source code are included in the current WSP4
examples code, and the
COMP for large files on commodity test systems. The
source code has benefited greatly from suggestions by Michael Bruestle.