John Melville's profileJohn Melville's Little S...BlogListsNetwork Tools Help

John Melville

Occupation
Location
Interests
As a physician and computer scientist in bush Alaska, I spend most of my free time building computer tools for medicine.

John Melville's Little Space

Musings of a WPF developer in bush Alaska
July 05

Lying API's Part 2: An Unpleasant Surprise In SQL Server Compact 3.5

I can't say how much I love SQL Server Compact.  Its a simple in-process server, which gives you SQL queries, transactions, constraints, and all the fundamentals of a relational database, in process in your app.  No service to initialize and no administrator privileges needed ever.  A SQL CE database is a file, and there is nothing special about the file.  This makes SQL CE an attractive alternative to serialize custom file formats in a robust manner.

The App I am working on today uses some big files.  Because I am a pediatrician, one of my volunteer opportunities is examining children who may have been victims of child abuse.  Since I am also a geek, I developed some software that keeps track of all my photos and evidence.  It makes sure that my manifest matches my evidence envelopes, that all my photos get tagged with all the right information about where they came from, and so forth.  The computer takes care of all the nit-picky details that can be a real embarrassment on the witness stand.  I collect a lot of evidence --sometimes as much as 500 megabytes per case!

My old file format, based on the Open Document Specification and its .Net Implementation in System.IO.Packaging was just too brittle.  Any problems at all with how the app closed would create an unreadable ZIP file.  After my lovely software made mincemeat out of four hours of work I had put into a forensic report last Friday, I decided it was time for a change.  (Luckily I had the actual photos backed up on the camera, but that didn't make me much happier.)

My approach was simple.  A database with a single table, having two fields.  1 field is the file name, and the other is an Image representing the file data.  This approach violates every tenant of database design that I know about; but its virtue is I could write a drop in replacement for the other format object and the database would give me true transactions on windows XP.

The problem I noticed immediately was that the performance in writing the stream to the database absolutely stinks.  I discovered why, and I think it's another interesting discussion of an API that tells a lie.

The API to write data into an Image field in SqlCE looks like this.

updatableRecord.SetBytes(fieldOrdinal, DestinationStartPosition, buffer, bufferStartPosition, count)

The analogy to the .NET stream implementation is to plain to ignore.  This is an API designed to let you drop blocks on data into your field in little chunks.  I still don't know why they didn't finish the job and wrap this method in a good implementation of Stream but they didn't.  That being said, the stream wrapper is so simple it almost writes itself.

So far this is a good API.  Its simple, easy to use, and immediatly understandable to anyone who has written a Stream -- which should be just about anyone who is about to load multi-megabyte blobs into database, like I do.

However, if you adopt the stream model, you ought to adopt the stream performance model.  Users of streams expect roughly order N performance -- pushing 100K into the stream should take about 10 times as long as 10K, and the kb/sec should be roughly constant.  As the graphs below show, this is not the case with the SetBytes API.

I set up a small perf harness that would create a DB with a single table.  It would repeatedly copy a single buffer of zeros into the database to make fields of the appropriate size and time the results.  Because I used a single source buffer, there is no readin step to contaminate of these results.  The graphs below show total "wall clock time" verus field size.  The second graph shows average transfer rate versus field size.  For right now look at the blue lines.

Both of these are log-log graphs.  Any polynomial function would be a straight line in these graphs.  An upward curvature represents greater than polynomial growth.

Transfer time as a function of transfer size.
TimeGraph

Average Transfer Speed as a Function of Transfer Size

TimeGraph

The blue lines show why my performance was so poor.  The SetBytes method, when call with repeated chunks displays not only not linear, but at least N^2 if not exponential with the total size of the data to be transferred.

The red lines show a simple performance tweak.  Prior to writing the data out, I wrote a single zero in the last position of the array I intended to fill.  This forces Sql Server to allocate enough space to hold the whole stream.

Even more interesting is that the transfer rate actually increases as total size increases.  This suggests that until the exponential growth catches up with us, the actual cost of streaming data into the database is pretty cheep.  The fixed costs of making a row dominate the time in the early elements.

The concerning point is that this optimization doesn't fix the problem, it just delays it.  If you look like the right end of both graphs, the exponential growth returns at larger sizes.  Fortunately, this hack made the system fast enough for my uses, so my inquiry ended here.

I think this behavior is a represents two separate bugs.  The firist is that I can't store large objects without a hack to declare their size first.  It wouldn't be called a binary LARGE object if it weren't inviting you to put lots of data into it.  Giving me worse-than-polynomial performance hits, especially hits at a realtively small 2mb factor, is unacceptable.

It could be that this limitation is inherent in the design of SQL Compact.  I'm NOT using SQL Compact for its intended purpose: to store small bits of connected data,  I have to admit that the documentation does recommend against using BLOBs for very large amounts of data, and suggests storing the data in the file system with a link in the database.  That design would not have worked for me -- I needed a single file, transactional store.  But point taken anyway.

Even if the first bug were unavoidable given other design decisions; the performance of the API lies to me.  It looks just like a streaming API, but wildly breaks my performance expectation for streaming.  To make the API work well, I have to declare the total size of the stream, but the API not only didn't push that fact into my face -- I had to come up with a hack to get it done at all.

I still think its a bug -- what do you think?

March 15

In Defense of Meta-Programming

Back when I was in college and medical school I did almost all my programming in C++.  C++, and C, have a very controversial language feature: the pre-processor.  The C preprocessor runs before code is compiled, and allows source level transformations on the code prior to compilation.

The most maligned of the C Preprocessor's features, macro expansion, is the subject of this post.  The marco feature lets the programmer define textual macros like this:

#define Add(a,b) ((a)+(b))

After this definition upon encountering "Add(3,5)" the preprocessor would replace "((3)+(5))".

Defining macros in this way has a host of gottchas and problems that I will not detail here.  In extreme cases bad programmers have "macroed" up the language to the point that its no longer recognizable as C.  There are subtle ordering bugs.  The bugs get worse if a macro is used with expressions having side effects or if a lazy programmer forgets some of the parentheses in the expression above.  For a host of these reasons, Anders Hejisberg explained in a TechEd chalk talk session that a macro processor is one of the features considered and rejected for the C# language.

This is unfortunate because I think it throws out a little bit of baby with a whole lot of bathwater.  My critics will rightly point out that modern language features like generics and compiler level inlining do a much better and safer job of a lot of things that were once done with the preprocessor when it was designed 40 years ago.  I have already admitted that macros have a habit of really biting some programmers who don't fully understand how text expansion can hurt you.

The redeeming grace of macro programming is seen in classes like Enumerable in the System.Linq namespace.  Enumerable contains 20 different overloads of the Average method, differing only by various types of the arguments.  Microsoft paid someone to write all 20 overloads of the function to do identical things, probably with identical code.  It pays a tester to verify in each new drop of this assembly that all these functions do exactly the same thing.  If a bug is ever discovered in the Average code, it will have to pay someone to fix all 20 overloads.  Microsoft is paying through the nose for the lack of metaprogramming in C#.

In a hypothetical C# with macros the code would look like this (using the antiquated C macro syntax that could use an update.)

#define Average (itemtype, getnumber) {  \

public itemtype Average(this IEnumerable<ItemType> source) { \

    Average(source, i=>i); \

}\

public itemtype Average<TSource>(this IEnumerable<TSource> source, Func<TSource, itemtype> func) { \

      //implement average here in terms of getnumber for each element.\

}
#define declareAverage(type) \

Average(type, item) \

Average(type/**/?, item.Value)

//Now to generate all 20 overloads

declareAverage(decimal)

declareAverage(float)

declareAverage(double)

declareAverage(int)

declareAverage(long)

 

In the above hypothetical code sample we know that the implementation of several average methods will remain consistent for each of the number types.  Changes to the average method method will automatically be reflected across all types that various overloads accept.  Futhermore we can extend the class with a Median method by simply writing the two methods, which will automatically get expanded into the 20 methods needed to keep the API consistent.  I do still have to test the entire API surface (there could be differences in how the number types are implemented, which could break specific overloads,) but the regression testing will be automated and the simpler code will result in fewer regressions (at least in the uniformity bugs -- it I screw up now all 20 overloads will be screwed up, but they'll be uniformly screwed up.)

It is interesting that in Visual Studio 2008, the Visual Studio team seems to have trumped Mr. Hejisberg by including T4, the Text Template TransformationToolKit, into visual studio.  T4 can be best described as macros on steroids, no make that steroids and crack cocaine.  T4 lets you write C# or VB code to generate code.  The blog I referenced has several examples of metaprogramming: running code to write code, in a variety of ways that make sense.

As I step back, metaprogramming seems to be the natural extension of a rule of programming I have recently begun to appreciate and honor.  The Extreme Programmers say: code once and only once.  The more dynamic language crowd seems to remind themselves: Don't Repeat Yourself (DRY principle.)  The concept is the same express each algorithm once -- rearrange the code so similar operations reuse the same code path.  It results in less code and eliminates a difficult class of bugs when parallel implementations are not kept in sync.

Metaprogramming extends DRY from algorithms to declarations.  If I really want to declare 20 overloads of a function that do the same things with different types of arguments, I really ought to only write the function once.  I should not repeat myself 20 times just because I want the function to have 20 overloads with small, systematic, and predictable differences.  (Dynamic programmers need not comment that their languages do this automatically.  I've already picked my static typed poison and now I have to live with it.  This blog is about how I do that.)  Using the templating engine lets me produce verbose, usable interfaces from template code that succinctly expresses my intent.

Try out the T4 engine.  See if it makes you productive like it does for me.  Metaprogramming has gotten a bad name because it can be abused -- and the results can be terrible.  The best professionals, however, have deep toolboxes filled with a variety of sharp and dangerous tools.  It is the selection and skillful use of the entire toolset allows them to produce some real works of art.

February 16

Helper Functions that hurt

Wildprarie noticed an "evil" anomaly in the implementation of BitmapImage's loading algorithm.  While attempting to load images in the background, he switched from using a URI to loading the image directly from a filestream.  This is where he noticed a 5-10x decrease in performance.

OK everyone, its time to pull out your psychic debugging hats and figure this one out.  Its not only obvious, but it exposes what I consider to be a fascinating "design bug" in the BitmapImage API, and perhaps in the .Net Framework.

If your psychic skills are one the blink right now, see the proposed fix listed in the same article.

// Notice this whole method runs in the background -- you can tell from the useless parameter in the signature

   private void DoLoadImage(object o)
        {
            if (!_loading) { return; }

            byte[] buffer = File.ReadAllBytes(_filename);
            // by reading the data into an in-memory buffer, we prevent the file from being read in the UI thread -- which speeds up access dramatically!
            MemoryStream mem = new MemoryStream(buffer)
            // . . .  snip
            BitmapImage bi = new BitmapImage();
           // . . . snip
           //bi.UriSource = new Uri(_filename); slow slow slow
            bi.StreamSource = mem;
            bi.EndInit();
            bi.Freeze();
             // . . . Snip           
            // if you dispose of the memory stream here, the image will be toast (burnt toast)
            // (as the dispatcher won't have run yet).

        }

The simple answer, of course, is buffering.  When you hand a FileStream to the BitmapImage, it reads from the stream in small bits, each of which requires actual, physical motion on the hard disk, and takes forever.  A much simpler, and more memory efficient, implementation is to use bi.StreamSouce = new BufferedStream(new FileStream(...), 10240) which will allocate a 10 k buffer instead of an unbounded amount of memory for the entire image.  We still get the benefit of not having to move the drive head with every byte that gets read and still get the performance boost. (I did actual tests and I did see the speed improvement he describes.)

I do not, however, describe this bug simply to cast stones at a fellow blogger, especially one who so clearly admits his confusion on the issue.  I don't think the confusion is his fault.

We usually think of API's in layers, objects "lower in the stack" have more "computery" interfaces with fewer services.  These lower layers are the foundation for more feature rich layers built on top of them. 

The BitmapImage breaks this model.  It is a fairly low-level object -- it parsers image files and turns them into bitmaps.  The problem is that BitmapImage features several "convenience" constructor that are very high level -- one can even download the image from a remote web host, handing the HTTP protocool, network delays, and cancelations.  In object oriented terms, the took a simple object that did one thing well and adorned it with "helper methods" so it could do many many different things.  (My suspicion is that most of these helper methods just wrap the source stream in a variety of stream adaptors already existing in the framework.  Some of these have buffers, HTTP streams, for example, and required by specification to be buffered.)

To understand BitmapImage's real place in the stack, look at the StreamSource member.  This member can be any Stream descendent, buffered, nonbuffered, encrypted, verifying, or anything else.  BitmapImage knows nothing about the performance characteristics of the stream it will get handed.  BitmapInfo must work even with Stream decendants that didn't exist when BitmapInfo was created; this is a fundamental tennent of object oriented design.  Clearly the creators of BitmapImage, like all skilled computer scientists, know that the stream would need to be buffered to get reasonable performance off the disk, but buffering many other streams could be detrimental to performance (buffering a memory stream just increases your working set,) or correctness (buffering a decrypting string leaves plain text in memory that has to be carefully managed.)  Since BitmapInfo can't decide whether to buffer or not buffer the input, it doesn't even try.  In my opinion this is the correct behavior.  Write a low level API, and if your caller's need a buffer, they can type the 20 some characters above to get one.

The problem her is that nothing in visual studio, c#, .NET, or the WPF documentation makes it clear that some of BitmapInfo's methods are very high level, with all the details covered for you, and other methods of the same object are low-level, use at your own risk kind of methods, that expect you to bring your own buffer.  The (wrong) comment I bolded above makes it all clear -- while shifting work to another thread, the programmer unwittingly changed from a high level api to a low level API, and ascribes the performance hit to the changes he was trying to make.  In effect the API lied, it gave us a couple different sources from which we could load images, some of which are overlapping, without pointing out that they had very different contracts.

The problem for me is I do this all the time.  I have a high level public api, with low level private implementation.  Eventually the low level API becomes useful, and it makes sense to make it public.  How should the API telegraph the fact that its methods describe different levels of abstraction, with different contracts?  What could the .Net framwork do to make these differences more obvious?

November 10

Xunit Theory Testing: The String Fuzzer

This is the last (for now) of a series of posts on a new data generation framework for the XUnit unit test framework.  The previous post is here.

This post introduces the a single extension method ValueSource<T>.Fuzz().  The Fuzz method creates an infinite ValueSource of "Fuzzed outputs" that represent mutations of the input strings.

Fuzz testing has become quite the rage recently, but just in case you have been living under a log for the past 3 years the idea is really simple.  In 1990 Barton P. Miller, Lars Fredriksen, and Bryan So fed completely random input into several unix utilities.  A frightning number of them crashed, or went into infinite loops, a few of them even brought the operating system down with them.  Since then Fuzz testing has advanced from simply random inputs to "almost right" inputs which have much more bug finding power.  Microsoft estimates that over half of the security bugs requiring critical patches in the past two years have been found through fuzz testing.

The beauty of fuzz testing is that the random number generator does stupid stuff that no human would ever try, and mahem results.  Unfortunatly just throwing random strings at a well tested program is about as effective as waiting for monkeys to type out Shakespeare.  There is just the right mix of random and nonrandom to cover all the code, but not lose the magic of random testing.

Traditional Fuzz testing just feeds input into the programs main input.  Complicated protocool generators to generate almost right packets and recompute checksums to bypass error checking subroutines.  I think fuzz testing should be a unit test technique.  It makes it easier to get high coverage with lots of randomness.  Inject the fuzz directly into the code you want to cover.  It also lets us test defense in depth because we don't have to disable the outer defenses or craft a packet that will slip past them.  Again, just inject the fuzz into the code you want to test.

 

The Fuzz method is a simple mutational fuzzer.  You hand it a ValueSource<string> representing valid strings.  The fuzzer uses the following algorithms to create new, probably invalid inputs, based on the given, valid ones.

  1. Retrieve a string and truncate it randomly.
  2. Retrieve a string, split it randomly, return the last part.
  3. Retrieve two strings, split each randomly.  Append the beginning of the first to the end of the second.
  4. Insert one string into a random position within another string.
  5. Reverse the characters in a random substring of a string.
  6. retrieve a string and reverse it.
  7. Remove a random number of characters from the middle of a string.
  8. Generate a completely random string.
  9. Return a string unmodified.

 

The Fuzz(int) method calls fuzz followed by the sample method, creating a finite data source with the given number of fuzzed strings.

 

Let me repeat the fuzz test I used in an earlier post:

   [TheoryMethod]
    public IEnumerable<ITestCommand> FuzzCompiler() {
      return TestData(CodeSamplesData().PreProcess(i => i[1].ToString()).Fuzz(1000000)).TestTheory<string>(
        input => {
          try {
            LightweightCompiler.Compile(input);
          } catch (ChamObjectCreationException) {
          }
        }
        );
    }

This fuzz test ensures that a compiler I am writing will not crash for any input.  I set this test up and ran 1,000,000 iterations.  On the first run I had 11 failiures.  1 had managed to expose a known limitation of the testing enviornment.  The other 10 errors represented two significant compiler bugs that exposed my own lack of understanding of an obscure corner case was supposed to work.  These two bugs were in code written using TDD and with extensive unit tests and 100% coverage.  I fixed the bugs and added the randomly generated, failing cases to my unit test suite, ensuring that the bug will never be repeated.

I was quite impressed.  This compiler is some of the oldest code in my project.  It is beginning to mature, and I am developing what I hope is deserved confidence in it.  I was surprised that I was able to uncover two significant bugs by throwing randomized data at a parser.

 

This is all I intend to write about my data generation framework for right now.  I hope you find it interesting.  I am going to go put up a link on the xUnit discussion board that I hope will push some traffic here.

I hope this code will eventually be accepted into the planned XUnit extension project because I have found it useful.  Otherwise the code is not yet publically availaible, by I will email it to anyone who asks.  I am currently using Beta 1 of XUnit, but hope to go to beta 2 soon.

I am still extending this framework.  I am working right now to generate random XML documents that conform to a given definition.  It will be similar to the other data generators already described.  I welcome feedback and criticism, especially on my API design, as I am very new at designing APIs for other programmers.  I am also curious about better ways to "fuzz" the strings, it turns out all the bugs I found in my use of it were from algorithm # 3 above.

Let me know what you think!

November 06

XUnit Theory Testing: Data Generation Methods

This post continues the series on a data generation framework for the xUnit unit test framework.  The previous post is here.

Ok the last two posts were a little bit long.  Now for a simpler post.  In this post I will document the classes that implement data generators.  Data Generators are the first classes we have run into that do not originate in TheoryGenerationFacade.  The data generators exist in appropriately named static classes.  DataGenerators are simple, they generate data.  All of the described methods return some specialization or descendant of ValueSource<T>.

DoubleGenerator

The double generator generates random, infinite sequences of double valued numbers.  It has the following methods:

  • DoubleGenerator.Range(low, high) returns an infinite sequence of doubles between low and high, both inclusive.
  • DoubleGenerator.Any() returns an infinite sequence of any double values.

 

IntGenerator

IntGenerator is a bit more complicated.  The ValueSource<T> returned is not infinite, because there is a finite number of values in every range.  This ValueSource has additional code so that when it is used as an infinite value source (which is the common practice) it does not generate the entire set of included values.  IntegerValueSource, the internal ValueSource<Int> descendant that these methods return, implements two separate algorithms optimized for infinite or finite value stream cases.  This is important because both cases are common.  IntGenerator has two methods, identical to DoubleGenerator.

  • IntGenerator.Range(low, high) returns a finite sequence of integers between low and high, both inclusive.
  • IntGenerator.Any() returns a finite sequence of integers.

 

StringGenerator


StringGenerator returns infinite ValueSource<Strings>.  Every StringGenerator method has two varieties.  The method(max) variety generates strings of the described characters with lengths between 1 and the max.  The method(min,max) generates strings with lengths between min and max. both inclusive.  To create fixed length strings use the syntax method(x,x).

Following are the methods of the StringGenerator class, listing only the method(max) forms:

  • Random(max, string sourceChars) create random strings using characters from the given string.
  • Random(max) create a random string (contains unicode points 1-1000)
  • Alpha(max) creates strings of upper and lower case alphabetic characters.
  • Digits(max) creates strings of the digits 0-9
  • AlphaNumeric(max) creates strings with upper and lower case letters and digits.
  • KeyboardChars(max) creates strings with all the characters I could easily create with my US english keyboard.
  • PrintableCharacters(max) creates strings of characters at unicode points 32-1000

 

The Closure extension method on ValueSource<String> also helps in producing strings. Closure(min, max) or Closure(max) concatenates an appropriate number of values from the given string source.  For example "abc".Closure(3,3) would produce "abcabcabc".  Closure is named after the Kleeny Closure, the mathematical operator that does the same thing.

The  TheoryGenerationFacade.Concat() method takes multiple ValueSource<String> or string argument, and concatenates all combinations of values produced by the argument.  A legal C identifier is a letter followed by a sequence of letters or numbers. I could generate them with this. TheoryMethodFacade.Concat(StringGenerator.Alpha(1,1), StringGenerator.AlphaNumeric(0,127)).  Even though Concat applies only to strings, it is not a

Combine these two methods with the TheoryMethodFacade.Options, which works on many types including strings, and you will find that you have all the primatives you need to generate strings representing any context free grammar.  Virtually every computer language every written is either context free or almost context free, so this should be adequate.  If I don't get tired to soon, a future post in this series will illustrate this process.

That's all for tonight.  Its been a long day, I need some time off.  Next time I feel like posting I will describe the ValueSource<String>.Fuzz methods.