Recent Posts

New Site! Timestamps Digitizing Home Videos JSON Never Dies - An Efficient, Queryable Binary Encoding Writing a Fast JSON Parser Assume Good Intentions Quasi-Succinct Data Structures
  • New Site!

    Finally! My new website is ready.

    What’s New?

    A new, minimal design! No more WordPress. The layout is responsive and looks good on everything from the 4” iPhones to a desktop.

    The RSS feed and Feedly subscription link work again.

    Discussions on Hacker News and Reddit are automatically discovered, linked from each post, and indexed via hackernews and reddit tags.

    Pages load in under 100 milliseconds from near the data center and half a second from Europe.

    Sadly, a casuality of migrating to a static site is losing the ability to post comments.

    Most importantly, now that I’m proud of the site, I am motivated to start writing again. That’s the theory anyway!

    History

    My first website was hand-authored in Netscape Composer and Notepad. Maroon text on a tiled wood texture, hosted by our local dial-up ISP.

    My second was a cgi-bin program written in C that parsed text files and produced HTML, hosted by a friend’s university server. I don’t recommend making websites in C.

    My third was split. Project pages and release notes were rendered with some custom PHP and hosted from my apartment on the domain aegisknight.org. Essays and personal updates went on my LiveJournal.

    As IMVU and I became more well-known, my posts started to appear on social media and LiveJournal became embarrassing. So I bought chadaustin.me, did a proper site design, and set up WordPress.

    Ten years later with a house and three kids I just can’t deal with the maintenance overhead for WordPress, PHP, and MySQL. Even beyond maintenance, WordPress started to feel stifling. Pages loaded slowly. The layout and style were hard to tweak. The latest themes were meh. The spam detection was flaky. I couldn’t easily batch updates across all posts. It looked like crap on mobile. The RSS feed had stopped updating and I couldn’t figure out why.

    So here we are! Back to a static site. This time powered by Jekyll and a bunch of Python scripts. Time is a river and history repeats.

    The Migration

    I wrote a Python script to dump the WordPress database into a JSON file.

    I then wrote a one-time migration script which took all of the WordPress posts and pages and converted them to Jekyll with appropriate YAML front matter. WordPress comments are placed in a list in the front matter which the Liquid template renders.

    Afterwards, on every build of the website, a post-processing script searches Hacker News and Reddit for links and places them into a list in the post’s front matter, which the corresponding liquid template can render.

    One challenge with manipulating the front matter is preserving ordering on the keys (or at least avoiding non-determinism). It’s possible to override pyyaml so it uses OrderedDict for mappings.

    I wrote a pile of automated tests to verify none of the WordPress links were broken, driven primarily by top hits to the access log and external links known to Google’s Search Console.

    Writing these tests was worth it - some links had broken because of inconsistent time zone handling in the source data, moving a post’s permalink to the next or previous day. I told Jekyll to pretend the site is specified in America/Chicago time zone which everything. :P

    Finally, sudo apt remove libapache2-mod-php mysql-server php-mysql. 🎉

    I’m sure there are still bugs. If you notice anything, please let me know!

  • Timestamps

    Working on Eden's timestamp storage.

    Hm, I wonder what ext4's possible timestamp range is?

    Time to read https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Inode_Timestamps.

    That's a cute way to extend the legacy timestamp format. One extra 32-bit value gives nanosecond precision and extends the range out to year 2446.

    What happens if we exceed the range?

    dev:~/tmp/timestamps $ df -T .
    Filesystem     Type    ...
    /dev/vda3      ext4    ...
    dev:~/tmp/timestamps $ touch -d '1900-01-01 01:02' a
    dev:~/tmp/timestamps $ ls -l
    total 0
    -rw-r--r--. 1 chadaustin users 0 Jan  1  1900 a
    

    Huh, that's weird. ext4 shouldn't be able to represent 1900.

    dev:~/tmp/timestamps $ touch -d '2800-01-01 01:02' b
    dev:~/tmp/timestamps $ ls -l
    total 0
    -rw-r--r--. 1 chadaustin users 0 Jan  1  1900 a
    -rw-r--r--. 1 chadaustin users 0 Jan  1  2800 b
    

    And it definitely shouldn't be able to represent year 2800...

    dev:~/tmp/timestamps $ sudo bash -c 'echo 2 > /proc/sys/vm/drop_caches'
    dev:~/tmp/timestamps $ ls -l
    total 0
    -rw-r--r--. 1 chadaustin users 0 May 29  2444 a
    -rw-r--r--. 1 chadaustin users 0 Aug  5  2255 b
    

    ... oh.

  • Digitizing Home Videos

    Several years back, my father started a project to digitize our home videos. He purchased an old computer from the IMVU automated builds cluster, bought an AVerMedia C027 capture card, digitized a few tapes… and then his digitization workstation sat there for years, untouched.

    Sadly, he passed away last year, so I picked up the project. There were four types of analog media to digitize: VHS tapes, 8mm tapes, audio cassettes, and old Super 8 film reels.

    Super 8

    The Super 8 film went to a service in Redwood City. I don’t have any relevant equipment to play it and they do a good job – they clean the tape, take high-resolution photos of each frame, and then apply color adjustments to correct for any age-related fading, overexposure, or underexposure. The output format is up to you. I selected an MP4 movie file and a 1080p JPEG for every captured frame. (30 GB of 1080p JPEGs for 20 minutes of video!)

    The service worked out pretty well. My only complaint was that I gave them seven individually labeled 3” film reels but, presumably to make it easier to process, they taped six of the reels into one larger 6” reel, so I had to split the files back up. Avidemux made lossless splitting on the I-frame boundaries trivial.

    Audio Cassettes

    The audio was similarly easy. ION makes an inexpensive tape deck that advertizes itself as a stereo USB microphone. You can capture the audio straight into Audacity and clip, process, and encode as needed.

    VHS and 8mm

    The bulk of the project was VHS and 8mm: we had two medium-sized moving boxes plus a shoebox of VHS tapes and a medium-sized box of 8mm. Probably close to 100 tapes in all.

    Home videos are not worth much if nobody can watch them, so my primary goal was to make the video conveniently accessible to family. I also wanted to minimize unnecessary re-encodes and quality loss. The film and VHS had already degraded over time. Some quality loss, unfortunately, is inevitable without spending $$$ on dedicated equipment that captures frames from the tape.

    My parents happened to own a very high-quality VCR that’s still in great shape. The capture sequence ended up something like this:

    Video Cassette -> VCR -> Composite Cables -> Capture Card -> MPEG-2

    Since each tape contained a hodgepodge of home videos (sometimes interleaved with TV recordings!), they had to be split up. The excellent, open source dvbcut software is perfect for this: it has a quadratic slider for frame-accurate scrubbing and it only recompresses frames when your splits don’t line up precisely with I-frames. I recommend doing your dvbcut work on an SSD. Scrubbing is painful on a spinny disk.

    Converting the 8mm tapes was similar except replace VCR with the (again, still in great shape) Sony camcorder in playback mode. Also, since the 8mm tapes are mono but the capture card always records in stereo, you have an option. You can either run a post-split ffmpeg -map_channel step to convert the stereo MPEG-2 files into mono. (This has to happen after splitting because dvbcut can’t read videos after ffmpeg processes them for some reason.) Or you can tell HandBrake to mixdown the audio to mono from the right channel only. The latter avoids an audio re-encode, but it’s easier to forget when setting up long HandBrake jobs.

    Finally, because the captured MPEG-2 files are large (4 GB per hour of video), I recompressed in HandBrake to H.264. I don’t notice a material quality difference (besides some “free” noise reduction), and the H.264 MP4 files are smaller and have more responsive seeking.

    In the end, the steps that involve quality loss are:

    1. Real-time playback. Tracking glitches, for example, result in a few missed frames. But, like I mentioned, it would take $$$ to do a precise, frame-accurate digitization of each VHS frame.
    2. Composite cables instead of S-Video. I couldn’t find a VCR on Craigslist that supported S-Video output.
    3. Capturing in MPEG-2. I’m not convinced the real-time AVerMedia MPEG-2 encoder is very good - I’d occasionally notice strips of artifacty blocks in high-frequency regions like tree lines.
    4. A few frames of dvbcut’s re-encoding at the beginning and end of every split.
    5. YouTube / HandBrake. Might be slightly better to upload the split MPEG-2 into YouTube and let it recompress, but uploading 2 TB of video to YouTube didn’t seem very fun.

    The bulk of the time in this project went towards capturing the video. It has to play in real time. Each 8mm cassette was 2 hours, and VHS tapes range between 2 and 8 hours.

    The bulk of the effort, on the other hand, went into splitting, labeling, and organizing. I had to rely on clues to figure out when and where some videos were set. There were many duplicate recordings, too, so I had to determine which was higher quality.

    Now that all that’s done, I plan to upload everything to YouTube and make a Google Doc to share with family members, in case anyone wants to write stories about the videos or tag people in them.

  • JSON Never Dies - An Efficient, Queryable Binary Encoding

    A familiar story: Team starts project. Team uses JSON for interchange; it's easy and ubiquitous and debuggable. Project grows. Entity types multiply. Data sizes increase. Finally, team notices hundreds of kilobytes -- or even megabytes -- of JSON is being generated, transferred, and parsed.

    "Just use a binary format!" the Internet cries.

    But by the time a project gets to this stage, it's often a nontrivial amount of work to switch to Protocol Buffers or Cap'n Proto or Thrift or whatever. There might be thousands of lines of code for mapping model objects to and from JSON (that is arrays, objects, numbers, strings, and booleans). And if you're talking about some hand-rolled binary format, it's even worse: you need implementations for all of your languages and to make sure they're fuzzed and secure.

    The fact is, the activation energy required to switch from JSON to something else is high enough that it rarely happens. The JSON data model tends to stick. However, if we could insert a single function into the network layer to represent JSON more efficiently, then that could be an easy, meaningful win.

    "Another binary JSON? We already have CBOR, MsgPack, BSON, and UBSON!" the Internet cries.

    True, but, at first glance, none of those formats actually seem all that great. JSON documents tend to repeat the same strings over and over again, but those formats don't support reusing string values or object shapes.

    This led me to perform some experiments. What might a tighter binary representation of JSON look like?

    What's in a format?

    First, what are some interesting properties of a file format?

    • Flexibility and Evolvability Well, we're talking about representing JSON here, so they're all going to be similar. However, some of the binary JSON replacements also have support for dates, binary blobs, and 64-bit integers.
    • Size How efficiently is the information encoded in memory? Uncompressed size matters because it affects peak memory usage and it's how much data the post-decompression parser has to touch.
    • Compressibility Since you often get some sort of LZ compression "for free" in the network or storage interface, there's value in the representation being amenable to those compression algorithms.
    • Code Size How simple are the encoders and decoders? Beyond the code size benefits, simple encoders and decoders are easier to audit for correctness, resource consumption bounds, and security vulnerabilities.
    • Decoder Speed How quickly can the entire file be scanned or processed? For comparison, JSON can be parsed at a rate of hundreds of MB/s.
    • Queryability Often we only want a subset of the data given to us. Does the format allow O(1) or O(lg N) path queries? Can we read the format without first parsing it into memory?

    Size and parse-less queryability were my primary goals with JND. My hypothesis was that, since many JSON documents have repeating common structures (including string keys), storing strings and object shapes in a table would result in significant size wins.

    Quickly glancing at the mainstream binary JSON encodings...

    MsgPack

    Each value starts with a tag byte followed by its payload. e.g. "0xdc indicates array, followed by a 16-bit length, followed by N values". Big-endian integers.

    Must be parsed before querying.

    BSON

    Per spec, does not actually seem to be a superset of JSON? Disallows nuls in object keys, and does not support arrays as root elements, as far as I can tell.

    Otherwise, similar encoding as MsgPack. Each value has a tag byte followed by a payload. At least it uses little-endian integers.

    UBJSON

    Same idea. One-byte tags for each value, followed by a payload. Notably, lists and objects have terminators, and may not have an explicit length. Kind of an odd decision IMO.

    Big-endian again. Weird.

    CBOR

    IETF standard. Has a very intentional specification with documented rationales. Supports arrays of known sizes and arrays that terminate with a special "break" element. Smart 3-bit major tag followed by 5-bit length with special values for "length follows in next 4 bytes", etc.

    Big-endian again... Endianness doesn't matter all that much, but it's kind of weird to see formats using the endianness that's less common these days.

    CBOR does not support string or object shape tables, but at first glance it does not seem like CBOR sucks. I can imagine legitimate technical reasons to use it, though it is a quite complicated specification.

    JND!

    Okay! All of those formats have roughly the same shape. One byte prefixes on every value, value payloads in line (and thus values are variable-width).

    Now it's time to look at the format I sketched up.

    The file consists of a simple header marking the locations and sizes of three tables: values, strings, and object shapes.

    The string table consists of raw UTF-8 string data.

    In the value table, every value starts with a tag byte (sound familiar?). The high nibble encodes the type. The low nibble contains two 2-bit values, k and n.

    Consider strings:

    0011 kknn [1<<k bytes] [1<<n bytes] - string: offset, length into string table

    0011 indicates this is a string value. k and n are "size tags" which indicate how many bytes encode the integers. The string offset is 1 << k bytes (little-endian) and the string length is 1 << n bytes. Once the size tags are decoded, the actual offset and length values are read following the tag byte, and the resulting indices are used to retrieve the UTF-8 text at the given offset and length from the string table.

    Now let's look at objects:

    1000 kknn [1<<k bytes] - object, object index, then <length> values of width n

    The following 1 << k bytes encode the index into the object shape table, which holds the number and sorted list of object keys. Afterwards is a simple list of indices into the value table, each of size 1 << n bytes. The values are matched up with the keys in the object shape table.

    Arrays are similar, except that instead of using an object index, they simply store their length.

    This encoding has the property that lookups into an array are O(1) and lookups into an object are O(lg N), giving efficient support for path-based queries.

    But there's a pretty big downside relative to MsgPack, CBOR, and the like. The cost of efficient random access is that the elements of arrays and objects must have a known size. Thus, the (variable-width) values themselves cannot be stored directly into the array's element list. Instead, arrays and objects have a list of fixed-width numeric offsets into the value table. This adds a level of indirection, and thus overhead, to JND. The payoff is that once a particular value is written (like a string or double-precision float), its index can be reused and referenced multiple times.

    So how does this play out in practice? Net win or loss?

    Size Benchmarks

    I used sajson's standard corpus of test JSON documents.

    • apache_builds Root object, largely consists of an array containing many three-element objects.
    • github_events Mix of objects, arrays, and long URL strings.
    • get_initial_state I can't share the contents of this document as it's from a project at work, but it's 1.7 MB of various entity definitions, where each entity type is an object with maybe a dozen or two fields.
    • instruments A huge number of repeated structures and data -- very compressible.
    • mesh 3D geometry. Basically a giant list of floating point and integer numbers.
    • svg_menu Only 600 bytes - used to test startup and base overhead costs.
    • twitter List of fairly large objects, many long strings.
    • update-center From Jenkins. Mostly consists of an object representing a mapping from plugin name to plugin description.

    Conclusions

    We can draw a few conclusions from these results.

    • As a wire replacement for the JSON data model, there's no apparent reason to use BSON or UBSON. I'd probably default to MsgPack because it's simple, but CBOR might be okay too.
    • The current crop of binary formats don't compress any better than JSON itself, so if size over the wire is your primary concern, you might as well stick with JSON.
    • Except for small documents, where JND's increased representation overhead dominates, JND is pretty much always smaller uncompressed. As predicted, reusing strings and object shapes is a big win in practice.
    • LZ-style compression algorithms don't like JND much. Not a big surprise, since they don't do a good job with sequences of numeric values. I expect delta coding value offsets in arrays and objects would help a lot, at the cost of needing to do a linear pass from delta to absolute at encode and decode time.

    JND's disadvantage is clear in the above graphs: while it's smaller uncompressed, it does not compress as well as JSON or MsgPack. (Except in cases where its uncompressed size is dramatically smaller because of huge amounts of string or object shape reuse.)

    Where would something like JND be useful? JND's big advantage is that it can be read directly without allocating or parsing into another data structure. In addition, the uncompressed data is relatively compact. So I could imagine it being used if IO is relatively cheap but tight memory bounds are required.

    Another advantage of JND is a bit less obvious. In my experience using sajson from other languages, the parse itself is cheaper than converting the parsed AST into values in the host language. For example, constructing Swift String objects from binary UTF-8 data was an order of magnitude slower than the parse itself. In JND, every unique string would naturally be translated into host language String objects once.

    If you'd like to play with my experiments, they're on GitHub at https://github.com/chadaustin/jnd.

    Future Work?

    It was more work than I could justify for this experiment, but I'd love to see how Thrift or Protocol Buffers compare to JND. JND is a self-describing format, so it's going to lose on small messages, but when data sizes get big I wouldn't be surprised if it at least ties protobufs.

    Update: Forgot to mention decode times. I didn't have time to set up a proper benchmark with high-performance implementations of each of these formats (rather than whatever pip gave me), but I think we can safely assume CBOR, MsgPack, BSON, and UBSON all have similar parse times, since the main structure of the decoder loop would be the same. The biggest question would be: how does that compare to JSON? And how does JND compare to both?

  • Writing a Fast JSON Parser

    Several holidays ago, I got a bee in my bonnet and wrote a fast JSON parser whose parsed AST fits in a single contiguous block of memory. The code was small and simple and the performance was generally on-par with RapidJSON, so I stopped and moved on with my life.

    Well, at the end of 2016, Rich Geldreich shared that he'd also written a high-performance JSON parser.

    I dropped his pjson into my benchmarking harness and discovered it was twice as fast as both RapidJSON and sajson! Coupled with some major JSON parsing performance problems in a project at work, I was nerd-sniped again.

    I started reading his code but nothing looked particularly extraordinary. In fact, several things looked like they ought to be slower than sajson... Oh wait, what's this?

    do {
        c = pStr[0]; if (globals::s_parse_flags[c] & 1)
            { ++pStr; break; }
        c = pStr[1]; if (globals::s_parse_flags[c] & 1)
            { pStr += 2; break; }
        c = pStr[2]; if (globals::s_parse_flags[c] & 1)
            { pStr += 3; break; }
        c = pStr[3];
        pStr += 4;
    } while (!(globals::s_parse_flags[c] & 1));
    

    Why is unrolling that loop a win? How is the parse flags look-up (even L1 is 3-4 cycle latency) better than some independent comparisons?

    Guess it was time to do some instruction dependency analysis!

    Fast String Parsing

    The problem at hand: given a pointer to the first byte after the opening quote of a string, find the first special byte, where special bytes are ", \, <0x20, or >0x7f.

    The loop:

    for (;;) {
        if (*p < 0x20 || *p > 127 || *p == '"' || *p == '\\') {
            break;
        }
        ++p;
    }
    

    roughly breaks down into:

    begin_loop:
        load c, [p]
        compare c, 0x20
        jump-if-less exit_loop
        compare c, 0x7f
        jump-if-greater exit_loop
        compare c, '"'
        jump-if-equal exit_loop
        compare c, '\\'
        jump-if-equal exit_loop
        increment p
        jump begin_loop
    exit_loop:
    

    How do we think about the performance of this loop? Remember that mainstream CPUs are out-of-order and can execute four (or more!) instructions in parallel. So an approximate mental model for reasoning about CPU performance is that they understand multiple instructions per cycle, then stick them all into an execution engine that can execute multiple instructions per cycle. Instructions will execute simultaneously if they're independent of each other. If an instruction depends on the result of another, it must wait N cycles, where N is the first instruction's latency.

    So, assuming all branches are correctly predicted (branch predictors operate in the frontend and don't count towards dependency chains), let's do a rough estimate of the cost of the loop above:

    load c, [p] # depends on value of p
    # assuming L1, ~4 cycle latency
    # all four comparisons can be done in parallel
    # assumption: jumps are all predicted
    increment p
    # jump to beginning
    

    increment p is the only instruction on the critical path - it carries a dependency across iterations of the loop. Is there enough work to satisfy the execution resources during the increment? Well, the comparisons are independent so they can all be done in parallel, and there are four of them, so we can probably keep the execution units busy. But it does mean that, at most, we can only check one byte per cycle. In reality, we need to issue the load and increment too, so we're looking at a loop overhead of about 2-3 cycles per byte.

    Now let's look more closely at Rich's code.

    Replacing the comparisons with a lookup table increases comparison latency (3-4 cycles latency from L1) and increased comparison throughput (multiple loads can be issued per cycle).

    So let's spell out the instructions for Rich's code (reordered for clarity):

    begin_loop:
        load c0, [p]
        load c1, [p+1]
        load c2, [p+2]
        load c3, [p+3]
        load t0, [is_plain_string+c0]
        load t1, [is_plain_string+c1]
        load t2, [is_plain_string+c2]
        load t3, [is_plain_string+c3]
        compare t0, 0
        jump-if-equal exit_loop
        compare t1, 0
        jump-if-equal increment_1_and_exit_loop
        compare t2, 0
        jump-if-equal increment_2_and_exit_loop
        compare t3, 0
        jump-if-equal increment_3_and_exit_loop
        add p, 4
        jump begin_loop
    increment_1_and_exit_loop:
        add p, 1
        jump exit_loop
    increment_2_and_exit_loop:
        add p, 2
        jump exit_loop
    increment_3_and_exit_loop:
        add p, 3
    exit_loop:
    
    • Load four bytes per iteration [not on critical execution path]
    • Load four LUT entries [not on critical execution path]
    • Issue four comparisons per cycle [not on critical execution path]
    • add p, 4

    Again, the critical path is only add p, 4, but we still need to issue the other instructions. The difference is that now all of the loads happen in parallel and the comparisons for 4 bytes happen in parallel too, rather than doing four comparisons per byte.

    It's still hard to say if this is a win on paper -- Haswell can only issue 2 loads per cycle. But the loads can overlap with the comparisons from previous bytes). However, we still have to issue all of these instructions. So maybe we're looking at something closer to 2 cycles per byte?

    Empirically, at least on my Broadwell, replacing four comparisons with a LUT was definitely a win. 55cc213

    But is unrolling the loop necessary? If I take out the unrolling but leave the LUT, clang gets slower but gcc stays the same. I checked - neither gcc nor clang do any automatic unrolling here. What's going on? Branch predictor effects? Also, Intel's Intel Architecture Code Analyzer tool says that the unrolled and non-unrolled loops are both frontend-bound and have the same loop throughput.

    I'm not yet convinced that a tight dependency chain is why unrolling is a win. More investigation is required here. But the important thing to keep in mind is to pay attention to the critical path latency through a loop body as well as the number of independent operations that can soak up spare execution bandwidth.

    Lead Bullets

    After playing around with LUTs and unrolling, I started implementing a bunch of small optimizations that I'd long known were available but didn't consider to be that important.

    Well, as it often turns out in performance projects, a sequence of 2% gains adds up to significant wins over time! If you've read Ben Horowitz's lead bullets story or how SQLite achieved a 50% speed up, this will sound familiar.

    Here are the optimizations that mattered:

    • Moved the input and output pointers into locals instead of members, which helps VC++ and Clang understand that they can be placed in registers. (gcc was already performing that optimization.) 71078d3 4a07c77
    • Replaced the entire parse loop with a goto state machine. Surprisingly, not only was this a performance win, but it actually made the code clearer. 3828769 05b3ec8 c02cb31
    • Change an 8-entry enum to a uint8_t instead of the pointer-sized value it was defaulting to. 799c55f
    • Duplicated a bit of code to avoid needing to call a function pointer. 44ec0df
    • Tiny microoptimizations like avoiding branches and unnecessary data flow. 235d330 002ba31 193b183 c23fa23
    • Store the tag bits at the bottom of the element index instead of the top, which avoids a shift on 64-bit. e7f2351

    Static Branch Prediction

    I also spent a bit of time on static branch prediction. It's a questionable optimization; in theory, you should just use profile-guided optimization (PGO) and/or rely on the CPU's branch predictors, but in practice few people actually bother to set up PGO. Plus, even though the CPU will quickly learn which branches are taken, the compiler doesn't know. Thus, by using static branch prediction hints, the compiler can line up the instructions so the hot path flows in a straight line and all the cold paths are off somewhere else, sometimes saving register motion in the hot path.

    For some examples, look at these uses of SAJSON_LIKELY and SAJSON_UNLIKELY.

    I can't recommend spending a lot of time on annotating your branches, but it does show up as a small but measurable win in benchmarks, especially on smaller and simpler CPUs.

    Things I Didn't Do

    • Unlike Rich's parser and RapidJSON, I chose not to optimize whitespace skipping. Why? Not worth the code size increase - the first thing someone who cares about JSON parsing performance does is minify the JSON. b05082b
    • I haven't yet optimized number parsing. Both RapidJSON and Rich's parser are measurably faster there, and it would be straightforward to apply the techniques. But the documents I regularly deal with are strings and objects and rarely contain numbers.

    Benchmark Results

    I tested on four devices: my desktop (high-clocked Haswell), laptop (low-clocked Broadwell), iPhone SE, and Atom-based home server.

    Desktop

    Dell XPS 13

    iPhone SE

    Atom D2700

    The charts aren't very attractive, but if you look closely, you'll notice a few things:

    • Parsing JSON on modern CPUs can be done at a rate of hundreds of megabytes per second.
    • gcc does a much better job with RapidJSON than either clang or MSVC.
    • JSON parsing benefits from x64 - it's not a memory-bound or cache-bound problem, and the extra registers help a lot.
    • The iPhone SE is not much slower than my laptop's Broadwell. :)

    The Remainder of the Delta

    As you can see in the charts above, sajson is often faster than RapidJSON, but still not as fast as Rich's pjson. Here are the reasons why:

    1. sajson does not require the input buffer to be null-terminated, meaning that every pointer increment requires a comparison with the end of the buffer (to detect eof) in addition to the byte comparison itself. I've thought about changing this policy (or even adding a compile-time option) but I really like the idea that I can take a buffer straight from a disk mmap or database and pass it straight to sajson without copying. On the other hand, I measured about a 10% performance boost from avoiding length checks.

    2. sajson sorts object keys so that object lookup takes logarithmic rather than linear time. The other high-performance parsers have linear-time object lookup by key. This is an optimization that, while not necessary for most use cases, avoids any accidental worst-case quadratic-time usage.

    3. sajson's contiguous AST design requires, for every array, shifting N words in memory where N is the number of elements in the array. The alternative would be to use growable arrays in the AST (requiring that they get shifted as the array is realloc'd). Hard to say how much this matters.

    Aside: The "You Can't Beat the Compiler" Myth

    There's this persistent idea that compilers are smart and will magically turn your code into something that maps efficiently to the machine. That's only approximately true. It's really hard for compilers to prove the safety (or profitability) of certain transformations and, glancing through the produced code for sajson, I frequently noticed just plain dumb code generation. Like, instead of writing a constant into a known memory address, it would load a constant into a register, then jump to another location to OR it with another constant, and then jump somewhere else to write it to memory.

    Also, just look at the charts - there are sometimes significant differences between the compilers on the same code. Compilers aren't perfect and they appreciate all the help you can give!

    Benchmarking

    Measuring the effect of microoptimizations on modern computers can be tricky. With dynamic clock frequencies and all of today's background tasks, the easiest way to get stable performance numbers is to take the fastest time from all the runs. Run your function a couple thousand times and record the minimum. Even tiny optimizations will show up this way.

    (If a library that uses statistical techniques to automatically achieve precise results is readily available, consider using that.)

    I also had my JSON benchmarks report MB/s, which normalizes the differences between test files of different sizes. It also helps understand parser startup cost (when testing small files) and the differences in parse speed between files with lots of numbers, strings, huge objects, etc.

    Swift Bindings

    Dropbox (primarily @aeidelson) contributed high-quality Swift bindings for sajson. The challenge in writing these bindings was finding a way to efficiently expose the sajson parse tree to Swift. It turns out that constructing Swift arrays and objects is REALLY expensive; we once benchmarked 10 ms in sajson's parser and 400 ms of Swift data structure construction.

    Fortunately, Swift has decent APIs for unsafely manipulating pointers and memory, and with those we implemented the ability to access sajson AST nodes through a close-to-natural ValueReader type.

    Converting sajson strings (e.g. ranges of UTF-8 memory) into Swift Strings is still expensive, but Swift 4 might improve the situation:

    An iOS team at Dropbox replaced JSONSerialization with sajson and cut their initial load times by two thirds!

    Summary

    I used to think JSON parsing was not something you ever wanted in your application's critical path. It's certainly not the kind of algorithm that modern computers love (read byte, branch, read byte, branch). That said, this problem has been beaten to death. We now have multiple parsers that can parse data at hundreds of megabytes per second -- around the same rate as SHA-256! If we relaxed some of the constraints on sajson, it could even go faster.

    So how fast was Rich's parser, after all? When measured in Clang and MSVC, quite a lot, actually. But when measured in GCC, RapidJSON, sajson, and pjson were (and remain) very similar. Many of the differences come down to naive, manually-inlined code, which we know the compiler can reliably convert to a fast sequence of instructions. It's annoying to eschew abstractions and even duplicate logic across several locations, but, in the real world, people use real compilers. For critical infrastructure, it's worth taking on some manual optimization so your users don't have to care.

    Update

    Arseny Kapoulkine pointed out an awesome trick to me: replace the last byte of the buffer with 0, which lets you avoid comparing against the end of the buffer. When you come across a 0 byte, if it's the last one, treat it as if it's the byte that got replaced. This would be a material performance win in sajson _and_ avoid the need for specifying null-terminated buffers.