Learn Span<T> by Implementing a high-performance CSV Parser

Ever since I first heard about Span<T>, I’ve been wanting play around with using it. It’s a ref struct, so the semantics of using this type and the restrictions that go along with it are best understood by actually trying to use it. So I decided to build a simple CSV parser and see how the performance compared to other mainstream CSV parsers.

You can find the full code on GitHub and reference the package from NuGet.

As the main goal of this project was to exercise Span<T> and, as we’ll later see, Memory<T>, so for the initial version I decided to ignore some of the details around CSV parsing such as handling quotes, trimming values, and handling comments.

Initial Implementation

For my initial implementation of a parser, I simply created a class which took a ReadOnlyMemory<char> as the content and two ReadOnlySpan<char>s as the value and record delimiters for example a comma and newline respectively for a CSV. Then in the constructor I would walk the entire content saving the index and length for each value in each row as a List<List<long>>. Finally, There are members on the class to allow getting the number of records, the number of values in a particular record, and a particular value. Already you may have some questions, so let’s dig into the details.

Firstly, I needed to use a ReadOnlyMemory<char> to store the content, as ref structs like Span<T> are stack-only and so can only be used in method parameters and local variables, while Memory<T> is a “regular” struct which can be stored on the heap. ReadOnlyMemory<T> can be sliced just as Span<T> can be without allocating or copying buffers around.

I wanted the parser to allow random-access into the values of the CSV, so I needed to store the index where each value started in the content, as well the length. To save on memory, I decided to store this data in List<List<long>> where the outer list represented a list of the records, the inner list represented the list of the values within a record, and the long (int64) store for each value was constructed by fusing the index and length data, both ints (int32), using bit shifting.

After running some benchmarks, I found that already this implementation was better than other libraries, both in terms of speed and allocations.

Benchmark results after iteration 1:

Method Data Size Mean Scaled Allocated
My Parser Large 1,221.2610 us 0.59 16111.39 KB
StringSplit Large 93,149.630 us 1.00 85937.55 KB
TinyCsvParser Large 169,713.472 us 1.82 25072.83 KB
FastCsvParser Large 63,043.752 us 0.68 47227.2 KB
CsvTextFieldParser Large 109,165.716 us 1.17 182031.39 KB
CsvHelper Large 206,209.245 us 2.21 173448.95 KB
FileHelpers Large 278,066.185 us 2.99 83321.9 KB
My Parser Medium 307.973 us 0.76 157.07 KB
StringSplit Medium 404.962 us 1.00 859.36 KB
TinyCsvParser Medium 2,406.140 us 5.94 322.31 KB
FastCsvParser Medium 717.625 us 1.77 796.22 KB
CsvTextFieldParser Medium 1,066.085 us 2.63 1820.45 KB
CsvHelper Medium 2,050.169 us 5.06 1747.09 KB
FileHelpers Medium 2,069.213 us 5.11 851.4 KB
My Parser Small 3.269 us 0.84 1.96 KB
StringSplit Small 3.890 us 1.00 8.58 KB
TinyCsvParser Small 1,059.621 us 272.45 74.74 KB
FastCsvParser Small 22.811 us 5.87 203.89 KB
CsvTextFieldParser Small 10.769 us 2.77 18.34 KB
CsvHelper Small 22.715 us 5.84 28.7 KB
FileHelpers Small 687.064 us 176.66 31.07 KB

Splitting Out a No-allocation Reader

Next I decided to extract the reader from the parser so that scenarios which didn’t need random-access to the data didn’t need the allocations from the index data. Basically the reader would advance the memory as the consumer read it, much like an IEnumerator<T> or TextReader. Additionally, the parser was simplified as it could use the reader to enumerate the values and simply worry about building up the index and length data for random-access.

After categorizing the benchmarks into Readers vs Parsers, the results showed that the reader implementation was far and away better than other libraries which exposed reader-like APIs. The closest contender was for a large data set using FastCsvParser, which only has a couple hundred downloads on NuGet.org as of this writing and still was took 3.5x longer to read the data. It requires zero allocations as it simply just stores and advances a span as well as a few other scalar fields.

Benchmark results after iteration 2:

Method Data Size Mean Scaled Allocated
My Reader Large 16,349.969 us 1.00 0 B
FastCsvParser Large 57,649.310 us 3.53 48360656 B
CsvTextFieldParser Large 103,398.356 us 6.32 186400144 B
CsvHelper Large 194,618.607 us 11.90 177611728 B
My Parser Large 53,862.981 us 1.00 16498940 B
StringSplit Large 87,535.705 us 1.63 88000046 B
TinyCsvParser Large 157,429.493 us 2.92 25674400 B
FileHelpers Large 259,745.145 us 4.82 85322480 B
My Reader Medium 161.729 us 1.00 0 B
FastCsvParser Medium 640.133 us 3.96 815328 B
CsvTextFieldParser Medium 1,024.653 us 6.34 1864144 B
CsvHelper Medium 1,953.924 us 12.08 1789016 B
My Parser Medium 321.522 us 1.00 160840 B
StringSplit Medium 382.312 us 1.19 879984 B
TinyCsvParser Medium 2,249.190 us 7.00 330045 B
FileHelpers Medium 1,901.365 us 5.91 871837 B
My Reader Small 1.647 us 1.00 0 B
FastCsvParser Small 22.443 us 13.62 208784 B
CsvTextFieldParser Small 10.202 us 6.19 18784 B
CsvHelper Small 21.441 us 13.02 29392 B
My Parser Small 3.385 us 1.00 2008 B
StringSplit Small 3.703 us 1.09 8784 B
TinyCsvParser Small 1,018.881 us 301.01 76532 B
FileHelpers Small 664.814 us 196.41 31812 B

To ref struct or not to ref struct

I found that using the reader API was fairly difficult, and I was unable to write convenience subclasses (eg. CSV reader, TSV reader) due to the fact that the reader was defined as a ref struct so that it could hold other ref structs like ReadOnlySpan<char>. Because of this, the reader is unable to be stored as a field in a class, used in async APIs, or be on the heap in any other way. I decided that instead it could just hold a ReadOnlyMemory<char> and become a class for easier usage.

Unfortunately, based on the benchmark results, this affected performance non-trivially and slowed down the reader by ~50%. The reader was still faster than the other implementations I tested against, but it was interesting to see just how optimized Span<T> and ref struct in general are.

Because of this, I may consider bringing back the ref struct implementation for usages which aren’t limited by it.

Benchmark results after iteration 3:

Method Data Size Mean Scaled Allocated
My Reader Large 24,929.372 us 1.00 0 B
FastCsvParser Large 58,659.486 us 2.35 48360656 B
CsvTextFieldParser Large 105,256.008 us 4.22 186400144 B
CsvHelper Large 196,541.906 us 7.89 177611728 B
My Parser Large 59,266.792 us 1.00 16499337 B
StringSplit Large 89,254.220 us 1.51 88000057 B
TinyCsvParser Large 160,947.603 us 2.72 25674926 B
FileHelpers Large 265,223.622 us 4.48 85322864 B
My Reader Medium 244.256 us 1.00 96 B
FastCsvParser Medium 686.830 us 2.81 815328 B
CsvTextFieldParser Medium 1,050.241 us 4.30 1864144 B
CsvHelper Medium 1,957.324 us 8.01 1789016 B
My Parser Medium 375.222 us 1.00 160936 B
StringSplit Medium 394.238 us 1.05 879984 B
TinyCsvParser Medium 2,311.955 us 6.16 330044 B
FileHelpers Medium 2,014.732 us 5.37 871837 B
My Reader Small 2.494 us 1.00 96 B
FastCsvParser Small 22.896 us 9.18 208784 B
CsvTextFieldParser Small 10.506 us 4.21 18784 B
CsvHelper Small 21.624 us 8.67 29392 B
My Parser Small 3.985 us 1.00 2104 B
StringSplit Small 3.825 us 0.96 8784 B
TinyCsvParser Small 1,050.156 us 263.60 76532 B
FileHelpers Small 680.903 us 170.91 31812 B

Next Steps

While analyzing these benchmarks, I was surprised to find that, the simple and obvious string.Split implementation was better than most the other libraries I tested against. One explanation could be that those libraries are optimized for usability rather than performance, for example mapping the records to a class. However, at least in some cases this case be explained by “streamability”. In my benchmarks, I loaded the entire content into memory as one big string prior to running each benchmark. I suspect that in real-world conditions like parsing the content from the disk or network, some of these libraries would perform much better. In fact, my implementation suffers from the same problem since it requires the entire buffer at once.

So the next step is to continue this exploration using System.IO.Pipelines. Stay tuned for a follow up post about that!

A Tale of Performance and Bad SQL Usage

The other day I was looking at telemetry for one of my websites to try and finally figure out why I was running into my telemetry cap, which I had been putting off for a while. I’m on the free Application Insights tier, so I only get 1 GB a month and was needing to sample to only 4% of traffic to remain under the cap, which didn’t seem right considering the traffic that site gets.

What I found is that the dependencies was just being spammed with similar SQL calls:

And based on the behavior of the website, I knew that all of those calls listed were part of a single web request and SQL transaction. At the time, I originally implemented this, it seems that I was lazy and just put a bunch of separate INSERT INTO statements in one transaction. Note that I manually craft the SQL queries in my app (I know, the horror!) because Entity Framework tends to have pretty bad performance despite being a pretty good programming interface, and stored procedures felt a bit awkward to use when the inputs are tables. Still, judgement may still be warranted this practice, but it’s what I have right now.

In any case, I ended up combining the INSERT INTO statements for similar tables, which cut the number of SQL calls from 40 to 7 for this sort of request. The results were a drastic decrease in the number of SQL calls and thus telemetry entries:

This actually was enough of a cut for me to increase my sampling to about 25%. That’s still lower than I’d like, but it’s a great improvement. The majority of the remaining excess is the SQL calls made by Entity Framework since I’m using it for Identity and Authentication.

One very interesting and initially unexpected side-effect of this change is that each of the individual SQL calls seems to take approximately the same amount of time as before, about 40 or so milliseconds per call (with some outliers where I need to use HOLDLOCK), so this actually had an additional benefit of cutting end to end latency for the request to about one third of what it previously was.

I’m considering taking this even further and composing one massive SQL query, including the transaction statements to just have one single SQL call per request. The moral of the story though is to just combine calls when it’s practical to do so. In my case, it was pretty straightforward and good sense to combine a loop of INSERT INTO statements into a single statement, as INSERT INTO easily allows for that.

Stored procedures are the obvious solution to this problem, and are one of the reasons they’re highly recommended by pretty much anyone who knows anything about SQL, including some readers of this article who I’m sure are livid by avoidance of them, but I personally find the awkwardness of use combined with the fact that my business logic ends up being split between my app and database to be unattractive enough to pass up. This investigation showed me however that when choosing that path one needs to be very careful when crafting their SQL queries and be cognizant of the number of individual calls being made to avoid performance pitfalls.

Trimming Unnecessary Dependencies from Projects

When you’re working in large repositories with hundreds of MsBuild projects, you’re bound to have fairly complex build graphs. Over time, these can devolve and you may end up with lots of dependencies between projects which are no longer needed. This can cause builds to slow down as they are less parallelizable, and the developer experience can suffer as you unnecessarily rebuild libraries which have falsely depend on libraries you changed.

Luckily, the Roslyn compiler is smart enough to only include references which are actually used in an assembly’s metadata. So you can actually compare the references which are passed to the compiler against the references that actually make it into the compiled assembly.

You can figure out which references make it into the assembly using a tool like ILSpy or dotPeek, or even use the reflection method Assembly.GetReferencedAssemblies.

There are three different ways that references can be included in your project: Reference, ProjectReference, and PackageReference.

References and Project References

References and Project References are both fairly straightforward. Each reference gets resolved, usually by its HintPath if it isn’t a framework assembly, and passed to the compiler. Similarly for project references, MsBuild does an inner build to determine each project reference’s target assembly and passes that to the compiler for the outer build. For both of these, it’s pretty straightforward to just compare each reference passed to the compiler and check whether it made it into the assembly’s metadata.

Transitivity

But wait, what if an indirect dependency is required at run time, but not at compile time? This can happen if you have a project with a dependency which itself has a dependency which isn’t directly required by the original project. To build the project, the indirect dependency may not be needed, however if the project produces something runnable like an exe or a unit test assembly, certain code paths may require the indirect dependency to get loaded into the App Domain.

So for project which produce something runnable, we have to consider all transitive references and not just the references the main assembly has.

Package References

Package references are where things get a little complicated. In the new Sdk-style projects, MsBuild and NuGet work together to form the entire dependency graph for the package references you specified, collects all the assemblies for all of those packages, and passes every single one of them to the compiler. With the packaging of the framework assemblies themselves (ie. netstandard2.0 and netcoreapp2.0), this can get fairly huge. As an example, for a fairly simple web application I have, a whopping 366 /reference parameters are passed to csc.exe. As if things weren’t complicated enough, some packages like Microsoft.AspNetCore.All are really just meta-packages which themselves don’t have any assemblies but instead just have a number of dependencies which do contain assemblies to be referenced. And then packages like Microsoft.Net.Compilers don’t add any references but instead provide additional build tooling by way of MsBuild props and targets.

So for package references, we have to account for any assembly in the package or any packages it depends on, and also for cases where they don’t provide references at all.

Introducing ReferenceTrimmer

If all of this seems overwhelming enough to make you not want to bother cleaning up your projects, fear not. I’ve create a little tool called ReferenceTrimmer which does all the work for you.

When you run it, it accounts for each of the things discussed above and prints out what it believes are unnecessary references, project references, and package references.

I’m sure it doesn’t cover all cases yet, and likely reports false-positives and false-negatives, but it was able to find some issues in some of my smaller projects even. Contributions are always welcome if you find that it could do better though!