I built an ARM64 assembly lexer (well, I generated one from my own parser generator, but this post is not about that) that processes Dart code 2x faster than the official scanner, a result I achieved using statistical methods to reliably measure small performance differences. Then I benchmarked it on 104,000 files and discovered my lexer was not the bottleneck. I/O was. This is the story of how I accidentally learned why pub.dev stores packages as tar.gz files.

The setup

I wanted to benchmark my lexer against the official Dart scanner. The pub cache on my machine had 104,000 Dart files totaling 1.13 GB, a perfect test corpus. I wrote a benchmark that:

  1. Reads each file from disk
  2. Lexes it
  3. Measures time separately for I/O and lexing

Simple enough.

The first surprise: lexing is fast

Here are the results:

Metric ASM Lexer Official Dart
Lex time 2,807 ms 6,087 ms
Lex throughput 402 MB/s 185 MB/s

My lexer was 2.17x faster. Success! But wait:

Metric ASM Lexer Official Dart
I/O time 14,126 ms 14,606 ms
Total time 16,933 ms 20,693 ms
Total speedup 1.22x -

The total speedup was only 1.22x. My 2.17x lexer improvement was being swallowed by I/O. Reading files took 5x longer than lexing them.

The second surprise: the SSD is not the bottleneck

My MacBook has an NVMe SSD that can read at 5-7 GB/s. I was getting 80 MB/s. That is 1.5% of the theoretical maximum.

The problem was not the disk. It was the syscalls.

For 104,000 files, the operating system had to execute:

  • 104,000 open() calls
  • 104,000 read() calls
  • 104,000 close() calls

That is over 300,000 syscalls. Each syscall involves:

  • A context switch from user space to kernel space
  • Kernel bookkeeping and permission checks
  • A context switch back to user space

Each syscall costs roughly 1-5 microseconds. Multiply that by 300,000 and you get 0.3-1.5 seconds of pure overhead, before any actual disk I/O happens. Add filesystem metadata lookups, directory traversal, and you understand where the time goes.

I tried a few things that did not help much. Memory-mapping the files made things worse due to the per-file mmap/munmap overhead. Replacing Dart's file reading with direct FFI syscalls (open/read/close) only gave a 5% improvement. The problem was not Dart's I/O layer, it was the sheer number of syscalls.

The hypothesis

I have mirrored pub.dev several times in the past and noticed that all packages are stored as tar.gz archives. I never really understood why, but this problem reminded me of that fact. If syscalls are the problem, the solution is fewer syscalls. What if instead of 104,000 files, I had 1,351 files (one per package)?

I wrote a script to package each cached package into a tar.gz archive:

104,000 individual files -> 1,351 tar.gz archives
1.13 GB uncompressed     -> 169 MB compressed (6.66x ratio)

The results

Metric Individual Files tar.gz Archives
Files/Archives 104,000 1,351
Data on disk 1.13 GB 169 MB
I/O time 14,525 ms 339 ms
Decompress time - 4,507 ms
Lex time 2,968 ms 2,867 ms
Total time 17,493 ms 7,713 ms

The I/O speedup was 42.85x. Reading 1,351 sequential files instead of 104,000 random files reduced I/O from 14.5 seconds to 339 milliseconds.

The total speedup was 2.27x. Even with decompression overhead, the archive approach was more than twice as fast.

Breaking down the numbers

I/O: 14,525 ms to 339 ms

This is the syscall overhead in action. Going from 300,000+ syscalls to roughly 4,000 syscalls (open/read/close for 1,351 archives) eliminated most of the overhead.

Additionally, reading 1,351 files sequentially is far more cache-friendly than reading 104,000 files scattered across the filesystem. The OS can prefetch effectively, the SSD can batch operations, and the page cache stays warm.

Decompression: 4,507 ms

gzip decompression ran at about 250 MB/s using the archive package from pub.dev. This is now the new bottleneck. I did not put much effort into optimizing decompression, an FFI-based solution using native zlib could be significantly faster. Modern alternatives like lz4 or zstd might also help.

Compression ratio: 6.66x

Source code compresses well. The 1.13 GB of Dart code compressed to 169 MB. This means less data to read from disk, which helps even on fast SSDs.

Why pub.dev uses tar.gz

pub.dev versions page with download button

pub.dev package download showing flame-1.34.0.tar.gz

This experiment accidentally explains the pub.dev package format. When you run dart pub get, you download tar.gz archives, not individual files. The reasons are now obvious:

  1. Fewer HTTP requests. One request per package instead of hundreds.
  2. Bandwidth savings. 6-7x smaller downloads.
  3. Faster extraction. Sequential writes beat random writes.
  4. Reduced syscall overhead. Both on the server (fewer files to serve) and the client (fewer files to write).
  5. Atomicity. A package is either fully downloaded or not. No partial states.

The same principles apply to npm (tar.gz), Maven (JAR/ZIP), PyPI (wheel/tar.gz), and virtually every package manager.

The broader lesson

Modern storage is fast. NVMe SSDs can sustain gigabytes per second. But that speed is only accessible for sequential access to large files. The moment you introduce thousands of small files, syscall overhead dominates.

This matters for:

  • Build systems. Compiling a project with 10,000 source files? The filesystem overhead might exceed the compilation time.
  • Log processing. Millions of small log files? Concatenate them. Claude uses JSONL for this reason.
  • Backup systems. This is why rsync and tar exist.

What I would do differently

If I were optimizing this further:

  1. Use zstd instead of gzip. 4-5x faster decompression with similar compression ratios.
  2. Use uncompressed tar for local caching. Skip decompression entirely, still get the syscall reduction.
  3. Parallelize with isolates. Multiple cores decompressing multiple archives simultaneously.

Conclusion

I set out to benchmark a lexer and ended up learning about syscall overhead. The lexer was 2x faster. The I/O optimization was 43x faster.


Addendum: Reader Suggestions

Linux-Specific Optimizations

servermeta_net pointed out two Linux-specific approaches: disabling speculative execution mitigations (which could improve performance in syscall-heavy scenarios) and using io_uring for asynchronous I/O. I ran these benchmarks on macOS, which does not support io_uring, but these Linux capabilities are intriguing. A follow-up post exploring how I/O performance can be optimized on Linux may be in order.

SQLite as an Alternative

tsanderdev mentioned that this is also why SQLite can be much faster than a directory with lots of small files. I had completely forgotten about SQLite as an option. Storing file contents in a SQLite database would eliminate the syscall overhead while providing random access to individual files, something tar.gz does not offer.

This also explains something I have heard multiple times: Apple uses SQLite extensively for its applications, essentially simulating a filesystem within SQLite databases. If 100,000 files on a modern Mac with NVMe storage takes 14 seconds to read, imagine what it was like on older, slower machines. The syscall overhead would have been even more punishing. Using SQLite instead of the filesystem lets them avoid those syscalls entirely. This is worth exploring.


Discuss on r/ProgrammingLanguages