deploy: 5a76002

zerodha · Feb 16, 2024 · a3c5d2a · a3c5d2a
1 parent 7af480e
commit a3c5d2a
Show file tree

Hide file tree

Showing 3 changed files with 4 additions and 4 deletions.
diff --git a/blog/1-5-million-pdfs-in-25-minutes/index.html b/blog/1-5-million-pdfs-in-25-minutes/index.html
@@ -20,7 +20,7 @@
     tasqueue.<span style=color:#268bd2>NewJob</span>(<span style=color:#2aa198>&#34;email&#34;</span>, payload),
 }
 tasqueue.<span style=color:#268bd2>NewChain</span>(chain, tasqueue.ChainOpts{})
-</code></pre></div><p>The first sequence in our chained jobs workflow begins with a generator worker processing various CSV files to create templates for PDF files. This is then pushed into the queue as a job for the next worker, say the PDF generator, to pick up. Various different kinds of workers upon picking up their designated jobs, retrieve the relevant file from S3, process it, and then upload the output back to S3. So, for a user’s contract note PDF to be delivered to them via e-mail, their data passes through four workers (process data -> generate PDF -> sign PDF -> e-mail PDF), where each worker after doing its job, dumps the resultant file to S3 for the next worker to pick up.</p><p>The global job states are stored in a Redis instance. It serves a dual role in this architecture: as a backend broker facilitating the distribution of jobs among workers and as a storage medium for the status of each job. By querying Redis, we can track the number of jobs processed or identify any failures. For jobs that fail, targeted retries are initiated for users whose jobs previously failed or were not processed.</p><h2 id=generating-pdfs>Generating PDFs</h2><p>Initially, PDFs were generated from HTML using Puppeteer, which involved spawning headless instances of Chrome. This of course grew to be slow and quite resource-intensive as our volume grew.</p><p>After several experiments including benchmarking PDF generation with complex layouts using native libraries in different programming languages, we had a breakthrough using LaTeX instead of HTML and generating PDFs using <code>pdflatex</code>. By converting our HTML templates into TEX formats and utilizing <code>pdflatex</code> for PDF generation, we observed a 10X increase in speed compared to the original Puppeteer way of generating. Moreover, pdflatex required significantly fewer resources, making it a much leaner solution for our PDF generation needs.</p><h3 id=problems-with-latex>Problems with LaTeX</h3><p>We rewrote our PDF generation pipeline to use <code>pdflatex</code> and ran this in production for a couple of months. While we were happy with the significant performance gains with LaTeX, this process had its own challenges:</p><ul><li><strong>Memory Constraints with pdflatex</strong>: For some of our prolific users, the PDF contract note extends to thousand pages as mentioned earlier. Generating such large documents leads to significantly higher memory usage. pdflatex lacks support for dynamic memory allocation, and despite attempts to tweak its parameters, we continued to face limitations in terms of memory usage.</li><li><strong>Switch to lualatex</strong>: In search of a better solution, we transitioned to lualatex, another TeX to PDF converter known for its ability to dynamically allocate more memory. lualatex resolved the memory issue to a considerable extent for large reports. However, while rendering such large tables, it sometimes broke and produced indecipherable stack traces which were very challenging to understand and debug.</li><li><strong>Docker image size</strong>: We were using Docker images for these tools, but they were quite large thanks to the required TeX libraries for both pdflatex and lualatex, along with their various 3rd party packages. Even with caching layers, the large size of these images resulted in delayed startup time for our instances on a fresh boot up before our workers could run.</li></ul><p>In our search for a better solution, we stumbled upon <a href=https://typst.app/>Typst</a> and began to evaluate it as a potential replacement for LaTeX.</p><h2 id=typst---a-modern-typesetting-system>Typst - a modern typesetting system</h2><p>Typst is a single binary application written in Rust that offers several advantages over LaTeX.</p><ul><li><strong>Ease of Use</strong>: Typst offers a more user-friendly developer experience compared to LaTeX, offering a simpler and consistent styling experience without the need for numerous third-party packages.</li><li><strong>Error Handling</strong>: It provides better error messages making debugging any such issues related to bad input data significantly easier.</li><li><strong>Performance</strong>: In our benchmarks, Typst performed be 2 to 3 times faster than LaTeX when compiling small files to PDF. For larger documents, such as those with tables extending over thousands of pages, Typst dramatically outperforms lualatex. A 2000-page contract note takes approximately 1 minute to compile with Typst, in stark contrast to lualatex&rsquo;s 18 minutes.</li><li><strong>Reduced docker image size</strong>: Since Typst is a small statically linked binary, the Docker image size reduced significantly compared to bundling lualatex/pdflatex, improving the startup time of our worker servers.</li></ul><h2 id=digitally-signing-pdfs>Digitally signing PDFs</h2><p>Regulations mandate that contract note PDFs should be digitally signed and encrypted. At the time of writing this, we have not found any performance-focused FOSS libraries that can batch sign PDFs, thanks to the immense complexities in PDF digital signatures. We ended up writing a <a href=https://github.com/zerodha/jpdfsigner>small HTTP wrapper</a> on top of the Java <a href=https://github.com/LibrePDF/OpenPDF>OpenPDF</a> library, enabling a single boot JVM which can then handle signing requests concurrently. We deploy this server as a sidecar alongside each of our signer workers.</p><h2 id=generating-and-storing-files-at-high-throughput>Generating and storing files at high-throughput</h2><p>The distributed contract note generation workflow processes client and transaction data, generating PDF files typically ranging from 100kb to 200kb per user, where the PDFs can also run into MBs for certain users. On an average day throughout the workflow, this comes up to some 7 million ephemeral files that are generated and accessed throughout the workflow (Typst markup files, PDFs, signed PDFs etc.). Each job’s execution is distributed across an arbitrary number of EC2 instances and requires access to temporary input data from the preceding process. Shared storage allows each process to write its output to a common storage area, enabling subsequent processes to retrieve these files, eliminating the need to transfer files back and forth between job workers over the network.</p><p>We evaluated AWS’ EFS (Elastic File System) in <a href=https://docs.aws.amazon.com/efs/latest/ug/performance.html#performancemodes>two different modes</a>: General Purpose mode and Max I/O mode. Initially, our tests revealed limited throughput as our benchmark data was relatively small. Without specified throughput provisioning, EFS imposes a throughput limit based on data size. Consequently, we adjusted our benchmark setup and set the provisioned throughput to 512Mb/s.</p><p>Our benchmark involved concurrently reading and writing 10,000 files, each sized between 100kb and 200kb.</p><ul><li>In General Purpose mode, we reached the EFS file operation limit (35k ops/sec, with reads counting as 1 and writes as 5) and experienced latencies resulting in 4-5 seconds to write these files to EFS.</li><li>Performance deteriorated in Max I/O mode, taking 17-18 seconds due to increased latency. We dismissed Max I/O mode from our considerations due to its high latency.</li></ul><p>Given the large number of small files, EFS seemed wholly unsuitable for our purpose. For comparison, performing the same task with EBS took approximately 400 milliseconds.</p><p>We revised our benchmark setup and experimented with storing the files on S3, which took around 4-5 seconds for a similar number of files. Additionally, we considered the cost differences between EFS and S3. With 1TB of storage and 512Mb/s provisioned throughput, S3&rsquo;s pricing was significantly lower. Consequently, we opted to store our files on S3 rather than EFS, given its cost-effectiveness and the operational limitations of EFS.</p><p>We also consulted with the AWS Storage team, who recommended exploring FSx as an alternative. FSx offers various file storage solutions, particularly <a href=https://aws.amazon.com/fsx/lustre/>FSx for Lustre</a>, which is commonly used in HPC environments. However, since FSx was not available in the ap-south-1 region at the time of our experimentation—and our operations are regulated to occur within this region—we proceeded with S3 for its ease of management.</p><p>We rewrote our storage interface to use S3 (using the zero-dependency lightweight <a href=https://github.com/rhnvrm/simples3>simples3</a> library which we developed in the past), but hit another challenge this time: S3 Rate Limits. S3&rsquo;s distributed architecture imposes request rate limits to ensure fair resource distribution among users.</p><p>Here are the specifics of the limits:</p><ul><li>PUT/COPY/POST/DELETE Requests: Up to 3,500 requests per second per prefix.</li><li>GET/HEAD Requests: Up to 5,500 requests per second per prefix.</li></ul><p>When these limits are exceeded, S3 returns 503 Slow Down errors. While these errors can be retried, the sporadic and bursty nature of our workload meant that we frequently encountered rate limits, even with a retry strategy of 10 attempts. In a trial run, we processed approximately 1.61 million requests within a 5-minute span, averaging around 5.3k requests per second, with an error rate of about 0.13%. According to the <a href=https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html>AWS documentation</a>, to address this challenge, the bucket can be organized using unique prefixes to distribute the load.</p><p>Initially, for each customer&rsquo;s contract note, we generated a unique <a href=https://github.com/segmentio/ksuid>ksuid</a>. These are not only sortable but also share a common prefix. Eg:</p><div class=highlight><pre style=color:#93a1a1;background-color:#002b36;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-sh data-lang=sh>bucket/2CTgQKodxGCCXxXQ2XlNyrVSFIV
+</code></pre></div><p>The first sequence in our chained jobs workflow begins with a generator worker processing various CSV files to create templates for PDF files. This is then pushed into the queue as a job for the next worker, say the PDF generator, to pick up. Various different kinds of workers upon picking up their designated jobs, retrieve the relevant file from S3, process it, and then upload the output back to S3. So, for a user’s contract note PDF to be delivered to them via e-mail, their data passes through four workers (process data -> generate PDF -> sign PDF -> e-mail PDF), where each worker after doing its job, dumps the resultant file to S3 for the next worker to pick up.</p><p>The global job states are stored in a Redis instance. It serves a dual role in this architecture: as a backend broker facilitating the distribution of jobs among workers and as a storage medium for the status of each job. By querying Redis, we can track the number of jobs processed or identify any failures. For jobs that fail, targeted retries are initiated for users whose jobs previously failed or were not processed.</p><h2 id=generating-pdfs>Generating PDFs</h2><p>Initially, PDFs were generated from HTML using Puppeteer, which involved spawning headless instances of Chrome. This of course grew to be slow and quite resource-intensive as our volume grew.</p><p>After several experiments including benchmarking PDF generation with complex layouts using native libraries in different programming languages, we had a breakthrough using LaTeX instead of HTML and generating PDFs using <code>pdflatex</code>. By converting our HTML templates into TEX formats and utilizing <code>pdflatex</code> for PDF generation, we observed a 10X increase in speed compared to the original Puppeteer way of generating. Moreover, pdflatex required significantly fewer resources, making it a much leaner solution for our PDF generation needs.</p><h3 id=problems-with-latex>Problems with LaTeX</h3><p>We rewrote our PDF generation pipeline to use <code>pdflatex</code> and ran this in production for a couple of months. While we were happy with the significant performance gains with LaTeX, this process had its own challenges:</p><ul><li><strong>Memory Constraints with pdflatex</strong>: For some of our prolific users, the PDF contract note extends to thousand pages as mentioned earlier. Generating such large documents leads to significantly higher memory usage. pdflatex lacks support for dynamic memory allocation, and despite attempts to tweak its parameters, we continued to face limitations in terms of memory usage.</li><li><strong>Switch to lualatex</strong>: In search of a better solution, we transitioned to lualatex, another TeX to PDF converter known for its ability to dynamically allocate more memory. lualatex resolved the memory issue to a considerable extent for large reports. However, while rendering such large tables, it sometimes broke and produced indecipherable stack traces which were very challenging to understand and debug.</li><li><strong>Docker image size</strong>: We were using Docker images for these tools, but they were quite large thanks to the required TeX libraries for both pdflatex and lualatex, along with their various 3rd party packages. Even with caching layers, the large size of these images resulted in delayed startup time for our instances on a fresh boot up before our workers could run.</li></ul><p>In our search for a better solution, we stumbled upon <a href=https://typst.app/>Typst</a> and began to evaluate it as a potential replacement for LaTeX.</p><h2 id=typst---a-modern-typesetting-system>Typst - a modern typesetting system</h2><p>Typst is a single binary application written in Rust that offers several advantages over LaTeX.</p><ul><li><strong>Ease of Use</strong>: Typst offers a more user-friendly developer experience compared to LaTeX, offering a simpler and consistent styling experience without the need for numerous third-party packages.</li><li><strong>Error Handling</strong>: It provides better error messages making debugging any such issues related to bad input data significantly easier.</li><li><strong>Performance</strong>: In our benchmarks, Typst performed be 2 to 3 times faster than LaTeX when compiling small files to PDF. For larger documents, such as those with tables extending over thousands of pages, Typst dramatically outperforms lualatex. A 2000-page contract note takes approximately 1 minute to compile with Typst, in stark contrast to lualatex&rsquo;s 18 minutes.</li><li><strong>Reduced docker image size</strong>: Since Typst is a small statically linked binary, the Docker image size reduced significantly compared to bundling lualatex/pdflatex, improving the startup time of our worker servers.</li></ul><h2 id=digitally-signing-pdfs>Digitally signing PDFs</h2><p>Regulations mandate that contract note PDFs should be digitally signed and encrypted. At the time of writing this, we have not found any performance-focused FOSS libraries that can batch sign PDFs, thanks to the immense complexities in PDF digital signatures. We ended up writing a <a href=https://github.com/zerodha/jpdfsigner>small HTTP wrapper</a> on top of the Java <a href=https://github.com/LibrePDF/OpenPDF>OpenPDF</a> library, enabling a single boot JVM which can then handle signing requests concurrently. We deploy this server as a sidecar alongside each of our signer workers.</p><h2 id=generating-and-storing-files-at-high-throughput>Generating and storing files at high-throughput</h2><p>The distributed contract note generation workflow processes client and transaction data, generating PDF files typically ranging from 100kb to 200kb per user, where the PDFs can also run into MBs for certain users. On an average day throughout the workflow, this comes up to some 7 million ephemeral files that are generated and accessed throughout the workflow (Typst markup files, PDFs, signed PDFs etc.). Each job’s execution is distributed across an arbitrary number of EC2 instances and requires access to temporary input data from the preceding process. Shared storage allows each process to write its output to a common storage area, enabling subsequent processes to retrieve these files, eliminating the need to transfer files back and forth between job workers over the network.</p><p>We evaluated AWS’ EFS (Elastic File System) in <a href=https://docs.aws.amazon.com/efs/latest/ug/performance.html#performancemodes>two different modes</a>: General Purpose mode and Max I/O mode. Initially, our tests revealed limited throughput as our benchmark data was relatively small. Without specified throughput provisioning, EFS imposes a throughput limit based on data size. Consequently, we adjusted our benchmark setup and set the provisioned throughput to 512Mb/s.</p><p>Our benchmark involved concurrently reading and writing 10,000 files, each sized between 100kb and 200kb.</p><ul><li>In General Purpose mode, we reached the EFS file operation limit (35k ops/sec, with reads counting as 1 and writes as 5) and experienced latencies resulting in 4-5 seconds to write these files to EFS.</li><li>Performance deteriorated in Max I/O mode, taking 17-18 seconds due to increased latency. We dismissed Max I/O mode from our considerations due to its high latency.</li></ul><p>Given the large number of small files, EFS seemed wholly unsuitable for our purpose. For comparison, performing the same task with EBS took approximately 400 milliseconds.</p><p>We revised our benchmark setup and experimented with storing the files on S3, which took around 4-5 seconds for a similar number of files. Additionally, we considered the cost differences between EFS and S3. With 1TB of storage and 512Mb/s provisioned throughput, S3&rsquo;s pricing was significantly lower. Consequently, we opted to store our files on S3 rather than EFS, given its cost-effectiveness and the operational limitations of EFS.</p><p>We also consulted with the AWS Storage team, who recommended exploring FSx as an alternative. FSx offers various file storage solutions, particularly <a href=https://aws.amazon.com/fsx/lustre/>FSx for Lustre</a>, which is commonly used in HPC environments. However, since FSx was complicated to set up and unavailable in the ap-south-1 region during our experimentation—coupled with our operations being restricted to this region—we opted for S3 for its ease of management.</p><p>We rewrote our storage interface to use S3 (using the zero-dependency lightweight <a href=https://github.com/rhnvrm/simples3>simples3</a> library which we developed in the past), but hit another challenge this time: S3 Rate Limits. S3&rsquo;s distributed architecture imposes request rate limits to ensure fair resource distribution among users.</p><p>Here are the specifics of the limits:</p><ul><li>PUT/COPY/POST/DELETE Requests: Up to 3,500 requests per second per prefix.</li><li>GET/HEAD Requests: Up to 5,500 requests per second per prefix.</li></ul><p>When these limits are exceeded, S3 returns 503 Slow Down errors. While these errors can be retried, the sporadic and bursty nature of our workload meant that we frequently encountered rate limits, even with a retry strategy of 10 attempts. In a trial run, we processed approximately 1.61 million requests within a 5-minute span, averaging around 5.3k requests per second, with an error rate of about 0.13%. According to the <a href=https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html>AWS documentation</a>, to address this challenge, the bucket can be organized using unique prefixes to distribute the load.</p><p>Initially, for each customer&rsquo;s contract note, we generated a unique <a href=https://github.com/segmentio/ksuid>ksuid</a>. These are not only sortable but also share a common prefix. Eg:</p><div class=highlight><pre style=color:#93a1a1;background-color:#002b36;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-sh data-lang=sh>bucket/2CTgQKodxGCCXxXQ2XlNyrVSFIV
 bucket/2CTgQKtyZ2O9NGvt1gSo75h5N9V/
 bucket/2CTgQKwBQasuWHaz1QnOIWDNrtc/
 bucket/2CTgQKyEGGhk006bgTXErNyu0NE/