There are several strategies that can be used to handle large jobs or reduce overall memory consumption of a document conversion. Which of these different strategies are best highly depend on the use case and the infrastructure. In this article, we will look at the most common strategies. You would have to decide on your own which is most appropriate for your case.
TABLE OF CONTENTS
- Increasing Memory
- Reducing Concurrent Conversions
- Save Memory Mode
- Fast Tables
- Asynchronous Conversions
The following list gives you an overview of the different strategies you can use:
- Increasing Memory: If your server has enough memory, increasing the amount of memory to PDFreactor is beneficial for almost all use cases
- Reducing Concurrent Conversions: Useful if you convert few but large documents
- Save Memory Mode: Useful if your document uses many images
- Segmentation: Useful for documents consisting of 1000+ pages
- Fast Tables: Useful for documents with large uniform data tables
- Asynchronous Conversions: Useful to avoid network timeout issues for long conversions (only applies to PDFreactor Web Service)
- Streaming: Useful for large output data
To convert very large or complex documents, you need to make sure that PDFreactor has sufficient memory available. To increase the memory of Java applications (such as PDFreactor), you can specify an appropriate "-Xmx" argument, such as "-Xmx4g" for 4GB of memory. To configure the memory when using the PDFreactor Web Service or the Docker image, please refer to their individual solutions:
One symptom of insufficient memory are entries in the log or errors that read like this: "OutOfMemoryError". When such errors occur, you should consider increasing the amount of memory available to PDFreactor. Alternatively, you could make sure to limit the amount of concurrent conversions so that fewer conversion share the available memory.
Reducing Concurrent Conversions
Each PDFreactor CPU license only allows for a certain number of conversions to be processed at the same time. Additional conversions will be queued until a slot frees up. This means that if there is high load on your server, PDFreactor might process multiple conversions in parallel, with each conversion consuming memory separately. To reduce memory consumption, you could limit the amount of concurrent conversions. The following examples set the maximum concurrent conversions to 2.
PDFreactor Web Service
This is most relevant when using the PDFreactor Web Service since it automatically works with multiple threads. To adjust the amount of concurrent conversions, use the "threadPoolSize" server parameter. This can e.g. be configured by adding the following like to the "PDFreactor/start.d/main.ini" file:
The parameter can be set by placing a file mapped to "/ro/config/pdfreactorwebservice.config" with the following content:
Save Memory Mode
PDFreactor automatically caches images used in the document in-memory. This enhances performance as images potentially need to be accessed multiple times during the conversion. However, in documents that use a large number of images this can lead to high memory requirements. To disable the image cache, you can enable "saveMemoryMode" in the API. Please refer to the PDFreactor manual here on how to enable this mode.
Segmentation allows PDFreactor to internally split conversions into multiple parts, drastically reducing the amount of memory required for large documents. There are a couple of limitations imposed on the document and the benefits are usually only noticeable if the document has 1000+ pages. More information about segmentation and on how to enable it can be found in the PDFreactor manual here.
The Fast Tables feature is useful for documents containing large uniform data tables with thousands of rows. Declaring tables as "Fast Tables" simplifies their layout process by only applying minimal styling but providing significantly better performance and lower memory requirements. Please refer to the PDFreactor manual here for more information and a list of restrictions.
Only applies to the PDFreactor Web Service. The PDFreactor Web Service can convert documents asynchronously, meaning that the client is not required to keep an HTTP connection open to the server until the conversion is finished. While this is usually negligible when converting small documents, synchronous conversions may be very detrimental to the user experience when converting large or complex documents since HTTP connections that are open for a long time are prone to timeouts of clients or proxy servers beyond PDFreactor's control. Asynchronous conversions don't have that issue since they are only open for a very short amount of time. For more information on asynchronous conversions, please refer to the PDFreactor manual here.
When the PDF result is very large, it is more efficient to stream the result data instead of caching it in-memory first.
PDFreactor Java Library
Use a "convert" method that takes an "OutputStream" as a parameter.
PDFreactor Web Service
Use "convertAsBinary" (sync) or "getDocumentAsBinary" (async) methods. When using these, the PDFreactor Web Service will stream the resulting bytes directly to the client. On the client, you can also optionally pass a stream (or other language equivalent) to these methods, thus continue streaming the result bytes. Please refer to the API documentation of your respective integration language.
You now know several strategies designed to accommodate special conversion scenarios, either by increasing PDFreactor's available system resources or by optimizing PDFreactor's usage of existing system resources.