Performance Optimization
This section provides guidance and best practice for optimizing the performance when using File Content Extraction.
API Usage
Initializing and shutting down a Filter session takes time. For best performance, initialize a session once before you process any files, and shut it down only when you have finished processing files on that thread.
You can perform filter operations on two or more files simultaneously by calling the Filter SDK from two or more threads simultaneously. In this case, you must provide each thread with a dedicated Filter session, because the individual sessions are not thread-safe. If you use multiple threads, be aware that the total memory that File Content Extraction uses scales linearly with the number of threads.
If you provide a custom input stream to File Content Extraction, be aware that File Content Extraction might use the seek function to move backwards and forwards through the input file. If your implementation is slow to seek and read from another location this can severely impact performance.
You can read filtered text in chunks by usingDocument.text
Filter SDK Configuration
You can speed up the filtering of image formats by switching OCR off. If you enable OCR, there are various options you can configure to optimize OCR performance. See Optimize OCR Performance.
NOTE: This suggestion applies only if you have OCR enabled in your license.
If you do not need to extract images, turning off the Extract Images option can improve extraction performance.
The legacy pipe-streaming method (see Configure Legacy Pipe-Streaming) is slower. For best performance do not enable this.
Some PDFs contain very small images that are time consuming to extract. If this is an issue for you, see Improve Performance for PDFs with Many Small Images.
You can set a timeout for filtering and extraction to abort processing for any file that takes an abnormally long time. Setting this can limit the time taken on slow files. However, using a very low timeout can adversely impact performance, because each time a timeout occurs, the session is automatically reinitialized.
File Content Extraction Environment
File Content Extraction creates and reads temporary files as part of its normal operation. The speed of the disk storing those temporary files can impact performance.
On Microsoft Windows platforms, anti-malware software often scans temporary files as they are written, which slows down the application that writes them (including File Content Extraction). This behavior is another reason to stick with the default interprocess communication mechanism, which reduces the number of temporary files created.
As with any software, if you provide insufficient memory, it can cause performance to drop significantly as data is swapped out to disk. File Content Extraction typically uses little memory when called from a single thread, but the memory usage adds up when you use multiple threads.