Extracting data, especially table data, from complex PDFs with tables used to be a challenge. But with the launch of LlamaParse by LlamaIndex, that period is now over.
Originally published on LinkedIn. Embedded post below.
Note for 𝗱𝗲𝘃𝗲𝗹𝗼𝗽𝗲𝗿𝘀 𝗱𝗼𝗶𝗻𝗴 𝘁𝗵𝗲 𝗰𝗼𝗻𝘃𝗲𝗿𝘀𝗶𝗼𝗻 𝘁𝗵𝗲𝗺𝘀𝗲𝗹𝘃𝗲𝘀 𝘄𝗶𝘁𝗵 𝗣𝘆𝘁𝗵𝗼𝗻/𝗝𝗦 𝘀𝗰𝗿𝗶𝗽𝘁𝘀:
- The API call works faster than the Python package.
- Chunking the file before parsing improves speeds.
- Currently, around 50 pages seems to be the optimal chunk size.
- Parsing is faster when done in 50-page chunks versus the full file at once, even for say a 100-page report.
- Tested chunk sizes between 25 to 100 pages, with less than 50 or more than 50 pages increasing the conversion time.
- However, all this can change rapidly as LlamaParse is evolving quickly. For example, just a few days back they increased file size limit from 200 to 700 pages.