junkyfert.blogg.se - Pdfwriter module python pypi

We will actually use the overlay technique for filling in PDF forms in chapter 17. You can also use the overlay one PDF on top of another PDF. For example, you might want to watermark a PDF with your buyer's email address or with your logo. You can use pdfrw to watermark your PDF with some kind of information. This code ran pretty fast on my machine and the output is what you would expect. If the page is an odd numbered page, we rotate it and then add that page to our writer object. Then we grab all the pages and iterate over them. Here we just open up the target PDF and create a writer object. Rotate_odd('reportlab-sample.pdf', 'rotate_odd.pdf') Note that in pdfrw you must rotate clockwise in increments that are divisible by 90 degrees.įor this example, I created a function that will extract all the odd pages from the input PDF and rotate them 90 degrees: # rotator.py So if you happen to have a PDF that was saved in a weird way or an intern that scanned in some documents upside down, then you can use pdfrw (or PyPDF2) to fix the PDFs.

The pdfrw package also supports rotating the pages of a PDF. Then we write out the concatenated PDF to disk. In this case, we add the title, author, subject and creator script information to the PDF. Just for fun, we also import IndirectPdfDict, which allows us to add some trailer information to our PDF. Then iterate over those paths, open the file and add all the pages to the writer object via the writer's addpages method. In this example, we create a function called concatenate that accepts a list of paths to PDFs that we want to concatenate together and the output path. Let's write up a simple example that demonstrates how to do it: # concatenator.pyįrom pdfrw import PdfReader, PdfWriter, IndirectPdfDict The pdfrw package makes merging multiple PDFs together very easy. Finally we write the extracted pages to disk. In each iteration, we attempt to extract a page from the input PDF and add that page to our writer object. Then we create a PdfWriter object and loop over the range of pages that we passed in.

Then we open up the file using pdfrw's PdfReader class and grab the total number of pages from the input PDF. Here we create a function called split that takes an input PDF file path, the number of pages that you want to extract and the output path. Split('reportlab-sample.pdf', 10, 'subset.pdf') # splitter.pyĭef split(path, number_of_pages, output): For this example, we will use my ReportLab book's sample chapter PDF that you can download on Leanpub. For example, maybe you want to take the cover off of a book for some reason or you just want to extract the chapters of a book into multiple PDFs instead of storing them in one file. You can also use pdfrw to split a PDF up. I haven't figured out exactly why that is, but I am assuming that PyPDF2 does some extra data massaging on the PDF trailer information that pdfrw currently does not do. If you run this against the reportlab-sample.pdf file that I also included in the source code for this article, you will find that the author name that is returned ends up being '' instead of "Michael Driscoll". Note: I am using the standard W9 form from the IRS for this example. While pdfrw does let you get the Info object, it displays it in a less friendly way. If you have using PyPDF2 in the past, then you may recall that PyPDF2 let's you extract an document information object that you can use to pull out information like author, title, etc. The pdfrw package does not extract data in quite the same way that PyPDF2 does. Now that we have pdfrw installed, let's learn how to extract some information from our PDFs. Let's get that done so we can start using pdfrw: python -m pip install pdfrw Code can be found on GitHub.Īs you might expect, you can install pdfrw using pip. Note: This article is based on my book, ReportLab: PDF Processing with Python.

Combining the use of pdfrw and ReportLab.

Extract certain types of information from a PDF.

In this article, we will learn how to do the following: You can also use pdfrw in conjunction with ReportLab to re-use potions of existing PDFs in new PDFs that you create with ReportLab. The pdfrw package has been used by the rst2pdf package (see chapter 18) since 2010 because pdfrw can "faithfully reproduce vector formats without rasterization". With that version, it supports subsetting, merging, rotating and modifying data in PDFs. At the time of writing, pdfrw was at version 0.4. The pdfrw package is a pure-Python library that you can use to read and write PDF files. Patrick Maupin created a package he called pdfrw and released it back in 2012.