Wednesday, July 2, 2014

2014-07-02 An ode to the "Margin Police," or how I learned to love LaTeX margins

To the great Margin Police:

"You lay down rules for all that approach you,

One and half on the left-hand edge,

One on all the other edges,

Page numbers one half down from the top.

These are your words.

And we are grateful for you guidance and direction.

Lo, you lead us in the ways of professionalism and consistency.

We, the unwashed are grateful."

But I have one question:

Why doesn't the LaTeX style file help me achieve these goals??

And so the exploration begins.

Sometimes we use LaTeX to write and submit papers and reports for publication.  Often the publishers provide a style file for us to use that dictates things like margins, number of columns per page, headers, footers, and other formatting directives.  Other times, guidance comes from "instructions to authors" and we are expected and required to meet the requirements.  What follows below are how see what are the current margins, how to set the margins, and how to see if your document stays within the margins.  (LaTeX has environments that "float" and will sometimes ignore the margins.)  Hold on while we wander through the great and beautiful world of LaTeX margins.

LaTeX "thinks" of sheets of paper as a collection of "boxes."  It fills the boxes with text and whatnot.  At first glance, the location and description of these boxes is arcane and really without much apparent rhyme or reason.  What it comes down to is that a box is defined to start relative to where other boxes end, and each box has a height and width.  Defining locations like this allows an entire set of boxes to be moved by changing the reference starting point for the beginning box.

A sample layout result page.
You can see what LaTeX thinks the current box settings are by including the package "layout" and then inside your document executing the command:


The layout command will inject a new page showing the boxes on the page and their dimensions.  The dimensions are expressed as points (1 inch = 72 points).  Because the \layout command injects a page into your document, you won't want to use the command in your final document.

Once you have your arms (sort of) around the idea of a layout, the next question is how do I affect the layout.  One way to do this is to set the various values that LaTeX uses by executing setlength commands (See the definition of the command MyPageSetup).  The values in MyPageSetup will result in the 1.5x1.0x1.0x1.0 margins with US letter paper that our "Margin Police" example dictated. (Changing the \textwidth value to 4.0in will result in very wide right margins because the text box is now much narrower.)

One of the "flies in the ointment" with the above approach is that not everything obeys LaTeX margins and stays within their "boxes." Some specific examples are figure and table environments that "float" on the page.  So, now that we have told the regular text where its boxes are, and we assume that LaTeX will honor those boxes, how do we identify those times when the "floating" things don't honor the margins.  Conceptually the answer is fairly simple: put a template that matches the margins on all the pages and then tell us which pages have things that are outside the margins.  Simply stated; not so simply answered.

Beware, gory details ahead!!  There is a make file (Listing 1), a LaTeX document (Listing 2), and an image analysis report (Listing 3).

Because I like make, the make file has a couple of targets that taken together answer the question: which page has something that violates the margins??

First the target: margins.  Here we:

1.  Define, remove and create a temporary directory.

The redacting mask.
2.  Copy the PDF document that we want to check into the temporary directory.

3.  In the temporary directory, we use pdftk to split the large PDF into a collection of small PDFs with one page per file.  (There are several other commands that will do the same job, I happen to choose pdftk.)

4.  Create a couple of skyblue redacting images (the images are sized to match page numbers, and the main body of text).

5.  For every page in the PDF (from step 3 above), overlay on that page the two redacting images (from step 4 above), and create a new file with the word "redacted" in the file name.

6.  It is always nice to tell the user that something is happening.

7.  Gather up all the redacted pages into a single large PDF.

At this point in time, we have two files of interest; the original large PDF, and a second PDF where all the text should be redacted.  If the files are small (10 pages or less being small), it is a simple manner to manually flip through the redacted file and see if there is anything that hasn't been redacted.  If the redacted file is large (100 pages or more being large), manually flipping pages is troublesome.

Now the target: checkColors.  Here we:

1.  Return to the temporary directory where we created the redacted pages.

2.  Define a "magic" number (to be explained later).
A redacted page with offending text.

3.  Define a "threshold" number (can be tweaked as desired).

4.  For all the redacted PDF files, figure out how many pixels are skyblue, how many are white, and compare that number with the threshold.  If the number of pixels that are neither skyblue nor white exceeds the threshold then alert the user and record that offending data.

The magic number 484704 is the number of pixels in the PDF image of a single US letter size page.  Changing the pixel density or page size means that you will have to use a different magic number.  The number comes from adding the number of pixels of different collors as returned by the convert command.

At the end of this processing you have:
  • The original and unchanged PDF
  • A page for page version of the original PDF that has been redacted
  • A report listing which redacted pages exposed more pixels than the threshold value allowed
Using these files you can do the following:
  1. Open the original PDF in a PDF viewer,
  2. Open the redacted PDF in a PDF viewer,
  3. Open the report in a TXT viewer,
  4. For each "bad" page in the report, goto that page in the redacted and see what the problem is, and if the problem is severe enough (you have to decide what severe enough means) then goto that page in the original and correct the problem.
Using the Listing 2 LaTeX document and Listing 3 report, page 1 exposes too many pixels, but they are the page number in the footer.  Formatting the headers and footers will correct this problem.  Page 2 exposes the right hand margin of the table.  Reading the LaTeX source document does not indicate that this should be a problem, but the LaTeX processing created it.  Correction of this problem is a little more challenging.

Is all the above processing worth time and effort??  For a small document (less than 10 pages) perhaps not.  For a largish document (greater than 100 pages) then yes.  Processing is fairly quick, a 17 MB (~500 page) PDF takes about 9 minutes to process, less time than it would take to manually flip through each page, even in a print preview mode.

And so we say to you, great Margin Police:

"We have heeded thy words on the margins,

We have checked and rechecked our margins,

And they are good.

We beseech the Oh Great Margin Police,

Let us pass and we will be enlightened all the days of our lives."

-- Chuck Cartledge

Listing 1: The make file.



Listing 2: The LaTeX file.



 Listing 3: The results.txt file.


No comments:

Post a Comment