Final Degree Project report available

Add a comment

The report containing my experience collaborating with the GNU PDF project is available for download here (look for the link named 73459.pdf). It contains useful information for newcomers, as well as a complete log of developments, experiences and interactions performed during this period.

Presentation slides (in catalan!) are available here.

The Final Degree Project is done, but GNU PDF goes on, so stay tunned for news!

PDF standards comparison: PDF/A vs PDF 1.4


The PDF file format, far from being a unique language definition, gathers an extensive family of sublanguages (which we friendly call dialects) and specifications. These specifications make the format richer, and allow a better fitting between the language restrictions and requirements of real scenarios. On the other hand, the library has to consider all these sublanguages and specifications to correctly read and write PDF files.

Thus, these specifications must be studied, compared and formalised, in order to allow the GNU PDF Library provide an API capable of processing syntax and semantics accordingly.

This study is contained in the GNU PDF Knowledge base, available here. Until now, only the comparison between PDF/A and PDF 1.4 is available, but more sublanguages of the PDF family are coming in the near future. Stay tunned!

1,000 visitors, thank you!


In its very first 5 months of life, Portable Document Features has reached 1,000 visitors. Most of you have been coming from or, but some others have reached the blog googling around about how to install the GNU PDF Library on your systems, how to start up with hacking the lib or just engaging some discussion on the lib issues. The goal for the next months, as I will be finishing my FDP, is to continue reporting library hacks and log a more low-level trace of what I’m exactly hacking :)

Thank you very much for you support!

Dealing with errors in the GNU PDF Library


The most effective debugging tool is still careful thought, coupled with judiciously placed print statements (Brian Kernighan)

The GNU PDF project takes testing very seriously. Despite this, while programming functionalities that rely on uderlying modules, a returning value reporting if something went wrong in a call has a very high value. So, in this post I am going to explain how to output typical errors in the stderr stream when we are calling almost any functionality in the pdf.h interface.

The key data type here is pdf_status_t. As you can see in the reference manual, almost any function returns a pdf_status_t. For instance, supose that we are creating a new token reader (or tokeniser) for a given stream; the header is

pdf_status_t pdf_token_reader_new (pdf_stm_t stm, pdf_token_reader_t *reader)

The return value can be compared typically with a PDF status value like PDF_OK or PDF_ERROR this way

  1. if (pdf_token_reader_new (stream, &reader) != PDF_OK)

in order to treat error flows. But sometimes we want to know what kind of error happened, instead of knowing if some did occur. To achieve this, we need a function that outputs the corresponding error message to stderr:

void pdf_perror (const pdf_status_t status, const char *str)

Calling it, we get a description of the status contents for a given pdf_status_t data type. So, in our example, this code

  1. pdf_status_t token_stat;
  2. pdf_token_reader_t reader;
  4. token_stat = pdf_token_reader_new (stream, &reader);
  6. if (token_stat != PDF_OK)
  7.   {
  8.     printf("ERROR creating tokeniser\n");
  9.     pdf_perror(token_stat, NULL);
  10.   }

easily helps us to know what happened if something went wrong. Note that the pdf_perror function gets a desired string to concatenate to the error message as the second parameter.

IMPORTANT UPDATE: Please read Aleksander’s comments here, containing very useful and updated information about new error reporting / printing procedures.

Security and Privacy Issues in the PDF Document Format

Add a comment

Researchers at Universidad Politecnica de Madrid (UPM) recently conducted a study examining security and privacy threats related to digital document publishing. The study focused on the PDF document format and addressed publisher-related information that is leaked once the document is distributed over the Internet. The UPM researchers developed several tools that extract information from PDF documents. The researchers say that users can be in danger every time a digital document is downloaded. For example, the study notes that metadata information such as the user name or the last day the document was edited can lead to privacy breaches since most document authors are not aware that the information remains available once the document is published. Meanwhile, the researchers found that poor document format design is responsible for leaking other potentially sensitive information. For example, the researchers note that when a paragraph is deleted, PDF authoring applications do not remove the text and instead mark it as invisible. As a result, the data can be read by malicious users that know what to look for. The researchers’ main goal is to make users aware of the risks associated with publishing a document on the Internet and to provide effective guidelines to minimize the leakage of sensitive information.

Full article here.

Make GNU PDF manuals: converting texi files to html or pdf


All documentation in GNU PDF, as in many other free software projects, is written using texinfo. Texinfo is a documentation format by Richard Stallman and Bob Chassell, which aims to integrate all the project documentation in a unique source, and then produce any desired output document format in an automatic and transparent way. As described in the official page:

Texinfo uses a single source file to produce output in a number of formats, both online and printed (dvi, html, info, pdf, xml, etc.). This means that instead of writing different documents for online information and another for a printed manual, you need write only one document. And when the work is revised, you need revise only that one document.

Several tools and scripts are available from the usual repositories to transform texi files to html or pdf files. In Debian-like systems, they can be installed issuing:

  1. sudo apt-get install texi2html texinfo

This should make available in your system the texi2html and texi2pdf binaries, which you can use to convert texi files into html or pdf files:

  1. cd ~/trunk/doc/
  2. texi2html *.texi
  3. texi2pdf *.texi

Now you can read all GNU PDF manuals in your favourite format. Hope this helps!

First hacking session with GNU PDF library


In this short session we’ll get running a minimal piece of C code which uses types and functions of the types module, in the base layer of the GNU PDF library. For more information about the GNU PDF library architecture, please go here.

First of all, create a directory to store these little hacking things:

  1. mkdir gnupdfhack
  2. cd gnupdfhack/

Then, create a C test file, and open it with your favorite editor:

  1. touch test.c
  2. emacs test.c

This is a very first approximation to a trivial test unit, which simply:

  • checks the types declared in the library specification, and
  • tries to call some functionalities of the library implementation

The test.c file goes like this:

  1. #include <stdio.h>
  2. #include “../trunk/src/pdf.h”
  4. int main ()
  5. {
  6.   printf (“GNU PDF hack test\n”);
  8.   pdf_size_t size = 128;
  9.   pdf_error_t *error = NULL;
  10.   pdf_buffer_t *buf = pdf_buffer_new(size, &error);
  12.   if (buf == NULL)
  13.     {
  14.       printf(“PDF buffer creation failed\n”);
  15.       /* do some more **error analysis here … */
  17.       return 1;
  18.     }
  19.   else
  20.     {
  21.       pdf_buffer_destroy(buf);
  22.       printf(“PDF buffer created and destroyed successfully\n”);
  23.     }
  25.   return 0;
  26. }

The type check tries to use the declaration of GNU PDF boolean types, while the library calls essentially allocate space for a buffer, and then they destroy it.

To get it working:

  1. gcc -Wall /usr/local/lib/ test.c -o test
  2. ./text

This works under the assumption that the GNU PDF library is installed under /usr/local/lib. Please go to this previous post here for a guide on how to install the GNU PDF library in your system. If everything goes well, you’ll get this output:

  1. GNU PDF hack test
  2. —————–
  3. I have the types!
  4. PDF buffer created and destroyed successfully

On further sessions we’ll get more from the actual implementation to make some improvements required on the actual types module. Please feel free to read the base layer interfaces, and happy hacking! Hope this helps!

Install GNU PDF library from source

1 Comment

First of all, this post is a clarification of the official GNU PDF library newcomers guide, found here, and a more explicit explanation steps than those found in the INSTALL and README files of the source trunk. There is no intention here to replace, but to better explain the contents of these sources, and I encourage all readers installing the GNU PDF library to refer them.

All GNU PDF library source is managed with the bazaar version control system, so the first step is to install the bzr package. You can install it from source, from a pre-compiled package, or from your preferred repository. For the latter, and assuming a Debian-style system, you can install it by just typing on your terminal (make sure you have enough permissions to run installation of packages):

  1. apt-get install bzr

Answering yes to APT will install the packages and all dependencies needed. Now it is time for retrieving the source:

  1. bzr branch bzr://

Wait a while, and source will be downloaded to ./trunk. Step inside this directory:

  1. cd trunk

The script will do the work, but it depends on the autoconf and libtool packages, so we install them and then we bootstrap the library:

  1. apt-get install autoconf libtool
  2. sh

After some messages from the libtool library the source is ready to configure, but usually some dependencies are not fulfilled at this point: zlib, libgpg-error, libgcrypt, uuid-dev and libcheck. Except libcheck, the rest of the required libraries are available in the Debian/Ubuntu repos:

  1. apt-get install zlib1g-dev libgpg-error-dev libgcrypt11-dev uuid-dev

The GNU PDF library requires the SVN source of libcheck to assure the latest version of this library. Obviously we need the subversion package in our system, and then retrieve sources, configure, compile and install (as root):

  1. apt-get install subversion
  2. cd ~
  3. svn co check
  4. cd check/trunk/
  5. autoreconf -i
  6. ./configure
  7. make
  8. make install

At this point, all GNU PDF library requirements are met, so we go for it:

  1. cd ~/trunk/
  2. ./configure
  3. make
  4. make install

This will install the GNU PDF library in the default location. In most cases, you can see the compiled library objects by issuing:

  1. ls /usr/local/lib

In further posts we’ll explain how to use the generated dynamic library for a first hacking session with some of the actual library features. Hope this helps!

Welcome to PDF — a blog about GNU PDF developing

Add a comment

As some may know, this is the final year of my degree in Facultat d’Informàtica de Barcelona (FIB, UPC). If everything goes as it should (and, believe me, it will), someday in the end of June 2011 I will be a Computer Engineer, Computer Scientist, Computer Science Engineer or put_here_your_favourite_computing_degree_title.

As some others (I guess less) may know, to achieve that one must pass the final degree project or PFC. My PFC will be a one year collaboration with the GNU PDF project. This is a wish I had since I began my studies, and my chance to provide something valuable to the free software community.

This blog is going to be useful to achieve the following:

  • As a log of my work; this way, my director Toni Soto and I will be able to track everything done,
  • To help any other developer who may encounter similar problems to those described here, and
  • To let you know a little more about the GNU PDF project

So, let’s do it, and hope this helps!