Wednesday, May 6, 2009

Converting JPEG's into PDF's with ImageMagick in Ubuntu 9.04

Before the recent update to Ubuntu Jaunty 9.04, the default install of ImageMagick (sudo apt-get install imagemagick) would convert a 300dpi scanned jpeg image into a PDF using the command line:
convert -page A4 image.jpg out.pdf
The resulting PDF would simply embed the jpeg image making it only a few Kb larger. However now with v6.5.4 of ImageMagick the default behaviour with this command is to uncompress the jpeg and store it in a lossless format. In my case, a scanned A4 page jumped from 150Kb to 7Mb.

Given the lack of documentation on ImageMagicks output format settings, it took a bit of experimenting to find the new command to embed a jpeg is:
convert -page A4 -compress jpeg image.jpg out.pdf
The additional compress option reproduces the original results.

This is part of a larger custom bash shell script that automates "one click" scanning of sequential paper pages into a PDF on ubuntu. Simple, fixed settings, no messing around. Post a comment if you're interested.

Update: i have posted this script here.

12 comments:

guilherme said...

Thank you VERY MUCH! It was exactly what I needed!

5 jpeg images with about 800 KB each were becoming a 123 MB pdf file! LOL

Arkadiusz said...

Now how to convert few jpeg files into single pdf?

Rob said...

I use pdftk to join multiple pdf's into one file. In its most basic usage:

pdftk *.pdf cat output outfile.pdf

Matthew said...

To convert multiple Jpeg files:

try
convert *.jpeg test.pdf
or
convert *.jpg test.pdf

Rob said...

@Matthew, that works okay for a couple of pages. Last time i tried that with 20 jpeg pages imagemagick jammed up trying to request a few GB of ram.

pdftk adds some great additional functionality thats worth investigating: http://www.accesspdf.com/pdftk

Manuel said...

Hi Rob, I would like to see your script bash.. I am interested..

Rob said...

@Manuel, I have posted the complete script i have been using at http://www.rrfx.net/2009/11/batch-scanning-paper-documents-to-pdf.html ...let me know if it helps you out!

Orbis said...

Hello,
I've observed that:

convert -compress jpeg in.jpg out.pdf

won't simply put the JPEG image into the output document, but it will instead *recompress* it, thereby losing data.

Is there a way around this?

Orbis said...

Now this is odd:

tlon:~/pdf-jpeg-test$ convert -compress jpeg original.jpg original.pdf
tlon:~/pdf-jpeg-test$ v
total 304
-rw-r--r-- 1 orbis tertius 186761 2009-11-25 17:59 original.jpg
-rw-r--r-- 1 orbis tertius 113360 2009-11-25 18:01 original.pdf

See the PDF file is smaller than the JPEG. Extracting the JPEG with pdfimages -j and then comparing it with the original one shows visible differences.


On the other hand, (re)compressing the JPEG picture before "converting" it into PDF results in the PDF containing the unmodified JPEG data:
tlon:~/pdf-jpeg-test$ convert -quality 99 original.jpg 99original.jpg
tlon:~/pdf-jpeg-test$ convert -compress jpeg 99original.jpg 99original.pdf
tlon:~/pdf-jpeg-test$ v 99*
-rw-r--r-- 1 orbis tertius 201099 2009-11-25 18:01 99original.jpg
-rw-r--r-- 1 orbis tertius 207282 2009-11-25 18:02 99original.pdf

tlon:~/pdf-jpeg-test$ convert -quality 50 original.jpg 50original.jpg
tlon:~/pdf-jpeg-test$ convert -compress jpeg 50original.jpg 50original.pdf
tlon:~/pdf-jpeg-test$ v 50*
-rw-r--r-- 1 orbis tertius 76878 2009-11-25 18:02 50original.jpg
-rw-r--r-- 1 orbis tertius 79395 2009-11-25 18:02 50original.pdf

Rob said...

Hi Orbis, i was about to (re)post a long reply to that effect. Unfortunately Firefox 3.5.5 is a buggy piece of crap and it crashed while i was waiting for Kdiff3.

A binary diff between an original test jpeg (8MB), and the one extracted from a pdf with "pdfimages -j" was identical for the first 40% of the file, and completely different for the other 60%. Odd, but it make sense that a single bit difference would then make the rest of the jpeg's different.

I remember doing tests like this way back when i first set myself up for scanning paper documents. Enough tests to be convinced that the jpeg was as good as being stored. Progressive scan jpegs were converted to baseline first.

It seems like imagemagick stores the quality level in the jpeg. I've noticed the same behaviour in Gimp when you hit "save as" on a jpeg, close it, reopen it and hit "save as" again. However Gimp doesnt pick up the quality level that Imagemagick seems to have written to the file.

Given i'm using imagemagick for all my postprocessing i've not found it to be a problem. Cheers.

Rob said...

For anyone wanting to test this:

#convert -quality 66 dsc07857.jpg test.jpg
#convert -compress jpeg test.jpg test.pdf
#pdfimages -j test.pdf out

#ls -l (reordered source->jpg->pdf->extracted jpg)
-rw-r--r-- 1 rob rob 52352 2008-04-05 14:02 dsc07857e800.jpg
-rw-r--r-- 1 rob rob 29499 2009-11-26 01:46 test.jpg
-rw-r--r-- 1 rob rob 32418 2009-11-26 01:46 test.pdf
-rw-r--r-- 1 rob rob 29481 2009-11-26 01:47 out-000.jpg

The extracted jpeg is almost the same file size. Binary diff:
#kdiff3 test.jpg out-000.jpg
In this case shows the first 20% of binary jpeg data to be the same

To verify the jpeg data is the same, convert to a bitmap and binary diff **:
#convert test.jpg test.bmp
#convert out-000.jpg out-000.bmp
#kdiff3 test.bmp out-000.bmp
Here the bitmap header is different, however the image data is identical.

**dont try this on large jpeg files

test said...

Thanks a lot, I use a gnome nautilus script with ubuntu :
#!/bin/bash
IFS='
'
convert -page a4 -quality 50 -compress jpeg $NAUTILUS_SCRIPT_SELECTED_FILE_PATHS photos.pdf

Post a Comment