Extracting text from PDFs

Published 11 January, 2010 in Applications , Programming - 0 Comments

Unwanted line breaks in text copied from PDF
Unwanted line breaks in text copied from PDF

Anybody working with information sooner or later have to copy and paste text from PDF-files. And anybody who has done that knows what a pain in the a** that is! You get actual line breaks from the visual line breaks in the PDF. The other day I began a job where I have to copy and paste text from a whole bunch of PDF files and it didn’t take long before I almost exploded with anger ;)

So I thought: Why not make a simple application that extracts the text from the PDF and – to the most possible degree – normalizes the unwanted line breaks.

And then there was Textifyer

So I fired up Visual C# Express and started hacking. I soon found the PDFbox component – using IKVM.NET – and it didn’t take long before I had some code that actually extracted the text from a PDF file. (a PDF extraction in C# howto)

I figured out how to detect unwanted line breaks: Each line with an unwanted line break ends with a space character. Lines with a wanted line break doesn’t (in 99% of the cases). So it is just a matter of of looping over the lines and if it ends with a space skip adding a line break and just append it to the previous text buffer.

Unwanted line breaks removed
Unwanted line breaks removed

So now I just have to clean up the interface and bug test the program – which will happen automatically since I’m copy and paste from a whole bunch of PDFs at the moment. When I feel it’s working alright I will release the program. It’s really nothing hardcore about it anyway ;)

Textifyer: Drag-n-drop enabled
Of course there’s drag-n-drop support!