File header Size

Oct 8, 2019 at 8:32pm

Hi,
I want to change the header of some document files like pdf, ms word, libre office programatically. I know that I have to use some byte type command like putc(..) and getc(). But I don't the header size of the above mentioned file formats.I saw a list of file format at wikipedia https://en.wikipedia.org/wiki/File_format
but I
can't see information about the header size. For instance, I know the header size of bmp file=54 byts.Can some body please guide me any link which tells me this information.

Zulfi. y

Oct 8, 2019 at 8:54pm

Ganado (6834)

PDFs, MS Word docs, and libre office docs are all three different formats, with different specifications, so you're going to have to handle them all separately.

.docx documents, for instance, are thinly veiled zip files, see: https://en.wikipedia.org/wiki/Office_Open_XML
They may contain multiple files, folders, with multimedia. I have not used it, but libarchive apparently can handle the compression/decompression of these zip file. I am sure there are other libraries that can do the equivalent: http://www.libarchive.org/

Instead of directly reading the files yourself, I would suggest finding a library that can already do it, as I'm sure some exist. But nevertheless, if you want to know how a particular file format is structured, you have to look up its specification.

For zip files, the header information can be see here:
https://en.wikipedia.org/wiki/Zip_(file_format)#Local_file_header

In a PDF document, the header is %PDF-1.7 (or 1.6, etc.) and simply defines that the document is using the PDF 1.7 format.
https://lotabout.me/orgwiki/pdf.html

It looks like LibreOffice document are also zipped, similar to MS Word documents.
https://help.libreoffice.org/Common/XML_File_Formats

For generally reading binary files, you need to make sure you have the file opened in binary mode. Here's an article that talks about reading a binary file: http://www.cplusplus.com/articles/DzywvCM9/

Try opening files in hex editors or programs like Notepad++ if you're curious about what's inside them.

Last edited on Oct 8, 2019 at 9:02pm

Oct 8, 2019 at 9:15pm

poteto (525)

ignore this: A bmp's header isn't 54 bytes, what are you talking about? bmp headers can have a variable length.

correction: the MS-Windows BMPv3 format is 54 bytes and it is the most common form of BMP since MS Paint still doesn't support transparency. I was reading the V5 version on wikipedia, which has a ton of variable sized parts in the header. The only way to error check if to see if the size of the info header is equal to 40. I hope you don't want to read the image data, since BMP aligns it's pixels by 4, a good reason to just use a library to read a BMP.

The best solution from my perspective is to use C# with a 3rd party library called PDFsharp and it has a feature to modify metadata. If you are using windows it can be really easy to turn C# into a DLL which can communicate with C++ by an exported function.

From a minute of skimming C/C++ doesn't have a popular library for modifying PDF's.

This is the C# example: https://stackoverflow.com/questions/1465434/edit-metadata-of-pdf-file-with-c-sharp

Other than that, it seems very easy to modify metadata using a GUI application (programs that can create pdf files like adobe acrobat and others), I don't really see the point of programmatically modifying the metadata in batches, maybe if you gave more information to your problem there could be a more helpful answer.

Last edited on Oct 9, 2019 at 1:04am

Oct 8, 2019 at 9:27pm

Duthomhas (13277)

There is absolutely no reason to recommend people to other languages or their libraries to do something in C or C++.

For a BMP there is EasyBMP, which is a very nice little library.

Everything that exists probably lt has a C or C++ li tary to handle it, and if not, the technical specification makes looking up how a thing is encoded very easy.

Oct 8, 2019 at 9:54pm

poteto (525)

I would love to agree with you, but I can't find a similar library for PDFs.

Also native C functions from a C# DLL's may be difficult for a beginner to figure out, it may be easier to make a command line application that uses the terminal arguments, like argv/argc equivalent in C# which is string[] args, and call the exe in C++ with system().

Last edited on Oct 8, 2019 at 10:42pm

Oct 8, 2019 at 11:19pm

zak100 (67)

Hi,
I tried with Editor then matched my result by a simple C program. But it worked with bmp because i knew the header size but what about other document files.

Does the notepad have any fixed header size?

Zulfi.

Oct 9, 2019 at 2:43am

Duthomhas (13277)

but I can't find a similar library for PDFs

Funny, I found plenty. The very first one is Adobe’s.
https://www.google.com/search?q=libpdf

Regrettably, Adobe no longer offers an OS solution.

Topic archived. No new replies allowed.