Linux flle structures

Hi guys

I'll start of by saying I'm pretty new to Linux, on windows all files have a signature or magic number such as 47 49 46 38 for png files, if you open a hex editor you will also find more information out about that file, and eventually the data will start and end, the data is normally followed by trailers too.

But when I do a simple google search for file formats on a Linux system( Linux file structure) all I get is articles and editorials for the Linux file system and not the actual format types, I'm trying to make a simple parser that recognizes the file type by finding it's magic number.

I also notice that when I open a normal .txt file in a hex editor(wxhexeditor), there is absolutely no metadata, the first byte of the file is just text and again there is no trailers, it's just ASCII text data. So how do text editors such as nano,vim and other text editors open this type of text file when there is no info that even identifies them as one(text file)?

Thanks!
Last edited on
Well a PNG file on Linux has the same format as it does on Windows.

> I'm trying to make a simple parser that recognizes the file type by finding it's magic number.
Like this :)
https://www.man7.org/linux/man-pages/man1/file.1.html

> So how do text editors such as nano,vim and other text editors open this type of
> text file when there is no info that even identifies them as one(text file)?
They don't.
Which is why in simpler text editors, when you open a binary file like a PNG by mistake, all sorts of fun can happen.
But text editors aimed at programmers in particular recognise that programmers want to look at all sorts of files, so some have specific 'hex dump' modes to allow you to look at the raw bytes.
you have a misconception.
a binary file uses all 0-255 values for any given byte.
a text file uses a subset of those, the printable characters.
so what happens?
say you had this 64 bit integer and wrote it to a file 9223372036854775808
in a text file, you write "9223372036854775808"
which is 19 or whatever bytes of data in the file. in a hex editor, you see the ascii value in hex for each number as a text value, eg 0x39 0x32 or displayed as 39 32 32 ...
in binary though, you see the 8 bytes of the integer (64 bits), 80 00 00 00 00 00 00 00 (or the endian reverse of this). So binary is harder to read in the hex editor, but saves 11 bytes per integer in this case. On the flipside, "0" still takes 8 bytes, and only one in text. More often than not the binary is more efficient, though: even for images RBG values may take 3 bytes in text and only 1 in binary, with 245 of 255 values taking more than 1 byte as text.

Now, all that aside: every file format is different. There are no 'magic numbers' -- these things you see are just integers or doubles or whatever values that mean something: could be the file's size, or the size of the data portion for the image, or the date, or what file version it is (many file types have had revisions and the format varies a little), and so on.
There isnt any universal way to detect what the file type is from its first few bytes. You can do a few -- like how email virus scanners can recognize most compressed file types -- but not all encompassing. The extensions on files (which linux does in a poor way, leaving extensions off many types) should tell you something too, .png, .jpg, .bmp all mean something unless the file was created by someone trying to fool you or someone clueless as to common extensions. So your first clue should be the extension if it has one, and it should for most types. A lot of the email and other scanners are so dumb that if you rename virus.exe to image.jpg it will pass right through the email checker and land on the target system. Getting it renamed as an attack is nontrivial but it certainly gets past the 'stop mailing your co-workers exe files' problem that many coders have to fight against.

finally .. TEXT editors do not know what to do with BINARY files. that is why we have the hex editors in the first place, is to do open this type because text editors cannot. The unprintable special characters in binary files can be skipped, shown as junk, or even break the data (some text editors read certain things as an end of file and stop at whatever random location had those byte(s) code(s). You get a mess, and if you print it to the console, it even beeps and complains as the hidden 'make noise' character is in there too at random.
Datamining: it can be very useful to write the simple c++ program that dumps a binary file with only the text remaining, all else removed, to a text file. Then you can search that result for things and unravel some mysteries of the file format, hackers / dataminers/ modders/ etc use this as one of many techniques. Unicode or non-ascii often is readable too; in many cases , its just s p r e a d o u t l i k e t h i s in the text file.
Last edited on
Registered users can post here. Sign in or register to post.