Linux flle structures

Pages: 12
Hi guys

I'll start of by saying I'm pretty new to Linux, on windows all files have a signature or magic number such as 47 49 46 38 for png files, if you open a hex editor you will also find more information out about that file, and eventually the data will start and end, the data is normally followed by trailers too.

But when I do a simple google search for file formats on a Linux system( Linux file structure) all I get is articles and editorials for the Linux file system and not the actual format types, I'm trying to make a simple parser that recognizes the file type by finding it's magic number.

I also notice that when I open a normal .txt file in a hex editor(wxhexeditor), there is absolutely no metadata, the first byte of the file is just text and again there is no trailers, it's just ASCII text data. So how do text editors such as nano,vim and other text editors open this type of text file when there is no info that even identifies them as one(text file)?

Thanks!
Last edited on
Well a PNG file on Linux has the same format as it does on Windows.

> I'm trying to make a simple parser that recognizes the file type by finding it's magic number.
Like this :)
https://www.man7.org/linux/man-pages/man1/file.1.html

> So how do text editors such as nano,vim and other text editors open this type of
> text file when there is no info that even identifies them as one(text file)?
They don't.
Which is why in simpler text editors, when you open a binary file like a PNG by mistake, all sorts of fun can happen.
But text editors aimed at programmers in particular recognise that programmers want to look at all sorts of files, so some have specific 'hex dump' modes to allow you to look at the raw bytes.
you have a misconception.
a binary file uses all 0-255 values for any given byte.
a text file uses a subset of those, the printable characters.
so what happens?
say you had this 64 bit integer and wrote it to a file 9223372036854775808
in a text file, you write "9223372036854775808"
which is 19 or whatever bytes of data in the file. in a hex editor, you see the ascii value in hex for each number as a text value, eg 0x39 0x32 or displayed as 39 32 32 ...
in binary though, you see the 8 bytes of the integer (64 bits), 80 00 00 00 00 00 00 00 (or the endian reverse of this). So binary is harder to read in the hex editor, but saves 11 bytes per integer in this case. On the flipside, "0" still takes 8 bytes, and only one in text. More often than not the binary is more efficient, though: even for images RBG values may take 3 bytes in text and only 1 in binary, with 245 of 255 values taking more than 1 byte as text.

Now, all that aside: every file format is different. There are no 'magic numbers' -- these things you see are just integers or doubles or whatever values that mean something: could be the file's size, or the size of the data portion for the image, or the date, or what file version it is (many file types have had revisions and the format varies a little), and so on.
There isnt any universal way to detect what the file type is from its first few bytes. You can do a few -- like how email virus scanners can recognize most compressed file types -- but not all encompassing. The extensions on files (which linux does in a poor way, leaving extensions off many types) should tell you something too, .png, .jpg, .bmp all mean something unless the file was created by someone trying to fool you or someone clueless as to common extensions. So your first clue should be the extension if it has one, and it should for most types. A lot of the email and other scanners are so dumb that if you rename virus.exe to image.jpg it will pass right through the email checker and land on the target system. Getting it renamed as an attack is nontrivial but it certainly gets past the 'stop mailing your co-workers exe files' problem that many coders have to fight against.

finally .. TEXT editors do not know what to do with BINARY files. that is why we have the hex editors in the first place, is to do open this type because text editors cannot. The unprintable special characters in binary files can be skipped, shown as junk, or even break the data (some text editors read certain things as an end of file and stop at whatever random location had those byte(s) code(s). You get a mess, and if you print it to the console, it even beeps and complains as the hidden 'make noise' character is in there too at random.
Datamining: it can be very useful to write the simple c++ program that dumps a binary file with only the text remaining, all else removed, to a text file. Then you can search that result for things and unravel some mysteries of the file format, hackers / dataminers/ modders/ etc use this as one of many techniques. Unicode or non-ascii often is readable too; in many cases , its just s p r e a d o u t l i k e t h i s in the text file.
Last edited on
If you have the vim text editor, it is very helpful for reading files. I use it a lot, and it will read and edit almost anything I try to open with it. I know it will at least read .csv, .txt, .dat, .bin, .rtf, .pages, and .doc. It will probably also read much more than that. You can download it here:
https://www.vim.org/download.php

Bonus, they help children in Uganda, so if you donate you're helping them support and provide for them!

Info about vim:
https://en.wikipedia.org/wiki/Vim_(text_editor)

Good luck,
max
For PNG format:

https://www.w3.org/TR/PNG-Structure.html

3.1. PNG file signature
The first eight bytes of a PNG file always contain the following (decimal) values:

137 80 78 71 13 10 26 10

This signature indicates that the remainder of the file contains a single PNG image, consisting of a series of chunks beginning with an IHDR chunk and ending with an IEND chunk.


12.11. PNG file signature
The first eight bytes of a PNG file always contain the following values:

(decimal) 137 80 78 71 13 10 26 10
(hexadecimal) 89 50 4e 47 0d 0a 1a 0a
(ASCII C notation) \211 P N G \r \n \032 \n



So, yes, some files have a header or magic number at the beginning. That's just so an interpreter can scan the file quickly and see what format it claims to be. And this has absolutely nothing to do with Windows vs. Linux.

Text files don't have that. They are just ASCII data. And binary files in general have no signature. It depends on the format that the file is trying to meet.


Adam2016: "I also notice that when I open a normal .txt file in a hex editor(wxhexeditor), there is absolutely no metadata, the first byte of the file is just text and again there is no trailers, it's just ASCII text data. So how do text editors such as nano,vim and other text editors open this type of text file when there is no info that even identifies them as one(text file)?"


Most modern OS's have done a lot of work to kind of hide from the users what file formats really are; most users expect files to act a certain way without wanting to know how it's done.

Much of the work is done trusting in the file extensions themselves. A txt file is expected to have no header and no added tail (which is why you can use Notepad on Windows to write html files and trust that no extra data will be added that will mess up its loading by a browser.

A png file is binary data with a header that describes how to access that data, and the magic number is required to identify the format. All of this is expected to be accessed in binary mode as described by Jonin.

For the most part formatting is an attempt to make the data OS independent/shareable, and to help figure out what the file actually should do because you can trust that somewhere out there is a user who just modified that extension thinking they are actually changing a mp3 file into an mkv (because again, windows and other modern OS's hide what data actually is, and the user doesn't always understand or care about what hides inside).

Vim is not exactly a smart program. It has many ways to extend it and is extremely useful, but if you open a png file with vim it will try to read all available text. It will generally look like garbage because it's binary data rather than text data. But you will see the word PNG near the top, since that's actually what part of the magic number translates to in binary to ascii.

If you want to represent the data in a png as hex using vim, then check out the answer here: https://vi.stackexchange.com/questions/2232/how-can-i-use-vim-as-a-hex-editor
That won't actually be very helpful because a png usually is compressed as well and the extraction method is as described in the aforesaid header for each png. And as Jonin points out, hex isn't quite useful in this instance anyways. Things get even more fun when you try to read a pdf by the way, since chunks of data in a pdf can be under different forms of compression and encryption and the chunks aren't necessarily in the same order as it is displayed on screen.

There is a program available on linux called "file". Try reading the man page for file since it describes how it works, and it sounds like it is just about what you are trying to create...

1
2
man file
file Picture.png


Here's the official git repo for the "file" program: https://github.com/file/file

This is a good route to study, it's very helpful to be able to ensure that the file you are reading is in fact the format that you are expecting.
Good luck.



P.S.
Here's someone's example of writing a simple bmp without external libraries https://stackoverflow.com/questions/2654480/writing-bmp-image-in-pure-c-c-without-other-libraries/47785639#47785639


β€œThe good thing about standards is that there are so many to choose from.” ― Andrew S. Tanenbaum
Last edited on
@doug4,
Hehe, that's only if it's an actual PNG file. If you run this command in a Bash terminal:
$ echo "This is a PNG file" >test.png

it will create a file with the extension .png but the file will not actually have those eight decimal values. If you open it with Vim, you won't actually see anything. And if you try to open it with a standard image viewing application, like Preview on a Macintosh, it won't open it because it's not actually an image, it's text.

I believe some systems aren't fooled by the extension at the end of a filename, but I'm not sure which ones. Mine automatically tries to open, say, a .png file with Preview, but it doesn't work because Preview doesn't recognize the format.

Edit:
Removed the pipe in the shell command, pointed out by @uplime. Thanks, @uplime!
Last edited on

If you run this command in a Bash terminal:


Bash is a shell, not a terminal :p

And if you try to open it with a standard image viewing application, like Preview on a Macintosh, it won't open it because it's not actually an image, it's text.


Actually, its not text either, since that command creates an empty file called "test.png". You would need to remove the pipe operator:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
~/tmp πŸ€ ll
~/tmp πŸ€ /bin/bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin19)
Copyright (C) 2007 Free Software Foundation, Inc.
~/tmp πŸ€ /bin/bash -c 'echo "This is a PNG file" | >test.png'
~/tmp πŸ€ ll
total 0
-rw-r--r--  1 nickchambers  admin  0 Mar 15 06:15 test.png
~/tmp πŸ€ rm test.png
~/tmp πŸ€ /usr/local/bin/bash --version
GNU bash, version 5.0.16(1)-release (x86_64-apple-darwin19.3.0)
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
~/tmp πŸ€ /usr/local/bin/bash -c 'echo "This is a PNG file" | >test.png'
~/tmp πŸ€ ll
total 0
-rw-r--r--  1 nickchambers  admin  0 Mar 15 06:15 test.png
~/tmp πŸ€ /usr/local/bin/bash -c 'echo "This is a PNG file" >test.png'
~/tmp πŸ€ ll
total 8
-rw-r--r--  1 nickchambers  admin  19 Mar 15 06:16 test.png
Last edited on
@uplime,
Oops, my bad. You're right, it's Bash shell, not terminal. The application I use is called Terminal which is probably what I meant to say. Actually it might be Unix, not Bash? I'm not sure. It looks like this:

Last login: Mon Mar 15 11:24:24 on console
prandtl:~ agentmax$ cd /tmp
prandtl:tmp agentmax$ /bin/bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin17)
Copyright (C) 2007 Free Software Foundation, Inc.
prandtl:tmp agentmax$ 


And thanks for pointing out the pipe, I copied and pasted the comand from a shell script which I've been using, and I apparently forgot to remove the pipe. I will edit my post and fix that.

Just out of curiosity, what shell are you using there? It doesn't look like anything I've ever seen before; mainly the four leaf clover...
@agent max:
It is possible to customize the prompt in shells.
For bash: https://wiki.archlinux.org/index.php/Bash/Prompt_customization

Another example: if you create python virtual env and activate it, that probably modifies your prompt too (to hint that you are in that special environment).
The snippet you showed looks like your terminal launching a shell (and perhaps a login process before that as well). I use Bash for my shell, but have code to change my prompt in my .bashrc to change it based on the holiday: https://github.com/Lime-Farms/Andromeda/blob/master/roles/init/files/etc/skel/.bashrc#L69-L99
Last edited on
@keskiverto,
Ah. Ok that makes sense, I though it was a weird Linux command-line or something like that. Or something else I'm not familiar with (as in anything other than Unix).

I don't use Python virtual environments, mainly because I don't know Python...lol...but I do use the vim text editor and the lldb debugger, which both modify the prompt (I'm pretty sure).

lldb:
prandtl:Desktop agentmax$ lldb a.out
(lldb) target create "a.out"
Current executable set to 'a.out' (x86_64).
(lldb) run
Process 15129 launched: '/Users/agentmax/Desktop/a.out' (x86_64)
This is a test
Process 15129 exited with status = 0 (0x00000000) 
(lldb) quit


@uplime,
Wow...ok, that is cool. I won't pretend to understand it at all, but it looks "legit," like the kids these days would say.

Using lldb, clang, vim, and making/changing directories is about the limit of my knowledge of Unix. Not like I need any more than that to compile/run/edit C++ programs ;) but I do know a guy at work who actually does all of his programming (C++, scripting, HTML, CSS, etc) from a Bash shell!
This is getting a bit offtopic, but to be honest I'm a bit anti-shell anyways. I don't think you should do much from it, but I'm a hypocrite and almost exclusively work from a shell (but I'm also a sysadmin, not a developer)
@agent max,

Hehe, that's only if it's an actual PNG file. If you run this command...


Of course. Writing to a file and naming it ".png" does not make it a PNG file--I never meant to imply that it did.

Additionally, opening a file and writing the PNG signature and additional random text into it does not make it a PNG file either.

My post was directed at @jonnin's post in which he said
Now, all that aside: every file format is different. There are no 'magic numbers' -- these things you see are just integers or doubles or whatever values that mean something...


A file that is in correct PNG format contains a magic number, also called a signature. The OP stated that "all" files have a signature or magic number, then proceeded to post an incorrect signature for PNG. I was just trying to clarify this issue.

These are rules about file signatures (magic numbers).
A signature applies to a specific file type.
Not all file types have an associated signature (like raw text files, frequently named .txt)
For file types with signatures

- If a file does not contain the signature, is IS NOT of that type

- If a file does contain the signature, this is an indication that the file MIGHT BE of that type

--- A file IS OF THE TYPE only if it is formatted according to the type's specification, and that includes the signature.

- Using file name extensions to determine whether a file is of a particular type is a short-cut that some programs us, but is incomplete.

All files (even those with signatures) can be treated as "raw" files", allowing them to edited/viewed in text editors or hex editors, etc.
@uplime,
I do a lot in shells too, except for editing code, I do that in TextMate. But a guy I know at work does all of his programming stuff through a shell. But I think he's also a bit weird...(and I wish I could do that too).

@doug4,
Quite right. Some systems (like mine) only look at the filename extension and try to open it in an application designed to open files with that extension (like Word opening a .doc file). But if the file contains some weird binary code, without the "magic number" and assciated formatting, then it usually gives you some kind of message like:

This file contains unknown characters. They may be read incorrectly.  Proceed?


@agent max:
lldb does not modify shell's prompt. It has its own prompt. Normal I/O.
Ah. I didn't know there was a difference, but thank you for correcting me. I defer to your superior knowledge, as I am but a novice in el arte de programar. (I forget what that quote is from, and I probably butchered it).
@agent max,

I forget what that quote is from, and I probably butchered it


Were you possibly trying to remember this quote from The Princess Bride?

Truly you have a dizzying intellect.
doug4,
Yes, that must have been it. Although I've watched that movie enough times that I should have known ;)
Wtf...why are you posting that in a C++ forum?

Edit:
I was replying to a post that seems to have been removed. Sorry if I confused anyone.
Last edited on
Pages: 12