Is this code is machine code?

Pages: 12

And this is why we have C.

I have professionally written disassemblers for over a dozen different processor families (IA32, PPC, 68K/ColdFire, ARM, MIPS, StarCore, Sparc, ...), mostly for use with debuggers and simulators.

First, a definition, a disassembler is a program/process which takes a program image (typically an executable file or a memory image) and converts it to assembly instructions much like what would be given to an assembler. In some cases, the disassembler output could be re-assembled, but not always.

Writing a simple disassembler is, well, simple. Like most programming projects, it is the cool features which are added to the disassembler which can make the project difficult. Features such as:
* Handling relocations (an object file has relocations which need to be resolved before disassembling, otherwise all of your jumps may look like they are jumping to address 0).
* Handling symbols ( "load $r1,myage" implies a lot more than "load $r1,0x14732").
* Handling values which are split across multiple instructions ("loadhi $r1,0x12345678@hi" and "ori $r1,$r1,0x12345678@lo" is clearer than "loadhi $r1,0x1234" and "ori $r1,$r1,0x5678", especially when 0x12345678 could be replaced with a symbol).
* Avoiding symbols when they aren't useful ("load r1,0(r3)" instead of "load r1,MEMORY_BASE(r3)" for structure pointer dereferencing).
* Simplified instruction mnemonics ("beq label" instead of "bcc cr1,3,1,label").
* Handling invalid instruction bit patterns
* Recognizing non-instructions (program data, strings, jump tables, etc). This is easier with an object file, but sometimes preceding instructions will indicate where data is. It is even nicer if program data can be displayed in the correct size/format (a 16-bit unsigned value, a 32-bit signed value, a 32-bit floating point value, etc).
* Recognizing instruction set changes (such as between 32-bit ARM instructions and 16-bit THUMB instructions)
* Instruction re-alignment (i.e. detecting the start of an instruction after data in a variable-length instruction set).
* Automatic symbol generation (i.e. create new labels such as ".L1", ".L2", etc for branch destinations within a function instead of displaying absolute addresses).
* Automatic end-of-function location to let the user disassemble a single function with a single command.
* Automatic start-of-function location (given any address, find the start of the function containing that address, useful for variable-length instruction alignment and for users disassembling a function with 1 command) (obviously, this is easier with an object file, but sometimes there is no symbolic information).
* Thunk identification. Sometimes a subroutine is called indirectly through a thunk. When a function calls a thunk, it is nicer to display "call func1$thunk" instead of "call 0x4273980".
* Helper-routine identification. On some architectures, it is common for function prologues and epilogues to call common code for saving registers on the stack. Nicer for the prologue to display "call __save_gpr_23_31" instead of "call 0x47291840".

In actual practice, we had a custom disassembler for each processor family. However, each disassembler was driven by several data tables which could very well be implemented by XML files. Data would include information on how to identify an instruction (which bits must be set, and which bits must be clear), the instruction name, the # of bytes, the argument types (and which bits in the instruction are used for the argument), restrictions (args 1 and 2 must use different registers, args 2 and 3 must use the same base register, the instruction is not supported by these processor variants, this instruction may not appear in delay slots, the next 4 instructions may not reference memory or branch, etc). Sometimes we would even have data tables describing registers (register name, alternate names, supported processors, (a register could be an "ITLBADDR" on one processor, and "RAMBAR" on another) etc) or even memory locations mapped to I/O registers.
A lot of code went into validating the data (surprisingly, a typo in a data table could lead to one bit pattern matching multiple instructions, but in some cases that is intentional such as having "clear $r1" and "load $r1,#0" both use the same bit pattern (sometimes the documentation would call this out as a simplified mnemonic, sometimes not). I have found a lot of typos in various user manuals over the years). And more code went into optimizing the instruction decoding (convert the instruction table into one or more fast, but large, lookup tables, mainly for simulation execution speed).

chrisname (7395)

@Hamsterman,
I guess you're right, that is more complex than I thought.

I thought it was one byte that told the CPU the size of the operands (byte, word, dword, qword or oword), one byte for the instruction (including how many parameters there are and what they are, e.g., reg->reg, reg->mem, etc.), and n * size for the parameters.

Topic archived. No new replies allowed.

Pages: 12