Questions about this topic? Sign up to ask in the talk tab.


From NetSec
Jump to: navigation, search
<m4> here
<m4> yes
<m4> hello
<m4> hi
<m4> ok
<m4> so
<m4> before I begin
<m4> most of what i'm saying here can be found for reference at the wiki article
<m4> it's still a work in progress
<m4> anyway
<m4> so
<m4> basically
<m4> when we talk about "machine code" or "binary code"
<m4> what we're talking about are opcodes
<m4> which is a numerical instruction
<m4> interpreted by the processor
<m4> now these can be, and usually are, represented in hexadecimal form for conveience
<m4> e.g. "0xcd"
<m4> now
<m4> as a result of "0xcd 0x80" not exactly being easy to read
<m4> we have assembly
<m4> which is very low-level
<m4> in essence, all assembly actually is
<m4> is a mnemonic system for opcodes
<m4> so instead of saying "0xcd 0x80" we can just say "int $0x80"
<m4> kso
<m4> when you are talking about assembly code
<m4> there's 2 ways it can vary
<m4> one is architecture
<m4> the other is syntax
<m4> as you all no doubt know, you've got 64-bit and 32-bit systems
<m4> what this refers to is the size of the register (which is like a native, tiny, fixed variable)
<m4> 64-bit systems have 8-byte (64-bit) register
<m4> whereas 32-bit systems have 4-byte registers
<m4> again,
<m4> a register is basically a location where a small amount of memory can be stored
* lighthouse has quit (client exited: Leaving.)
<m4> it's similar to a variable
<m4> but
<m4> unlike a variable,
<m4> they are fixed in size, have a limited number, are native, and also many of them have specific purposes
<m4> other consideration is syntax
<m4> there are two syntaxes that are widely used
<m4> AT&T syntax and Intel syntax
<m4> neither of them are "right" or "wrong"
<m4> keeping in mind that assembly is a big mnemonic for machine code, they are two ways of representing it
<m4> AT&T afaik tends to be used more for linux
<m4> and Intel for windows
<m4> the annoying thing about the two syntaxes
<m4> is that the way they represent an operation is completely opposite to each other
<m4> if you want to move the value 8 into the register eax
<m4> AT&T:
<m4> movl $8, %eax
<m4> Intel:
<m4> mov eax, 8h
<m4> AT&T uses the format (source, destination) whereas Intel uses the format (destination, source)
<m4> which can be annoying
<m4> i'll be using AT&T because I'm familiar with it, but remember that neither of them are wrong or right, and both are widely used
<m4> so
<m4> more detail about registers
<m4> as i said
<m4> the number of registers is fixed.
<m4> furthermore, many of them have specific purposes
<m4> so
<m4> you've got your general purpose registers
<m4> these are eax, ebx, ecx, and edx
<m4> these are "free use", you can basically use them as small variables
<m4> each of them are 4-bytes since they're the 32-bit registers
<m4> i'll get onto 64-bit in a sec
<m4> the general purpose registers you can pretty much do whatever you want with
<m4> they're also used to hold arguments for system calls
* ackit ([email protected]) has joined #CSIII
<m4> a system call is basically a function that is native to your OS
<m4> so, if I wanted to call the exit system call
<m4> i would move the value 1 into eax, since 1=exit
<m4> i would move an error code from 0 to 255 into ebx
<m4> and i would call the kernel interrupt
<m4> with int $0x80
<m4> so you see that the general purpose registers can hold arguments for a system call
<m4> however
<m4> obviously
<m4> you don't always want to work with an entire 4-byte register
<m4> each register is actually split up into sub-registers
<m4> for example,
<m4> eax is 4 bytes. but the least significant 2 bytes of eax forms its own register, called ax
<m4> for those of you who don't know, "least significant" means "right-most" for our purposes
<m4> you can look up little-endian for more info on that if you don't know
<m4> furthermore
<m4> ax is split further
<m4> the most significant byte of ax is a subregister called ah
<m4> and the least significant byte of ax is a subregister called al
<m4> both are 1 byte in size
<m4> being able to refer to only one byte is particularly useful when designing shellcode
<m4> because
<m4> writing to a whole 4-byte register
<m4> can create nulls
<m4> and nulls are bad because they terminate strings
<m4> so instead of copying to the whole register with "movl $1, %eax"
<m4> you can just copy to the last byte
<m4> "movb $1, %al"
<m4> note that the ending of the mov instruction changed there
<m4> in the AT&T syntax, you must specify what size data you're working with using a suffix to your instruction
<m4> -b = byte
<m4> -l = 4-byte word
<m4> -q = 8-byte DWORD, that's only used with 64-bit for obvious reasons
<m4> no suffix = 2-byte halfword
<m4> NOW
<m4> with 64-bit registers
<m4> what they basically do is
<m4> make everything one step bigger
<m4> so, with 32-bit eax is the biggest register and it has a bunch of subregisters
<m4> with 64-bit, eax IS a subregister
<m4> so you've got this 8-byte register
<m4> called rax
<m4> the least significant 4 bytes of rax
<m4> are called eax
<m4> the least significant 2 bytes of eax 
<m4> are called ax
<m4> and so on
<m4> so. that's a register
<moot[GAR]> Least significant?
<m4> right-most
<moot[GAR]> Cool
<moot[GAR]> Just making sure
<m4> sec
<m4> see:
<m4> ok so as for the different types of registers
<m4> as I said, the number of them is fixed
<m4> so eax ebx ecx edx are the general purpose
<m4> (or, for 64-bit, rax rbx rcx rdx)
<m4> then you've got some others
<m4> which have specific functions.
<m4> esi and edi
<m4> are the source index and destination index
<m4> they're used for efficiently copying data
<m4> eip is very important; it's the instruction pointer
<m4> that means
<m4> that eip holds the memory address of the current instruction that is being executed
<m4> you deal with eip with Buffer Overflows etc.
<m4> esp is the stack pointer - it points to the top of the stack
<m4> whenever you push or pop data to or from the stack (pushing and popping is basically adding and removing data from the stack, they are instructions), the value of esp is either added to or subtracted from to reflect the new top of the stack
<m4> so
<m4> say i do this
<m4> pushl %eax
<m4> i've pushed 4 bytes onto the stack
<m4> so 4 bytes will be subtracted from esp to reflect the new top of the stack
<m4> you don't have to do this manually, it's part of the push instruction
<m4> the reason that adding new data to the stack causes the value of esp to decrease
<m4> is because the stack grows downwards
<m4> so the bottom of the stack is at the highest memory location the stack uses
<m4> and the top is at the lowest
<m4> for those of you who don't know, the stack is a data structure used in programming for a variety of reasons, notably for function handling
<m4> you can check wikipedia for that
<m4> final register
<m4> ebp
<m4> base pointer
<m4> it's used in the c calling convention (the method that C uses to execute functions) in order to store the value of esp before the stack was altered, so as to be able to restore it when you exit the function and return to the main execution
<m4> right so
<m4> quick example
<m4> so
<m4> let's say I wish to write a very basic program in assembly
<m4> one thing I haven't mentioned is the use of sections
<m4> sections or segments are structures used by the assembler that tell it which parts of your code are actual code to be executed
<m4> and which parts contain other stuff
<m4> there are various segments or sections
<m4> but the two main ones you look at are the "text" and "data" sections
<m4> so if I wanted to write some simple code that calls the exit system call
<m4> which, as I said earlier, is like a function built into your OS
<m4> note that this is linux specific, as windows handles this differently
<m4> so you'd do
<m4> .section .text
<m4> #that defines that this section contains code to be executed
<m4> .globl _start
<m4> #this defines the _start global symbol. the assembler recognise _start as the place to begin executing, it's a native symbol
<m4> _start:
<m4> movl $1, %eax
<m4> #move 1 into eax - "exit" is system call 1. you can look up a list of syscalls for linux online, there are various ones
<m4> movl $255, %ebx
<m4> #move 255 into ebx as the second argument to the kernel, this denotes the error code it will exit with
<m4> int $0x80
<m4> #interrupt, using the interrupt instruction with hexadecimal value 80 will initiate a kernel interrupt on linux
<m4> so what this would do
<m4> is fuck all basically
<m4> because it's incredibly simple
<m4> but it'd exit with the error code 255
<m4> note that we use movl
<m4> because we're working with eax, which is 4 bytes
<m4> so we need to use the 4-byte word suffix
<m4> if we had used movb or mov or movq
<m4> it would have thrown up a type mismatch error
<m4> so, once you have written a program, it needs to be assembled and linked
<m4> assembling and linking sounds scary, but in reality it's exactly what your compiler does for a higher-level language
<m4> the higher-level language simply adds another layer of functionality (in the form of libraries) and readability to it
<m4> and the end of the day, it's all a matter of interpreting a complex mnemonic and turning it into machine code
<m4> so when you write a program in C, for example
<m4> what's happening behind the scenes is this
<m4> the compiler takes your code
<m4> and it interprets it according to the C syntax
<m4> and converts it into an assembly object file
<m4> it then links these object files with any libraries you reference - such as libc - which are referred to as "shared objects"
<m4> hence why libc ends with the extension .so
<m4> when you're coding in pure assembly, you don't use a compiler
<m4> instead you assemble and link your code seperately
<m4> a common assembler and linker on most linux machines are as and ld
<m4> so, for example
<m4> as exit.s -o exit.o
<m4> what that does
<m4> is it takes your assembly code
<m4> it turns any direct address references into relative references
<m4> and returns a stripped-down object file
<foo> use nasm much?
<m4> I tried it once
<foo> I may be dating my assembly knowledge with that question :-)
<m4> :>
<m4> I didn't like it :L
<foo> k, sorry to interrupt. great class so far. thank you. 
<m4> what your linker does is, it links your object files together and returns an executable binary
<m4> if your program is incredibly simple
<m4> such as this one, which literally only uses system calls native to the OS
<m4> it may not even link it _with_ any other object files
<m4> for that program we just assembled
<m4> all you would need to do is
<m4> ld exit.o -o exit
<m4> and it would create an executable, which would execute our code and make the exit syscall when executed
<m4> however
<m4> in many cases, you also wish to link libraries
<m4> a compiler does this automatically - after it converts your C code into assembly, it then proceeds to link it with standard libraries
<m4> such as libc
<m4> as well as any libraries you include
<m4> so if you #include <string.h>
<m4> it will also link the relevant shared object(s)
<m4> obviously, when using ld you have to do this yourself
<m4> if your assembly contains a reference to a function in libc
<m4> but has no way of accessing the libc library
<m4> it ain't gonna work
<m4> so
<m4> in that case
<m4> you might end up calling ld with something like
<m4> ld -dynamic-linker /lib/ -o file file.o -lc
<m4> if you intend to link your source code with libc, for example
<m4> you could then call functions from libc
<m4> e.g.:
<m4> .section .data
<m4> str: #str is a label 
<m4> .ascii "hello, world!\n\0"
<m4> .section .text
<m4> .globl _start
<m4> _start:
<m4> push $str
<m4> call printf
<m4> #exit sycall goes here, won't bother writing it out
<m4> ok
<m4> so
<m4> i'm running out of time
<m4> so a couple other notes on things you'd use
<m4> as you may have noticed
<m4> your code can include labels
<m4> a label is similar to a strange amalgamation of a function and those weird GOTO statements in BASIC
<m4> in truth, it's really just a pointer in memory
<m4> so if to use a string
<m4> i can create a label
<m4> .section .data
<m4> str:
<m4> .ascii "hello world\n\0"
<m4> then later on
<m4> reference $str
<m4> and it will read from that label str onwards
<m4> as you can see
<m4> there are several datatypes you can define in the .data section
<m4> .byte is a value from 0 to 255
<m4> .int takes up 2 storage locations and is a value from 0 to 65535
<m4> .long takes up 4 storage locations and is a value from 0 to 4294467295
<m4> .ascii takes up 1 storage location per character, including newlines and null terminators
<m4> ok
<m4> so
<m4> i won't really have time to explain the c calling convention in great detail
<m4> but i mentioned it earlier
<m4> so i will give an overview of what it is
<m4> when you call a function in C
<m4> you're displacing the execution of your program
<m4> this is where the stack comes in
<m4> if halfway through main()
<m4> i decided to call printf("%s", str)
<m4> the way this function call actually gets translated when it all turns into assembly
<m4> is:
<m4> push a pointer to the value contained in variable "str" to the stack
<m4> push a pointer to the format string "%s" to the stack
<m4> call the function printf
<m4> this is the c calling convention
<m4> when you call a function
<m4> you actually transfer execution to a function, which acts as a special type of label
<m4> the instruction for this is the aptly-named "call" instruction
<m4> as seen earlier
<m4> so if I have "call printf"
<m4> in some assembly code
<m4> what that's actually doing is
<m4> changing the value of eip to point to the start of the function symbol "printf"
<m4> the call function does nothing more or less than that
<m4> it's the function's responsibility to make sure it returns execution back to main() when it's done
<m4> and it's also the function's responsibility to ensure that any changes made to the stack while the function is going are restored when the function restores execution to main()
<m4> otherwise, it could fuck everything up
<m4> and that's where ebp comes in handy
<m4> under the C calling convention, the first thing a function does once it's called
<m4> is push the value of ebp to the stack
<m4> (so it's not lost forever)
<m4> then copy the value of esp into ebp
<m4> that way, no matter what we do to the stack
<m4> once we're done
<m4> we can always just move the value of ebp into esp
<m4> and voila, esp points to the orginal top of the stack
<m4> then just pop back into ebp, and ebp is restored
<m4> and you're ready to restore execution
<m4> so anyway, no time really to go into the specifics of the c calling convention and how it's done in detail
<m4> but that's the general idea behind what happens when a c-style function is called
<m4> from the assembly perspective
<m4> arguments are pushed in reverse order, execution is transferred, position of stack is 'saved', function executes, stack is restored, execution is restored
<m4> so, there you have it
<m4> a basic introduction to the concepts of assembly language and how it interfaces with higher-level languages and machine code
<m4> if you're looking for something more tutorial-ish
<ackit> esp into ebp? (ENTER/LEAVE) inst
<m4> againm
<hatter> hm
<m4> check our the wiki page
<m4> out*
<m4> even if it is still in progress
<m4> the info is good