Factor out irrelevant footnotes and parentheses
This commit is contained in:
parent
c5f1f8615b
commit
67b25ca038
1 changed files with 91 additions and 99 deletions
|
@ -80,15 +80,15 @@ restored before returning, the function's return address and local variables.
|
||||||
|
|
||||||
On the x86\_64 platform, with which this report is mostly concerned, the
|
On the x86\_64 platform, with which this report is mostly concerned, the
|
||||||
calling convention that is followed is defined in the System V
|
calling convention that is followed is defined in the System V
|
||||||
ABI~\cite{systemVabi} for the Unix-like operating systems (among which Linux).
|
ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux.
|
||||||
Under this calling convention, the first six arguments of a function are passed
|
Under this calling convention, the first six arguments of a function are passed
|
||||||
in the registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8},
|
in the registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8},
|
||||||
\reg{r9}, while additional arguments are pushed onto the stack. It also defines
|
\reg{r9}, while additional arguments are pushed onto the stack. It also defines
|
||||||
which registers may be overwritten by the callee, and which parameters must be
|
which registers may be overwritten by the callee, and which parameters must be
|
||||||
restored before returning (which most of the time is done by pushing the
|
restored before returning. This restoration, most of the time, is done by
|
||||||
register value onto the stack in the function prelude, and restoring it just
|
pushing the register value onto the stack in the function prelude, and
|
||||||
before returning). Those preserved registers are \reg{rbx}, \reg{rsp},
|
restoring it just before returning. Those preserved registers are \reg{rbx},
|
||||||
\reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
|
\reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
|
||||||
|
|
||||||
\begin{wrapfigure}{r}{0.4\textwidth}
|
\begin{wrapfigure}{r}{0.4\textwidth}
|
||||||
\centering
|
\centering
|
||||||
|
@ -98,11 +98,8 @@ before returning). Those preserved registers are \reg{rbx}, \reg{rsp},
|
||||||
\end{wrapfigure}
|
\end{wrapfigure}
|
||||||
|
|
||||||
The register \reg{rsp} is supposed to always point to the last used memory cell
|
The register \reg{rsp} is supposed to always point to the last used memory cell
|
||||||
in the stack, thus, when the process just enters a new function, \reg{rsp}
|
in the stack. Thus, when the process just enters a new function, \reg{rsp}
|
||||||
points right to the location of the return address\footnote{Remember that since
|
points right to the location of the return address. Then, the compiler might
|
||||||
the stack grows \emph{downwards} in memory, the arrow of \reg{rsp} points
|
|
||||||
\emph{below} the RA cell in the figure, and yet the memory cell indexed is the
|
|
||||||
one \emph{above} in the drawing, that is, the RA.}. Then, the compiler might
|
|
||||||
use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing
|
use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing
|
||||||
the old value of \reg{rbp} just below the return address on the stack, then
|
the old value of \reg{rbp} just below the return address on the stack, then
|
||||||
copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
|
copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
|
||||||
|
@ -148,8 +145,8 @@ Left apart analyzing the assembly code produced, there is no way to find where
|
||||||
the return address is stored, relatively to \reg{rsp}, at some arbitrary point
|
the return address is stored, relatively to \reg{rsp}, at some arbitrary point
|
||||||
of the function. Even when \reg{rbp} is used, there is no easy way to guess
|
of the function. Even when \reg{rbp} is used, there is no easy way to guess
|
||||||
where each callee-saved register is stored in the stack frame, and worse, which
|
where each callee-saved register is stored in the stack frame, and worse, which
|
||||||
callee-saved registers were saved (since it is not necessary to save a register
|
callee-saved registers were saved, since it is optional to save a register
|
||||||
that the function never touches).
|
that the function never touches.
|
||||||
|
|
||||||
With this example, it seems pretty clear that it is often necessary to have
|
With this example, it seems pretty clear that it is often necessary to have
|
||||||
additional data to perform stack unwinding. This data is often stored among the
|
additional data to perform stack unwinding. This data is often stored among the
|
||||||
|
@ -171,11 +168,11 @@ context, by unwinding \lstc{fct_b}'s frame.
|
||||||
|
|
||||||
\medskip
|
\medskip
|
||||||
|
|
||||||
Yet, stack unwinding (and thus debugging data) \emph{is not limited to
|
Yet, stack unwinding, and thus, debugging data, \emph{is not limited to
|
||||||
debugging}.
|
debugging}.
|
||||||
|
|
||||||
Another common usage is profiling. A profiling tool, such as \prog{perf} under
|
Another common usage is profiling. A profiling tool, such as \prog{perf} under
|
||||||
Linux -- see Section~\ref{ssec:perf} --, is used to measure and analyze in
|
Linux --~see Section~\ref{ssec:perf} --, is used to measure and analyze in
|
||||||
which functions a program spends its time, identify bottlenecks and find out
|
which functions a program spends its time, identify bottlenecks and find out
|
||||||
which parts are critical to optimize. To do so, modern profilers pause the
|
which parts are critical to optimize. To do so, modern profilers pause the
|
||||||
traced program at regular, short intervals, inspect their stack, and determine
|
traced program at regular, short intervals, inspect their stack, and determine
|
||||||
|
@ -202,8 +199,8 @@ trigger the destructors of stack-allocated objects. Furthermore, this is often
|
||||||
undesirable: \lstc{setjmp} has a quite big overhead, which is introduced
|
undesirable: \lstc{setjmp} has a quite big overhead, which is introduced
|
||||||
whenever a \lstc{try} block is encountered. Instead, it is often preferred to
|
whenever a \lstc{try} block is encountered. Instead, it is often preferred to
|
||||||
have strictly no overhead when no exception happens, at the cost of a greater
|
have strictly no overhead when no exception happens, at the cost of a greater
|
||||||
overhead when an exception is actually fired (after all, they are supposed to
|
overhead when an exception is actually fired --~after all, they are supposed to
|
||||||
be \emph{exceptional}). For more details on C++ exception handling,
|
be \emph{exceptional}. For more details on C++ exception handling,
|
||||||
see~\cite{koening1990exception} (especially Section~16.5). Possible
|
see~\cite{koening1990exception} (especially Section~16.5). Possible
|
||||||
implementation mechanisms are also presented in~\cite{dinechin2000exn}.
|
implementation mechanisms are also presented in~\cite{dinechin2000exn}.
|
||||||
|
|
||||||
|
@ -237,8 +234,8 @@ the previous paragraph, in an ELF section originally called
|
||||||
For any binary, debugging information can easily get quite large if no
|
For any binary, debugging information can easily get quite large if no
|
||||||
attention is payed to keeping it as compact as possible. In this matter, DWARF
|
attention is payed to keeping it as compact as possible. In this matter, DWARF
|
||||||
does an excellent job, and everything is stored in a very compact way. This,
|
does an excellent job, and everything is stored in a very compact way. This,
|
||||||
however, as we will see, makes it both difficult to parse correctly (with \eg{}
|
however, as we will see, makes it both difficult to parse correctly and quite
|
||||||
variable-length integers) and quite slow to interpret.
|
slow to interpret.
|
||||||
|
|
||||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
\subsection{DWARF unwinding data}
|
\subsection{DWARF unwinding data}
|
||||||
|
@ -259,19 +256,15 @@ a table, that has a range of IPs on which it has authority. Most often, but not
|
||||||
necessarily, it corresponds to a single function in the original source code.
|
necessarily, it corresponds to a single function in the original source code.
|
||||||
Each column of the table is a register (\eg{} \reg{rsp}), with two additional
|
Each column of the table is a register (\eg{} \reg{rsp}), with two additional
|
||||||
special registers, CFA (Canonical Frame Address) and RA (Return Address),
|
special registers, CFA (Canonical Frame Address) and RA (Return Address),
|
||||||
containing respectively the base pointer of the current stack
|
containing respectively the base pointer of the current stack frame and the
|
||||||
frame\footnote{The CFA is most commonly thought of as the base pointer of the
|
return address of the current function. For instance, on a x86\_64
|
||||||
frame, yet this is not enforced by DWARF\@. The CFA is used as an address from
|
architecture, RA would contain the unwound value of \reg{rip}, the instruction
|
||||||
which other registers will be deduced as offsets, and although it is supposed
|
pointer. Each row has a certain validity interval, on which it describes
|
||||||
to be the actual base pointer, it can be anything as long as it is close enough
|
accurate unwinding data. This range starts at the instruction pointer it is
|
||||||
to the addresses that will be deduced from it.} and the return address of the
|
associated with, and ends at the start IP of the next table row (or the end IP
|
||||||
current function (\ie{} for x86\_64, the unwound value of \reg{rip}, the
|
of the current FDE if it was the last row). In particular, there can be no ``IP
|
||||||
instruction pointer). Each row has a certain validity interval, on which it
|
hole'' within a FDE --~unlike FDEs themselves, which can leave holes between
|
||||||
describes accurate unwinding data. This range starts at the instruction pointer
|
them.
|
||||||
it is associated with, and ends at the start IP of the next table row (or the
|
|
||||||
end IP of the current FDE if it was the last row). In particular, there can be
|
|
||||||
no ``IP hole'' within a FDE --~unlike FDEs themselves, which can leave holes
|
|
||||||
between them.
|
|
||||||
|
|
||||||
\begin{figure}[h]
|
\begin{figure}[h]
|
||||||
\begin{minipage}{0.45\textwidth}
|
\begin{minipage}{0.45\textwidth}
|
||||||
|
@ -329,17 +322,17 @@ how the stack frame is constructed. When interpreting the generated \ehframe{}
|
||||||
with \lstbash{readelf -wF}, we obtain the (slightly edited)
|
with \lstbash{readelf -wF}, we obtain the (slightly edited)
|
||||||
Listing~\ref{lst:ex1_dw}. During the function prelude, \ie{} for $\mhex{615}
|
Listing~\ref{lst:ex1_dw}. During the function prelude, \ie{} for $\mhex{615}
|
||||||
\leq \reg{rip} < \mhex{619}$, the stack frame only contains the return address,
|
\leq \reg{rip} < \mhex{619}$, the stack frame only contains the return address,
|
||||||
thus the CFA is 8 bytes above \reg{rsp} (which was the value of \reg{rsp}
|
thus the CFA is 8 bytes above \reg{rsp}, and the return address is precisely at
|
||||||
before the call, and is the topmost value of used space for this stack frame),
|
\reg{rsp} --~that is, stored between \reg{rsp} and $\reg{rsp} + 8$. Then, the
|
||||||
and the return address is precisely at \reg{rsp} --~that is, stored between
|
contents of \lstc{fibo}, 8 integers of 4 bytes each, are allocated on the
|
||||||
\reg{rsp} and $\reg{rsp} + 8$. Then, 8 integers of 4 bytes each (for
|
stack, which puts the CFA 32 bytes above \reg{rsp}; the return address still
|
||||||
\lstc{fibo}, \lstc{pos} being optimized out) are allocated on the stack, which
|
being 8 bytes below the CFA\@. The variable \lstc{pos} is optimized out in the
|
||||||
puts the CFA 32 bytes above \reg{rsp}, and the return address still 8 bytes
|
generated assembly code, thus no stack space is allocated for it. Yet,
|
||||||
below the CFA\@. Yet, \prog{gcc} decided to allocate a total space of 48 bytes
|
\prog{gcc} decided to allocate a total space of 48 bytes for the stack frame
|
||||||
for the stack frame for memory alignment reasons, which means subtracting 40
|
for memory alignment reasons, which means subtracting 40 bytes to \reg{rsp}
|
||||||
bytes to \reg{rsp} (address $\mhex{615}$ in the assembly). Then, by the end of
|
(address $\mhex{615}$ in the assembly). Then, by the end of the function, the
|
||||||
the function, the local variables are discarded and \reg{rsp} is reset to its
|
local variables are discarded and \reg{rsp} is reset to its value from the
|
||||||
value from the first row.
|
first row.
|
||||||
|
|
||||||
However, DWARF data isn't actually stored as a table in the binary files, but
|
However, DWARF data isn't actually stored as a table in the binary files, but
|
||||||
is instead stored as in Listing~\ref{lst:ex1_dwraw}. The first row has the
|
is instead stored as in Listing~\ref{lst:ex1_dwraw}. The first row has the
|
||||||
|
@ -425,9 +418,9 @@ These are the DWARF instructions used for CFI description, that is, the
|
||||||
instructions that contain the stack unwinding table informations. The following
|
instructions that contain the stack unwinding table informations. The following
|
||||||
list is an exhaustive list of instructions from the DWARF5
|
list is an exhaustive list of instructions from the DWARF5
|
||||||
specification~\cite{dwarf5std} concerning CFI, with reworded descriptions for
|
specification~\cite{dwarf5std} concerning CFI, with reworded descriptions for
|
||||||
brevity and clarity. All these instructions are up to variants (most
|
brevity and clarity. All these instructions are up to variants --~most
|
||||||
instructions exist in multiple formats to handle various operands formatting,
|
instructions exist in multiple formats to handle various operands formatting,
|
||||||
to optimize space). Since we won't be talking about the underlying file format
|
to optimize space. Since we won't be talking about the underlying file format
|
||||||
here, those variations between eg. \dwcfa{advance\_loc1} and
|
here, those variations between eg. \dwcfa{advance\_loc1} and
|
||||||
\dwcfa{advance\_loc2} --~which differ only on the number of bytes of their
|
\dwcfa{advance\_loc2} --~which differ only on the number of bytes of their
|
||||||
operand~-- are irrelevant and will be eluded.
|
operand~-- are irrelevant and will be eluded.
|
||||||
|
@ -517,10 +510,10 @@ only handled as register identifiers, so we can safely state that $\reg{reg}
|
||||||
|
|
||||||
A value can then be undefined, stored at memory address $x$ or be directly a
|
A value can then be undefined, stored at memory address $x$ or be directly a
|
||||||
value $x$, $x$ being here a simple expression consisting of $\reg{reg} +
|
value $x$, $x$ being here a simple expression consisting of $\reg{reg} +
|
||||||
\textit{offset}$. The CFA is considered a simple register here. For instance, to
|
\textit{offset}$. The CFA is considered a simple register here. For instance,
|
||||||
define $\reg{rax}$ to the value contained in memory 16 bytes below the CFA, we
|
to define $\reg{rax}$ to the value contained in memory 16 bytes below the CFA,
|
||||||
would have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$ (for the stack grows
|
we would have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$, since the stack
|
||||||
downwards).
|
grows downwards.
|
||||||
|
|
||||||
\subsection{Target language~: a C function body}
|
\subsection{Target language~: a C function body}
|
||||||
|
|
||||||
|
@ -533,10 +526,10 @@ execution stack, or even the heap.
|
||||||
This function takes as arguments an instruction pointer --~supposedly
|
This function takes as arguments an instruction pointer --~supposedly
|
||||||
extracted from $\reg{rip}$~-- and an array of register values; and returns a
|
extracted from $\reg{rip}$~-- and an array of register values; and returns a
|
||||||
fresh array of register values after unwinding this call frame. The function is
|
fresh array of register values after unwinding this call frame. The function is
|
||||||
compositional\footnote{up to technicities: the IP obtained after unwinding the
|
compositional: it can be called twice in a row to unwind two stack frames,
|
||||||
first frame might be handled in a different dynamically loaded object, and this
|
unless the IP obtained after the first unwinding comes from another shared
|
||||||
would require inspecting the DWARF located in another file}: it can be called
|
object file, for instance a call to \prog{libc}. In this case, unwinding the
|
||||||
twice in a row to unwind two stack frames.
|
second frame will require loading the corresponding DWARF information.
|
||||||
|
|
||||||
The function is the following~:
|
The function is the following~:
|
||||||
|
|
||||||
|
@ -636,8 +629,8 @@ $F\left[0 \ldots |F|-2\right] \extrarrow{reg} \bullet$.
|
||||||
\semI{\dwcfa{nop()} \cdot d}{s}(F) &:= \contsem{F}\\
|
\semI{\dwcfa{nop()} \cdot d}{s}(F) &:= \contsem{F}\\
|
||||||
\end{align*}
|
\end{align*}
|
||||||
|
|
||||||
(The stack is used for \texttt{remember\_state} and \texttt{restore\_state}. If
|
The stack is used for \texttt{remember\_state} and \texttt{restore\_state}. If
|
||||||
we omit those two operations, we can plainly remove the stack).
|
we omit those two operations, we can plainly remove the stack.
|
||||||
|
|
||||||
|
|
||||||
\subsection{From $\intermedlang$ to C}
|
\subsection{From $\intermedlang$ to C}
|
||||||
|
@ -694,8 +687,9 @@ machine code on the x86\_64 platform.
|
||||||
The rough idea of the compilation is to produce, out of the \ehframe{} section
|
The rough idea of the compilation is to produce, out of the \ehframe{} section
|
||||||
of a binary, C code that resembles the code shown in the DWARF semantics from
|
of a binary, C code that resembles the code shown in the DWARF semantics from
|
||||||
Section~\ref{sec:semantics} above. This C code is then compiled by GCC in
|
Section~\ref{sec:semantics} above. This C code is then compiled by GCC in
|
||||||
\lstbash{-O2} mode\footnote{Compiling in \lstbash{-O3} takes way too much
|
\lstbash{-O2} mode, since it already provides a good level of optimization and
|
||||||
time.}, providing for free all the optimization passes of a modern compiler.
|
compiling in \lstbash{-O3} takes way too much time. This saves us the trouble
|
||||||
|
of optimizing the generated C code whenever GCC does that by itself.
|
||||||
|
|
||||||
The generated code consists in a single monolithic function, \lstc{_eh_elf},
|
The generated code consists in a single monolithic function, \lstc{_eh_elf},
|
||||||
taking as arguments an instruction pointer and a memory context (\ie{} the
|
taking as arguments an instruction pointer and a memory context (\ie{} the
|
||||||
|
@ -715,18 +709,18 @@ return it.
|
||||||
|
|
||||||
A setting of the compiler also optionally enables another parameter to the
|
A setting of the compiler also optionally enables another parameter to the
|
||||||
\lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This
|
\lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This
|
||||||
\lstc{deref} function, when enabled, replaces everywhere the dereferencing
|
\lstc{deref} function, when present, replaces everywhere the dereferencing
|
||||||
\lstc{*} operator, and can be used to generate \ehelfs{} that will work on
|
\lstc{*} operator, and can be used to generate \ehelfs{} that will work on
|
||||||
remote address spaces (\ie{} whenever the unwinding is not done on the process
|
remote address spaces, that is, whenever the unwinding is not done on the
|
||||||
reading the \ehelf{} itself, but some other process, or even on a stack dump of
|
process reading the \ehelf{} itself, but some other process, or even on a stack
|
||||||
a long-terminated process).
|
dump of a long-terminated process.
|
||||||
|
|
||||||
Unlike in the \ehframe, and unlike what should be done in a release,
|
Unlike in the \ehframe, and unlike what should be done in a release,
|
||||||
real-world-proof version of the \ehelfs, the choice was made to keep this
|
real-world-proof version of the \ehelfs, the choice was made to keep this
|
||||||
prototype simple, and only handle the few registers that were needed to simply
|
prototype simple, and only handle the few registers that were needed to simply
|
||||||
unwind the stack. Thus, the only registers handled in \ehelfs{} are \reg{rip},
|
unwind the stack. Thus, the only registers handled in \ehelfs{} are \reg{rip},
|
||||||
\reg{rbp}, \reg{rsp} and \reg{rbx} (the latter being used quite often in
|
\reg{rbp}, \reg{rsp} and \reg{rbx}, the latter being used quite often in
|
||||||
\prog{libc} to hold the CFA address). This is enough to unwind the stack, but
|
\prog{libc} to hold the CFA address. This is enough to unwind the stack, but
|
||||||
is not sufficient to analyze every stack frame as \prog{gdb} would do after a
|
is not sufficient to analyze every stack frame as \prog{gdb} would do after a
|
||||||
\lstbash{frame n} command.
|
\lstbash{frame n} command.
|
||||||
|
|
||||||
|
@ -736,10 +730,9 @@ is not sufficient to analyze every stack frame as \prog{gdb} would do after a
|
||||||
In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type
|
In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type
|
||||||
\lstc{uintptr_t} are the values of the corresponding registers, and
|
\lstc{uintptr_t} are the values of the corresponding registers, and
|
||||||
\lstc{flags} is a 8-bits value, indicating for each register whether it is
|
\lstc{flags} is a 8-bits value, indicating for each register whether it is
|
||||||
present or not in this context (\ie{} if the \lstc{rbx} bit is not set, the
|
present or not in this context, plus an error bit, indicating whether an error
|
||||||
value of \lstc{rbx} in the structure isn't meaningful), plus an error bit,
|
occurred during unwinding. Such errors can be due \eg{} to an unsupported
|
||||||
indicating whether an error occurred during unwinding (which can be due \eg{}
|
operation in the original DWARF\@.
|
||||||
to an unsupported operation in the original DWARF, thus compiled to an error).
|
|
||||||
|
|
||||||
This generated data is stored in separate shared object files, which we call
|
This generated data is stored in separate shared object files, which we call
|
||||||
\ehelfs. It would have been possible to alter the original ELF file to embed
|
\ehelfs. It would have been possible to alter the original ELF file to embed
|
||||||
|
@ -827,12 +820,12 @@ made in order to shrink the \ehelfs.
|
||||||
The major optimization that most reduced the output size was to use an if/else
|
The major optimization that most reduced the output size was to use an if/else
|
||||||
tree implementing a binary search on the program counter relevant intervals,
|
tree implementing a binary search on the program counter relevant intervals,
|
||||||
instead of a huge switch. In the process, we also \emph{outline} a lot of code,
|
instead of a huge switch. In the process, we also \emph{outline} a lot of code,
|
||||||
that is, find out identical ``switch cases'' bodies (which are not switch cases
|
that is, find out identical ``switch cases'' bodies --~which are not switch
|
||||||
anymore, but if bodies), move them outside of the if/else tree, identify them
|
cases anymore, but if bodies~--, move them outside of the if/else tree,
|
||||||
by a label, and jump to them using a \lstc{goto}, which de-duplicates a lot of
|
identify them by a label, and jump to them using a \lstc{goto}, which
|
||||||
code and contributes greatly to the shrinking. In the process, we noticed that
|
de-duplicates a lot of code and contributes greatly to the shrinking. In the
|
||||||
the vast majority of FDE rows are actually taken among very few ``common'' FDE
|
process, we noticed that the vast majority of FDE rows are actually taken among
|
||||||
rows.
|
very few ``common'' FDE rows.
|
||||||
|
|
||||||
This makes this optimization really efficient, as seen later in
|
This makes this optimization really efficient, as seen later in
|
||||||
Section~\ref{ssec:results_size}, but also makes it an interesting question
|
Section~\ref{ssec:results_size}, but also makes it an interesting question
|
||||||
|
@ -886,13 +879,12 @@ Listing~\ref{lst:ex1_dw}, etc.
|
||||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
\subsection{Presentation of \prog{perf}}\label{ssec:perf}
|
\subsection{Presentation of \prog{perf}}\label{ssec:perf}
|
||||||
|
|
||||||
\prog{Perf} is a \emph{profiler} that comes with the Linux ecosystem (actually,
|
\prog{Perf} is a \emph{profiler} that comes with the Linux ecosystem, and is
|
||||||
\prog{perf} is developed within the Linux kernel source tree). A profiler is an
|
even developed within the Linux kernel source tree. A profiler is an important
|
||||||
important tool from the developer's toolbox that analyzes the performance of
|
tool from the developer's toolbox that analyzes the performance of programs by
|
||||||
programs by recording the time spent in each function, including within nested
|
recording the time spent in each function, including within nested calls. This
|
||||||
calls. This analysis often enables programmers to optimize critical paths and
|
analysis often enables programmers to optimize critical paths and functions in
|
||||||
functions in their programs, while leaving unoptimized functions that are
|
their programs, while leaving unoptimized functions that are seldom traversed.
|
||||||
seldom traversed.
|
|
||||||
|
|
||||||
For this purpose, the basic idea is to stop the traced program at regular
|
For this purpose, the basic idea is to stop the traced program at regular
|
||||||
intervals, unwind its stack, write down the current nested function calls, and
|
intervals, unwind its stack, write down the current nested function calls, and
|
||||||
|
@ -924,16 +916,16 @@ activity, be linked against \prog{libc} and \prog{pthread}, and be very light.
|
||||||
Interfacing \ehelfs{} with \prog{perf} required, in a first place, to fork
|
Interfacing \ehelfs{} with \prog{perf} required, in a first place, to fork
|
||||||
\prog{libunwind} and implement \ehelfs{} support for it. In the process, it
|
\prog{libunwind} and implement \ehelfs{} support for it. In the process, it
|
||||||
turned out necessary to slightly modify \prog{libunwind}'s interface to add a
|
turned out necessary to slightly modify \prog{libunwind}'s interface to add a
|
||||||
parameter to a function, since \prog{libunwind} is made to be agnostic of the
|
parameter to an initialisation function, since \prog{libunwind} is made to be
|
||||||
system and process as much as possible, to be able to unwind in any context.
|
agnostic of the system and process as much as possible, to be able to unwind in
|
||||||
This very restricted information lacked a memory map (a table indicating which
|
any context. This very restricted information lacked a \emph{memory map}, a
|
||||||
shared object is mapped at which address in memory) in order to use \ehelfs.
|
table indicating which shared object is mapped at which address in memory, in
|
||||||
Apart from this, the modified version of \prog{libunwind} produced is entirely
|
order to use \ehelfs. Apart from this, the modified version of \prog{libunwind}
|
||||||
compatible with the vanilla version, meaning that the only modifications
|
produced is entirely compatible with the vanilla version. This means that the
|
||||||
required to use \ehelfs{} within any project using \prog{libunwind} should be
|
only modifications required to use \ehelfs{} within any project using
|
||||||
modifying one line of code (this function call, which is a setup function) and
|
\prog{libunwind} should be changing one line of code to add one parameter to a
|
||||||
linking against the modified version of \prog{libunwind} instead of the system
|
function call and linking against the modified version of \prog{libunwind}
|
||||||
version.
|
instead of the system version.
|
||||||
|
|
||||||
Once this was done, plugging it in \prog{perf} was the matter of a few lines of
|
Once this was done, plugging it in \prog{perf} was the matter of a few lines of
|
||||||
code only, left apart the benchmarking code. The major problem encountered was
|
code only, left apart the benchmarking code. The major problem encountered was
|
||||||
|
@ -984,9 +976,9 @@ swapping.
|
||||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
\subsection{Measured time performance}
|
\subsection{Measured time performance}
|
||||||
|
|
||||||
The benchmarking, as described in Section~\ref{ssec:bench_perf}, of \ehelfs{}
|
A benchmarking of \ehelfs{} against the vanilla \prog{libunwind} was made using
|
||||||
against the vanilla \prog{libunwind} (using the same methodology, only linking
|
the exact same methodology as in Section~\ref{ssec:bench_perf}, only linking
|
||||||
\prog{perf} against the vanilla \prog{libunwind}), gives the results in
|
\prog{perf} against the vanilla \prog{libunwind}. It yields the results in
|
||||||
Table~\ref{table:bench_time}.
|
Table~\ref{table:bench_time}.
|
||||||
|
|
||||||
\begin{table}[h]
|
\begin{table}[h]
|
||||||
|
@ -1036,11 +1028,11 @@ instruction, however, would not slow down at all the implementation, since
|
||||||
every instruction would simply be compiled to x86\_64 without affecting the
|
every instruction would simply be compiled to x86\_64 without affecting the
|
||||||
already supported code.
|
already supported code.
|
||||||
|
|
||||||
It is also worth noting that on the machine described in
|
It is also worth noting that the compilation time of \ehelfs{} is also
|
||||||
Section~\ref{ssec:bench_hw}, the compilation of the \ehelfs{} at a level of
|
reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and
|
||||||
\lstc{-O2} needed to run \prog{hackbench}, that is, \prog{hackbench},
|
without using multiple cores to compile, the various shared objects needed to
|
||||||
\prog{libc}, \prog{ld}, and \prog{libpthread} takes an overall time of $25.28$
|
run \prog{hackbench} --~that is, \prog{hackbench}, \prog{libc}, \prog{ld} and
|
||||||
seconds (using only a single core).
|
\prog{libpthread}~-- are compiled in an overall time of $25.28$ seconds.
|
||||||
|
|
||||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
\subsection{Measured compactness}\label{ssec:results_size}
|
\subsection{Measured compactness}\label{ssec:results_size}
|
||||||
|
@ -1189,8 +1181,8 @@ only concerned about the columns CFA, \reg{rip}, \reg{rsp}, \reg{rbp} and
|
||||||
second row analyzes all the columns that were encountered, no matter whether
|
second row analyzes all the columns that were encountered, no matter whether
|
||||||
supported or not.
|
supported or not.
|
||||||
|
|
||||||
The Table~\ref{table:instr_types} analyzes the proportion of each command (\ie\
|
The Table~\ref{table:instr_types} analyzes the proportion of each command
|
||||||
the formal way a register is set) for non-CFA columns in the sampled data. For
|
--~the formal way a register is set~-- for non-CFA columns in the sampled data. For
|
||||||
a brief explanation, \texttt{Offset} means stored at offset from CFA,
|
a brief explanation, \texttt{Offset} means stored at offset from CFA,
|
||||||
\texttt{Register} means the value from a machine register, \texttt{Expression}
|
\texttt{Register} means the value from a machine register, \texttt{Expression}
|
||||||
means stored at the address of an expression's result, and the \texttt{Val\_}
|
means stored at the address of an expression's result, and the \texttt{Val\_}
|
||||||
|
|
Loading…
Reference in a new issue