┌───────────────────────┐
                                                       ▄▄▄▄▄ ▄▄▄▄▄ ▄▄▄▄▄       │
                                                       │ █   █ █ █ █   █       │
                                                       │ █   █ █ █ █▀▀▀▀       │
                                                       │ █   █   █ █     ▄     │
                                                       │                 ▄▄▄▄▄ │
                                                       │                 █   █ │
                                                       │                 █   █ │
                                                       │                 █▄▄▄█ │
                                                       │                 ▄   ▄ │
                                                       │                 █   █ │
                                                       │                 █   █ │
                                                       │                 █▄▄▄█ │
                                                       │                 ▄▄▄▄▄ │
Introducing SHELF Loading                              │                   █   │
The Nexus between Static and Position Independent Code │                   █   │
~ @ulexec and @Anonymous_                              └───────────────────█ ──┘

1. Introduction

Over the last several years there have been several enhancements in Linux 
offensive tooling in terms of sophistication and complexity. Linux malware has
become increasingly more popular, given the higher number of public reports
documenting Linux threats. These include government-backed Linux implants such 
as APT28's VPNFilter, Drovorub or Winnti wide range of Linux Malware.

However, this increase in popularity does not seem to have had much of an impact
in the totality of the sophistication of the current Linux threat landscape just
yet. It's a fairly young ecosystem, where cybercriminals have not been able to 
identify reliable targets for monetization apart from Cryptocurrency Mining, 
DDoS, and more recently, Ransomware operations.

In today's Linux threat landscape, even the smallest refinement or introduction
of complexity often results in AV evasion, and therefore Linux malware authors 
do not tend to invest unnecessary resources to sophisticate their implants. 
There are various reasons why this phenomenon occurs, and it is subject to 
ambiguity. The Linux ecosystem, in contrast to other popular spheres such as 
Windows and MacOS, is more dynamic and diverse, stemming from the many flavors 
of ELF files for different architectures, the fact that ELF binaries can be 
valid in many different forms, and that the visibility of Linux threats is often
quite poor.

Due to these issues, AV vendors face a completely different set of challenges 
detecting these threats. Often times this disproportionate detection failure of
simple/unsophisticated threats leaves an implicit impression that Linux malware
is by nature not complex. This statement couldn't be further from the truth, and
those familiar with the ELF file format know that there is quite a lot of room 
for innovation with ELF files that other file formats are not able to provide 
due to their lack of flexibility, even if we have not seen it abused as much 
over the years.

In this article we are going to discuss a technique that achieves an uncommon 
functionality of file formats, which generically converts full executables to 
shellcode in a way that demonstrates, yet again, another example that ELF 
binaries can be manipulated to achieve offensive innovation that is hard or 
impossible to replicate in other file formats.


2. A Primer On ELF Reflective Loading

In order to understand the technique, we must first give a contextual background
on previously known ELF techniques upon which this one is based, with a 
comparison of the benefits and tradeoffs.

Most ELF packers, or any application implementing any form of ELF binary 
loading, are primarily based on what's known as User-Land-Exec.

User-Land-Exec is a method first documented by @thegrugq, in which an ELF binary
can be loaded without using any of the execve family of system calls, and hence 
its name.

For the sake of simplicity, the steps to implement an ordinary User-Land-Exec 
with support of ET_EXEC and ET_DYN ELF binaries is illustrated in the following
diagram, showcasing an implementation of the UPX packer for ELF binaries:

  [IMG] https://tmpout.sh/1/10/10.1.png

As we can observe, this technique has the following requirements (by @thegrugq):

  1. Clean out the address space
  2. If the binary is dynamically linked, load the dynamic linker.
  3. Load the binary.
  4. Initialize the stack.
  5. Determine the entry point (i.e. the dynamic linker or the main executable).
  6. Transfer execution to the entry point.

On a more technical level, we come up with the following requirements:
 
  1. Setup the stack of the embedded executable with its correspondent Auxiliary
     Vector.
  2. Parse PHDR's and identify if there is a PT_INTERP segment, denoting that 
     the file is a dynamically linked executable.
  3. LOAD interpreter if PT_INTERP is present.
  4. LOAD target embedded executable.
  5. Pivot to mapped e_entry of target executable or interpreter accordingly, 
     depending if the target executable is a dynamically linked binary.

For a more in-depth explanation, we suggest reading @thegrugq's comprehensive 
paper on the matter [9]. 

One of the capabilities of conventional User-Land-Exec are the evasion of an 
execve footprint as previously mentioned, in contrast with other techniques 
such as memfd_create/execveat, which are also widely used to load end execute
a target ELF file. Since the loader maps and loads the target executable, the
embedded executable has the flexibility of having a non-conventional structure.
This has the side benefit of being useful for evasion and anti-forensics 
purposes.

On the other hand, since there are a lot of critical artifacts involved in the
loading process, it can be easy to recognize by reverse-engineers, as well as 
being somewhat fragile due to the fact that the technique is heavily dependent
on these components. For this reason, writing User-Land-Exec based loaders have
been somewhat tedious. As more features get added to the ELF file format, this 
technique has been inclined to mature over time and thereby increasing its 
complexity.

The new technique that we will be covering in this paper relies on implementing
a generic User-Land-Exec loader with a reduced set of constraints supporting a 
hybrid PIE and statically linked ELF binaries that to our knowledge have yet to
be reported. 

We believe this technique represents a drastic improvement of previous versions
of User-Land-Exec loaders, since based on the lack of technical implementation 
constraints and the nature of this new hybrid static/PIE ELF flavor, the extent
of capabilities it can provide is wider and more evasive than with previous 
User-Land-Exec variants.


3. Internals Of Static PIE Executable Generation

3.1 Background

In July of 2017 H. J. Lu patched a bug entry in GCC bugzilla named ‘Support 
creating static PIE'. This patch mentioned the implementation of a statically 
based PIE in his branch at glibc hjl/pie/static, in which Lu documented that by
supplying –static and –pie flags to the linker along with PIE versions of crt*.o
as input, static PIE ELF executables could be generated. It is important to 
note, that at the time of this patch, generation of fully statically linked PIE
binaries was not possible.[1]

In August, Lu submitted a second patch[2] to the GCC driver, for adding the 
–static flag to support static PIE files that he was able to demonstrate in his
previous patch. The patch was accepted in trunk[3], and this feature was 
released in GCC v8.

Moreover, in December of 2017 a commit was made in glibc[4] adding the option 
–enable-static-pie. This patch made it possible to embed the needed parts of 
ld.so to produce standalone static PIE executables.

The major change in glibc to allow static PIE was the addition of the function
_dl_relocate_static_pie which gets called by __libc_start_main. This function is
used to locate the run-time load address, read the dynamic segment, and perform 
dynamic relocations before initialization, then transfer control flow of 
execution to the subject application.

In order to know which flags and compilation/linking stages were needed in order
to generate static PIE executables, we passed the flag –static-pie –v to GCC. 
However, we soon realized by doing this that the linker generated a plethora of
flags and calls to internal wrappers. As an example, the linking phase is 
handled by the tool /usr/lib/gcc/x86_64-linux-gnu/9/collect2 and GCC itself is 
wrapped by /usr/lib/gcc/x86_64-linux-gnu/9/cc1. Nevertheless, we managed to 
remove the irrelevant flags and we ended up with the following steps:

  [IMG] https://tmpout.sh/1/10/10.2.png

These steps are in fact the same provided by Lu, supplying the linker with input
files compiled with –fpie, and –static, -pie, -z text, --no-dynamic-linker. 
In particular, the most relevant artifacts for static PIE creation are rcrt1.o,
libc.a, and our own supplied input file, test.o. The rcrt1.o object contains the
_start code which has the code required to correctly load the application before
executing its entry point by calling the correspondent libc startup code 
contained in __libc_start_main:

  [IMG] https://tmpout.sh/1/10/10.3.png

As previously mentioned, __libc_start_main will call the new added function 
_dl_relocate_static_pie (defined at elf/dl-reloc-static-pie.c file of glibc 
source). The primary steps performed by this function are commented in the 
source:

  [IMG] https://tmpout.sh/1/10/10.4.png

With the help of these features, GCC is capable of generating static executables
which can be loaded at any arbitrary address. 

We can observe that _dl_relocate_static_pie will handle the needed dynamic 
relocations. One noticeable difference of rcrt1.o from conventional crt1.o is 
that all contained code is position independent. Inspecting what the generated 
binaries look like we see the following:

  [IMG] https://tmpout.sh/1/10/10.5.png

At first glance they seem to be common dynamically linked PIE executables, based
on the ET_DYN executable type retrieved from the ELF header. However, upon 
closer examination of the segments, we will observe the nonexistent PT_INTERP 
segment usually denoting the path to the interpreter in dynamically linked 
executables and the existence of a PT_TLS segment, usually contained only in 
statically linked executables.

  [IMG] https://tmpout.sh/1/10/10.6.png

If we check what the dynamic linker identifies the subject executable as, we 
will see it identifies the file type correctly:

  [IMG] https://tmpout.sh/1/10/10.7.png

In order to load this file, all we would need to do is map all the PT_LOAD 
segments to memory, set up the process stack with the correspondent Auxiliary
Vector entries, and then pivot to the mapped executable's entry point. We do 
not need to be concerned about mapping the RTLD since we don't have any external
dependencies or link time address restrictions.

As we can observe, we have four loadable segments commonly seen in SCOP ELF 
binaries. However, for the sake of easier deployment, it will be crucial if we 
could merge all those segments into one as is usually done with ELF disk 
injection into a foreign executable. We can do just this by using the –N linker
flag to merge data and text within a single segment.

3.2. Non-compatibility of GCC's -N and static-pie flags

If we pass –static-pie and –N flags together to GCC we see that it generates the
following executable:

  [IMG] https://tmpout.sh/1/10/10.8.png

The first thing we noticed about the type of generated ELF when using 
–static-pie alone was that it had a type of ET_DYN, and now together with –N it 
results in an ET_EXEC.

In addition, if we take a closer look at the segment's virtual addresses, we see
that the generated binary is not a Position Independent Executable. This is due 
to the fact that the virtual addresses appear to be absolute addresses and not 
relative ones. To understand why our program is not being linked as expected, we
inspected the linker script that was being used.

As we are using the ld linker from binutils, we took a look on how ld selected 
the linker script; this is done in the ld/ldmain.c code at line 345:

  [IMG] https://tmpout.sh/1/10/10.9.png

The ldfile_open_default_command_file is in fact an indirect call to an 
architecture independent function generated at compile time that contains a set
of internal linker scripts to be selected depending upon the flags passed to ld.
Because we are using the x86_64 architecture, the generated source will be 
ld/elf_x86_64.c, and the function which is called to select the script is 
gldelf_x86_64_get_script, which is simply a set of if-else-if statements to 
select the internal linker script. The –N option sets the config.text_read_only
variable to false, which forces the selection function to use an internal script
which does not produce PIC as can be seen below:

  [IMG] https://tmpout.sh/1/10/10.10.png

This way of selecting the default script makes the –static-pie and –N flags 
non-compatible, because the forced test of selecting the script based on –N is 
parsed before –static-pie.

3.3. Circumvention via custom Linker Script

The incompatibility between –N, -static, and –pie flags led us to a dead end, 
and we were forced to think of different ways to overcome this barrier. What we
attempted was to provide a custom script to drive the linker. As we essentially
needed to merge the behavior of two separate linker scripts, our approach was to
choose one of the scripts and adapt it to generate the desired outcome with 
features of the remaining script.

We chose the default script of –static-pie over the one used with –N because in
our case it was easier to modify as opposed to changing the –N default script to
support PIE generation.

To accomplish this goal, we would need to change the definition of the segments,
which are controlled by the PHDRS [5] field in the linker script. If the command
is not used the linker will provide program headers generated by default – 
However, if we neglect this in the linker script, the linker will not create any
additional program headers and will strictly follow the guidelines defined in 
the subject linker script.

Taking into account the details discussed above, we added a PHDRS command to the
default linker script, starting with all the original segments which are created
by default when using –static-pie:

  [IMG] https://tmpout.sh/1/10/10.11.png

After this we need to know how each section maps to each segment – and for this 
we can use readelf as shown below:

  [IMG] https://tmpout.sh/1/10/10.12.png

With knowledge of the mappings, we just needed to change the section output 
definition in the linker script which adds the appropriate segment name at the
end of each function definition, as shown in the following example:

  [IMG] https://tmpout.sh/1/10/10.13.png

Here, the .tdata and .tbss sections are being assigned to the segments that get
mapped in the same order that we saw in the output of the readelf –l command. 
Eventually, we ended up having a working script precisely changing all mapped 
sections which were mapped in data to the text segment:

  [IMG] https://tmpout.sh/1/10/10.14.png

If we compile our subject test file with this linker script, we see the 
following generated executable:

  [IMG] https://tmpout.sh/1/10/10.15.png

We now have a static-pie with just one loadable segment. The same approach can 
be repeated to remove other irrelevant segments, keeping only critical segments
necessary for the execution of the binary. As an example, the following is a 
static-pie executable instance with minimal program headers needed to run:

  [IMG] https://tmpout.sh/1/10/10.16.png

The following is the final output of our desired ELF structure – having only one
PT_LOAD segment generated by a linker script with the PHDRS command configured 
as in the screenshot below:

  [IMG] https://tmpout.sh/1/10/10.17.png


4. SHELF Loading

This generated ELF flavor gives us some interesting capabilities that other ELF
types are not able to provide. For the sake of simplicity, we have labelled this
type of ELF binary as SHELF, and we will be referencing it throughout the rest 
of this paper. The following is an updated diagram of the loading stages needed
for SHELF loading:

  [IMG] https://tmpout.sh/1/10/10.18.png

As we can see in the diagram above, the process of loading SHELF files is highly
reduced in complexity compared to conventional ELF loading schemes. 

To illustrate the reduced set of constraints to load these types of files, a 
snippet of a minimalistic SHELF User-Land-Exec approach is as follows:

  [IMG] https://tmpout.sh/1/10/10.19.png

By using this approach, a subject SHELF file would look as follows in memory and
on disk:

  [IMG] https://tmpout.sh/1/10/10.20.png

As we can observe, the ELF header and Program Headers are missing from the 
process image. This is a feature that this flavor of ELF enables us to implement
and is discussed in the following section.

4.1 Anti-Forensic Capabilities

This new approach to User-Land-Exec has also two optional stages useful for 
anti-forensic purposes. Since the dl_relocate_static_pie function will obtain 
all of the required fields for relocation from the Auxiliary Vector, this leaves
us room to play with how the subject SHELF file structure may look in memory and
on disk.

Removing the ELF header will directly impact reconstruction capabilities, 
because most Linux-based scanners will scan process memory for existing ELF 
images by first identifying ELF headers. The ELF header will be parsed and will
contain further information on where to locate the Program Header Table and 
consequently the rest of the mapped artifacts of the file.

Removal of the ELF header is trivial since this artifact is not really needed by
the loader – all required information in the subject file will be retrieved from
the Auxiliary Vector as previously mentioned.

An additional artifact that can be hidden is the Program Header Table. This is a
slightly different case when compared with the ELF Header. The Auxiliary Vector 
needs to locate the Program Header Table in order for the RTLD to successfully 
load the file by applying the needed runtime relocations. Regardless, there are
many approaches to obfuscating the PHT. The simplest approach is to remove the
original Program Header Table location, and relocate it somewhere in the file 
that is only known by the Auxiliary Vector.

  [IMG] https://tmpout.sh/1/10/10.21.png

We can precompute the location of each of the Auxiliary Vector entries and 
define each entry as a macro in an include file, tailoring our loader to every
subject SHELF file at compile-time. The following is an example of how these 
macros can be generated:

  [IMG] https://tmpout.sh/1/10/10.22.png

As we can observe, we have parsed the subject SHELF file for its e_entry and 
e_phnum fields, creating corresponding macros to hold those values. We also have
to choose a random base image to load the file. Finally, we locate the PHT and 
convert it to an array, then remove it from its original location. Applying 
these modifications allows us to completely remove the ELF header and change the
default location of the subject SHELF file PHT both on disk and in memory(!)

Without successful retrieval of the Program Header Table, reconstruction 
capabilities may be strictly limited and further heuristics will have to be 
applied for successful process image reconstruction.

An additional approach to make the reconstruction of the Program Header Table 
much harder is by instrumenting the way glibc implements the resolution of the
Auxiliary Vector fields.

4.2 Obscuring SHELF features by PT_TLS patching

Even after modifying the default location of the Program Header Table by 
choosing a new arbitrary location when crafting the Auxiliary Vector, the 
Program Header Table would still reside in memory and could be found with some
effort. To obscure ourselves even further we can cover how the startup code 
reads the Auxiliary Vector fields.

The code that does this is in elf/dl_support.c in the function _dl_aux_init. In
abstract, the code iterates over all the auxv_t entries, and each of these 
entries initialize internal variables from glibc:

  [IMG] https://tmpout.sh/1/10/10.23.png

The only reason the Auxiliary Vector is required is to initialize internal _dl_*
variables. Knowing this, we can bypass the creation of the Auxiliary Vector 
entirely and do the same job that _dl_aux_init would do before passing control
of execution to the subject SHELF file.

The only entries which are critical are AT_PHDR, AT_PHNUM, and AT_RANDOM. 
Therefore, we only need to patch the respective _dl_* variables that depend on
these fields. As an example of how to retrieve these values, we can use the 
following one-liner to generate an include file with precomputed macros holding
the offset to every dl_* variable:

  [IMG] https://tmpout.sh/1/10/10.24.png

With the offset to these variables located, we only need to patch them in the
same way the original startup code would do so using the Auxiliary Vector. As a
way to illustrate this technique, the following code will initialize the 
addresses of the Program Headers to new_address, and the number of program 
headers to the correct number:

  [IMG] https://tmpout.sh/1/10/10.25.png

At this point we have a working program without supplying the Auxiliary Vector.
Because the subject binary is statically linked, and the code that will load the
SHELF file is our loader, we can neglect every other segment in the Auxiliary 
Vector's AT_PHDR and AT_PHNUM or dl_phdr and dl_phnum respectively. There is an
exception, which is the PT_TLS segment which is the interface in which Thread 
Local Storage is implemented in the ELF file format.

The following code which resides in csu/libc-tls.c on function __libc_setup_tls
show the type of information that gets retrieved from the PT_TLS segment:

  [IMG] https://tmpout.sh/1/10/10.26.png

In the code snippet above, we can see that TLS initialization relies on the 
presence of the PT_TLS segment. We have several approaches that can obfuscate 
this artifact, such as patching the __libc_setup_tls function to just return and
then initialize the TLS with our own code. Here, we'll choose to implement a 
quick patch to glibc instead as a PoC.

To avoid the need of the PT_TLS Program Header we have added a global variable 
to hold the values from PT_TLS and set the values inside __libc_setup_tls, 
reading from our global variable instead of the subject SHELF file Program 
Header Table. With this small change we finally strip all the program headers:

  [IMG] https://tmpout.sh/1/10/10.27.png

Using the following script to generate _phdr.h:

  [IMG] https://tmpout.sh/1/10/10.28.png

We can apply our patches in the following way after including _phdr.h:

  [IMG] https://tmpout.sh/1/10/10.29.png

Applying the methodology shown above, we gain a high level of evasiveness by 
loading and executing our SHELF file without an ELF header, Program Header 
Table, and Auxiliary Vector – just as shellcode gets loaded. The following 
diagram illustrates how straightforward the loading process of SHELF files is:

  [IMG] https://tmpout.sh/1/10/10.30.png


5. Conclusion

We have covered the internals of Reflective Loading of ELF files, explaining 
previous implementations of User-Land-Exec, along with its benefits and 
drawbacks. We then explained the latest patches in the GCC code base that 
implemented support for static-pie binaries, discussing our desired outcome, 
and the approaches we followed to achieve the generation of static-pie ELF files
with one single PT_LOAD segment. Finally, we discussed the anti-forensic 
features that SHELF loading can provide, which we think to be a considerable 
enhancement when compared with previous versions of ELF Reflective Loading.

We think this could be the next generation of ELF Reflective Loading, and it may
benefit readers to understand the extent of offensive capabilities that the ELF
file format can provide. If you would like access to the source code, contact 
@sblip or @ulexec.


6. References

[1] (support static pie) 
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81498 
[2] (first patch gcc)
    https://gcc.gnu.org/ml/gcc-patches/2017-08/msg00638.html
[3] (gcc patch)
    https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=252034
[4] (glibc --enable-static-pie)
    https://sourceware.org/git/?p=glibc.git;a=commit; \
      h=9d7a3741c9e59eba87fb3ca6b9f979befce07826 
[5] (ldscript doc)
    https://sourceware.org/binutils/docs/ld/PHDRS.html#PHDRS 
[6] https://sourceware.org/binutils/docs/ld/
      Output-Section-Phdr.html#Output-Section-Phdr
[7] https://www.akkadia.org/drepper/tls.pdf
[8] (why ld doesn't allow -static -pie -N)
    https://sourceware.org/git \
      /gitweb.cgi?p=binutils-gdb.git;a=blob;f=ld/ldmain.c; \
      h=c4af10f4e9121949b1b66df6428e95e66ce3eed4;hb=HEAD#l345 
[9] (grugq ul_exec paper)
    https://grugq.github.io/docs/ul_exec.txt 
[10] (ELF UPX internals)
     https://ulexec.github.io/ulexec.github.io/article \
       /2017/11/17/UnPacking_a_Linux_Tsunami_Sample.html