This document describes the details of the Generalized Head-Related Transfer Function (GHRTF), which is an adjusted HRTF using machine learning. This adjustment improves sound quality by ensuring that the signal from the front is transferred transparently. It also enables typical GHRTF by training them using HRTFs from numerous subjects. Finally, the document evaluates the GHRTF and clarifies that it provides a clearer sense of direction and higher sound quality.
1. Introduction
The Head-Related Transfer Function (HRTF) is a key technology for three-dimensional sound. However, issues regarding sound quality and HRTF personalization must be resolved for this technology to be adopted more widely.
When applying HRTF to music production, sound quality issues may arise. Fig. 1 shows the frequency responses in the horizontal and vertical planes of subject H4 in the SADIE II database [2]. As the figure shows, these responses do not meet the required quality level for music production.

Due to significant individual variation in HRTF, using personalized HRTF—that is, measured or customized for an individual—is desirable. However, this increases the cost. Therefore, for this technology to be adopted more widely, it requires a typical HRTF, which has a moderate effect for everyone.
SoundObject version 4.1 proposes a Generalized HRTF (GHRTF) based on machine learning that provides music production–level quality and typical HRTF. This document describes the definition of GHRTF, how it is estimated using machine learning, and its evaluation results.
2. Coordinate system in this document
This document uses a coordinate system based on the Spatially Oriented Format for Acoustics (SOFA), as defined in the AES69-2020 specification [1]. The head is located at the origin. The x-, y-, and z-axis directions are the front, leftward, and upward directions, respectively. In the polar coordinate system, the radius r is the distance from the origin to the sound source, the elevation angle θ [−90, 90] deg. is measured from the x-y plane, and the azimuth angle φ [0, 360) deg. is measured counterclockwise from the x-axis. Furthermore, to unify the elevation angle in the vertical plane, this document defines θp as follows.
Note: The definition of the elevation angle in AES69-2020 differs from the standard definition.
3. Generalized Head-Related Transfer Function
3.1 Head-Related Transfer Function
The HRTF describes the transfer characteristics from the sound source to the ear. As shown in Fig. 2, Y (θ, φ, z) = H (θ, φ, z) X (z), where X (z) is the Z-transform of the sound source signal x [n], H (θ, φ, z) is the HRTF for the sound source direction (θ, φ), h [θ, φ, n] is the impulse response of H (θ, φ, z), and Y (θ, φ, z) is the Z-transform of the observed signal y [θ, φ, n]. Note that the Z-transform of the observed signal from the front direction (θ = 0, φ = 0) is Y (0, 0, z) = H (0, 0, z) X (z). The HRTF impulse response, h [θ, φ, n], can be obtained directly from the SOFA format file.

3.2 Proposed Generalized HRTF
The Generalized HRTF G (θ, φ, z) is defined as follows.
As shown in Fig. 3, it is an adjusted HRTF based on the HRTF for the front direction, and the signal from the front is transferred transparently. Therefore, an improvement in sound quality is expected. Note that, since this simply adjusts sound color, it does not affect sound localization. However, G (θ, φ, z) cannot be derived analytically from H (θ, φ, z).

3.2 GHRTF estimation using machine learning
Equation 2 can be transformed as follows.
In the transfer function method, the system's impulse response and the signal are theoretically equivalent. Therefore, Eq. 3 can be considered a system with a transfer function of G (θ, φ, z). When h [0, 0, n] is input into this system, the output is h [θ, φ, n]. Here, h [θ, φ, n] is the impulse response of H (θ, φ, z).
Transfer function estimation based on input and output signals is known as system identification. The simplest system identification method is multiple regression. The multiple regression model for a N order FIR filter is shown below. Here, g [θ, φ, k] is the impulse response of G (θ, φ, z), which can be estimated by multiple regression, i.e., machine learning. In this case, h [θ, φ, n] is the target variable, and h [0, 0, n − k] is the explanatory variable.
3.4 Examples of GHTRF estimation
The following GHTRF estimation uses 48KHz sampling data from subject H4 in the SADIE II database [2]. Figures 4 and 5 show the estimated impulse responses in the horizontal and vertical planes, respectively.


The left heat maps show h [θ, φ, n], the HRTF impulse responses stored in the SOFA format file. The heat maps on the right show g [θ, φ, k], the GHRTF impulse responses estimated using machine learning. The center heat maps illustrate the HRTF impulse responses inferred from h [0, 0, n] and the trained g [θ, φ, k], demonstrating high repeatability.
Next, Figures 6 and 7 show the frequency responses of the aforementioned HRTF and GHRTF, respectively.


The plots on the left are the HRTF frequency responses, and the plots on the right are the estimated GHRTF frequency responses. In both the horizontal and vertical planes, the variance of the GHRTF frequency responses is reduced, thus improving sound quality.
4. Typical GHRTF
4.1 Typical GHRTF estimation using machine learning
GHRTF training data does not need to be limited to data from a single subject. When trained on several subjects, the estimation of the typical GHRTF is expected.
Note that, when estimating the GHRTF impulse response g [θ, φ, k] for each subject, the sample with the maximum response varies depending on the subjects and (θ, φ). Therefore, simply estimating the typical GHRTF from multiple subjects' data does not meet expectations.
First, estimate the GHRTF impulse response g [θ, φ, k] for each subject. Then, identify the sample m with the maximum response for each impulse response. Next, calculate the average value of m across subjects with the same (θ, φ) values. Then, calculate the difference from the average. Finally, estimate the typical GHRTF impulse response for several subjects. Align the samples with the maximum response across subjects based on this difference by modifying n in Eq. 4.
4.2 Examples of typical GHTRF estimation
The following is a typical GHTRF estimation obtained by training using HRTF data from numerous subjects. Figures 8 and 9 show the impulse and frequency responses in the vertical plane, respectively, as used by SoundObject. Compared with the aforementioned frequency responses, the variance of the typical GHTRF frequency responses is reduced even further.


5. Evaluation
SoundObject version 4.1 enables a sense of direction for up-down and front-back with the aforementioned typical GHRTF. As shown in the following video, it achieves a quality level for music production. A sufficient sense of direction is also achieved for up-down and front-back. The sense of direction for front-back has especially improved compared with the prior version's impulse responses extracted from the KU 100 and KEMAR dummy heads.
References
[1] Audio Engineering Society, "AES standard for file exchange - Spatial acoustic data file format," AES69-2020, December 2020.
[2] SADIE project, "SADIE II Database," University of York, EPSRC, June 2018. https://www.york.ac.uk/sadie-project/database.html
[3] M. Kohnen, F. Denk, J. Llorca-Bofi, B. Kollmeier, and M. Vorlander, "COAT - Cross-site Oldenburg-Aachener transfer-functions," Zenodo, November 2020. https://zenodo.org/records/4556707
[4] M. Warnecke, S. Clapp, Z. Ben-Hur, D. Lou Alon, S. V. Amengual Garí, and P. Calamia, "Sound Sphere 2: A High-resolution HRTF Database," 2024 AES 5th International Conference on Audio for Virtual and Augmented Reality, Augst 2024. https://facebookresearch.github.io/SS2_HRTF/






