Exploring Developments in
Immersive 3D Media

by Dave Zimmerman

Posted on May 17, 2018 at 1:30 PM

Virtual Zen Garden

This paper will explore the impact of immersive media in regards to the capturing, processing and transport of content. This will include a review of methods of producing immersive content, high efficiency video coding (HEVC) using the H.265 file format and Spatial Audio using the MPEG-H file format. It will evaluate tools and techniques for working with immersive content. Finally, the constraints on current technology and the impact of 5G networks will be considered. Through review of research reports, industry-related articles, and manufacturer’s websites I aim to understand immersive content production, processing and transmission. Through self-evaluation of consumer products, I intend to capture content and develop skills in the creation of immersive experiences.



The release of free high-definition 3D game development software, affordable 360 cameras and spatial audio microphones have made the production of immersive content more accessible to the hobbyist and professional alike. Using a motion tracking headset, immersive content can create a sense of presence by not only delivering the content that the user experiences, but also content the user may not experience due to the user’s position and direction at any moment. More efficient encoding methods are being used for the transmission of immersive content, but with its high bandwidth requirements, along with the advent of Internet of Things (IoT) systems, greater strain is being placed onto current telecommunication networks. With the proliferation of mobile and smart devices increasing the amount of data flowing, 5G Networks are being developed to accommodate the rapid deployment of time sensitive content to the increasing number of mobile and smart devices and to transition from existing networks to new network technologies.


Immersive Video Content Production

Although 3D video and virtual reality have popularised immersive content, its development is not new or exclusive to digital technology. Since the early days of photography, photographers have been experimenting with ways to bring the viewer deeper into the experience of a particular location, a sense of immersion. Wide-angle zoom lenses are used to capture a wider field of vision than standard zoom lenses. Though wide-angle lenses can capture a wider field of view, the lens distorts the image, bending edges that should be straight.

Typically, the human eye can receive nearly 180 degrees of visual information. By taking multiple standard zoom lens photos and linking their edges, a technique known as “stitching” is used to combine several separate images into one seamless image that greatly exceeded the nearly 180 degrees of the viewer’s field of vision, thereby creating a sense of immersion. The photograph below is an early example of photo stitching.


View of Madison, Ind.png
View of Madison, Ind. Indiana Madison United States (Gorgas & Mulvey, ca. 1866)


To understand more about immersive content development, I had taken a course on VR and 360 Video Production with www.Coursera.org. There, I learned how 360 video is used in VR, the VR preproduction pipeline, 360 video production, postproduction and publishing. After completing the course, I conducted an evaluation of a 360 camera. With a limited budget, I was able to purchase a minimal amount of equipment to create a 360 photosphere including…

  • PIXPRO SP360 Camera and Accessories Pack

  • Telescopic Monopod

  • Triple Axis Spirit Level

PIXPRO SP360 Camera

Monopod, Spirit Level & Camera Housing

All Components Assembled


Using the monopod, I maintained the height of the camera. With the spirit level, I checked the horizontal and vertical alignment before capturing the image. After capturing the first 180 degrees, I rotated the monopod 180 degrees to capture the other side. To offset the gap made in the stitching process, I repositioned the camera, compensating for the distance of the camera to the monopod created by the camera mount.

The stitching process is essential for creating fully immersive visual content. The two hemispherical images were transferred from the camera to a computer. Using stitching software, I was able to combine the two images into one 360 panorama.


PIXPRO Stitch software


The PIXPRO camera comes with PIXPRO Stitch software. The software takes the two images and allows the designer to adjust the positioning to reduce the effects of gaps in 360 image stitching. Both images were repositioned for minimal disparity between the edges. Once the images are repositioned to a satisfactory degree, the images are saved together into an equirectangular image. This image can be viewed in 360 image viewers, modern web browsers, and imported 3D game development software.


360 equirectangular image of a Zen Garden


Even though I compensated for the offset created by the camera rig, I still found that there were some artefacts within my image. The overlap of the image in the left side of the wooden stand and in the floor below it is visible. Perfect calibration of the two images was not possible without photo editing software.

Example of Stitching Artefact


Digital photography has made the capturing of panoramic images extremely simple, reducing labour, cost and development time. Portable 360 cameras such as the Kodak PIXPRO have brought immersive visual content production to the masses. Usually, these devices consist of a single fisheye lens, which records one hemisphere of content. To create full 3D photo sphere, a multi-camera mount is needed. The multi-camera mount ensures that the images captured are consistently and evenly positioned, making the images easier to stitch together.


Kodak PixPro SP360 4K – Dual Pack


With immersive content technology developing rapidly, newer, higher resolution cameras are coming onto the market regularly. The trend in immersive content production is to make live streaming of high-definition audio/video content as portable and user friendly as possible. To that point, new 360 cameras are being developed with stitching software built into them. For example, Pilot Era Professional 8K VR panoramic camera has four camera lenses set at 90 degree intervals which records overlapping high definition (4K, 6K, and 8K) images and records ambisonic sound using four microphones. The device stitches, encodes to H.265 or H.264 MP4 file format and streams using 4G at 2Mbps-15Mbps in near real-time.


Pilot Era


Immersive Content Processing Methods

Efficient methods of transmitting content are required for live broadcast and streaming services. In respect to 3D and VR, 4K is the standard display resolution with service-compatible transmission of stereoscopic video being the ideal method of delivery since there is no downsampling (combining and compressing), though the drawback to this method is that it must decode at twice the frame rate to avoid impact to the user’s QoS.

Because of the high bitrate that 4K transmission requires, a service-compatible transmission may not be practical for users with limited data connections. To accommodate this possibility, a frame-compatible approach is usually employed for the transport of immersive content. The frame-compatible method downsamples the left and right images into a single frame for transport and when it is decoded, the media player upsamples (decompresses and separates ) the image and sends the appropriate image to the intended view. The negative impact to QoS is low for normal viewing, but with 4k VR displays being so close to the eye, the user might detect blurriness in the image quality which may have an undesired effect. With 8K resolution, two 4K images can be transferred with the frame-compatible method to a VR headset displaying one 4K image to each eye.

Immersive content should make use of High Efficiency Video Coding (HEVC), encoding with the H.265 file format. While the H.264 format made in OTT services such as Netflix possible and revolutionised YouTube, the H.265 format can display the same or greater image quality that the H.264, but at roughly half the bitrate. As Encoding.com explains…

Source video, consisting of video frames, is encoded or compressed by an HEVC video encoder to create a compressed video bitstream. Each individual frame is first broken up into blocks of pixels… These are encoded via motion vectors that predict qualities of the given block on the next frame..

Encoding Intelligence, https://www.encoding.com/h-265/

This higher compression would result in approximately a 35% reduction in bandwidth, allowing higher quality video to be displayed on devices with lower bitrate connections. This higher efficiency would also have an impact on IP transport for OTT services, reducing overall costs of transport. With the demands that immersive content put onto networks, the reduced bitrate and higher image quality of HEVC is the right solution.


Immersive Audio Content Production

Audio is essential for deeply immersive content. Rendering systems for audio include channel-based, object-based and scene based. Each system works with audio signals and output devices differently. When used together, each system can contribute to the production of immersive spatial audio.

Channel-based audio is produced when the audio signals are mixed and processed using a Digital Audio Workstation (DAW) with the user speaker setup clearly defined. This is a well-established method of sound production, widely used for 5.1 surround sound and stereo recordings.



Object-based audio is comprised of virtual audio sources which can change position independent of channel. This audio is used in 3D video games, where sound objects gain and lose attenuation, delay and occlusion based on the position and rotation of a virtual sound receiver. It is used in multi-track recording where spatial positioning is determined through panning software before it is mixed down to stereo or surround sound.



Scene-based audio uses ambisonic spatial audio. Spatial audio microphones use coincident pairs of directional microphones to record quadraphonic sound across Mid-Side (MS) and the XY axes (Forward/Backward/Left/Right) and encode them to render in separate audio channels. This form of audio can take advantage of surround sound hardware and software to create 3D audio sound fields.



To understand more about immersive content development, I have conducted an evaluation of a ZOOM H2n spatial audio microphone. The ZOOM H2n field recorder is the industry standard for scene-based ambisonic recording. It uses unidirectional and bidirectional microphones to record 2-channel and 4-channel audio. Though not intended for spatial audio recording, with the advent of immersive technology, the firmware of the device can be updated to support spatial audio recording.



Spatial Audio Mode


Spatial audio was recorded using the Zoom H2n, placing the microphone where I took the 360 images. The sample was saved for further editing in Reaper, with the Ambisonic Toolkit plug-ins.

Audio recorded in the spatial audio setting can be imported into DAWs such as Reaper, and with the Ambisonic Toolkit plug-ins, the audio can be further modified to suit the needs of the user. Reaper is used for Wave Field Synthesis (WFS), to observe the impact of gain and speaker positioning on a sound-field and make modifications to suit the project.


Wave Field Synthesis using Reaper and the Ambisonic Toolkit


For deeper immersion, spatial audio can give the user a greater sense of sound localisation within a virtual environment. The MPEG-H file format allows for the rendering of audio signals across a wide range of speaker configurations, from stereo to 22.2 surround and greater. This allows for speaker arrangements to include not only horizontal positions around the listener, to set speakers higher and lower, making a more realistic 3D soundscape.

The diagram below shows the system architecture of the MPEG-H Audio Decoder. The MPEG-H format uses Unified Speech and Audio Coding (USAC), a codec with high compression for speech, music and any mixture thereof, to decode for channel-based, object-based, Spatial Audio Object Coding (SAOC) and Higher Order Ambisonic (HOA) scene-based material.


Top level block diagram of MPEG-H 3D Audio decoder.


From the MP4 file format, the MPEG-H layer is processed, de-multiplexing the signal into audio material that can be further modified by the user. This has direct application within immersive audio, since MPEG-H can render the different audio signals with a user interface which adds interactivity to the playback of the audio, using the positioning of a VR headset to determine sound source localisation before mixing and output to loudspeakers or headphones. This allows for deeply immersive experiences and compelling 3D video content.


Immersive Content Transport Methods

While more efficient codecs have improved compression of content and alleviated some of the congestion on networks, the number of devices connecting to networks is increasing at a staggering rate. With IoT devices like smart home appliances, driverless cars and wearable technology, unmanaged systems will rely heavily on 5G networks to accommodate the increased need for connectivity. To meet these recommendations, a range of technologies will be implemented in 5G including edge computing, millimetre wave (MMW) systems and heterogeneous networks. For 5G to work, Dat, et al states that it should be able to perform as follows…

It should be capable of supporting a higher number of simultaneously connected devices, better coverage, higher spectral efficiency, lower battery consumption, lower outage probability, lower latencies, lower infrastructure deployment costs, and higher reliability of communications.

Dat, P. T., Kanno, A. & Yamamoto, N., 2015. 5G Transport and Broadband Access Networks: The Need for New Technologies and Standards. n.p., ITU, pp. 175 - 182.

Edge computing allows data to be processed closer to the device, at the edge of the network, making it unnecessary to enter the cloud. This will reduce latency, especially in places with high volumes of wireless devices such a stadium full of smartphone users recording, uploading and streaming in real time. Edge computing can take a lot of the bandwidth heavy processes off of the cloud and make time sensitive procedures like real-time data processing and transport more reliable . In the case of driverless cars, the resulting reduction in latency could be life saving, as the car’s AI makes split-second decisions.

MMW systems would increase bandwidth to include frequencies in the 30 Ghz to 300 GHz range. While this increase in broadband is welcome, it does have limitations. The waves have a shorter range and are subject to atmospheric moisture reducing range even further.

The short range can be a positive feature for processes that use edge computing such as IoT devices. The data does not need to travel long distances to core networks, so it is less likely to be intercepted, thereby increasing security.

5G will solve many problems that come from congested data streams. With heterogeneous networks, intelligent Wi-Fi offloading can balance out data streams over legacy cellular and Wi-Fi bands as well as MMW bands. These networks will manage the handover of information to different networks based on context. This distribution process aims to make the transport of data more efficient and reliable, while maintaining high QoS. To this point, Dat, et al recommends that it provides the following services.

It should… be able to provide a variety of services with various features, including machine to machine, Internet of Things (IoT), delay-sensitive services such as real-time 8K ultra high-definition video, and other new services.

Dat, P. T., Kanno, A. & Yamamoto, N., 2015. 5G Transport and Broadband Access Networks: The Need for New Technologies and Standards. n.p., ITU, pp. 175 - 182.

The next generation of mobile devices will transform the way we work and live. For example, large, densely packed masses of people will be able to live stream 8K immersive audio/video while riding in driverless cars. Further innovation in this field will impact automation, with AI managing many IoT devices; systems that are currently managed by people will soon not require such management. With this to consider, it is unknown how much 5G networks will affect employees, but if impacted employers do not adapt, they could suffer unforeseen loses.



In capturing, processing and transport of immersive content, I have discovered factors that are driving innovation. From the limitations of the current systems and the demands of industry and the market, to human curiosity and the pursuit of more authentic simulations of physical phenomena, immersive content technology continues to grow and mature. My evaluation has shown that although availability of the technology is wide, there is an immense amount of knowledge and skill involved in the creation of this content. Through practical application of the concepts and technology explored, I have also gained a greater appreciation for spatial audio and 360 video production.


Google Pixel, Headset and Controller


To deepen my understanding, and further my skills and abilities, I purchased a Google Pixel smartphone, headset and controller.With the Pixel’s Photosphere camera feature and its native stitching software, I am able to create equirectangular images and view them in an immersive setting anywhere at any time. I look forward to the using this technology to create and share unique immersive experiences.