May 29, 2018

Organic Nature

Impossible Audio in #SecondLife

Immersive Audio Banner

When it comes to creating an immersive virtual environment, the goal is to allow the participant to suspend disbelief. Unfortunately, when it comes to places like Second Life (and SANSAR), we find that even the best designed sims always seem to forget the basics.

You know what I’m talking about today, and you’ve likely experienced this for yourself, (or rather didn’t). You teleport to some highly recommended sim somewhere only to find that the music is blaring 24 hours a day. When you turn the music off, you begin to understand why -

The whole place is eerily dead silent.

So what gives?

Audio immersion is a fundamental design staple when it comes to virtual worlds, and there are a few ways that designs either go about it or decidedly do not.

On the low end, we have our typical sim that just forgoes the immersive audio altogether in favor of sticking a music stream on the parcel. You’ll find more often than not that these locations are also quite sparse in their overall design, having a shopping mall or no real coherent planning.

Let’s say we happen across a sim that actually took some time to plan out their audioscape. Even here we’ll find that the audio often seems flat and repetative.

There are plenty of options on Marketplace when it comes to ambient audio, with the most populated option being from SoundScenes. What I am about to say is in no way an indication of whether Hastur Piersterson has done a great job or not with his product. I believe given the circumstances, the SoundScapes series of ambient audio is just fine for everyday use in your builds and I’d definitely recommend it.

One thing you’ll notice, (however), is that with the SoundScapes lineup the quality seems to be entirely hit or miss. Of course, this is totally subjective and I’m merely looking at the reactions of the customers who will give certain audio cubes a 3 star rating or 5 stars.

This isn’t a situation relegated to just SoundScapes, and it is something that persists across most (if not all) ambient audio systems in Second Life.

Much of the problem comes about from understanding what the limitations of SecondLife are in relation to audio itself.

  • 44100khz
  • Mono Channel
  • 10 seconds or less

That doesn’t seem to be a lot to work with up front, but if you understand how audio works, this is more than enough for a virtual world, especially when you understand why Second Life insists on uploading Mono tracks only.

The first bit of information we need to understand is that Mono tracks are required for Second Life to properly pan the audio points and add doppler. Well, not entirely… but this is the main stated reason because this is how the audio engine works in Second Life.

The problem here is that when you’re doing ambient audio systems in Second Life, you take these awesome stereo tracks and effectively crush them down to mono, in the process you lose what is called “side information”. A Mono track will effectively take stereo and average the left and right channel into a mid channel.

Mono mixes will always sound different to stereo ones, and there is little that you can do about that. On a technical level, the mono mix contains only the 'mid' information whereas the stereo mix has both 'mid' and 'side' information.

The reason a stereo mix 'sounds massive' is because of the quantity and nature of the side signal. If there is a lot of out-of-phase information in the stereo mix it will tend to sound very big, but this information will largely be lost when listening to the mid signal only.

This is why most audio in Scond Life sounds “flat” or low quality. We’ve simply stripped out the side information and uploaded the equivalent of average.

Now, you can get away with some tricks here in Mono and we’ll get to that in a moment. I want to address the 10 second limit first.

When we step up our game (pun intended) and start using ambient audio, maybe somebody has created a looped player in a cube (which is the most common approach), you find that 10 second loop sound annoying as hell and fake. It’s simply not organic enough to introduce randomness or it is so short that you can tell when it is looping.

This destroys the immersion.

Years ago when I was in ActiveWorlds, the AWGate world had a looping track of forest and birds. The problem was that this looped every 60 seconds or so. Those bird calls became predictable and ultimately annoying.

We could, of course, implement cubes that play some clips at random to break it up. We get the random crow in the distance or whatever. But let’s stay with our baseline for now.

Ok, so let’s say we step up our game again… this time we’re chaining together multiple 10 second clips to extend that loop further.

Excellent… we’re onto something now. But again, most stop around 30 seconds or 1-2 minutes in the high end. At the very least, we should be shooting for a 2 minute loop.

But what of the sound quality?

We’re still stuck with this crushed mono track, right? We’ve lost that side information and saved the mid range average. This still makes our audio sound flat.

Curse you Linden Lab!

Hold on… there is a light at the end of this tunnel.

We know that Second Life doesn’t allow stereo files (singular), and we have to upload in Mono only in 10 second maximum length. But that isn’t necessarily a limitation if you understand audio editing.

So let’s say that we understand now that combining a stereo track into a mono track will effectively lose the side information and flatten our audio.

But what if we isolated the Left and Right channels of a stereo track, and saved them separately as Mono tracks?

Of course we split them up into 10 seconds or less clips to chain together in-world.

Now we’re sitting with a split stereo audio… of which Second Life will gladly accept for upload. Those two channels saved individually also retain their side information, making them sound bigger when played back together in sync. There’s more depth to it and spatialization – far more than you’d get out of a mono track.

Now we’re stuck with solving the problem of how to play them simultaneously in Second Life. This is where the solution gets a little more complex because we can’t just code a single cube and let it go… I mean you could but that would be kind of a nightmare and limiting because you’re using a single item contents and dumping everything in there.

So let’s say we have our main cube, it’s a controller cube. The entire purpose of this object is to orchestrate the left and right channels of audio, which are in two other objects running a clone of our audio script and listening on an internal channel for the controller to tell them what to do.

You have the left channel audio files in one object, and the right channel audio files in the other object, both listening for the controller cube to tell them when to start and stop.

This should sound eerily similar to a Night/Day system except instead of day and night and two separate loops, we’re treating the two internals like left and right speaker to play simultaneously and giving them specialized audio that syncs together.

By doing it like this, we side-step the mono midrange problem and retain our side information, making the combined audio seem “bigger” and more robust. It just sounds more natural.

Of course, Second Life will pan those audios and add doppler because it thinks it’s just two separate tracks in mono and doesn’t know there is a correlation.

The purpose of doing it like this is predominantly to retain that side information which gives our audio depth. Now we have something that sounds more natural in the process, and whether Second Life is panning them is irrelevant because they are being panned in relation to each other, and that is what counts to retain the symmetry of the audio. We effectively are doubling the audio information being played back in-world, which to our ears sounds better.

These are what we can refer to as baseline ambient audio systems. Our “first layer” foundation. We build from here to create a totally immersive environment. Once we’ve sorted out the original audio information limitation and solved it, it becomes easier as we build with it.

Yes, we can have multiple cubes synced up but I’ll be the first to tell you that you actually don’t want to do that. If you have multiple cubes like this playing out of sync, it by defaults makes your environment seem organic and “random”.

As you move around the environment, they cue up out of sync with each other but in sync with itself (if that makes sense).

In the audio editing phase of such a project, we can apply some more tricks. What if we applied a wider spatialization to the stereo track before splitting it up? In-world, it would sound richer and more organic (within reason).

Once we understand how the audio system works, and a bit of audio theory for editing, we should be able to figure out how to get around the limits within reason.

I wouldn’t suggest that we can nail down true binaural audio in Second Life this way. If we approached it slightly different then yes we actually could.

Let’s say we applied the same technique to a pair of virtual headphones in Second Life. The headphone has two objects, one for right and left channel, and the headphone is the controller to them. We apply the same technique of stereo splitting and synchronization as above but now from a fixed position in relation to the listener.

With this setup, we could replicate full binaural audio in Second Life, albeit in a manner which is highly controlled. You wouldn’t get real-time panning like this, but in the bigger picture, maybe you could invent a pair of ASMR headphones for Anxiety Relief that plays back a head massage for fifteen minutes in binaural?

At this point you should understand that there is a bit of a downside to this, depending on what you’re trying to accomplish.

  • Synchronized Stereo Costs Twice As Much

Developing such a system in Second Life would also entail that your sound cube systems now cost you roughly twice as much to make, obviously because you have to upload double the audio files.

You therefore wouldn’t want to apply this technique to everything, but instead figure out where this technique would best be suited.

There is also another cost factor when trying to do this for a Day/Night system. Instead of two sets of audio in mono, you’re now using two stereo sets and so the cost of a Day/Night ambient system would run quadruple to had you just had a single mono loop.

I suppose in the bigger picture, we’re talking about that up-front cost and investment, and whether the benefits outweigh the costs. I’d imagine we would have to charge a slight premium for these HD Audio systems, but as long as it was still reasonable I think the end-user would still pay for it.

For me, the benefits definitely outweigh the costs. Audio using such a system sounds far better in Second Life than the typical flat mono loops we’re used to. It has a better dynamic range, it sounds less flat and more robust – even though Second Life pans the audio around as you move, that side information is still there and (interestingly) compliments it (I’ll get to that in a moment).

One could most definitely add the ability to “widen” the audio field in-world by allowing the end-user to expandor contract the left and right channel distance – which is just a fancy way of saying move those two cubes internall farther or closer together.

The Cetera Algorithm & HRTF 

The Cetera Algorithm is a reference to Starkey Labs and their hearing aid technology which makes the hearing aid seem invisible to the brain. Cetera removes the barrier between sound and the brain’s ability to process signals, and helps retain the subtle differences in arrival time between left and right ears so that your brain can process positional audio.

In the world of virtual reality, we also refer to this as understanding teh acoustic properties of Head Related Transfer Function (HRTF) which model 3D sound in both the room and how it arrives to your ears. That inference pattern of information is unconscious, but means a lot to the brain when trying to determine position, whether something sounds “real” or not and so on.

Let’s take an audio journey as an example:

Synthetic HRTF Audio Test

Of course, this is a massive oversimplification. So far as Second Life is concerned, yes it can be done but it likely will not anytime soon. We’re not talking about simple panning of left and right chanels with a mono track, but instead a panning stereo track, and even then we’re talking about a stereo track that was recorded in a very specific manner. Yes, we can effectively fake it in Second Life to a degree and under very controlled circumstances, but for our purposes here we’re discussing how to at least up the ante with stereo and the extended information at that level.

Suffice it to say, when somebody says that the difference between CD audio and Vinyl is “all in your head”, they don’t quite seem to understand how right they are (for all the wrong reasons).

While we may not get a full HRTF Model in Second Life, we can approximate things a bit. We can also take this information and subtle cues approach to help us further our understanding on how to approach and apply audio in the virtual world, even with our current limitations.

If we know that the extra information is paramount to our brain in order to process the audio better, then we can look for ways to reasonably retain that information and higher frequency whenever possible for a more natural listening experience.

Planning Your Scene

Even if we know all these crazy details about human hearing and perception, and create a tool by which exploits both how Second Life works and effectively doubles the perceptual audio resolution, there is still the understanding that the best tools are only as effective as the person using them.

For instance, we don’t actually want all of these ambient cubes synchronized to each other. With themselves, yes (and for obvious reason). But because of the nature of Second Life itself, and because such a system would invariably have a delay anyway for preloading and so on, it’s not a big deal and it’s actually more preferred to have the cubes not synced together because then they are offset around your sim playing out of sync with each other and effectively randomizing the soundscape based on where the end-user is location and moving.

The next part to understand is that we aren’t using these singular cubes as the end-all to be all. We have to plan ahead for a soundscape, and include those little details to layer things beyond the baseline.

A random cube that plays maybe crows during day and owls at night, or woodpeckers or whatever. That’s a good addition to the baseline.

The real trick here is walking around the sim as you’re building and asking yourself:

What does this sound like?

There’s really no such thing as “silence”. Whether the soda machine is making a compressor hum, there’s distance walla in a city (like white noise), the door opens, a bell rings entering the store, whatever… Things make noise on their own or when interacted with.

It all adds up.

AES Audio

A lot of this post comes about from a long-term project and R&D from AMG (Andromeda Media Group) in Second Life. One of the projects has been improving audio by thinking of things these systems could really use, and that we weren’t happy with out of the box.

There is, of course, more to it than “We’ve doubled the audio resolution”, however impressive that may be. Things like Dynamic Crosstalk Suppression (DCS) are also included in our current prototypes.

Should another audio brand in Second Life wish to upgrade their own systems with this information, I wouldn’t mind. Whatever makes the experience in Second Life better overall is a win for everyone.

That being said, I’m not going to explain how we’re pulling off Dynamic Crosstalk Suppression. That’s our little secret.

As a final note, let’s recap how to upgrade our ambient audio systems in Second Life:

  • Stereo Synchronization
  • Understanding Audio Information
  • Using loops measured in minutes, not seconds
  • Optional Stereo Widening before splitting
  • Understanding the circumstances of how it will be heard
  • Dynamic Crosstalk Suppression
  • Optional User Defined Channel Widening
  • Using additional randomized audio to break it up