November 12, 2014

Nimble Streamer performance tuning

Nimble Streamer is created with high performance in mind. Being a native Linux application, Nimble is fast and has low resources consumption. This is why it is chosen for robust high-availability infrastructures and high-speed transmission use cases.

So our customers want to learn more about hardware configuration to be best fit for usage with Nimble Streamer. They also want to know what is their existing hardware capable of when using Nimble.

The description below covers these and other aspects of Nimble Streamer performance tuning. We will refer to Nimble config and its parameters - it is described in Nimble Streamer configuration article.

Calculating RAM amount for live streaming cache


The most used parameter which influences the streaming process is the amount of RAM available for caching. This amount is used for HLS chunks storage.

For each stream, Nimble stores 4 chunks in cache. Once the chunk is out of the playlist, it gets timeout of 45 seconds. So additionally the cache stores several chunks and the number depends on chunk duration. If it's 6 seconds, this would be 4 + 45/6 = 4 + 7 = 11 chunks. For 10 seconds chunks this would be 4 + 45/10 = 4 + 4 = 8 chunks.
The consumed memory amount for those chunks will depend on the bitrate:

RAM size (bytes) = number of chunks * chunk duration * bitrate / 8

I.e. for 1Mbps stream with 6 seconds chunks this would be 8.25MB. If you have an ABR HLS streaming with 512Kbps, 1.5Mbps and 2Mbps, with 10 seconds chunks, your cache amount would be around 40MB.

Notice that this number does not depend on a number of simultaneous viewers.

In addition to cache size you also need to consider the RAM size which your OS will take for network processing. As you can see from this real-life example the OS itself may take a lot more than Nimble Streamer instance.

RAM cache size parameter is set up via web UI.

CPU-related tuning for high load


CPU consumption of Nimble is very low. However, processing large speed streaming will require additional CPU resources. This is why we introduced worker_threads parameter in the config which defines how many threads will handle the streaming.

On average, 1 core process may handle up to 4 Gbps. If you need more bandwidth you can increase workers appropriately. Number of worker threads should not be larger than total number of processors' cores. Our advice is to keep it as low as possible to process expected bandwidth. E.g. if you have 10 Gbps NIC you should set workers to 3. If you won't have more than 4 Gbps then keep the default value.

In addition, you need to be sure your server can process all available bandwidth dividing IO interrupts between cores. It's not a nimble part but more HW server configuration.

VOD streaming cache control


VOD streaming also requires cache settings. As in case of live streaming, by default it's located in RAM. In addition, when RAM cache is full, VOD cache starts residing in the file system. So if you're doing VOD, you need to use "Max disk cache size" parameter for setting the maximum size for VOD cache. This is done via web UI. You may also need vod_cache_timeout and vod_cache_min_storage_time parameters in config file for TTL of cached chunks.

Please read "Improving cache control for VOD re-streaming" article to learn how to control VOD streaming cache.

Examples


Let's see some common questions coming from our customers.

How many connections can be handled on 2 CPU 512MB RAM server having 1.2Mbps stream with 1Gbps network?
1.2Mbps would require about 10MB of RAM cache. Also, 1Gbps will not require changing worker threads parameter and and 1 extra CPU. So the only limit in this case is the network speed. For given bitrate, the channel will handle 830 connections.

Handling 10 streams, 512Kbps each for approximately 10K viewers, what hardware do I need for that?
From RAM cache perspective, this will take around 42MB RAM cache. However, 10000 viewers will produce 5.1 Gbps bandwidth (i.e. transmission speed). This would definitely need an additional CPU core (making it 2 CPU cores total) and worker_threads parameter set to "2".

The ABR stream has 10 seconds chunks for 240, 360, 480 and 512 Kbps bitrates. What is the cache size for that?
Sum of bitrates is 1.6Mbps. 8 chunks, each 10 seconds, divided by 8 will make 16MB of cache size. You can have 10 streams like that and still not get the limit of the cheapest 512MB RAM virtual machine. With average bandwidth of around 400Kbps, your 10Gbps network will allow you to have around 25000 simultaneous connections but that will depend on the popularity of bitrates.


Contact us if you have any further questions or need some advise on Nimble Streamer usage.

Related documentation


Live Streaming scenarios in NimbleBuild streaming infrastructure with Nimble Streamer,  Utilizing 10Gbps bandwidth with a single Nimble instanceNimble Streamer APINimble Streamer configurationVOD cache control,
Live Transcoder for Nimble Streamer

3 comments:

  1. Hi in your last example, where you mentioned ABR streaming, I am bit confused here, unless its a typo on your part. You mentioned that for average 400Kbps bit rate, 10Gbps would cater 2500 concurrent connections, whilst calculations shows it should be 25000, or in other words, for 2500 concurrent connections 1Gbps would be sufficient.
    Please correct if I am wrong.

    ReplyDelete
    Replies
    1. Hello!
      That's right - it was a typo and it's been fixed, thank you very much!

      Delete
    2. Sigh! good to read that rectified and top of that your immediate response is much appreciated. Shows your customer level support ;-)

      Delete