I started to optimize the subband synthesis with some mild success. If i use psyco and a mono mp3 at 128kps, it 10x slower than realtime. I think I’m going to have to resort to numpy tricks. A short term goal is 100% python realtime decode of an mp3 (super common python modules are fine).
Decoder first pass complete!!
I finished the decoder and it can produce wav file output from an mp3. The good news is that is works, the bad news is that its 34 times to slow to play the audio runtime. (I have a core2 duo running at 1.83ghz.)
Since the spec pdf I found is out of date, I had to resort to looking at a couple mp3 decoders source to get the tables and algorithms. As such the python code is very close in form to the originals. I hate when it feels like im just porting code to python, but the exercise did help me understand the mp3 spec more. I plan to refactor alot of the code that was based on others decoders to make it myown and more python friendly.
Time to start cleaning up the code for readability and speed. Maybe there is enough fat to trim to get the decoding fast enough for direct playback.
Here’s some profile data:
Wed Aug 25 23:19:54 2010 stats.dat 5725333 function calls (5724461 primitive calls) in 139.562 CPU seconds Ordered by: cumulative time List reduced from 413 to 20 due to restriction ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 139.564 139.564 :1() 1 0.021 0.021 139.564 139.564 {execfile} 1 0.003 0.003 139.543 139.543 ./pyMP3.py:10() 1 0.017 0.017 122.649 122.649 ./pyMP3.py:149(decode) 100 0.832 0.008 61.701 0.617 ./pyMP3.py:587(polyphase_synthesis) 7200 60.602 0.008 60.719 0.008 ./pyMP3.py:609(subBandSynthesis) 100 1.555 0.016 29.090 0.291 ./pyMP3.py:543(hybrid_synthesis) 12800 27.348 0.002 27.535 0.002 ./pyMP3.py:563(inv_mdct) 100 17.204 0.172 23.309 0.233 ./pyMP3.py:205(dequantize_samples) 1 0.000 0.000 16.655 16.655 ./pyMP3.py:78(__init__) 1 8.702 8.702 16.589 16.589 ./bitbfr.py:2(__init__) 4052179 7.888 0.000 7.888 0.000 {ord} 739584 5.587 0.000 5.587 0.000 {pow} 100 0.101 0.001 4.206 0.042 ./pyMP3.py:637(decode_main_info) 400 1.438 0.004 4.034 0.010 ./pyMP3.py:672(decode_main_data_huffman) 65073 1.410 0.000 2.558 0.000 ./pyMP3.py:714(huffman_decoder) 100 1.471 0.015 1.471 0.015 ./pyMP3.py:510(antialias_samples) 295978 1.296 0.000 1.296 0.000 ./bitbfr.py:18(read_bits) 1 0.486 0.486 1.005 1.005 ./pyMP3.py:179(write_wav) 100 0.898 0.009 0.898 0.009 ./pyMP3.py:273(process_stereo) Wed Aug 25 23:19:54 2010 stats.dat 5725333 function calls (5724461 primitive calls) in 139.562 CPU seconds Ordered by: internal time List reduced from 413 to 20 due to restriction ncalls tottime percall cumtime percall filename:lineno(function) 7200 60.602 0.008 60.719 0.008 ./pyMP3.py:609(subBandSynthesis) 12800 27.348 0.002 27.535 0.002 ./pyMP3.py:563(inv_mdct) 100 17.204 0.172 23.309 0.233 ./pyMP3.py:205(dequantize_samples) 1 8.702 8.702 16.589 16.589 ./bitbfr.py:2(__init__) 4052179 7.888 0.000 7.888 0.000 {ord} 739584 5.587 0.000 5.587 0.000 {pow} 100 1.555 0.016 29.090 0.291 ./pyMP3.py:543(hybrid_synthesis) 100 1.471 0.015 1.471 0.015 ./pyMP3.py:510(antialias_samples) 400 1.438 0.004 4.034 0.010 ./pyMP3.py:672(decode_main_data_huffman) 65073 1.410 0.000 2.558 0.000 ./pyMP3.py:714(huffman_decoder) 295978 1.296 0.000 1.296 0.000 ./bitbfr.py:18(read_bits) 100 0.898 0.009 0.898 0.009 ./pyMP3.py:273(process_stereo) 100 0.832 0.008 61.701 0.617 ./pyMP3.py:587(polyphase_synthesis) 100 0.797 0.008 0.797 0.008 ./pyMP3.py:460(reorder_samples) 230400 0.518 0.000 0.518 0.000 {abs} 1 0.486 0.486 1.005 1.005 ./pyMP3.py:179(write_wav) 35205 0.318 0.000 0.318 0.000 {numpy.core.multiarray.zeros} 115213 0.281 0.000 0.281 0.000 {method 'write' of 'file' objects} 115209 0.237 0.000 0.237 0.000 {_struct.pack} 7230 0.150 0.000 0.150 0.000 {method 'extend' of 'list' objects}
Main data decoding complete
The scaling factor decoding was fairly straight forward, but the huffman tables gave me fits. I had an error in one of the lookup tables and it was causing the decoder to be off by a couple bits. Argh. I think it is all worked out now. The code is up on bitbucket (the first post). The bit reservoir is still not implemented yet.
The bit reservoir is a cool concept to put more bits where they are needed in the bit stream. This is different from Variable Bitrate. Basically, if a frame completed encoding its data and there is space left over, it can be donated to future frames. The reservoir is only used for the main data section.
The main data is split between scaling factors and coefficients. The coefficients are in the frequency domain and are split into regions ( big value region, one’s region, and zeros) Different coding schemes are used for each region. When a sound signal is converted to the frequency domain, it tends to have most of its energy in the lower bands. Since the values are larger towards the bottom, more bits are used to encode them. After the big value region, most of the values are -1, 0, or 1. That region is called the one’s region and can be represented with few bits. After that, all of the coefficients are 0.
Main data decoding and the joys of bad documentation
I got tripped up trying to make a first pass at decoding the main data section. The MP3 spec doc i got is outdated and is missing information. So the info I have for the main data section comes from random pdfs and mp3 decoder sources.
MP3 frames are composed of a syncword, header, sidedata, and then main data. The main data has 2 parts called granules. Each granule starts with a set of scaling factors and then is followed by huffman encoded coefficients.
To make sure that python code is correct, I started adding printfs to the reference decoder. So far, so good.
Eureka!!
I was trying to dig up more info on the mp3 format and i ran across a great reference site: www.mp3-tech.org One of the docs there is the offical MP3 spec. Now I can code against the reference instead of looking at various decoder implementations
Stage II: Sidedata parsing
The next step is to parse the sidedata. The sidedata has the parameters that are needed to decode the main data that is to follow. Since I don’t have the ISO doc, this is where things get dicey. I have the source of a couple open source mp3 decoders and some random docs on mp3s, but I can’t really vouch for the correctness.
If the protection bit was set in the header, there is a 16 bit CRC word before the start of the sidedata.
pseudo code of sidedata format (primarily based fraunhofer ref source):
main_data_begin = getbits(9) if mono private_bits = getbits(5) else private_bits = getbits(3) for ch from 0 upto num_channels for i from 0 upto 4 scfsi[ch][i] = getbits(1) for gr from 0 upto 2 for ch from upto num_channels part2_3_length[ch][gr] = getbits(12) big_values[ch][gr] = getbits(9) global_gain[ch][gr] = getbits(8) scalefac_compress[ch][gr] = getbits(4) window_switching_flag[ch][gr] = getbits(1) if window_switching_flag[ch][gr] block_type[ch][gr] = getbits(2) mixed_block_flag[ch][gr] = getbits(1) for i from 0 upto 2 table_select[ch][gr][i] = getbits(5) for i from 0 upto 3 subblock_gain[ch][gr][i] = getbits(3) if block_type[ch][gr] == 2 and mixed_block_flag[ch][gr] == 0 region0_count[ch][gr] = 8 else region0_count[ch][gr] = 7 region1_count[ch][gr] = 20 - region0_count[ch][gr] else for i from 0 upto 3 table_select[ch][gr][i] = getbits(5) region0_count[ch][gr] = getbits(4) region1_count[ch][gr] = getbits(3) block_type[ch][gr] = 0 preflag[ch][gr] = getbits(1) scalefac_scale[ch][gr] = getbits(1) count1table_select[ch][gr] = getbits(1)
gr = granule. Each frame has 2 granules each with 18*32 subband samples
main_data_end = negative offset in bytes from next frames header to the end of main data
private_bits = standard has no meaning for these.
scfsi= scale factor selection info. ‘0’ means there are scalefactors for each granule, ‘1’ means one set of scalefactors
part2_3_length=number of bits used for scalefactors and huffman data
big_values= size of big value section
global_gain=quantize step size.
scalefac_compress = used to lookup number of bits in scalefactors
window_switching_flag = true means a block other than normal(0) is used
block_type = used to select window type (len and style)
mixed_block_flag= true if different windows for high and low freq bands
table_select = which huffman table to use.
subblock_gain=offset from global gain for each short block
region0_count = number of scalefactors in region 0 minus one
region1_count = number of scalefactors in region 1 minus one
preflag = if set, a table of values are added to the scalefactors
scalefac_scale = chooses step size of scalefactor quantizer
count1table_select =huffman table selection for “ones” region
Later today I’m going to post python code that decodes the sidedata and also descriptions of each of the values parsed.
Stage I: Sync bits and Headers
Time to get roll’n. First a standard disclaimer. The code is meant to be readable and written quickly. I’ll go back later and optimize. Also I make mistakes, but once discovered I’ll correct them as fast as possible. whew…
MP3s are composed of a series of frames. And each frame starts with a sync word. The sync word is 12 1’s (0xFFF). If reading an MP3 stream, one would wait for the sync word and then try to decode the rest of the header and maybe wait for the next frame to make sure that the syncword that was found was real and not just a random occurrence.
Bit start bit pos | bit length | Description | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12 | Frame sync word 0xFFF | ||||||||||||||||||||||||||||||||||
12 | 1 | MPEG version, 0=v2 1=v1 | ||||||||||||||||||||||||||||||||||
13 | 2 | Layer, 00=reserved 01=layer III 10=layer II 11=layer I | ||||||||||||||||||||||||||||||||||
15 | 1 | Protection bit, 0=has 16bit CRC after header 1=no protection | ||||||||||||||||||||||||||||||||||
16 | 4 | Bitrate index
|
||||||||||||||||||||||||||||||||||
20 | 2 | Sampling rate 00=44100 01=48000 10=32000 11=reserved | ||||||||||||||||||||||||||||||||||
22 | 1 | Padding bit 0=no pad 1=padded | ||||||||||||||||||||||||||||||||||
23 | 1 | Private bit | ||||||||||||||||||||||||||||||||||
24 | 2 | Channel mode 00=stereo 01=joint stereo 10=dual channel 11= single | ||||||||||||||||||||||||||||||||||
26 | 2 | Mode extention (joint stereo) | ||||||||||||||||||||||||||||||||||
28 | 1 | Copyrighted? 1=true | ||||||||||||||||||||||||||||||||||
29 | 1 | Original? 1=true | ||||||||||||||||||||||||||||||||||
30 | 2 | Emphasis 00=none 01=50/15ms 10=reserved 11=CCIT J.17 |
I whipped up some code to find the sync and parse the header from a standard MP3 and created a new bitbucket repo.
First blog post ever
Well I decided to take the plunge and try this blogging action. I thought it would be cool to document some of the side projects I’ve been working on. Currently, I’m working on a subSonic client for my N900. To speed up development, all the code has been in python. Now that I’ve got the REST stuff going, I need to play the MP3 stream. This is where the blog comes in. I’m going to write a python based MP3 decoder. Yeah, its a bad idea from a performance standpoint, but from a learning/documenting perspective, it is quick and easy. I finished a h.264 decoder in python about 3 months ago (code up soon), so I have a lot of building blocks ready to go.
OK, time to get started. The MP3 spec 11172-3 cost money. (which makes no sense! its a open spec). So unlike the h264 stuff, I dont have a solid reference. Instead I’m going to break apart the Fraunhofer code and document the decode process.