Content dump from previous post about natural language processing systems (bonus at end: listen to a cloned Boba Esponja [Spanish Spongebob] voice in Latin Spanish)
Phonology is the study of the sound system of a language. It is concerned with the way sounds are produced, transmitted, and received, and with the way they are combined to form words and utterances. Deep learning text-to-speech systems make use of phonological knowledge to convert text into speech. The goal of speech synthesis is to produce speech that sounds natural and intelligible. To do this, the system must be able to generate the correct sounds for the words and utterances that it is asked to produce (this is done by collecting very good audio files along with the language transcript data [speech to text])
At a very abstract level, Tacotron 2 model first converts text into a sequence of phonemes (basic units of sound) - first, the text data goes into an encoding function - compresses the text data into a ‘special representation code’ then that ‘special representation code’ goes into the decoding function - this function attempts to reconvert the encoded data back to its original form [a predicted spectrogram]. Then, WaveNet simply maps that spectrogram to sound data. It’s underlying theory lies in joint probability distribution of a waveform (i.e, learning the probability distribution of data points to then generate a new data point), ‘joint’ as in - the new data point is dependent on previous data points (e.g, the quick brown fox jumped over the lazy __) [it is likely the __ is ‘dog’]. WaveNet does that but with audio data.
Bonus: Since everything is just statistics and probability, it can replicate any voice. A cool thing about statistical natural language processing is that it’s very dynamic. If you have english data, then it speaks english. If you have spanish data, then it speaks spanish (i.e, if you have X language data, then it speaks X). I demostrate this with English and Latin Spanish version of Spongebob