|!|| The following article is marked for standard and long term commitments the wikia has to do that are related to site maintenance. We apologize for the inconvenience this may cause in the meantime.
Similar pages are automatically sorted into this category.
Below is information on how the VOCALOID voicebanks are developed, based on what is known on the software as revealed by the studios. Notice that this is not a guide on how to create an actual VOCALOID. Please consider alternatives such as UTAU.
- For a list of all vocals released and in development, see status.
The VOCALOID project was a international effort, and is considered the brainchild of Kenmochi Hideki, also known as the "father" of VOCALOID. In Japan, in 2000, he proposed the first initial ideas that founded VOCALOID. Much of the research into the software came from the Pompeu Fabra University in Spain, in a project led by Mr. Kenmochi. It was purely collaborative research; selling a product using it was not being considered at the time. At first, VOCALOID could only say vowels like ai (love). Four months later, the VOCALOID's first real word was "asa (morning)". The original aim of VOCALOID was to act as a replacement singer for a real vocalist. Many reviewers at the time of LEON and LOLA's release noted that "VOCALOID" was a bold effort, as human speech was a complex thing to recreate. VOCALOID was regarded as the first of its kind to tackle singing vocals.
Both an English and a Japanese version were developed alongside each other. The first studio on board was Crypton Future Media, who was hired to find English studios to support an English version. Sadly, their efforts amounted to mostly negative responses, and the only studio to enter development was Zero-G.
The VOCALOID singing synthesizer technology is categorized as concatenative synthesis, which splices and processes vocal fragments extracted from human singing voices in the frequency domain. In singing synthesis, the system produces realistic voices by adding information of vocal expressions like vibrato to score information. The VOCALOID synthesis technology was initially called "Frequency-domain Singing Articulation Splicing and Shaping" (周波数ドメイン歌唱アーティキュレーション接続法 Shūhasū-domain Kashō Articulation Setsuzoku-hō?), although YAMAHA no longer uses this name on its websites. "Singing Articulation" is explained as "vocal expressions" such as vibrato, and vocal fragments necessary for singing. The VOCALOID and VOCALOID2 synthesis engines are designed for singing, not reading text aloud. They also cannot naturally replicate singing expressions like hoarse voices or shouts.
The main parts of the VOCALOID2 system are the Score Editor (VOCALOID2 Editor), the Singer Library, and the Synthesis Engine. The Synthesis Engine receives score information from the Score Editor, selects appropriate samples from the Singer Library, and concatenates them to output synthesized voices. There is minimal difference in the Score Editor and the Synthesis Engine provided by YAMAHA among different VOCALOID2 products. If a VOCALOID2 product is already installed, the user can enable another VOCALOID2 product by adding its library. The system originally supported two languages, Japanese and English; upon the release of VOCALOID3 language support for Korean, Spanish, and Chinese was also included. Other languages may be optional in the future. It works standalone (playback and export to WAV) and as a ReWire application or Virtual Studio Technology instrument (VSTi) accessible from a Digital Audio Workstation (DAW).
The Score Editor uses a piano roll style editor to input notes, lyrics, and some expressions. For a Japanese Singer Library, the user can input gojūon lyrics in hiragana, katakana or romaji writing. For an English library, the editor automatically converts the lyrics into the IPA phonetic symbols using the built-in pronunciation dictionary. The user can directly edit the phonetic symbols of unregistered words. A Japanese library and an English library differ in the lyrics input method, but share the same platform. Therefore, the Japanese editor can load an English library and vice versa. As mentioned above, the lyric input method is library-dependent, and so the Japanese and English editors differ only in the menus. The Score Editor offers various parameters to add expressions to singing voices. The user is supposed to optimize these parameters that best fit the synthesized tune when creating voices. This editor supports ReWire and can be synchronized with a DAW. Real-time "playback" of songs with predefined lyrics using a MIDI keyboard is also supported.
A VOCALOID studio will first have to approach YAMAHA and acquire a license to produce a VOCALOID.
The price of licensing varies per circumstance. Overseas studios such as Zero-G, PowerFX and Voctro Labs pay more for their VOCALOIDs because of the exportation rates to outside of Japan, while Japanese and other Asian studios pay less. It was confirmed by Joffrey of Voxwave that many groups approach YAMAHA to create a Vocaloid but were rejected; he even suggested that the majority of the attempted Vocaloids do not make it past this stage.
The cost of producing a Vocaloid is unknown, though a few amounts have been dropped;
- ¥5,000,000 was raised for the production of Tohoku Zunko (approx $50,000+ USD) for her vocal bank to be developed.
- €7000 ($9,300+ USD) was originally estimated as the cost to record the samples for ALYS, including hiring (Poucet) and obtaining a studio. This does not include other costs required to produce the product.
- $10,000 to $12,000 was given by PowerFX as an estimate of the cost for creating a new voicebank, once a voice provider, artist and plan were in place.
Once the company is in agreement to release the Vocaloid, the Vocaloid becomes theirs. They hold the license, they pay the fees they distribute and sell the product through their website etc. Each studio is sent a construction kit which guides the studio in the production of each VOCALOID. After they have set up all the necessary means to begin work, the process moves onto selecting the singer.
Extra vocals and updates of old vocals will cost more licensing fees to release. This is why several studios focus on the sales of new Vocaloids rather then updating older ones. Though if a Vocaloid sells well, some studios may consier updating their vocals even if they have never updated any of their Vocals in the past.
If the vocal is an update on an old vocal, then the older vocal will have to be analysed and bugs found ad fixed. Studios often rely on User feed back to find these bugs.
Singer Library Each VOCALOID licensee develops the Singer Library, or a database of vocal fragments sampled from real people. The database must have all possible combinations of phonemes of the target language, including diphones (a chain of two different phonemes) and sustained vowels, as well as polyphones with more than two phonemes if necessary. For example, the voice corresponding to the word "sing" ([sIN]) can be synthesized by concatenating the sequence of diphones "#-s, s-I, I-N, N-#" (# indicating a voiceless phoneme) with the sustained vowel ī. The VOCALOID system changes the pitch of these fragments so that it fits the melody.
Hiring/Selecting the VocalistEdit
Zero-G's singer selecting process during the VOCALOID1 era began by looking at what was missing in the Vocaloid range. It was decided that there was a gap for a classical soprano voice. This voice type was decided during Miriam's release, along with the type of voice suited for choir music. Prima's voice provider was a singer who answered an ad put up on a music academy website. After some test samples in the VOCALOID software, they decided to go ahead and record her voice for the VOCALOID2 software.
Internet Co. wanted to utilize the voice of a singer for the creation of VOCALOID, but felt it would be difficult to get a singer to agree. They consulted Dwango Co.,Ltd. who managed Nico Nico Douga, and Dwango suggested Gackt (神威 楽斗 Camui Gackt), a singer and actor, as he had previously provided his voice for Dwango's cell phone services. He lent his voice and named the VOCALOID, Gackpoid.
Not all VOCALOIDs have professional singers behind their vocals like Prima and Miriam did. According to Crypton, professional female singers refused to provide voice samples, in the fear that the software might create their singing voice's clones. In response, Crypton changed their focus from imitating certain singers to creating characteristic vocals. This change of focus led to sampling vocals of voice actors. The Japanese voice actor agency Arts Vision supported their development. Similar concerns regarding vocal clones have been expressed throughout the other studios using VOCALOID, with Zero-G refusing to release the names of their providers. Miriam Stockley (who provided the voice for Miriam) remains the only known Zero-G voice provider. PowerFX only hinted at Sweet Ann's voice provider; only Big AL and YOHIOloid's are known. AH-Software named the voice providers for Miki, Kiyoteru, Yukari, and Zunko, but for legal reasons cannot name Kaai Yuki, as minors were the subject of the recordings.
For Aoki Lapis, a voice recording competition was held to find the voice provider, where entrants uploaded the song they thought best suited her. In early August 2012, i-Style Project started an open recruitment for the voice provider of Merli, the follow up project of Aoki Lapis. The deadline of recruitment was set to September 10, 2012.
The Recording processEdit
All VOCALOIDs have a similar recording process. First, the recording sessions begin with the vocal provider singing out the phonetics needed for the vocal library. This has been nicknamed a "spell" by those working with this part of the vocal construction process for its almost "chanting" sound. Originally, the "spell" was nonsense words, but it has been adjusted over time to make getting voice samples easier. This recording session varies on how long it takes to do. For Japanese VOCALOIDs, a voicebank may be produced within four hours, as was the example of Gackpoid's voicebank, while English voicebanks can take from one week to up to a month to record all their samples, due to the size of the vocal library required. An additional second recording may take place to give a better result.
In order to get more natural sounds, three or four different pitch ranges are required to be stored in the library. Japanese requires 500 diphones per pitch, whereas English requires 2,500. Japanese has fewer diphones because it has fewer phonemes, and most syllabic sounds are open syllables ending in a vowel. In Japanese, there are three patterns of diphones containing a consonant: voiceless-consonant, vowel-consonant, and consonant-vowel. On the other hand, English has many closed syllables ending in a consonant, and consonant-consonant and consonant-voiceless diphones as well. Thus, more diphones need to be recorded into an English library than into a Japanese one. Due to this linguistic difference, a Japanese library is not suitable for singing in English. Each language has to go through similar processes to English and Japanese, with the number of sounds needed. being different, as well as the type of sound needed to be recorded. Some languages require more vowel variations, others consonants, there are even others who focus on certain tonal based sounds.
The construction of a voicebank occurs mostly through the trial, error, and experience of those who work on it. Each sample will need to be assembled into the singer library piece by piece. There is a risk of a particular sample not being correct, therefore editing may have to occur to get the best the production team can from the samples provided. Other sounds cannot be gotten via easy means as the sound is only formed in a cluster of other sounds, resulting in the sound having to be cut out among others. This is most commonly found in to European languages like Spanish and English.
Many studios such as Zero-G and PowerFX do not have an in-house production team, whereas others like Crypton Future Media and Internet co., Ltd do. Therefore, one would often find people like Anders Sodergren who will work with more than one studio. There is also a possibility of the same workers being present on more than one construction team for the studios. Another example of this is Luo Tianyi, who was also worked upon by members who had experience from making VY1 and/or VY2.
Yamaha's Final ApprovalEdit
YAMAHA inspects the vocals and checks for issues that may require repair, which is often the reason a vocal may be delayed or suspended in production. If a bug was found, the vocal is sent back to the development team for repairs. Even if there is only one minor fix to be made, the entire vocal must be resubmitted for YAMAHA to inspect from scratch. The process is the same regardless of language or studio involved. Once the testing is finished, the voice is compressed ready for release. At this point, the vocal may be released without hesitation.
A VOCALOID cannot be released without YAMAHA's permission. As seen with OLIVER, the process may take over a month to complete. Other vocaloids confirmed to have been held up by this process include YOHIOloid, Ruby, DEX and DAINA. It was witnessed that DEX and DAINA took approximately two months to achieve release stage, spending roughly their final two weeks waiting for compression to take place.
Other than YAMAHA, as seen with DEX and DAINA, distributors that the studios affiliate to sell their vocal with can also reject certain illustrations.
After release of the VOCALOID voicebank, further planning is taken into consideration for the release of the next VOCALOID. Some studios like Zero-G are only given enough funding to cover the costs of making a single vocal, and therefore are limited by what they can do within a year. The impact of the sales of one VOCALOID can easily effect the production of the next, and studios will also be watching to see how well the VOCALOID fairs. A VOCALOID needs to ship 1,000 units in order to be declared successful, as quoted by Crypton Future Media in regards to the success of Hatsune Miku, who sold 40,000+ in her first year and went onto sell 60,000+ over her lifetime. However, most VOCALOIDs do not sell this well as reported, SeeU had failed to meet sales expectations, and LEON and LOLA failed to impact America.
The impact of KAITO's VOCALOID failure led to an overall smaller demand for Japanese male VOCALOIDs, and is considered part of the reason why there is less production put into masculine vocals. However, VOCALOIDs are subject to producer trends and interests, and this may be turned around by a sudden rise in the popularity of a particular VOCALOID, as was the case with KAITO himself.