Overview


High Level Parameter Speech Synthesis System - $500

  • Uses 13 higher-level parameters to produce more natural-sounding speech
  • Uses your Windows-compatible audio card – no other hardware necessary
  • Runs under Windows® 2000/XP/Vista
  • Developed and used every day by speech researchers at Sensimetrics
  • Volume discounts are available

Speech Synthesis for Research

HLsyn is a parametric synthesizer designed and produced by Sensimetrics Corporation. Since its original release in 1995, numerous improvements have been made, and all are now incorporated in the present version 2.2. The most important advance has been the addition of three new control parameters for aerodynamic and source control. Improvements have also been made in the algorithms employed and in user documentation, which is extensive.

Background

HLsyn may reasonably be termed a quasi-articulatory synthesizer.  The design of the HLsyn synthesizer is based on the observation that the values of the 40-odd parameters used to control SenSyn and similar Klatt-type synthesizers are not independent, but are subject to inter-parameter constraints. The constraints arise because speech production, as a physical process, permits only certain combinations of synthesis parameters to arise, and also limits the rates at which the parameter values can change with time. To express these constraints, a small set of 10 high-level (HL) parameters was originally proposed. This set has now been expanded to include 13 HL parameters, providing improved aerodynamic and source control. The HL parameters are more closely related to the actual states and articulatory movements in the vocal tract than are lower-level (Klatt) parameters. The principle employed in HLsyn is simple: a set of mapping relations within HLsyn transforms the HL parameters into the values of the corresponding lower-level parameters that control SenSyn.


The HL parametersClick for larger image...

There are thirteen HL parameters. These include five constriction areas within the vocal tract:

an, the cross-sectional area of the opening to the nasal cavity; ag, the average area of the glottal opening bounded by the membranous portion of the glottis; ap the area of the glottal opening bounded by the cartilaginous portion of the glottis; al, the cross-sectional area of a constriction formed by the lips; and ab, the cross-sectional area of a constriction formed by the tongue blade.

Parameters al and ab come into play only during the production of consonants.

Another HL parameter is ue, the active rate of change of vocal-tract volume during obstruent consonants. This parameter can act to expand the vocal-tract volume in order to facilitate glottal vibration during obstruent consonants, or to prevent the expansion of the vocal-tract volume in order to inhibit glottal vibration. The dc parameter controls the compliance of the vocal tract walls and the vocal folds. This parameter can simulate compliance changes, such as those observed during unvoiced consonants. Subglottal pressure, ps, is a parameter that can be employed to control source amplitudes, important for emphasis in running speech.

Five of the HL parameters are similar (but not necessarily identical) to lower-level (Klatt) parameters. These are the fundamental frequency f0 and the four formant frequencies (f1, f2, f3, and f4). The latter specify the natural frequencies of the vocal tract, without accounting for acoustic coupling of the vocal tract to the trachea or nasal cavity, provided that there is no localized constriction formed by the tongue blade or the lips.

The time-varying HL formant-frequency parameters, in effect, specify how the shape of the vocal tract changes with time, independent of any nasal or tracheal coupling or local constriction. If there is nasal or tracheal coupling or a local constriction (as specified by the parameters an, ag, ap, al, and ab), then the mapping relations automatically modify the HL formant parameters to yield the lower-level formant parameters used to control the synthesizer. The HL parameters f1, f2, f3, and f4 actually describe the aspects of vocal-tract shape that are determined by tongue-body position, jaw position, pharyngeal shape, and possible lip rounding.

HLsyn includes a facility for manually entering time-varying parameters, for displaying the parameters in graphical or numerical form, for playing back the results of the synthesis and for displaying the synthesized sound pattern in the form of a spectrogram. Speaker-dependent constraints can be specified by the user to enable the synthesizer to simulate different female and male voices.

Current research at Sensimetrics includes the formulation of rules for generating the time-varying HL parameters from a phonetic string, so that HLsyn can be incorporated into a complete text-to-speech system.


HLsyn Synthesis Examples

The following are examples of copy synthesis made with HLsyn. "The Raven" has been fine tuned by adjusting HLsyn's lower-level Klatt parameters.

The Raven - .wav 164KB
The Raven - .aif 164KB

"She gave the kitten..." - .wav 33KB
"She gave the kitten..." - .aif 33KB

"Five women..." - .wav 31KB
"Five women..." - .aif 31KB