Speech recognition applications are part of all modern smart phones, including Old IPhone models, which take instructions from the user and perform basic operations like calling, play-song, launch other apps etc. But what makes “SIRI” so unique and what is the basic architecture of speech recognition systems?
What is siri?
Siri is the smart personal assistant that helps you get things done just by asking. Not only it performs the operations requested by the user, but also interacts with the user as it can speak back. For example, ask Siri about the weather, and it will respond out loud with a short summary of the day's weather report and on-screen with a snapshot of the five-day forecast. Tell Siri that you need to schedule an appointment for 2:30 p.m. on Wednesday, and it will add an item to your calendar, and then confirm verbally that it has done so. Say you receive an incoming text message but can't devote your eyes to the screen to read it; you can command Siri to read it aloud to you.
One of the main differences of Siri when compare to the counterpart speech recognition applications is, it process the commands on a remote server and not in the local system, so you have to be connected to Wi-Fi or a 3G signal. You can have a general conversation with Siri as you are talking to a person. It would answer back to you for some of the casual questions like “What is your name?”, what is your gender?” etc. This makes “Siri” so peculiar compare to other similar applications.
Speech Recognition – brief history:
Speech recognition first appeared as a primitive technology in the 1950s, as little more than a curiosity. In the early 1960s, IBM's Shoebox device could recognize 16 spoken words and could respond to simple mathematical requests, such as "three plus four total."
DragonDictate by Dragon Systems was probably the first speech-recognition program for the PC, released in the early 1980s for DOS computers. It could recognize only individual words, spoken one at a time. It evolved over time into the product Dragon NaturallySpeaking, which can transcribe text spoken in a normal conversational voice and speed.
In the early 2000s, despite their constraints, mobile phones were programmed to recognize voices for dialing digit-by-digit, and to some extent to recognize names. The main issue was memory, so most of these phones could recognize only up to 10 or so names at a time. But it had relatively little usage of the feature, possibly due to poor marketing on the part of handset makers.
In 2005, the samsung added speech-to-text dictation as well as voice-activated dialing in one of its smart phones. This was the first phone released in market with true “speech-recognition” technology.
Another key advance has been the speed of the network. The rising tide of faster wireless networks has raised a great many boats, including the most recent generation of speech-processing technologies, by making it possible to offload the work onto a remote server. This remote based voice recognition technology is used in Google apps and in IPhone Siri.
Architecture:
To convert speech to command, the server has to go through several complex steps. When you speak, you create vibrations in the air. The analog-to-digital converter (ADC) translates this analog wave into digital data that the system can understand. To do this, it samples, or digitizes, the sound by taking precise measurements of the wave at frequent intervals. The system filters the digitized sound to remove unwanted noise, and sometimes to separate it into different bands of frequency. It also normalizes the sound, or adjusts it to a constant volume level. It may also have to be temporally aligned.
Next the signal is divided into small segments as short as a few hundredths of a second, or even thousandths in the case of plosive consonant sounds -- consonant stops produced by obstructing airflow in the vocal tract -- like "p" or "t." The program then matches these segments to known phonemes in the appropriate language. A phoneme is the smallest element of a language -- a representation of the sounds we make and put together to form meaningful expressions. There are roughly 40 phonemes in the English language (different linguists have different opinions on the exact number), while other languages have more or fewer phonemes.
The next step seems simple, but it is actually the most difficult to accomplish and it is the focus of most speech recognition research. The program examines phonemes in the context of the other phonemes around them. It runs the contextual phoneme plot through a complex statistical model and compares them to a large library of known words, phrases and sentences. The program then determines what the user was probably saying and issues a command.
Drawbacks in the speech-recognition system:
No speech recognition system is 100 percent perfect; several factors can reduce accuracy. Some of these factors are issues that continue to improve as the technology improves.
The biggest limitation right now in Siri is language support. Currently, you can only use Siri to its full potential when you use American English and you're physically in the United States. Siri does have options for U.K. English, Australian English, French (France), and German, but if you use the program in any of those language, you can't search for businesses or locations on a map.
The program needs to "hear" the words spoken distinctly, and any extra noise introduced into the sound will interfere with this. The noise can come from a number of sources, including loud background noise in an office environment. Users should work in a quiet room with a quality microphone positioned as close to their mouths as possible. Current systems have difficulty separating simultaneous speech from multiple users.
Homonyms are two words that are spelled differently and have different meanings but sound the same. "There" and "their," "air" and "heir," "be" and "bee" are all examples. There is no way for a speech recognition program to tell the difference between these words based on sound alone. However, extensive training of systems and statistical models that take into account word context has greatly improved their performance.
Conclusion:
Do you know how many words in google voice search database? A gazillion (a gazillion means 1 followed by 28,810 sets of zeroes)
The number of different words in English vocabulary is roughly a million, and over time that evolves because, obviously, new words enter the language, new names come along, so on and so forth, so that gets rediscovered from time to time and it gets added, too. Then, those words can be put together in any imaginable order, and for any length word string. So you might come up with a 10-word query, picking randomly from those million words, so it turns out to be an astronomically large number. However, by using this kind of statistical language model, and training it on lots and lots of queries, hundreds of billions of queries, we can develop most accurate speech recognition system that recognizes every word and every accent.
One of the striking features of Siri is that it will improve the more people use it. Siri will be collecting data from users, like their regional accents and dialects and common phrases that people use, and analyzing that information to improve. Additionally, the more you specifically use it, the better it understands your particular accent and characteristics of your speech.
No comments:
Post a Comment