On Speech Recognition: Web App Integration, Pointers for Newbies, & Lessons Learned from a failed startup

Piece of advice

For all of those thinking of integrating speech recognition into their apps I have a word of advice for you: Don’t.

But you’re implementing the Star Trek computer you say..?
Well in that case it’s absolutely necessary to the core functionality of your solution! Starship software definitely needs damn good speech rec for its interface. How the hell else is it going to get your Earl Grey the right temperature..!? Well at least read this article before you embark on your journey so you have a better idea of where to start.

Just so we’re straight:
Speech rec discussed in this article is the kind that understands short phrases and/or commands with no training required. It’s not free flowing dictation like that found in Dragon software. That requires training and still has a long ways to go before being viable in a web 2.0 kind of way.

Engage:

Three main ways to integrate speech rec into a web app:

  1. Telephony
  2. Web Services
  3. Embedded

1. Telephony & VoIP:

Why do we consider telephony..? Why not. There are more phones in the world than computers. They all have built in microphones and speakers and they all speak the same crappy language designed by the evil telcos years ago. This means that anyone anywhere who has access to a phone can interface with your app, international restrictions aside. All you need to do sign up for a DID. A DID stands for Direct Inward Dialing and can be any kind of phone number like 555-555-5555. When anyone picks up a phone and dials your DID they’ll be routed to your server where the call can then be managed by telephony software.

Asterisk & FreeSwitch are the two big players here. Technically they’re soft switches, software created to handle and direct calls either between two VoIP points like Skype or they can be used to accept and/or direct calls from/to the POTS network, otherwise known as the Plain Old Telephone Service.

Both are open source but FS is definitely more so. You could say Asterisk is like MySql and FS is like Postgres. The former has a massive ecosystem and lots of commercial support, books, and a giant user base. FreeSwitch, on the other hand, is supported and developed by a smaller more cohesive group of talented and dedicated devs. By extension FS has a smaller but more dedicated user base much in the same way Postgres does.

Performance and stability wise FS easily wins the battle. However, support and availability for 3rd party extensions, tools, and dev resources definitely favor Asterisk.

For my solution I stuck with Asterisk but not because I liked it better as you’ll see why later on in the article..

So in summary telephony is great because you can operate free of operating system quirks, platform differences, and having to support several different devices.. but the real question is: does it scale..? Well yes it does but not easily. There are per minute charges for the DIDs. It’s cheap, i.e. around 2 cents a minute, but imagine supporting 1000 simulateneous calls to your server.. Quite possible performance wise as long as you offload the speech rec to another server, it’s just that you’re paying per minute every time someone calls in to interact with your app. That could get expensive pretty quickly. There is one way to get around all of this headache and that leads us to the next section of course.

2. Web Services:
The web services setup is simple architecturally speaking: a voice sample is recorded either at a device, like an iPhone for example or from a computer mic and sent to the server as a file. The voice file can be compressed before it’s sent, most likely using Speex or GSM, and POSTed to the server RESTfully, uncompressed, and then decoded using your speech rec software. The server can then send back an interpretation to the client.

The nice thing about this solution is that it scales.! As long as you have some decent power server side you can easily decode thousands of requests a minute. A request being a spoken command or phrase no longer than 10-20 seconds in length. We’re talking voice commands here, not dictation, which is much much harder to do. It’s certainly possible but you’d need a lot more power.. This kind of tech just isn’t ready for prime time IMHO. If it were we might indeed have a star trek computer available. We’d still need to work out the AI though. Even if the dictation speech rec were 100% you’d need one behemoth of a program to analyze the meaning of it all.. Not yet available folks.

The one issue of course with going the web services route is that you have to implement a client side program for each device you want to work with. This particularly sucks in the mobile world since you’re dealing with so many device quirks and platforms.. If only working with devices was like working with browsers.. Never mind IE not supporting events the right way.. the iPod touch doesn’t even include a friggin mic..! Welcome to the world of mobile devices. And you thought browser quirks were bad…!

3. Embedded
Honestly, I just put this option up there to warn you. Embedded speech rec just makes all speech rec look bad. Don’t bother. Sure it will rec a few hundred names and/or song titles.. In general though serious speech rec should be done in the cloud b/c that’s where the power is and good speech rec requires a good deal of power.

Available Speech Rec solutions:

You didn’t seriously think you were going to implement this part did you..? =) Companies who deal with this tech often have several PhDs on board and there’s a good reason for that: Recognizing human language in a meaningful way is really hard. At least by today’s standards. With all the nuances, intonations, accents, and signal distortion i.e. wind, weak signal, that annoying guy on the cell phone etc.. it’s amazing it works at all.. A real credit to the science and researchers behind it all.

That said it’s so far away from what the stupidest human you’ve ever met can do it’s almost humiliating we’re not further along.

At any rate it’s not like you’re trying to build the next starship voice interface. That’s currently impossible anyway so don’t try it. What you can do is build some basic speech rec into your app much in the same way major banks, airlines, and mobile operators have: automated voice attendants that understand a constrained set of responses, lists of names or places, and/or strings of numbers.. Here are the viable 3rd party options for doing just that:

Commercially available:

  1. Lumenvox (San Diego, CA)
  2. Nuance (Burlington, MA)

Open Source:

  1. Sphinx (Carnegie Mellon University)
  2. Julius/Julian (Kyoto University, Japan)

The real deal:
I know there are others available like IBM, AT&T, Microsoft, Acapela, Loguendo etc… but they all kind of suck. Trust me. The suits at AT&T only care about BOA and AA contracts. IBM doesn’t even know it still has a speech rec department. Microsoft made some valiant efforts in the field but I’m not a windows dev and Loquendo/Acapela are in Europe so perhaps that’s why they were hard to get a hold of. Speaking French fluently I couldn’t get these guys to return my calls or emails. They did put me on their mailing list though. =)

Also it’s not like any of them are superior to what the other guys have or else we’d all know. Everyone is basically using the same kind of tech at their core. There is no one company with superior proprietary speech recognition. At least not yet. Maybe Google will be that company at some point but for now it doesn’t exist.

So your best bet is Lumenvox or Nuance. In particular Lumenvox is your best bet because they are the most developer friendly and the nicest to work with. Nuance is alright but it’s a windows shop and they are a lot more pricey.. They also have a ton of products, some of which may even compete with what you’re trying to do..! It’s easy to get lost on their site.

Open source dreams:
Accurate open source speech recognition is still but a dream unfortunately. I have the utmost respect and admiration for both the Sphinx and Julius teams. I mean holy shit.. They’re literally giving away speech recognition tech for free so all of us can tinker with it. The only real differences between the open source and commercially available solutions lie in what’s called their Acoustic Models. AMs for speech rec are like gold. A good AM is produced from several thousand hours of good audio samples. It’s what gives speech recognition its ability to recognize. Think stats: weights, averages, means, standard deviations.. etc.. that kind of thing…

The commercial products do indeed spend most of their time building up good AMs by enlisting large groups of people to help ‘train’ the system. The open source guys just can’t afford to do this for obvious reasons. In fact Lumenvox is based in no small part on the Sphinx project. So you might say that Lumenvox is Sphinx with a really good AM and some polish. Some well intentioned efforts are being made over at Voxforge.org to provide better AMs for Sphix and Julius but they still have a ways to go.

Wrapping Up:

Stick with Asterisk for now if you go the telephony route:
The primary reason I stuck with Asterisk was because FS dropped Lumenvox support. Understandably so.. Lumenvox attempted to charge them for dev licenses. I’m not sure that was a wise move on Lumen’s part but that’s politics I guess. FS is really becoming quite a contender in the world of Telephony & VoIP so I hope they work it out. For me it was a deal breaker.

Eventually though I went the web services route. It made sense because using this approach we were able to scale out the solution web 2.0 style. Telephony becomes too expensive cost and performance wise. You can still get good performance from the telephony route, just not nearly what you can attain using web services.

Anyway, I should have just listened to the advice at the beginning of this article in the first place: forget about speech rec unless it’s absolutely central to your solution. I learned the hard way and I’m writing this so hopefully you won’t have to.

I may write some more on the subject later on. It really was an enlightening experience to submerge myself in this whole other universe.

At any rate this article should have at least moved you a little closer to building your very own Star Trek computer. If you want to hear more on the subject or have questions just submit a comment and I’ll do my best to answer.

That’s a wrap

10 Responses to “On Speech Recognition: Web App Integration, Pointers for Newbies, & Lessons Learned from a failed startup”

  1. [...] View original here:  On Speech Recognition: Web App Integration, Pointers for Newbies, & Lessons Learned from a faile… [...]

  2. [...] View original here: On Speech Recognition: Web App Integration, Pointers for Newbies, & Lessons Learned from a faile… [...]

  3. [...] On Speech Recognition: Web App Integration, Pointers for Newbies, & Lessons Learned from a faile… http://www.teabuzzed.com/2009/08/on-speech-recognition-web-app-integration-pointers-for-newbies-lessons-learned-from-a-failed-startup – view page – cached Article on technology, django, web design, .net, startups, software, code and coding, javascript, linux and more — From the page [...]

  4. [...] This post was Twitted by fredcatalan [...]

  5. Diego Viola says:

    FreeSWITCH is the best!

  6. [...] Asterisk & FreeSwitch are the two big players here. Technically they’re soft switches, software created to handle and direct calls either between two VoIP points like Skype or they can be used to accept and/or direct calls from/to the …Continue Reading… [...]

  7. Sylver says:

    Thanks for the data. Quite an interesting article, although I would suggest cleaning up a bit the language, which I found a distraction from the valuable infos your are providing.

    • Fred says:

      Ha. Thanks. I did ‘clean’ it up a bit per your suggestion. I got a bit carried away..
      I love to hate on Speech Rec but in fact it’s a really exciting technology which holds a lot of promise.. I just think that promise is still quite a ways away.. =)

  8. Stephen says:

    Hi Fred — sorry to hear about the failure. I know a bunch of us here at LumenVox were impressed with your app while it was active.

    I hope if you ever find a project where speech recognition is more than just a shiny addon/distraction, you consider LumenVox again.

    • Fred says:

      Thanks Stephen.

      I appreciate that. It was certainly a lot of fun working with the Lumen team and you guys provided excellent support.

      What a journey into the unknown that was for me. I only hope this article sheds a little more light on the speech rec/telephony world since I feel as though guys coming from a web 2.0 background like me usually understand so little about it.

      Thanks again and hope all is well at Lumen HQ in San Diego.

Leave a Reply