Accessing Google Speech API / Chrome 11
Like this article? Follow me on Twitter @mikepultz for more updates.
Just yesterday, Google pushed version 11 of their Chrome browser into beta, and along with it, one really interesting new feature- support for the HTML5 speech input API. This means that you'll be able to talk to your computer, and Chrome will be able to interpret it. This feature has been available for awhile on Android devices, so many of you will already be used to it, and welcome the new feature.
If you're running Chrome version 11, you can test out the new speech capabilities by going to their simple test page on the html5rocks.com site:
http://slides.html5rocks.com/#speech-input
Genius! but how does it work? I started digging around in the Chromium source code, to find out if the speech recognition is implemented as a library built into Chrome, or, if it sends the audio back to Google to process- I know I've seen the Sphynx libraries in the Android build, but I was sure the latter was the case- the speech recognition was really good, and that's really hard to do without really good language models- not something you'd be able to build into a browser.
I found the files I was looking for in the chromium source repo:
http://src.chromium.org/viewvc/chrome/trunk/src/content/browser/speech/
It looks like the audio is collected from the mic, and then passed via an HTTPS POST to a Google web service, which responds with a JSON object with the results. Looking through their audio encoder code, it looks like the audio can be either FLAC or Speex- but it looks like it's some sort of specially modified version of Speex- I'm not sure what it is, but it just didn't look quite right.
If that's the case, there should be no reason why I can't just POST something to it myself?
The URL listed in speech_recognition_request.cc is:
https://www.google.com/speech-api/v1/recognize
So a quick few lines of PERL (or PHP or just use wget on the command line):
#!/usr/bin/perl
require LWP::UserAgent;
my $url = "https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US";
my $audio = "";
open(FILE, "<" . $ARGV[0]);
while(<FILE>)
{
$audio .= $_;
}
close(FILE);
my $ua = LWP::UserAgent->new;
my $response = $ua->post($url, Content_Type => "audio/x-flac; rate=16000", Content => $audio);
if ($response->is_success)
{
print $response->content;
}
1;
This quick PERL script uses LWP::UserAgent to POST the binary audio from my audio clip; I recorded a quick wav file, and then converted it to FLAC on the command line (see SoX for more info)
To run it, just do:
[root@prague mike]# ./speech i_like_pickles.flac
The response is pretty straight forward JSON:
{
"status": 0,
"id": "b3447b5d98c5653e0067f35b32c0a8ca-1",
"hypotheses": [
{
"utterance": "i like pickles",
"confidence": 0.9012539
},
{
"utterance": "i like pickle"
}]
}
I'm not sure if Google is intending this to be a public, usable web service API, but it works- and has all sorts of possibilities!

March 24th, 2011 - 08:38
And i was just starting to go down this path and sure enough you beat me to the punch! Thank you this is really amazing information!
March 25th, 2011 - 13:43
Thanks! I’ve been watching Speechrecognition API thread for a while now and hoping that someone would add some new information. I saw the HTML5 speech input announcement for Chrome 11 and was hoping that it could be accessed somehow…you just proved it can…let’s hope it stays that way!
March 25th, 2011 - 23:37
Well,
Thank you, you just made my day
Wojtek
March 29th, 2011 - 06:32
Hi!
I’ve also been “playing” with google STT engine, just with wget.
It works fine, but I’ve not been able to add a link to a grammar. Have you tried something similar?
Looking in the code, it’s supossed to expect the variable lm as the URL to a standard SRGS grammar (grxml), for example:
wget –post-file flacs/pieles_de_naranja.flac –header=”Content-Type: audio/x-flac; rate=16000″ -O – “http://www.google.com/speech-api/v1/recognize?lang=es&lm=http://www.naradarobotics.com/testGrammar.grxml”
The response is the same I get without grammar (as free text).
Do you have any idea on how to deal with this issue?
Thanks in advanced!
March 29th, 2011 - 22:59
Hey Kaik,
I haven’t tried passing in grammer settings- I didn’t see much in the chrome code about grammer, other than it simply passing the value through the URL-
did you find anything that indicated that it expected the format as a SRGS XML file?
Mike
March 30th, 2011 - 23:08
I’ve been trying to get this to work. It fails for me. Has google found out we’re using it (what would be their problem with that?)?
What I did:
wget -U “Mozilla/5.0″ –post-file=recording.flac –header=”Content-Type: audio/x-flac; rate=16000″ -O – “http://www.google.com/speech-api/v1/recognize?lang=en-us&client=chromium”
What I got:
HTTP request sent, awaiting response… 500 Internal Server Error
2011-03-30 17:11:54 ERROR 500: Internal Server Error.
Any thoughts?
March 31st, 2011 - 09:43
I don’t think so- I tried it again this morning and it worked fine; what I did notice though, is if you send it a large amount of audio, it will timeout or throw an error- unfortunately, it’s tough to tell from this side of the fence.
That being said, I don’t really see how they can block this service- if they want it to work for chrome, then they’re going to have to leave it open; even if they force it to require certain HTTP headers or a key or something, it’s all visible via the chrome source code, and therefore re-producible.
Mike
April 1st, 2011 - 15:39
Hey Mike,
Thanks for the tip. You’re right: it’s very important not to make the sound file too large. I have been encoding to flac. Perhaps if I used the custom speex codec Android uses (assumedly) I could get more out of it. But I just tested it with a short clip and got a response (lol, incorrect though it may be).
As for blocking the service, you’re probably right. I think it’s designed to work this way. They’d have to change chrome to handshake somehow to avoid people using it out of the browser. Since it’s a free service for everyone in the world anyway, why would they bother to control that?
tl;dr: Don’t post long sound files to the service.
April 4th, 2011 - 10:52
Thank you very much. I hope Google will open STT API soon, because until they are official nobody can be sure that API will work. BTW, that JSON looks like Google Translate’s JSON reply. They’re definitely going to open STT API.
P.S. Speech to Text… F*ck yeah!
April 4th, 2011 - 10:54
P.P.S. What with headers’ fonts? They are not aliased
April 5th, 2011 - 11:32
Hi Mike,
I”ve been working on the issue of file size that google will allow and come up with this solution: https://github.com/taf2/audiosplit. It is very much a work in progress but the idea is to detect minor silences in an audio stream and cut the audio stream into smaller ~ 10 second chunks. It also makes it easy to send an a bit more arbitrary wave file and using ffmpeg, flac and some libsndfile code to chunk the wave files. Combining the results sort of works… Today I am working on merging smaller chunks…
-Todd
April 5th, 2011 - 22:20
Great Article.
I tried curl, and it seemed to work (I got something back from the Google Server, but it was NOT close to being a good representation of the sound):
size=33036
date_time=Apr 06 03:10
file=recording3.flac
curl -H “Content-Type: audio/x-flac; rate=16000″ “https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US” -F “myfile=@recording3.flac”
{“status”:0,”id”:”b4078d226d77e4593700df906c81fb34-1″,”hypotheses”:[{"utterance":"what is this","confidence":0.4514247}]}
I used a free prog to ‘save’ the file as MP3 (recording3.mp3), and another (VLC) to save it as FLAC (recording3.flac).
April 5th, 2011 - 22:47
There might be a sound quality issue converting first to mp3 then to flac?
Also- if your audio isn’t 16khz (as specified in the Content-Type header), then it’s going to be pretty off too.
April 5th, 2011 - 23:41
Mike,
You are correct! Hope this is of value (and use) by others…. file=recording4.flac
…and that is my way of ‘giving back’.
I have been desperate to find a way to make our videos section 508 compliant; this helps to that end.
What would be really valuable, is to have ‘longer’ audios be transcribed as well; have ‘text equivalents’ too.
PS: Google-Voice, allows recordings of up to 3 minutes, and makes its transcribed text available to you (for free). The past several times that I have tested it, I had to wait under 10 minutes to get the text back (with some editing, still needed).
WORKS as expected (now), using CURL on new file=recording4.flac:
$ curl -H “Content-Type: audio/x-flac; rate=16000″ “https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US” -F “myfile=@recording4.flac”
Results are spit out (% is close to 97% now):
{“status”:0,”id”:”916c8a10b3c3f0e8e57573b2f98038ae-1″,”hypotheses”:[{"utterance":"this is another test","confidence":0.96688336}]}
Anyhow,……. this time, I paid attention to my batch file, running on WinXP, using VLC
(Convert__mp3_to_FLAC__WORKS.bat)—– note: VLC is available for Mac, and Linux as well:
@ECHO OFF
SET VLC_EXE=”c:\program files\videolan\vlc\vlc.exe”
SET file_name=recording4
SET file_path=c:\ahmad\music
SET SRC_File=%file_path%\%file_name%.mp3
SET DST_File=%file_path%\%file_name%.flac
SET transcode_options=vcodec=none,acodec=flac,ab=16,channels=1,samplerate=16000
::—- HIDE the VLC interface & WORK !!!
%VLC_EXE% –file-caching=300 “%SRC_File%” –sout #transcode{%transcode_options%}:file{dst=’%DST_File%’} -I dummy vlc://quit
PAUSE
::exit
April 13th, 2011 - 05:35
An example with PHP + FFMPEG receiving an MP3 or other audio file, converting it with FFMPEG to FLAC, posting it to the API and returning the response.
$url = “https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US”;
$target = “uploads/”;
$target = $target.basename( $_FILES['file']['name']);
// move the temporary file
if(move_uploaded_file($_FILES['file']['tmp_name'], $target)) {
$file = $_FILES['file']['name'];
$fileflac = substr($file,0,-3).’flac’;
// convert the audio file to flac
exec(‘ffmpeg -i uploads/’.$file.’ -ab 96 -ar 44000 uploads/’.$fileflac);
// make the request
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_POSTFIELDS, array(‘file’=>’@uploads/’.$fileflac));
curl_setopt($ch, CURLOPT_HTTPHEADER, array(‘Content-Type: audio/x-flac; rate=44000′));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
curl_close($ch);
// return the response
print_r($response);
// delete the audio files
unlink(‘uploads/’.$file);
unlink(‘uploads/’.$fileflac);
}
April 13th, 2011 - 23:33
To Anonymouse:
I like your approach; in fact you can do mp3-to-flac conversion in Windows as well
(if you have ffmpeg.exe from one of the Open Source downloads):
ffmpeg -i uploads/recording4.mp3 -ab 96 -ar 44000 uploads/recording4.flac
Curious, have you got the PHP code working then? I fiddled with it for a few minutes, with empty $response ! Would you have the code zipped up some place?
PS: Your code could probably be changed a little, to run from CMD line as well (in case ‘upload’ is disallowed on a web server):
/path/to/php audio_to_flac.php uploads/recording22.mp3 > ecording22.txt
April 15th, 2011 - 08:25
Thank you for the topic!
Take a look at http://habrahabr.ru/blogs/macosx/117570/ (it is in Russian, but no matter). It Speech Translator for Mac OS X (based on Google speecg recognition API). Source: https://github.com/Kyrie1965/SpeechTranslator
May 3rd, 2011 - 16:51
Looking at the code, the only change to the speex encoding is that they put a length byte in front of every speex frame. I guess this is so they can easily use speex in VBR mode where the frames are of variable length. It should be pretty easy to create an encoder for that.
May 4th, 2011 - 11:28
Thanks for the info – I’ve used it to hack up a replacement for non-chrome browsers – java applet based I’m afraid…
http://api.phonefromhere.com/stt/test.html
May 5th, 2011 - 00:10
Hi guys
Please testing my release
SpeechRecognition v. 1.0 Beta
http://programmer.uz/?action=comments&id=714
May 5th, 2011 - 10:35
Can you tell us more about your java example? Or release the code?
May 12th, 2011 - 22:48
To: Florian
It’s pretty simple.
To encode audio library was used http://javaflacencoder.sourceforge.net/, for deserialization google answer library was used gson.
Ask questions if something else is interested
I’m sorry for my english.
May 25th, 2011 - 20:09
To: Horsekiller
I just downloaded your SpeechRecognition.jar file. Is it possible for me to run this and send text to Processing?
If it works, could you briefly tell me the procedure?
May 31st, 2011 - 06:37
to Tim Panton,
I have tried to run recognition for Speex audio codec (http://jspeex.sourceforge.net/),
can you give more details about Google changes in this codec as compared to original.
I tell that changes is frame length at start of each frame? For 8khz speex sound frame size is 160 bites,
so should I add 160 int (long) before each 160 bytes buffer array ?
June 13th, 2011 - 19:46
Anyone figured out the grammar question ?
June 17th, 2011 - 11:12
Has anyone had much success with speex-with-header-byte encoding?
I get back different (terrible) results from the server from the same file, which is something I don’t see happening when posting a flac encoded file. It occasionally returns the correct transcription which leads me to believe I didn’t completely foobar the modified speex encoding.
July 1st, 2011 - 15:29
Anyone have a clue as to when this will be able to be used commercially?
July 17th, 2011 - 12:15
I have been using SoX to record .flac files and this system to translate them to speech. While the service is definitely working (returning valid JSON objects) the accuracy is atrocious, especially compared to using the service from a google site. Could it be related to the audio format? Are there adjustments I could make to improve the accuracy?
Right now i use ‘sox -d file.flac silence -l 1 0 5% 1 2.5 1%’ to record
July 17th, 2011 - 12:31
Worked it out – This API requires the audio to be uploaded at 16kHz, SoX defaults to 44.1kHz; so just add a rate conversion to the record command. Below are a few lines of code you can add just after “my $audio” to make this perl script handle voice commands (assuming SoX is installed).
my $file = ‘.flac’;
my $record = “sox -d $file silence -l 1 0 1% 1 2.5 1% rate 16k’;
print “Speak Now \n”;
`$record`;
print “Processing \n”;
open (FILE, “<".$file);
July 19th, 2011 - 03:40
Can some body help regarding its implementation in Iphone.?
Any code or tutorial using this API.
July 21st, 2011 - 08:20
I am facing this issue when i use the same call in Iphone. Code i am using is
ASIFormDataRequest *request = [ASIFormDataRequest requestWithURL:url];
NSString *filePath = [[NSBundle mainBundle] pathForResource:@”can_you_keep_a_secret” ofType:@”flac”];
NSData *myData = [NSData dataWithContentsOfFile:filePath];
[request addPostValue:myData forKey:@"Content"];
[request addPostValue:@"audio/x-flac; rate=16000" forKey:@"Content-Type"];
The response which i am getting is “Content-Type media type is not audio ” .Status code is 400.
Can any one let me know the error in code .? Why this response. ? I am passing .flace file which is running perfect in my VLC player.
August 2nd, 2011 - 00:06
great article!
May I ask you a question?
the result shows only one word(or sentence).
Can I get nbest(multiple) results?
August 3rd, 2011 - 15:07
Not sure if this is free for Web, but it’s free for mobile. The quality seems decent:
http://techcrunch.com/2011/08/03/ispeech-launches-free-mobile-sdk-to-bring-speech-recognition-to-ios-android-apps/
August 7th, 2011 - 11:08
Pretty late but I made a Java / Processing library some time ago using the techniques mentioned here: http://stt.getflourish.com
August 13th, 2011 - 23:09
I wrote a Java based program that exploits the Speech API and TTS API to make an interactive desktop assistant
check out the code here
https://www.github.com/samirahmed
or read about it here
http://www.samir-ahmed.com/iris.html
August 16th, 2011 - 06:14
Hey man! You saved my graduationg final project!!!! I was almost given up, and I found this post!!! Realy thanks!!!!!!
September 1st, 2011 - 08:01
Hi. I’m trying to get it working w flash speex codec. If someone feels lucky try here: http://stackoverflow.com/questions/7270619/flash-speex-codec-coversion-for-google-speech-api-a-challenge
September 8th, 2011 - 09:24
Thanks for the info, plan on using this myself.
September 20th, 2011 - 07:43
Hello, i’m trying with the perl script and returns nothing, if i use:
$ curl -H “Content-Type: audio/x-flac; rate=16000″ “https://www.google.es/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en_US” -F “myfile=@hello.flac”
{“status”:0,”id”:”b86fd217bf576032ca03746da9c534f9-1″,”hypotheses”:[{"utterance":"hello good morning","confidence":0.8036038}]}
works fine!! but i need to recognize spanish, then i record “hola, buenas tardes” in a flac file and:
$ curl -H “Content-Type: audio/x-flac; rate=16000″ “https://www.google.es/speech-api/v1/recognize?xjerr=1&client=chromium&lang=es_ES” -F “myfile=@hola_es.flac”
{“status”:0,”id”:”2e1f7b23562ccb95e72af513e3f243a0-1″,”hypotheses”:[{"utterance":"ull","confidence":0.20765519}]}
ull????? What’s ull?? If i use the html5 example, works fine too, but i need in command line. Any ideas?
September 25th, 2011 - 23:30
I tried this snippet of c# code but getting back the line below. Any suggestion on what is going wrong would be greatly appreciated.
{“status”:5,”id”:”186c0611e33571e187bbd85c0bbd1f85-1″,”hypotheses”:[]}
————————
string uploadUrl = “https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US”;
FileStream rdr = new FileStream(“C:\\test.wav”, FileMode.Open);
WebRequest request = WebRequest.Create(uploadUrl);
request.Method = “POST”;
request.ContentType = “audio/x-flac; rate=16000″;
byte[] byteArray = new byte[rdr.Length];
int bytesRead = rdr.Read(byteArray, 0, byteArray.Length);
request.ContentLength = byteArray.Length;
using (Stream dataStream = request.GetRequestStream())
{
dataStream.Write(byteArray, 0, byteArray.Length);
}
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());
string strText = reader.ReadToEnd();
reader.Close();
September 25th, 2011 - 23:53
Please ignore previous message. Forgot to convert file to flac from wav.
September 26th, 2011 - 19:50
for those interested, here is a php script i’m using currently that works quite well.
=====================
September 30th, 2011 - 20:59
Were you able to access the force alignment of speech and text as well?
October 3rd, 2011 - 10:25
not sure why the code didn’t post the first time.
but here it is again.
October 4th, 2011 - 10:58
[code]
$filename = '/path/to/flac/file/my_soundfile.flac';
$handle = fopen($filename, "r");
$XPost = fread($handle, filesize($filename));
fclose($handle);
$url = "https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US";
$ch = curl_init(); // initialize curl handle
curl_setopt($ch, CURLOPT_URL, $url); // set url to post to
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // return into a variable
curl_setopt($ch, CURLOPT_HTTPHEADER, Array("Content-Type: audio/x-flac; rate=16000"));
curl_setopt($ch, CURLOPT_TIMEOUT, 300); // times out after 30s
curl_setopt($ch, CURLOPT_POSTFIELDS, $XPost); // add POST fields
curl_setopt($ch, CURLOPT_POST, 1);
$str = curl_exec($ch); // run the whole process
curl_close($ch);
//decode returned json to associative array
$objs = json_decode($str, true);
//extract the data we need
$converted_text = $objs["hypotheses"][0]["utterance"];
$score = round($objs["hypotheses"][0]["confidence"], 2);
?>
[/code]
October 11th, 2011 - 07:24
I am trying use: “https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=es”; for spanish recongnition and it dont work, any suggest?
Regards!
October 15th, 2011 - 15:20
cause you need, fisrt, change de language… take a look on parameter lang=en-US, probably you will need to use something like es-ES I think… or not??
October 21st, 2011 - 14:48
Hi,
found this by pure chance – great thread !
Unfortunately, I do not get a result. wget returns: “Authority of issuer of the certificate cannot checked
locally.”
This surprises me, since I have called wget with option –no-check-certificate
Any ideas ?
Thanks a lot in advance for your help
Joshi
November 9th, 2011 - 09:53
I want to support multiple languages with this link, and I know it’s possible, because lang=es-mx for example, recognizes spanish.
My questions are:
* How do I know which languages are supported?
* Can I use a flag to indicate “auto-recognize”?
December 1st, 2011 - 11:54
Did anyone try with WAV or g711 files? Without conversion… I mean with something like:
Content-Type: audio/x-wav; rate=16000 ?