Accessing Google Speech API / Chrome 11

I’ve posted an updated version of this article here, using the new full-duplex streaming API.

Just yesterday, Google pushed version 11 of their Chrome browser into beta, and along with it, one really interesting new feature- support for the HTML5 speech input API. This means that you’ll be able to talk to your computer, and Chrome will be able to interpret it. This feature has been available for awhile on Android devices, so many of you will already be used to it, and welcome the new feature.

If you’re running Chrome version 11, you can test out the new speech capabilities by going to their simple test page on the html5rocks.com site:

http://slides.html5rocks.com/#speech-input

Genius! but how does it work? I started digging around in the Chromium source code, to find out if the speech recognition is implemented as a library built into Chrome, or, if it sends the audio back to Google to process- I know I’ve seen the Sphynx libraries in the Android build, but I was sure the latter was the case- the speech recognition was really good, and that’s really hard to do without really good language models- not something you’d be able to build into a browser.

I found the files I was looking for in the chromium source repo:

http://src.chromium.org/viewvc/chrome/trunk/src/content/browser/speech/

It looks like the audio is collected from the mic, and then passed via an HTTPS POST to a Google web service, which responds with a JSON object with the results. Looking through their audio encoder code, it looks like the audio can be either FLAC or Speex– but it looks like it’s some sort of specially modified version of Speex- I’m not sure what it is, but it just didn’t look quite right.

If that’s the case, there should be no reason why I can’t just POST something to it myself?

The URL listed in speech_recognition_request.cc is:

https://www.google.com/speech-api/v1/recognize

So a quick few lines of PERL (or PHP or just use wget on the command line):

#!/usr/bin/perl

require LWP::UserAgent;

my $url = "https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US";
my $audio = "";

open(FILE, "<" . $ARGV[0]);
while(<FILE>)
{
    $audio .= $_;
}
close(FILE);

my $ua = LWP::UserAgent->new;

my $response = $ua->post($url, Content_Type => "audio/x-flac; rate=16000", Content => $audio);
if ($response->is_success)
{
    print $response->content;
}

1;

This quick PERL script uses LWP::UserAgent to POST the binary audio from my audio clip; I recorded a quick wav file, and then converted it to FLAC on the command line (see SoX for more info)

To run it, just do:

[root@prague mike]# ./speech i_like_pickles.flac

The response is pretty straight forward JSON:

{
    "status": 0,
    "id": "b3447b5d98c5653e0067f35b32c0a8ca-1",
    "hypotheses": [
    {
        "utterance": "i like pickles",
        "confidence": 0.9012539
    },
    {
        "utterance": "i like pickle"
    }]
}

I’m not sure if Google is intending this to be a public, usable web service API, but it works- and has all sorts of possibilities!

265 thoughts on “Accessing Google Speech API / Chrome 11

  1. todd

    And i was just starting to go down this path and sure enough you beat me to the punch! Thank you this is really amazing information!

  2. Pingback: Introducing Speech 2 Text API by Google | Captico

  3. Milton

    Thanks! I’ve been watching Speechrecognition API thread for a while now and hoping that someone would add some new information. I saw the HTML5 speech input announcement for Chrome 11 and was hoping that it could be accessed somehow…you just proved it can…let’s hope it stays that way!

  4. KaiK

    Hi!

    I’ve also been “playing” with google STT engine, just with wget.
    It works fine, but I’ve not been able to add a link to a grammar. Have you tried something similar?
    Looking in the code, it’s supossed to expect the variable lm as the URL to a standard SRGS grammar (grxml), for example:
    wget –post-file flacs/pieles_de_naranja.flac –header=”Content-Type: audio/x-flac; rate=16000″ -O – “http://www.google.com/speech-api/v1/recognize?lang=es&lm=http://www.naradarobotics.com/testGrammar.grxml”

    The response is the same I get without grammar (as free text).

    Do you have any idea on how to deal with this issue?

    Thanks in advanced!

  5. mike Post author

    Hey Kaik,

    I haven’t tried passing in grammer settings- I didn’t see much in the chrome code about grammer, other than it simply passing the value through the URL-

    did you find anything that indicated that it expected the format as a SRGS XML file?

    Mike

  6. hesperaux

    I’ve been trying to get this to work. It fails for me. Has google found out we’re using it (what would be their problem with that?)?

    What I did:
    wget -U “Mozilla/5.0″ –post-file=recording.flac –header=”Content-Type: audio/x-flac; rate=16000” -O – “http://www.google.com/speech-api/v1/recognize?lang=en-us&client=chromium”

    What I got:

    HTTP request sent, awaiting response… 500 Internal Server Error
    2011-03-30 17:11:54 ERROR 500: Internal Server Error.

    Any thoughts?

  7. mike Post author

    I don’t think so- I tried it again this morning and it worked fine; what I did notice though, is if you send it a large amount of audio, it will timeout or throw an error- unfortunately, it’s tough to tell from this side of the fence.

    That being said, I don’t really see how they can block this service- if they want it to work for chrome, then they’re going to have to leave it open; even if they force it to require certain HTTP headers or a key or something, it’s all visible via the chrome source code, and therefore re-producible.

    Mike

  8. hesperaux

    Hey Mike,

    Thanks for the tip. You’re right: it’s very important not to make the sound file too large. I have been encoding to flac. Perhaps if I used the custom speex codec Android uses (assumedly) I could get more out of it. But I just tested it with a short clip and got a response (lol, incorrect though it may be).
    As for blocking the service, you’re probably right. I think it’s designed to work this way. They’d have to change chrome to handshake somehow to avoid people using it out of the browser. Since it’s a free service for everyone in the world anyway, why would they bother to control that?

    tl;dr: Don’t post long sound files to the service.

  9. Oxygen

    Thank you very much. I hope Google will open STT API soon, because until they are official nobody can be sure that API will work. BTW, that JSON looks like Google Translate’s JSON reply. They’re definitely going to open STT API.

    P.S. Speech to Text… F*ck yeah!

  10. todd

    Hi Mike,

    I”ve been working on the issue of file size that google will allow and come up with this solution: https://github.com/taf2/audiosplit. It is very much a work in progress but the idea is to detect minor silences in an audio stream and cut the audio stream into smaller ~ 10 second chunks. It also makes it easy to send an a bit more arbitrary wave file and using ffmpeg, flac and some libsndfile code to chunk the wave files. Combining the results sort of works… Today I am working on merging smaller chunks…

    -Todd

  11. Ahmad

    Great Article.
    I tried curl, and it seemed to work (I got something back from the Google Server, but it was NOT close to being a good representation of the sound):

    size=33036
    date_time=Apr 06 03:10
    file=recording3.flac

    curl -H “Content-Type: audio/x-flac; rate=16000” “https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US” -F “myfile=@recording3.flac”

    {“status”:0,”id”:”b4078d226d77e4593700df906c81fb34-1″,”hypotheses”:[{“utterance”:”what is this”,”confidence”:0.4514247}]}

    I used a free prog to ‘save’ the file as MP3 (recording3.mp3), and another (VLC) to save it as FLAC (recording3.flac).

  12. mike Post author

    There might be a sound quality issue converting first to mp3 then to flac?

    Also- if your audio isn’t 16khz (as specified in the Content-Type header), then it’s going to be pretty off too.

  13. ahmad

    Mike,
    You are correct! Hope this is of value (and use) by others…. file=recording4.flac
    …and that is my way of ‘giving back’.

    I have been desperate to find a way to make our videos section 508 compliant; this helps to that end.
    What would be really valuable, is to have ‘longer’ audios be transcribed as well; have ‘text equivalents’ too.

    PS: Google-Voice, allows recordings of up to 3 minutes, and makes its transcribed text available to you (for free). The past several times that I have tested it, I had to wait under 10 minutes to get the text back (with some editing, still needed).

    WORKS as expected (now), using CURL on new file=recording4.flac:
    $ curl -H “Content-Type: audio/x-flac; rate=16000” “https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US” -F “myfile=@recording4.flac”

    Results are spit out (% is close to 97% now):
    {“status”:0,”id”:”916c8a10b3c3f0e8e57573b2f98038ae-1″,”hypotheses”:[{“utterance”:”this is another test”,”confidence”:0.96688336}]}

    Anyhow,……. this time, I paid attention to my batch file, running on WinXP, using VLC
    (Convert__mp3_to_FLAC__WORKS.bat)—– note: VLC is available for Mac, and Linux as well:

    @ECHO OFF
    SET VLC_EXE=”c:\program files\videolan\vlc\vlc.exe”

    SET file_name=recording4

    SET file_path=c:\ahmad\music

    SET SRC_File=%file_path%\%file_name%.mp3

    SET DST_File=%file_path%\%file_name%.flac

    SET transcode_options=vcodec=none,acodec=flac,ab=16,channels=1,samplerate=16000

    ::—- HIDE the VLC interface & WORK !!!
    %VLC_EXE% –file-caching=300 “%SRC_File%” –sout #transcode{%transcode_options%}:file{dst=’%DST_File%’} -I dummy vlc://quit

    PAUSE
    ::exit

  14. Anonymouse

    An example with PHP + FFMPEG receiving an MP3 or other audio file, converting it with FFMPEG to FLAC, posting it to the API and returning the response.

    $url = “https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US”;
    $target = “uploads/”;
    $target = $target.basename( $_FILES[‘file’][‘name’]);

    // move the temporary file
    if(move_uploaded_file($_FILES[‘file’][‘tmp_name’], $target)) {
    $file = $_FILES[‘file’][‘name’];
    $fileflac = substr($file,0,-3).’flac’;

    // convert the audio file to flac
    exec(‘ffmpeg -i uploads/’.$file.’ -ab 96 -ar 44000 uploads/’.$fileflac);

    // make the request
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_POSTFIELDS, array(‘file’=>’@uploads/’.$fileflac));
    curl_setopt($ch, CURLOPT_HTTPHEADER, array(‘Content-Type: audio/x-flac; rate=44000’));
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $response = curl_exec($ch);
    curl_close($ch);

    // return the response
    print_r($response);

    // delete the audio files
    unlink(‘uploads/’.$file);
    unlink(‘uploads/’.$fileflac);
    }

  15. Ahmad

    To Anonymouse:
    I like your approach; in fact you can do mp3-to-flac conversion in Windows as well
    (if you have ffmpeg.exe from one of the Open Source downloads):

    ffmpeg -i uploads/recording4.mp3 -ab 96 -ar 44000 uploads/recording4.flac

    Curious, have you got the PHP code working then? I fiddled with it for a few minutes, with empty $response ! Would you have the code zipped up some place?

    PS: Your code could probably be changed a little, to run from CMD line as well (in case ‘upload’ is disallowed on a web server):
    /path/to/php audio_to_flac.php uploads/recording22.mp3 > ecording22.txt

  16. Tim Panton

    Looking at the code, the only change to the speex encoding is that they put a length byte in front of every speex frame. I guess this is so they can easily use speex in VBR mode where the frames are of variable length. It should be pretty easy to create an encoder for that.

  17. Q

    To: Horsekiller
    I just downloaded your SpeechRecognition.jar file. Is it possible for me to run this and send text to Processing?
    If it works, could you briefly tell me the procedure?

  18. Pingback: Playing around with speech-to-text « load,buffer,play

  19. Ilya

    to Tim Panton,
    I have tried to run recognition for Speex audio codec (http://jspeex.sourceforge.net/),
    can you give more details about Google changes in this codec as compared to original.
    I tell that changes is frame length at start of each frame? For 8khz speex sound frame size is 160 bites,
    so should I add 160 int (long) before each 160 bytes buffer array ?

  20. Pingback: How to add a full-vocabulary-sized english language model? « Support Forums

  21. ShamblingMound

    Has anyone had much success with speex-with-header-byte encoding?

    I get back different (terrible) results from the server from the same file, which is something I don’t see happening when posting a flac encoded file. It occasionally returns the correct transcription which leads me to believe I didn’t completely foobar the modified speex encoding.

  22. Pingback: Accessing Google Speech API / Chrome 11 « don’t_panic « marcusjpotter

  23. Anonymous

    Anyone have a clue as to when this will be able to be used commercially?

  24. Henry

    I have been using SoX to record .flac files and this system to translate them to speech. While the service is definitely working (returning valid JSON objects) the accuracy is atrocious, especially compared to using the service from a google site. Could it be related to the audio format? Are there adjustments I could make to improve the accuracy?

    Right now i use ‘sox -d file.flac silence -l 1 0 5% 1 2.5 1%’ to record

  25. Henry

    Worked it out – This API requires the audio to be uploaded at 16kHz, SoX defaults to 44.1kHz; so just add a rate conversion to the record command. Below are a few lines of code you can add just after “my $audio” to make this perl script handle voice commands (assuming SoX is installed).

    my $file = ‘.flac’;
    my $record = “sox -d $file silence -l 1 0 1% 1 2.5 1% rate 16k’;
    print “Speak Now \n”;
    `$record`;
    print “Processing \n”;
    open (FILE, “<".$file);

  26. Luke

    Can some body help regarding its implementation in Iphone.?
    Any code or tutorial using this API.

  27. Luke

    I am facing this issue when i use the same call in Iphone. Code i am using is

    ASIFormDataRequest *request = [ASIFormDataRequest requestWithURL:url];
    NSString *filePath = [[NSBundle mainBundle] pathForResource:@”can_you_keep_a_secret” ofType:@”flac”];
    NSData *myData = [NSData dataWithContentsOfFile:filePath];

    [request addPostValue:myData forKey:@”Content”];
    [request addPostValue:@”audio/x-flac; rate=16000″ forKey:@”Content-Type”];

    The response which i am getting is “Content-Type media type is not audio ” .Status code is 400.
    Can any one let me know the error in code .? Why this response. ? I am passing .flace file which is running perfect in my VLC player.

  28. Ted Kim

    great article!

    May I ask you a question?

    the result shows only one word(or sentence).
    Can I get nbest(multiple) results?

  29. Pingback: “free” Google Speech Recognition API | Intelligible Babble

  30. Pingback: How to Add Speech Recognition To Website? HTML5 Tips | Globinch

  31. Pingback: Flash SPEEX codec coversion for Google Speech API – a challenge | Technical support, Computer, programming issue, issue tracking, quality assurance

  32. Pingback: Speech Recognition for the Web

  33. juanmol

    Hello, i’m trying with the perl script and returns nothing, if i use:

    $ curl -H “Content-Type: audio/x-flac; rate=16000” “https://www.google.es/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en_US” -F “myfile=@hello.flac”
    {“status”:0,”id”:”b86fd217bf576032ca03746da9c534f9-1″,”hypotheses”:[{“utterance”:”hello good morning”,”confidence”:0.8036038}]}

    works fine!! but i need to recognize spanish, then i record “hola, buenas tardes” in a flac file and:

    $ curl -H “Content-Type: audio/x-flac; rate=16000” “https://www.google.es/speech-api/v1/recognize?xjerr=1&client=chromium&lang=es_ES” -F “myfile=@hola_es.flac”
    {“status”:0,”id”:”2e1f7b23562ccb95e72af513e3f243a0-1″,”hypotheses”:[{“utterance”:”ull”,”confidence”:0.20765519}]}

    ull????? What’s ull?? If i use the html5 example, works fine too, but i need in command line. Any ideas?

  34. Raza

    I tried this snippet of c# code but getting back the line below. Any suggestion on what is going wrong would be greatly appreciated.

    {“status”:5,”id”:”186c0611e33571e187bbd85c0bbd1f85-1″,”hypotheses”:[]}

    ————————
    string uploadUrl = “https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US”;

    FileStream rdr = new FileStream(“C:\\test.wav”, FileMode.Open);

    WebRequest request = WebRequest.Create(uploadUrl);
    request.Method = “POST”;
    request.ContentType = “audio/x-flac; rate=16000”;
    byte[] byteArray = new byte[rdr.Length];
    int bytesRead = rdr.Read(byteArray, 0, byteArray.Length);
    request.ContentLength = byteArray.Length;
    using (Stream dataStream = request.GetRequestStream())
    {
    dataStream.Write(byteArray, 0, byteArray.Length);
    }

    HttpWebResponse response = (HttpWebResponse)request.GetResponse();
    StreamReader reader = new StreamReader(response.GetResponseStream());
    string strText = reader.ReadToEnd();
    reader.Close();

  35. slm32006

    for those interested, here is a php script i’m using currently that works quite well.

    =====================

Leave a Reply

Your email address will not be published. Required fields are marked *