How To Mine Twitter Streams from PHP in Real Time
Need to mine Twitter for tweets related to certain keywords?
Twitter provides a pretty simple streaming interface to the onslaught of tweets it receives, letting you specify whatever keywords you want to search for, in a real-time "live" way.
To do this, I created a simple PHP class that can run in the background, collecting tweets for certain keywords:
ctwitter_stream.php
class ctwitter_stream
{
private $m_username;
private $m_password;
public function __construct()
{
//
// set a time limit to unlimited
//
set_time_limit(0);
}
//
// set the login details
//
public function login($_username, $_password)
{
$this->m_username = $_username;
$this->m_password = $_password;
}
//
// process a tweet object from the stream
//
private function process_tweet(array $_data)
{
print_r($_data);
return true;
}
//
// the main stream manager
//
public function start(array $_keywords)
{
while(1)
{
$fp = fsockopen("ssl://stream.twitter.com", 443, $errno, $errstr, 30);
if (!$fp)
{
echo "ERROR: Twitter Stream Error: failed to open socket";
} else
{
//
// build the request
//
$request = "GET /1/statuses/filter.json?track=";
$request .= urlencode(implode($_keywords, ',')) . " HTTP/1.1\r\n";
$request .= "Host: stream.twitter.com\r\n";
$request .= "Authorization: Basic ";
$request .= base64_encode($this->m_username . ':' . $this->m_password);
$request .= "\r\n\r\n";
//
// write the request
//
fwrite($fp, $request);
//
// set it to non-blocking
//
stream_set_blocking($fp, 0);
while(!feof($fp))
{
$read = array($fp);
$write = null;
$except = null;
//
// select, waiting up to 10 minutes for a tweet; if we don't get one, then
// then reconnect, because it's possible something went wrong.
//
$res = stream_select($read, $write, $except, 600, 0);
if ( ($res == false) || ($res == 0) )
{
break;
}
//
// read the JSON object from the socket
//
$json = fgets($fp);
if ( ($json !== false) && (strlen($json) > 0) )
{
//
// decode the socket to a PHP array
//
$data = json_decode($json, true);
if ($data)
{
//
// process it
//
$this->process_tweet($data);
}
}
}
}
fclose($fp);
sleep(10);
}
return;
}
};
The "process_tweet()" method will be called for each matching tweet- just modify that method to process the tweet however you want (load it into a database, print it to screen, email it, etc). The keyword matching isn't perfect- if you search for a string of words, it won't necessarily match the words in that exact order, but you can check that yourself from the process_tweet() method.
Then create a simple PHP application to run the collector:
require 'ctwitter_stream.php';
$t = new ctwitter_stream();
$t->login('your twitter username', 'your twitter password');
$t->start(array('facebook', 'fbook', 'fb'));
Just provide your twitter account username/password, and then an array of keywords/strings to search for.
Since this application runs continuously in the background, it's obviously not meant to be run via a web request, but meant to be run from the command line of your Unix or Windows box.
According to the Twitter documentation, the default access level allows up to 400 keywords, so you can track all sorts of things at the same time. If you need more details about the Twitter streaming API, it's available here.
This class uses the HTTPS PHP stream- so you'll need the OpenSSL extension enabled for it to work.

March 20th, 2012 - 19:03
Thanks for the great code, I’ve been using a modified version of this and creating a web page to display a filtered stream. Everything works great except when I try submitting keywords from another page through a form submit. I figured out that I needed to encode the string to create the array and then submit it to the start() function. I’m finding that once it adds the string to the URL it has an extra “+” inbetween the keywords so I get a %2C+ instead of just %2C. If I hard code the keywords within the same page it works great and doesn’t have the ‘+’. That’s the only thing that I see different. Any suggestions on this would be greatly appreciated!
March 20th, 2012 - 22:49
Hey Jose,
Well- %2C is a comma (,), and + is usually put in for a space (that or %20); if you urldecode() the content, does it come out ok?
Paste in your code if you’re still having issues- I can take a look.
Mike