How To Mine Twitter Streams from PHP in Real Time

UPDATE: I’ve wrote a new post with an example on how to connect to the v1.1 Twitter API, using OAuth – here.

Need to mine Twitter for tweets related to certain keywords?

No problem-

Twitter provides a pretty simple streaming interface to the onslaught of tweets it receives, letting you specify whatever keywords you want to search for, in a real-time “live” way.

To do this, I created a simple PHP class that can run in the background, collecting tweets for certain keywords:

ctwitter_stream.php

class ctwitter_stream
{
    private $m_username;
    private $m_password;

    public function __construct()
    {
        //
        // set a time limit to unlimited
        //
        set_time_limit(0);
    }

    //
    // set the login details
    //
    public function login($_username, $_password)
    {
        $this->m_username = $_username;
        $this->m_password = $_password;
    }

    //
    // process a tweet object from the stream
    //
    private function process_tweet(array $_data)
    {
        print_r($_data);

        return true;
    }

    //
    // the main stream manager
    //
    public function start(array $_keywords)
    {
        while(1)
        {
            $fp = fsockopen("ssl://stream.twitter.com", 443, $errno, $errstr, 30);
            if (!$fp)
            {
                echo "ERROR: Twitter Stream Error: failed to open socket";
            } else
            {
                //
                // build the request
                //
                $request  = "GET /1/statuses/filter.json?track=";
                $request .= urlencode(implode($_keywords, ',')) . " HTTP/1.1\r\n";
                $request .= "Host: stream.twitter.com\r\n";
                $request .= "Authorization: Basic ";
                $request .= base64_encode($this->m_username . ':' . $this->m_password);
                $request .= "\r\n\r\n";

                //
                // write the request
                //
                fwrite($fp, $request);

                //
                // set it to non-blocking
                //
                stream_set_blocking($fp, 0);

                while(!feof($fp))
                {
                    $read   = array($fp);
                    $write  = null;
                    $except = null;

                    //
                    // select, waiting up to 10 minutes for a tweet; if we don't get one, then
                    // then reconnect, because it's possible something went wrong.
                    //
                    $res = stream_select($read, $write, $except, 600, 0);
                    if ( ($res == false) || ($res == 0) )
                    {
                        break;
                    }

                    //
                    // read the JSON object from the socket
                    //
                    $json = fgets($fp);
                    if ( ($json !== false) && (strlen($json) > 0) )
                    {
                        //
                        // decode the socket to a PHP array
                        //
                        $data = json_decode($json, true);
                        if ($data)
                        {
                            //
                            // process it
                            //
                            $this->process_tweet($data);
                        }
                    }
                }
            }

            fclose($fp);
            sleep(10);
        }

        return;
    }
};

The “process_tweet()” method will be called for each matching tweet- just modify that method to process the tweet however you want (load it into a database, print it to screen, email it, etc). The keyword matching isn’t perfect- if you search for a string of words, it won’t necessarily match the words in that exact order, but you can check that yourself from the process_tweet() method.

Then create a simple PHP application to run the collector:

require 'ctwitter_stream.php';

$t = new ctwitter_stream();

$t->login('your twitter username', 'your twitter password');

$t->start(array('facebook', 'fbook', 'fb'));

Just provide your twitter account username/password, and then an array of keywords/strings to search for.

Since this application runs continuously in the background, it’s obviously not meant to be run via a web request, but meant to be run from the command line of your Unix or Windows box.

According to the Twitter documentation, the default access level allows up to 400 keywords, so you can track all sorts of things at the same time. If you need more details about the Twitter streaming API, it’s available here.

This class uses the HTTPS PHP stream– so you’ll need the OpenSSL extension enabled for it to work.

12 thoughts on “How To Mine Twitter Streams from PHP in Real Time

  1. Jose Cardenas

    Thanks for the great code, I’ve been using a modified version of this and creating a web page to display a filtered stream. Everything works great except when I try submitting keywords from another page through a form submit. I figured out that I needed to encode the string to create the array and then submit it to the start() function. I’m finding that once it adds the string to the URL it has an extra “+” inbetween the keywords so I get a %2C+ instead of just %2C. If I hard code the keywords within the same page it works great and doesn’t have the ‘+’. That’s the only thing that I see different. Any suggestions on this would be greatly appreciated!

  2. mike Post author

    Hey Jose,

    Well- %2C is a comma (,), and + is usually put in for a space (that or %20); if you urldecode() the content, does it come out ok?

    Paste in your code if you’re still having issues- I can take a look.

    Mike

  3. Pingback: Kip – Knowledge Is Power – How To Mine Twitter Streams from PHP in Real Time

  4. Adam Rochford

    I’m trying to run this from the command line as a background process on Ubuntu, however whenever I run it, it stops and never really executes. Any idea on how to do this?

  5. mike Post author

    Hey Adam,

    The only problem I’ve ever had running PHP in the background, is the time_limit setting- just use “set_time_limit(0);” to disable it.

    Otherwise it’ll have a maximum execution time of 30 seconds (or something like that).

    Mike

  6. mathieu

    hey mike great work!! soon the twitter api 1.0 will expire, can you show me how to do this with 1.1 and oauth?
    thank you in advance:)

  7. Pingback: Mining Twitter API v1.1 Streams from PHP – with OAuth | mike pultz

  8. Don

    Hi thanks for the post,

    Just a couple of quick questions for you.

    1. Once this is setup and called – so all running smoothly. How would you cancel the stream / close the socket?

    2. Is it possible to have several of these running at one time or if I wanted to add another keyword would I need to cancel the original stream and start a new strem with the additional keyword included?

    Thanks for your help

  9. mike Post author

    Hey Don,

    I think I tried running more than one once, and Twitter rejected the connection; but don’t quote me on that. Things might be difference too with the v1.1 API. It’s pretty easy to just start up two of these and see what happens.

    Also, you can add a lot of search keywords to that keyword array, so you probably shouldn’t have to run more than once.

    When you want to stop it, you just need to end the process. On my system, I just created a Linux RC script to start/stop the process.

    FYI- if you haven’t sent it, I wrote an updated version for the v1.1 API using OAuth.

    http://mikepultz.com/2013/06/mining-twitter-api-v1-1-streams-from-php-with-oauth/

    Mike

  10. Don

    Thanks Mike,

    I got the multiple keywords running now and then using the processtweet to allocate to the correct place so all good without the multiple streams. Still struggling with making sure I close a stream before opening it with a revised array of keywords though.

    At present I start the process by running the stream script with an exec in the background (note at present I set a time limit on is so that it times out after 3 minutes rather than lasting forever) –
    i.e. – exec(“php myfile.php &> /dev/null &”);

    So I’m just trying to figure out how to cancel this process before restarting it again.

    Any thoughts would be much appreciated. Also thanks for the updated version you wrote up using OAuth – have swapped it over and all working perfectly!

    cheers
    don

  11. Don

    Ah, not to worry – found a way to kill the process while keeping track of the PID.

    cheers
    don

  12. Hans

    Hi Mike,
    I am using this code for a few days and it works fine. Now I want to add something to it ansd it’s over.
    Instead of just jusing the track-parameter I like to combine it with the locations-parameter. Whatever I try, I keep getting a 401 in return.
    In function start I added this line:
    $data = ‘track=’ . rawurlencode(implode($_keywords, ‘,’)); (exiting line)
    $data = ‘locations=’ . rawurlencode(‘3,50,8,54′).’&’.$data; (added line)
    When I echo the request It looks fine to me but not for the twitter APi.
    Any Idea how to add the locations parameter in a right way?

    I know it is possible because when I make the request in cURL, I use the combination like this: –data “locations=3%2C50%2C8%2C54&track= ” and it works perfect.

    By the way: setting up multiple streams is possible but not with the same application defined in Twitter. To start simultanuously more streams you have te create an application for every stream. Otherwise one of your runnig streams will be canceled.

Leave a Reply

Your email address will not be published. Required fields are marked *