Skip to content

Monitor Twitter stream for S&P 500 companies to identify & act on unexpected increases in tweet volume

Notifications You must be signed in to change notification settings

hortonworks-gallery/hdp22-twitter-demo

Repository files navigation

Streaming workshop demo

This demo was part of below technical webinar workshops

  • "Real Time Monitoring with Hadoop" - Slides and webinar recording are available here
  • "Search Workshop" - Slides and webinar recording are available here

Author: Ali Bajwa

With special thanks to:

  • Guilherme Braccialli for helping to maintain the code and adding sentiment analysis component
  • Tim Veil for developing the original banana dashboard

Purpose: Monitor Twitter stream for S&P 500 companies to identify & act on unexpected increases in tweet volume

  • Ingest: Listen for Twitter streams related to S&P 500 companies

  • Processing:

    • Monitor tweets for unexpected volume
    • Volume thresholds managed in HBASE
  • Persistence:

    • HDFS (for future batch processing)
    • Hive (for interactive query)
    • HBase (for realtime alerts)
    • Solr/Banana (for search and reports/dashboards)
    • Audits in Ranger/Solr/Banana
    • Authorization policies in Ranger
  • Refine:

    • Update threshold values based on historical analysis of tweet volumes
  • Demo setup:

    • Either download and start prebuilt VM
    • Start HDP 2.4 sandbox and run provided scripts to setup demo
  • Previous versions

    • For HDP 2.3.2 instructions see here
    • For HDP 2.3.0 instructions see here
    • For HDP 2.2 instructions see here
    • For HDP 2.1 version of this demo see here

Contents

  1. Option 1: Setup demo using prebuilt VM based on HDP 2.3 sandbox
  2. Option 2: Setup demo via scripts on vanilla HDP 2.4 sandbox
  3. Kafka basics - optional
  4. Setup Eclipse
  5. Run demo to monitor Tweets about S&P 500 securities in realtime
  6. Stop demo
  7. Troubleshooting
  8. Observe results in HDFS, Hive, Solr/Banana, HBase
  9. Use Zeppelin to create charts to analyze tweets - optional
  10. Import data into BI tools - optional
  11. Other things to try - optional
  12. Reset demo
  13. Run demo on cluster

Option 1: Setup demo using prebuilt VM based on HDP 2.3 sandbox

  • Download VM from here. Import it into VMWare Fusion and start it up.
  • Find the IP address of the VM and add an entry into your machines hosts file e.g.
192.168.191.241 sandbox.hortonworks.com sandbox    
  • Connect to the VM via SSH (password hadoop)
  • Start the demo by
cd /root/hdp22-twitter-demo
./start-demo.sh
#once storm topology is submitted, press control-C

#start kafka twitter producer
./kafkaproducer/runkafkaproducer.sh

  • Observe results in HDFS, Hive, Solr/Banana, HBase

  • Troubleshooting: check the Storm webUI for any errors and try resetting using below script:

./reset-demo.sh

Option 2: Setup demo via scripts on vanilla HDP 2.4 sandbox

These setup steps are only needed first time and may take upto 30min to execute (depending on your internet connection)

  • While waiting on any step, if you don't already have Twitter credentials, follow steps here to get them

  • Download HDP 2.4 sandbox VM image file (Hortonworks_sanbox_with_hdp_2_4_vmware.ova) from Hortonworks website

  • Import the ova into VMWare Fusion and allocate at least 4cpus and 8GB RAM (its preferable to increase to 10GB+ RAM) and start the VM

  • Find the IP address of the VM and add an entry into your machines hosts file e.g.

192.168.191.241 sandbox.hortonworks.com sandbox    
  • Download code as root user
cd
git clone https://github.com/hortonworks-gallery/hdp22-twitter-demo.git	
  • (Optional) To setup a development env on the sandbox to browse/modify the code, download Ambari service for VNC (details below). Not required if you just want to setup the demo
VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - \([0-9]\.[0-9]\).*/\1/'`
sudo git clone https://github.com/hortonworks-gallery/ambari-vnc-service.git   /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/VNCSERVER   
service ambari restart
  • Setup demo:Run below to setup demo (one time): it will start Ambari/HBase/Kafka/Storm and install maven, solr, banana.
cd /root/hdp22-twitter-demo
./setup-demo.sh
  • while it runs, proceed with installing VNC service per steps below

Optional - Setup VNC/Eclipse on your sandbox
  • Once the status of HDFS/YARN has changed from a yellow question mark to a green check mark...

  • Setup Eclipse on the sandbox VM and remote desktop into it using an Ambari service for VNC

    • In Ambari open, Admin > Stacks and Services tab.
    • Deploy the service by selecting:
      • VNC Server -> Add service -> Next -> Next -> Enter password (e.g. hadoop) -> Next -> Proceed Anyway -> Deploy
      • Make sure the password is at least 6 characters or install will fail
    • Connect to VNC from local laptop using a VNC viewer software (e.g. Tight VNC viewer or Chicken of the VNC or just your browser). Detailed steps here
    • Import code into Eclipse using "Getting started with Storm and Maven in Eclipse environment" steps here
    • Review Storm code in Eclipse under /root/hdp22-twitter-demo/stormtwitter-mvn/src/main/java/hellostorm:
      • GNstorm.java: Main class, also where topology, KafkaSpout, HDFSBolts instatiated
      • TwitterScheme.java: defines structure of a Tweet
      • SolrBolt.java: writes to Solr
      • TwitterRuleBolt.java: defines business logic of when a tweet should results in an alert

Setup Twitter credentials
  • Twitter4J requires you to have a Twitter account and obtain developer keys by registering an "app". Create a Twitter account and app and get your consumer key/token and access keys/tokens: https://apps.twitter.com > sign in > create new app > fill anything > create access tokens
  • Then enter the 4 values into the file below in the sandbox
vi /root/hdp22-twitter-demo/kafkaproducer/twitter4j.properties
oauth.consumerKey=
oauth.consumerSecret=
oauth.accessToken=
oauth.accessTokenSecret=

Kafka basics
  • Once Kafka is started, run through the below commands to understand the basics
#check if kafka already started
ps -ef | grep kafka

#if not, start kafka
nohup /usr/hdp/current/kafka-broker/bin/kafka-server-start.sh /usr/hdp/current/kafka-broker/config/server.properties &

#create topic
/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --zookeeper $(hostname -f):2181 --replication-factor 1 --partitions 1 --topic test

#list topic
/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --zookeeper $(hostname -f):2181 --list | grep test

#start a producer and enter text on few lines
/usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list $(hostname -f):6667 --topic test

#start a consumer in a new terminal your text appears in the consumer
/usr/hdp/current/kafka-broker/bin/kafka-console-consumer.sh --zookeeper $(hostname -f):2181 --topic test --from-beginning

#hit Control-C on both terminals to quit the consumer/producer

#delete topic (only works if delete.topic.enable and setup auto.create.topics.enable is set to true in Ambari > Kafka > Config)
/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --delete --zookeeper $(hostname -f):2181 --topic test

Run Twitter demo

Most of the below steps are optional as they were already executed by the setup script above but are useful to understand the components of the demo:

/root/hdp22-twitter-demo/fetchSecuritiesList/rungeneratecsv.sh
cat /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv
  • (Optional) for future runs: you can add other stocks/hashtags to monitor to the csv (make sure no trailing spaces/new lines at the end of the file). Find these at http://mobile.twitter.com/trends
sed -i '1i$HDP,Hortonworks,Technology,Technology,Santa Clara CA,0000000001,5' /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv
sed -i '1i#hadoopsummit,Hadoop Summit,Hadoop,Hadoop,Santa Clara CA,0000000001,5' /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv
  • Open connection to HBase via Phoenix and check you can list tables. Notice securities data was imported and alerts table is empty
/usr/hdp/current/phoenix-client/bin/sqlline.py  localhost:2181:/hbase-unsecure
!tables
select * from securities;
select * from alerts;
select * from dictionary;
!q
  • check Hive table schema where we will store the tweets for later analysis
hive -e 'desc tweets_text_partition'
  • Start Storm Twitter topology to generate alerts into an HBase table for stocks whose tweet volume is higher than threshold this will also read tweets into Hive/HDFS/local disk/Solr/Banana. The first time you run below, maven will take 15min to download dependent jars
cd /root/hdp22-twitter-demo
./start-demo.sh
#once storm topology is submitted, press control-C
  • (Optional) Other modes the topology could be started in future runs if you want to clean the setup or run locally (not on the storm running on the sandbox)
cd /root/hdp22-twitter-demo/stormtwitter-mvn
./runtopology.sh runOnCluster clean
./runtopology.sh runLocally skipclean
  • If you see errors like below, double check in Ambari that Storm is still up
Caused by: java.net.ConnectException: Connection refused
Could not find leader nimbus from seed hosts [sandbox.hortonworks.com]. Did you specify a valid list of nimbus hosts for config nimbus.seeds
/root/hdp22-twitter-demo/kafkaproducer/runkafkaproducer.sh

To stop collecting tweets:

  • To stop producing tweets, press Control-C in the terminal you ran runkafkaproducer.sh

  • kill the storm topology to stop processing tweets

storm kill Twittertopology

Troubleshooting

  • If Storm webUI shows topology errors...

  • Check the Storm webUI for any errors and try resetting using below script:

./reset-demo.sh

Observe results

Image

Image

  • Open Banana UI and view/search tweet summary and alerts: http://sandbox.hortonworks.com:8983/solr/banana/index.html Image

    • You can also access the UI via Ambari view by following steps here and replacing the url with http://sandbox.hortonworks.com:8983/solr/banana/index.html
    • For more details on the Banana dashboard panels are built, refer to the underlying json file that defines all the panels
    • In case you don't see any tweets, try changing to a different timeframe on timeline (e.g. by clicking 24 hours, 7 days etc). If there is a time mismatch between the VM and your machine, the tweets may appear at a different place on the timeline than expected.
  • Run a query in Solr to look at tweets/hashtags/alerts. Click on 'Query' and enter a query under 'q'. Examples are doctype_s:tweet and text_t:AAPL. You can choose an output format under 'wt' and click 'Execute Query'. http://sandbox.hortonworks.com:8983/solr/#/tweets

Image

/usr/hdp/current/phoenix-client/bin/sqlline.py  localhost:2181:/hbase-unsecure
select * from alerts

Image

  • Notice tweets written to sandbox filesystem via FileSystem bolt
vi /tmp/Tweets.xls
  • Open Hive view and query the tweets table: Image

Use Zeppelin to create charts to analyze tweets

  • Apache Zeppelin can also be installed on the cluster/sandbox to generate charts for analysis using:

    • Spark
    • SparkSQL
    • Hive
    • Flink
  • The Zeppelin Ambari service can be used to easily install/manage Zeppelin on HDP cluster

  • Sample queries and charts: Image


Import data to BI Tool via ODBC for analysis - optional

  • Create ORC table and copy the tweets over:
hive -f /root/hdp22-twitter-demo/stormtwitter-mvn/createORC.sql

Image

  • Grant select access to user hive to the ORC table
hive -e 'grant SELECT on table tweets_orc_partition_single to user hive'
  • On windows VM create an ODBC connector called sandbox with below settings:
	Host=<IP address of sandbox VM>
	port=10000 
	database=default 
	Hive Server type=Hive Server 2 
	Mechanism=User Name 
	UserName=hive 

Image

  • Import data from tweets_orc_partition_single table into Excel over ODBC Data > From other Datasources > From dataconnection wizard > ODBC DSN > sandbox > tweets_orc_partition_single > Finish > Yes > OK

Image

  • Create some visualizations using PowerCharts (e.g. plotting tweets with coordinates on map) Image

Other things to try: Analyze any kind of tweet - optional

  • Instead of filtering on tweets from certain stocks/hashtags, you can also consume all tweets returned by TwitterStream API and re-run runkafkaproducer.sh Note that in this mode a large volume of tweets is generated so you should stop the kafka producer after 20-30s to avoid overloading the system It also may take a few minutes after stopping the kafka producer before all the tweets show up in Banana/Hive
mv /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv.bak
  • To filter tweets based on geography open below file and uncomment this line and re-run runkafkaproducer.sh
vi /root/hdp22-twitter-demo/kafkaproducer/TestProducer.java
/root/hdp22-twitter-demo/kafkaproducer/runkafkaproducer.sh

Reset demo

  • This empties out the demo related HDFS folders, Hive table, Solr core, Banana webapp and stops the storm topoogy
/root/hdp22-twitter-demo/reset-demo.sh
  • If kafka keeps sending your topology old tweets, you can also clear kafka queue
zookeeper-client
rmr /group1

Run demo on cluster

  • To run on actual cluster instead of sandbox, there are a few things that need to be changed before compiling/starting the demo:
    • when running kafka/Hbase shell, change the zookeeper connect strings to use localhost instead of sandbox

    • /root/hdp22-twitter-demo/solrconfig.xml: change sandbox reference to HDFS location of solr user e.g. hdfs://summit-twitterdemo01.cloud.hortonworks.com:8020/user/solr

    • /root/hdp22-twitter-demo/default.json: change sandbox to Solr server (e.g. summit-twitterdemo01.cloud.hortonworks.com)

    • /root/hdp22-twitter-demo/stormtwitter-mvn/src/main/java/hellostorm/GNstorm.java: change zookeeper host, NN url, Hive metastore

      • BrokerHosts hosts = new ZkHosts("localhost:2181”);
      • String fsUrl = "hdfs://summit-twitterdemo01.cloud.hortonworks.com:8020";
      • String sourceMetastoreUrl = "thrift://summit-twitterdemo02.cloud.hortonworks.com:9083”;
    • /root/hdp22-twitter-demo/stormtwitter-mvn/src/main/java/hellostorm/SolrBolt.java: change Solr collection url reference and zookeeper host reference

      • server = new HttpSolrServer("http://summit-twitterdemo01.cloud.hortonworks.com:8983/solr/tweets");
      • conn = phoenixDriver.connect("jdbc:phoenix:localhost:2181:/hbase-unsecure",new Properties());
    • /root/hdp22-twitter-demo/stormtwitter-mvn/src/main/java/hellostorm/TwitterRuleBolt.java: change Solr collection url reference and zookeeper host reference

      • conn = phoenixDriver.connect("jdbc:phoenix:localhost:2181:/hbase-unsecure",new Properties());
      • SolrServer server = new HttpSolrServer("http://summit-twitterdemo01.cloud.hortonworks.com:8983/solr/tweets");

About

Monitor Twitter stream for S&P 500 companies to identify & act on unexpected increases in tweet volume

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published