Disclaimer:  We have not verified how fake the ‘fake news’ really is, or how the true the ‘real news’ really is.  This process is repeatable with ANY data.

Having seen a fake news dataset on Kaggle, we wanted to see if we could differentiate between ‘fake news’ and ‘real news’.  Four of the SAP HANA capabilities enabled us to attempt this challenge.  Text Analytics, Predictive (External Machine Learning), Data Modeling and Data Virtualization.

Main steps covered here

  1. Data Acquisition & Data Preparation
  2. Model Creation
  3. Model Deployment
  4. Model Consumption
  5. Validation
  6. Lessons Learned

Data Acquisition & Data Preparation

For this challenge we combined 2 datasets

  1. Real News from Twitter (using reputable news agencies) 26,425 tweets
  2. Fake News from Kaggle 12,994 articles

We used the data virtualization capability, HANA Smart Data Integration (SDI) Twitter Adapter to acquire some ‘real news’ tweets from Twitter. We used 9 Twitter news feeds BBC, CNN, Bloomberg, Reuters, ABC News, Wall Street Journal, New York Times, The Guardian and Sky News. The Twitter API limits queries to 3200 tweets, so we captured latest 3200 Tweets from each.  if needed we could parameterize these queries to return more historical Tweets.

-- GetUserTimeline Function already built as per Twitter Adapter setup CREATE COLUMN TABLE NEWS_REAL AS (select ID, SCREENNAME, TWEET, CREATEDAT from GetUserTimeline('BBCNews',3500,null,null) where RETWEET = 0);

The Twitter feeds appeared to be pretty good for the chosen news outlets, so minimal data cleanup was needed for the ‘real news’.  We did remove all RTs (re-tweets).

The ‘fake news’ dataset is freely available from Kaggle, with some example pieces of code.
We found multiple different issues with the fake news dataset.  This is not unusual, some time was spent correcting and filtering the issues.  The issues found include datatype mismatches, columns offset and incorrect UTF-8 character encoding.  Python was then used to load the clean fake news.

# Python 3 Code df = pd.read_csv('../data/Fake.News/Fake/fake.csv', usecols =['uuid','author','title','text','language','site_url']) # Filter out the records with UTF-8 issues df = df[df.uuid !='97b1c7d89d0c2856afb7d729ad79038ea88b6943'] df = df[df.uuid !='e8077b77ccacf1493c94c9f85d9d47d949b340cb'] df = df[df.uuid !='effbcf6de55f9d987e4d8c5619843aff0219333d'] df = df[df.uuid !='fce977b83e53b59c297a8ef5b3523186bd6837fd'] df = df[df.uuid !='42ac7f1707d16e3c95317025e0cf5d5cd009e373'] df['title'].fillna(value="", inplace=True) df.dropna(axis=0, inplace=True, subset=['text']) # Trim Column Lengths df.title = df.title.str.slice(0, 253) df.text = df.text.str.slice(0, 4999) df.author = df.text.str.slice(0,127) print(df.shape) df.head()

Having filtered out the rubbish, we used the HANA sqlalchemy driver available from SAP’s Github to load into HANA. If you use the sqlalchemy library, advice is to use the GitHub source currently the repository one does not support Python 3, but the git source does.

from sqlalchemy import create_engine hanaeng = create_engine('hana+pyhdb://USER:PASS@myHANAServer.com:30015') df.to_sql('NEWS_FAKE', con=hanaeng, index=False, dtype={'uuid': sqlalchemy.types.NVARCHAR(length=100), 'author': sqlalchemy.types.NVARCHAR(length=128), 'title': sqlalchemy.types.NVARCHAR(length=256), 'text': sqlalchemy.types.NVARCHAR(length=5000), 'language': sqlalchemy.types.NVARCHAR(length=24), 'site_url': sqlalchemy.types.NVARCHAR(length=255)})

Now we have both datasets in HANA we can make a combined table with an indicator whether it is ‘fake’ or ‘real’.  We trimmed both strings to 140 characters, randomised the order and created a  sequential ID.


The combined table is 39,373 rows.  Checking the data distribution shows a 1/3rd vs 2/3 split.  With more real than fake, we didn’t balance the data but this could be done in HANA or python if necessary.

We tried to remove any obvious suspicious or bias texts in the data.  For example tweets often include URLs http/s, where as the fake news did not.  We also removed the news Twitter handles, again another potential giveaway in the data.

We created a full text index on the title, this takes the unstructured text and restructures it in to a ‘Text Analysis’ $TA_ table.  In doing so, we tokenise the raw text data.


Creating the tokenised data in HANA allows us to create a dictionary of words mapped to numbers (TOKEN_IDs).  These TOKEN_IDs are exactly what Deep Learning models such as TensorFlow require.


Model Creation

Some Data Science colleagues, Nidhi Sawhney and Stojan Maleschlijski  are the brains behind the Model Creation, as you may know, this is not in isolation they also needed to perform further data prep and then evaluate the outcomes and iterate multiple times.  We can easily adapt this process for any text classification scenario.

The TensorFlow models are built in python, and we used Jupyter notebooks for this.

# Connect to HANA and retrieve data connection = pyhdb.connect( host="", port=30015, user="MY_TOP_SECRET_USERNAME", password="MY_TOP_SECRET_PASSWORD" ) cursor = connection.cursor() cursor.execute("SELECT ID,D.TA_COUNTER, V.TA_TOKEN,V.TOKEN_ID FROM TBL_TOKEN_ID V INNER JOIN TBL_TA_TOKENS D ON V.TA_TOKEN = D.TA_TOKEN ORDER BY ID,D.TA_COUNTER") # Show First 5 Reords for Testing #cursor.fetchmany(5)

To build the actual text classifier we used a Keras (TensorFlow) Convolution Neural Network (CNN)  with a number of Layers, similar to those taught by fast.ai.  Below we have specified 5000 as our vocabulary size and 40 is the maximum number of words coming from our 140 character text.

model = Sequential([ Embedding(5000, 32, input_length=40), SpatialDropout1D(0.2), Dropout(0.25), Convolution1D(64, 5, padding='same', activation='relu'), Dropout(0.25), MaxPooling1D(), Flatten(), Dense(100, activation='relu'), Dropout(0.7), Dense(1, activation='sigmoid',name='prediction')]) 

The model summary is shown below.

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 40, 32) 160000 _________________________________________________________________ spatial_dropout1d_2 (Spatial (None, 40, 32) 0 _________________________________________________________________ dropout_4 (Dropout) (None, 40, 32) 0 _________________________________________________________________ conv1d_2 (Conv1D) (None, 40, 64) 10304 _________________________________________________________________ dropout_5 (Dropout) (None, 40, 64) 0 _________________________________________________________________ max_pooling1d_2 (MaxPooling1 (None, 20, 64) 0 _________________________________________________________________ flatten_2 (Flatten) (None, 1280) 0 _________________________________________________________________ dense_2 (Dense) (None, 100) 128100 _________________________________________________________________ dropout_6 (Dropout) (None, 100) 0 _________________________________________________________________ dense_3 (Dense) (None, 1) 101 ================================================================= Total params: 298,505 Trainable params: 298,505 Non-trainable params: 0 __________________________ Train on 31498 samples, validate on 7874 samples Epoch 1/14 31498/31498 [==============================] - 2s 60us/step - loss: 0.3160 - acc: 0.8539 - val_loss: 0.2802 - val_acc: 0.9271 Epoch 2/14 31498/31498 [==============================] - 1s 39us/step - loss: 0.0790 - acc: 0.9726 - val_loss: 0.3442 - val_acc: 0.9053 Epoch 3/14 31498/31498 [==============================] - 1s 39us/step - loss: 0.0457 - acc: 0.9846 - val_loss: 0.3334 - val_acc: 0.9347

We then trained this model with an 80/20 training/validation split over a number of Epochs.  Above you can see the accuracy of the model improves as the output back propagates.  The val_acc (Validation Accuracy) is above 90%, we can therefore be confident based on the data we provided that the model is able to identify the different types of news.

After further validation using F1 scores, Confusion Matrix’s and analysis of individual records, we began to trust the model (but not the fake news!). We then could save the trained (Keras) model.


Model Deployment

To deploy a TensorFlow model with HANA you need to create a Saved Model.
This is easier said than done! Unfortunately, the Keras model.save (as above) is not what TensorFlow Serving requires.  We therefore need to rebuild the Keras model as a pure TensorFlow model.

We had also performed some pre-processing steps during the model creation.  These pre-processing steps need to be captured and included with the saved model.  Below is the pre-processing function we required for TensorFlow Serving.

def preprocess (txt_input): sparse_tokenized_input = tf.string_split(txt_input,delimiter=' ') tokenized_input = tf.sparse_tensor_to_dense(sparse_tokenized_input, default_value='0') token_idxs = tf.string_to_number(tokenized_input, out_type=tf.float32) inputlength = tf.size(token_idxs) # Max Number of Words in Sentance 40 padding = 40 - inputlength token_idxs_padded = tf.pad(token_idxs, [[0,0],[padding,0]]) token_idxs_embedding = tf.slice(token_idxs_padded, [0,0], [1,40]) return token_idxs_embedding;

The save code that we used is below, this captured our pre-processing function.

from tensorflow.python.saved_model import builder as saved_model_builder from tensorflow.python.saved_model import utils from tensorflow.python.saved_model import tag_constants, signature_constants from tensorflow.python.saved_model.signature_def_utils_impl import build_signature_def, predict_signature_def from tensorflow.contrib.session_bundle import exporter export_path = 'FakeNews-Serving/17' builder = saved_model_builder.SavedModelBuilder(export_path) signature = predict_signature_def(inputs={'text': txt_input}, outputs={'labels': model.output}) with K.get_session() as sess: builder.add_meta_graph_and_variables(sess=sess, tags=[tag_constants.SERVING], signature_def_map={'predict': signature}) builder.save()

Saving the model requires a clean TF session, so be prepared to iterate here, if you are frequently exporting.  Often when saving models we experienced errors, for example.

AssertionError: Export directory already exists. Please specify a different export directory: FakeNews-Serving/16

If an error has been raised during export, it often breaks the TF session, we found that it is best to ‘Restart the Kernal’ from within jupyter.

Once exported, to validate the ‘saved_model.pb’ we can use the TF saved_model_cli.  Here you see a warning as we had a small version mis-match, but these warnings are generally ok.

Our signature below shows a single input with a string datatype, the shape of this is unknown because this relates to a TF placeholder value that is only known once the model is executed.  We will pass in a single string from HANA of numbers separated by spaces, our preprocess function expects this and translates it to a padded tensor. The output returned is a single tensor with one value this contains the fake 0/1 prediction stored as a float.

ubuntu@ip-10-0-31-145:~$ saved_model_cli show --dir FakeNews-Serving/17 --signature_def 'serving_default' --all /usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: compiletime version 3.6 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.5 return f(*args, **kwds) MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs: signature_def['predict']: The given SavedModel SignatureDef contains the following input(s): inputs['text'] tensor_info: dtype: DT_STRING shape: unknown_rank name: txt_input:0 The given SavedModel SignatureDef contains the following output(s): outputs['labels'] tensor_info: dtype: DT_FLOAT shape: (-1, 1) name: prediction_1/Sigmoid:0 Method name is: tensorflow/serving/predict ubuntu@ip-10-0-31-145:~$ 

When the model is running via TensorFlow Serving the output would look something like

ubuntu@ip-10-0-31-145:tensorflow_model_server --port=9000 --model_name=fakenews --model_base_path=/home/ubuntu/FakeNews-Serving 2017-12-19 16:05:18.560516: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /home/ubuntu/FakeNews-Serving/17 2017-12-19 16:05:18.560541: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:236] Loading SavedModel from: /home/ubuntu/FakeNews-Serving/17 2017-12-19 16:05:18.598911: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:155] Restoring SavedModel bundle. 2017-12-19 16:05:18.623654: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running LegacyInitOp on SavedModel bundle. 2017-12-19 16:05:18.625504: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:284] Loading SavedModel: success. Took 64901 microseconds. 2017-12-19 16:05:18.626184: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: fakenews version: 17} E1219 16:05:18.628689844 15683 ev_epoll1_linux.c:1051] grpc epoll fd: 3 2017-12-19 16:05:18.630423: I tensorflow_serving/model_servers/main.cc:288] Running ModelServer at ...

Model Consumption

Using the HANA EML we can now consume the model, for completeness I have pasted all the code we used here, but 95% of this came from the HANA Academy EML by Philip Mugglestone

-- register model INSERT INTO "_SYS_AFL"."EML_MODEL_CONFIGURATION" VALUES ('fakenews', 'RemoteSource', 'TensorFlowModelServer'); SELECT * FROM "_SYS_AFL"."EML_MODEL_CONFIGURATION"; -- create parameters table (used in subsequent calls) CREATE TABLE "PARAMETERS" ("Parameter" VARCHAR(100), "Value" VARCHAR(100)); -- apply registered models CALL "_SYS_AFL"."EML_CTL_PROC" ('UpdateModelConfiguration', "PARAMETERS", ?); -- verify model is up and running on remote source TRUNCATE TABLE "PARAMETERS"; INSERT INTO "PARAMETERS" VALUES ('Model', 'fakenews'); CALL "_SYS_AFL"."EML_CHECKDESTINATION_PROC" ("PARAMETERS", ?); CALL "SYS"."AFLLANG_WRAPPER_PROCEDURE_DROP" ('FAKENEWS', 'CLASSIFY_NEWS'); -- create table types CREATE TYPE "T_PARAMS" AS TABLE ("Parameter" VARCHAR(100), "Value" VARCHAR(100)); CREATE TYPE "T_DATA" AS TABLE ("text" VARCHAR(256)); CREATE TYPE "T_RESULTS" AS TABLE ("Score1" FLOAT); -- create signature table then generate stored procedure CREATE COLUMN TABLE "SIGNATURE" ("POSITION" INTEGER, "SCHEMA_NAME" NVARCHAR(256), "TYPE_NAME" NVARCHAR(256), "PARAMETER_TYPE" VARCHAR(7)); INSERT INTO "SIGNATURE" VALUES (1, 'FAKENEWS', 'T_PARAMS', 'IN'); INSERT INTO "SIGNATURE" VALUES (2, 'FAKENEWS', 'T_DATA', 'IN'); INSERT INTO "SIGNATURE" VALUES (3, 'FAKENEWS', 'T_RESULTS', 'OUT'); CALL "SYS"."AFLLANG_WRAPPER_PROCEDURE_CREATE" ('EML', 'PREDICT', 'FAKENEWS', 'CLASSIFY_NEWS', "SIGNATURE"); -- create tables CREATE TABLE "PARAMS" LIKE "T_PARAMS"; CREATE TABLE "RESULTS" LIKE "T_RESULTS"; -- run time -- data to be scored CREATE VIEW "NEWS_SENTANCE" AS (SELECT '445 4999 400 41 3016 334 1806 1160 4999' as "text" from dummy); CREATE VIEW "NEWS_SENTANCE_FAKE" AS (SELECT '161 45 145 30 4999 107 1112 947 25' as "text" from dummy); -- params TRUNCATE TABLE "PARAMS"; INSERT INTO "PARAMS" VALUES ('Model', 'fakenews'); --INSERT IGNORE INTO "PARAMS" VALUES ('Model', 'saved_model%predict'); -- mandatory: model name (optional: signature name) --INSERT IGNORE INTO "PARAMS" VALUES ('Deadline', '1000'); -- optional: max milliseconds to wait -- scoring : results inline CALL "CLASSIFY_NEWS" ("PARAMS", "NEWS_SENTANCE", ?);


To perform an end to end validation we should test both python and HANA are behaving correctly.
The validation dataset is all IDs above 31,498 (80% of 39,373), here I will check 31,505 and 31,506.

In the Base Table in HANA we can see the 2 original sentences

Using SQL initially developed by Nidhi, for preparing the input data for training, we can validate our steps.  We can see the key tokens in our sentence.  Any TOKEN_IDs above 5000 are called rare so all become 4999.

In python we can view our data assigned to ‘trn_data’, which was loaded as a Numpy array.
Note the Python IDs start at zero so they will be 1 less than the HANA IDs

print(trn_data[31504]) [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 445 4999 400 41 3016 334 1806 1160 4999]

In python we can check the predicted value, we can see it predicted 0 for the 31505 (real news) and almost 1 for 31506 to be classified as fake news.  This exactly matches the source data flags but we didn’t provide this information to our model!

model.predict(trn_data[31504:31506]) array([[ 0.03409993], [ 0.99994373]], dtype=float32)

We can use these 2 sentences encoded as numbers from the validation data. We can query our TF model from SAP HANA to confirm the results reflect the validation predictions.  Below we pass in 445 4999 400 41 3016 334 1806 1160 4999′ and the predicted value is almost 0, therefore it has been identified as real news.  We would typically round this value to 0.

Changing the input data (View) to the next sentence (161 45 145 30 4999 107 1112 947 25′) in our dataset shows it really can tell the difference! Again we would round 0.9999 to 1.


Lessons Learned & Issues Encountered

Yes, we did learn a lot, found some issues, learned some Python, Keras, TensorFlow, TensorFlow Serving, AWS and of course the HANA EML integration.

With the EML you are passing TENSORS, so your input table becomes a TF tensor and the Output table corresponds the the output tensor of your model.
The input table/column name DOES matter, the output does not seem to.
The output column datatypes and number of columns do matter

Some errors and resolutions below

-- scoring : results inline CALL "CLASSIFY_NEWS" ("PARAMS", "NEWS_SENTANCE", ?); 

java.sql.SQLWarning: wrong number or types of parameters in call: Column mismatch happened at ‘FAKENEWS.CLASSIFY_NEWS’: Declared type ‘VARCHAR(256)’ of attribute ‘text’ not same as assigned type ‘VARCHAR(11)’

Initially I thought this was causing us an issue, but actually it is just a warning, so can be safely ignored, as long as your data fits within the lower limits.


Could not execute 'CALL "CLASSIFY_NEWS" ("PARAMS", "NEWS_SENTANCE", ?)' in 137 ms 909 µs . SAP DBTech JDBC: [423]: AFL error: search table error: _SYS_AFL.EML:PREDICT: [423] (range 3) AFL error exception: RPC "Prediction/Metamodel" failed (5) 

Could not execute CALL ‘CLASSIFY_NEWS’ (‘PARAMS’, ‘NEWS_SENTANCE’, ?)’ in 137 ms 909 µs .
SAP DBTech JDBC: [423]: AFL error: search table error: _SYS_AFL.EML:PREDICT: [423] (range 3) AFL error exception: RPC ‘Prediction/Metamodel’ failed (5)

This error is more problematic, and after checking the indexserver and scriptserver diagnosis files I was stuck for a while.  This error means that the EML cannot find the matching signature on the TensorFlow Server.  We had to change the PARAMS table that defines which model and signature we are calling.  It could indicate a problem with EML metadata, removing the %predict also seemed to fixed this.



Could not execute CALL ‘CLASSIFY_NEWS’ (‘PARAMS’, ‘NEWS_SENTANCE’, ?)’ in 131 ms 198 µs .
SAP DBTech JDBC: [423]: AFL error: search table error: _SYS_AFL.EML:PREDICT: [423] (range 3) AFL error exception: RPC ‘Prediction/Predict’ failed (14)

Upon checking the destination, we can see that the TensorFlow Serving appears not to running as expected.

Restarting the Server with a command similar to below will fix this for us.

nohup tensorflow_model_server --port=9000 --model_name=fakenews --model_base_path=/home/ubuntu/FakeNews 


Could not execute CALL ‘MULTI_CLASSIFY_NEWS’ (‘PARAMS’, ‘NEWS_SENTANCE’, ?)’ in 134 ms 560 µs .
SAP DBTech JDBC: [423]: AFL error: search table error: _SYS_AFL.EML:PREDICT: [423] (range 3) AFL error exception: RPC ‘Prediction/Predict’ failed (3)

Looking at the Diagnosis Files, helps us here Shows that the input value is not in the expected range, we defined the vocab size as 5000, and have supplied 9999.  This should be handled in our pre-processing step.

[1256]{200896}[7/-1] 2017-12-19 14:52:31.262719 e LJIT cePopCustomLjit.cpp(00639) : _SYS_AFL.EML:PREDICT: [423] (range 3) AFL error exception: RPC "Prediction/Predict" failed (3) [8590]{200896}[3/-1] 2017-12-19 14:54:31.781945 e AFL_EML impl.cpp(00272) : PredictClient::predict failed (3): indices[0,33] = 9999 is not in [0, 5000)%0A%09 [[Node: embedding_1_1/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _output_shapes=[[1,40,32]], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1_1/embeddings/read, embedding_1_1/Cast)]] [8590]{200896}[3/-1] 2017-12-19 14:54:31.782200 e LJIT cePopCustomLjit.cpp(00620) : Llang Runtime Error: Exception::SQLException423: RPC "Prediction/Predict" failed (3)