dnunoo
diff --git a/‎_content/articles/220118_bitcoin_tweet/df_sentiment.png‎
-1012 Bytes b/‎_content/articles/220118_bitcoin_tweet/df_sentiment.png‎
-1012 Bytes
diff --git a/‎_content/articles/220118_bitcoin_tweet/df_sentiment_preprocessed.png‎
47.6 KB b/‎_content/articles/220118_bitcoin_tweet/df_sentiment_preprocessed.png‎
47.6 KB
diff --git a/‎_content/articles/220118_bitcoin_tweet/generated.html‎
Lines changed: 120 additions & 39 deletions b/‎_content/articles/220118_bitcoin_tweet/generated.html‎
Lines changed: 120 additions & 39 deletions
@@ -1,22 +1,14 @@
-<!-- wp:paragraph -->
 <p>In this post, I am going to download and analyze the tweets regarding Bitcoin tweets from the last two weeks and perform sentiment analysis to gather market intelligence. What are people's opinions about Bitcoin tweets?</p>
-<!-- /wp:paragraph -->
 
 <!-- wp:heading {"textColor":"secondary"} -->
 <h2 class="has-secondary-color has-text-color" id="what-is-sentiment-analysis">What is Sentiment Analysis?</h2>
 <!-- /wp:heading -->
 
-<!-- wp:paragraph -->
 <p>To do this, I will need to use Natural Language Processing as a way to gain insights into my data. One of the most common forms of analysis we can exploit using NLP is called sentiment analysis, and it consists of converting a text into a score that estimates its sentiment. There are several models we can use to perform sentiment analysis, but they all fulfill the same purpose.</p>
-<!-- /wp:paragraph -->
 
-<!-- wp:paragraph -->
 <p>The most common use case of sentiment analysis is to estimate the demand of the market for a certain product, hopefully entering into a trend just when it begins. In Finance, this is one of the most searched ML applications.</p>
-<!-- /wp:paragraph -->
 
-<!-- wp:paragraph -->
 <p>The project will be following these steps:</p>
-<!-- /wp:paragraph -->
 
 <!-- wp:list {"ordered":true} -->
 <ol><li>Download data from Twitter</li><li>Preprocess the data</li><li>Perform sentiment analysis</li><li>Analyze results</li></ol>
@@ -26,25 +18,17 @@ <h2 class="has-secondary-color has-text-color" id="what-is-sentiment-analysis">W
 <h2 class="has-secondary-color has-text-color" id="1-download-data-from-twitter">1. Download data from Twitter</h2>
 <!-- /wp:heading -->
 
-<!-- wp:paragraph -->
 <p>To download data from Twitter without using its metered API, hence without any limit on the volume of data I wish to scrape, I can use different libraries. One of the most common is called <strong>twint</strong>, however, after the latest Twitter updates, has not been working very well. </p>
-<!-- /wp:paragraph -->
 
-<!-- wp:paragraph -->
 <p>As a valid and also simpler alternative, I will be using <strong>snscrape</strong>. </p>
-<!-- /wp:paragraph -->
 
 <!-- wp:code {"backgroundColor":"primary"} -->
 <pre class="wp-block-code has-primary-background-color has-background"><code>!pip install snscrape</code></pre>
 <!-- /wp:code -->
 
-<!-- wp:paragraph -->
 <p>After installing the library with pip, I will need to declare which are the search parameters. Because I may need to use it on more queries, for example, I could search for the sentiment on the top 10 Billionaires, I want to be able to have a control panel that gives instruction to the program. </p>
-<!-- /wp:paragraph -->
 
-<!-- wp:paragraph -->
 <p>As such, I will use movie_dict as a variable to store all the instructions to perform multiple searches. For each search, a csv will be created with all the data I have been able to scrape from Twitter:</p>
-<!-- /wp:paragraph -->
 
 <!-- wp:code {"backgroundColor":"primary"} -->
 <pre class="wp-block-code has-primary-background-color has-background"><code>import snscrape.modules.twitter as sntwitter
@@ -54,13 +38,11 @@ <h2 class="has-secondary-color has-text-color" id="1-download-data-from-twitter"
 from datetime import datetime
 import os
 
-{'bitcoin': ['bitcoin since:2022-01-01 until:2022-01-17', 1000]}
+movie_dict = {'bitcoin': ['bitcoin since:2022-01-01 until:2022-01-17', 1000]}
 </code></pre>
 <!-- /wp:code -->
 
-<!-- wp:paragraph -->
 <p>The following is the code that executes the scrape:</p>
-<!-- /wp:paragraph -->
 
 <!-- wp:code {"backgroundColor":"primary"} -->
 <pre class="wp-block-code has-primary-background-color has-background"><code>today = datetime.today().strftime('%Y%m%d')&#91;2:]+'_'
@@ -81,17 +63,13 @@ <h2 class="has-secondary-color has-text-color" id="1-download-data-from-twitter"
     bar.finish()</code></pre>
 <!-- /wp:code -->
 
-<!-- wp:paragraph -->
 <p>This code is an improved version of the <a href="https://medium.com/dataseries/how-to-scrape-millions-of-tweets-using-snscrape-195ee3594721">standard code used to run a query</a> to filter the tweets you wish to download from Twitter. You can use it to download not only one query, but a list of query</p>
-<!-- /wp:paragraph -->
 
 <!-- wp:heading {"textColor":"secondary"} -->
 <h2 class="has-secondary-color has-text-color" id="2-preprocess-the-data">2. Preprocess the data</h2>
 <!-- /wp:heading -->
 
-<!-- wp:paragraph -->
 <p>Now that a csv file has been created for every query in my control panel, let us look at the raw data of a single query:</p>
-<!-- /wp:paragraph -->
 
 <!-- wp:code {"backgroundColor":"primary"} -->
 <pre class="wp-block-code has-primary-background-color has-background"><code>import pandas as pd
@@ -101,9 +79,7 @@ <h2 class="has-secondary-color has-text-color" id="2-preprocess-the-data">2. Pre
 df</code></pre>
 <!-- /wp:code -->
 
-<!-- wp:paragraph -->
 <p>Because some of the rows may be null when importing the dataset, I am dropping them and resetting the index. I am also going to apply a small preprocessing snippet. Preprocessing is a step that you can customize depending on your needs. In this case, because I only want to get rid of links and non-ascii characters, I am going to use the following two functions:</p>
-<!-- /wp:paragraph -->
 
 <!-- wp:code {"backgroundColor":"primary"} -->
 <pre class="wp-block-code has-primary-background-color has-background"><code>#get rid of links and hashtags
@@ -114,25 +90,19 @@ <h2 class="has-secondary-color has-text-color" id="2-preprocess-the-data">2. Pre
 df</code></pre>
 <!-- /wp:code -->
 
-<!-- wp:paragraph -->
 <p>This is a screenshot of the dataframe after preprocessing:</p>
-<!-- /wp:paragraph -->
 
 <!-- wp:image {"align":"center","width":480,"height":616,"sizeSlug":"large","className":"is-style-default"} -->
-<div class="wp-block-image is-style-default"><figure class="aligncenter size-large is-resized"><img src="https://raw.githubusercontent.com/arditoibryan/pythonkai/main/_content/articles/220118_bitcoin_tweet/df.png" alt="" width="390" height="616"/><figcaption>df raw</figcaption></figure></div>
+<div class="wp-block-image is-style-default"><figure class="aligncenter size-large is-resized"><img src=@@@df alt="" width="390" height="616"/><figcaption>@@@df_caption</figcaption></figure></div>
 <!-- /wp:image -->
 
 <!-- wp:heading {"textColor":"secondary"} -->
 <h2 class="has-secondary-color has-text-color" id="3-perform-sentiment-analysis">3. Perform sentiment analysis</h2>
 <!-- /wp:heading -->
 
-<!-- wp:paragraph -->
 <p>I am now going to apply a sentiment analysis to our cleaned data. There is a myriad of sentiment analysis libraries you can use to perform the same task, from <strong>transformers</strong>, <strong>textblob</strong>, <strong>spacy</strong>. For this tutorial I am going to use the latest version of spacy, and its extension called <a href="https://spacy.io/universe/project/spacy-textblob" target="_blank" rel="noreferrer noopener">spacytextblob</a>.</p>
-<!-- /wp:paragraph -->
 
-<!-- wp:paragraph -->
 <p>To install it, I will need to run the following commands and restart the notebook:</p>
-<!-- /wp:paragraph -->
 
 <!-- wp:code {"backgroundColor":"primary"} -->
 <pre class="wp-block-code has-primary-background-color has-background"><code>!pip install spacytexblob==3.0.1
@@ -141,9 +111,7 @@ <h2 class="has-secondary-color has-text-color" id="3-perform-sentiment-analysis"
 !python -m spacy download en_core_web_sm</code></pre>
 <!-- /wp:code -->
 
-<!-- wp:paragraph -->
 <p>Once the installation is complete, we can run the sentiment analysis and append the score to our dataframe:</p>
-<!-- /wp:paragraph -->
 
 <!-- wp:code {"backgroundColor":"primary"} -->
 <pre class="wp-block-code has-primary-background-color has-background"><code>import spacy
@@ -157,14 +125,127 @@ <h2 class="has-secondary-color has-text-color" id="3-perform-sentiment-analysis"
 df_sentiment</code></pre>
 <!-- /wp:code -->
 
-<!-- wp:paragraph -->
 <p>As we can see, this is the final result:</p>
-<!-- /wp:paragraph -->
 
 <!-- wp:image {"align":"center","width":480,"height":616,"sizeSlug":"large","className":"is-style-default"} -->
-<div class="wp-block-image is-style-default"><figure class="aligncenter size-large is-resized"><img src="https://raw.githubusercontent.com/arditoibryan/pythonkai/main/_content/articles/220118_bitcoin_tweet/df_sentiment.png" alt="" width="480" height="616"/><figcaption>sentiment</figcaption></figure></div>
+<div class="wp-block-image is-style-default"><figure class="aligncenter size-large is-resized"><img src=@@@df_sentiment alt="" width="480" height="616"/><figcaption>_</figcaption></figure></div>
 <!-- /wp:image -->
 
-<!-- wp:paragraph -->
 <p>I decided to sort the values from the most negative, so that we could see some of the most shocking comments regarding Bitcoin tweets.</p>
-<!-- /wp:paragraph -->
+
+<!-- wp:heading {"textColor":"secondary"} -->
+<h2 class="has-secondary-color has-text-color" id="4-analyze-results">4. Analyze results</h2>
+<!-- /wp:heading -->
+
+<p>Before analyzing the content of the tweets, we are first going to preprocess our data even more. There are several preprocessing strategies, in this post, we are going to:</p>
+
+<!-- wp:list -->
+<ul><li>Lemmatize each word</li><li>Delete extra characters</li><li>Remove stop words</li></ul>
+<!-- /wp:list -->
+
+<p>I am using my own function to perform this cleaning. Because of the high availability of similar preprocessing functions, if you wish to try other code, perhaps simpler or that it only performs a single preprocessing step, you can easily google it:</p>
+
+<!-- wp:code {"backgroundColor":"primary"} -->
+<pre class="wp-block-code has-primary-background-color has-background"><code>import re
+import nltk
+nltk.download('wordnet')
+nltk.download('stopwords')
+from nltk.tokenize import RegexpTokenizer
+from nltk.stem import WordNetLemmatizer,PorterStemmer
+from nltk.corpus import stopwords
+lemmatizer = WordNetLemmatizer()
+stemmer = PorterStemmer() 
+
+#adding a counter to check the progress of the algo while it runs
+global counter
+counter = 0
+def preprocess(sentence, stemming=False, lemmatizing=False):
+  global counter
+  counter += 1
+  if counter % 100 == 0:
+    pass
+    #print(counter)
+
+  #clean as much as possible, but not apply strong editing to the text, yet
+  sentence=str(sentence)
+  tokenizer = RegexpTokenizer(r'\w+')
+
+  sentence = sentence.lower()
+  sentence=sentence.replace('{html}',"") 
+  cleanr = re.compile('&lt;.*?&gt;')
+  cleantext = re.sub(cleanr, '', sentence)
+  rem_url=re.sub(r'http\S+', '',cleantext)
+  rem_num = re.sub('&#91;0-9]+', '', rem_url)
+  tokens = tokenizer.tokenize(rem_num)
+  
+  filtered_words = &#91;w for w in tokens if len(w) &gt; 2 if not w in stopwords.words('english')]
+  
+  if stemming == True and lemmatizing == False:
+    stem_words=&#91;stemmer.stem(w) for w in filtered_words]
+    return " ".join(stem_words)
+
+  if stemming == False and lemmatizing == True:
+    lemma_words=&#91;lemmatizer.lemmatize(w) for w in filtered_words]
+    return " ".join(lemma_words)
+
+  if stemming == True and lemmatizing == True:
+    stem_words=&#91;stemmer.stem(w) for w in filtered_words]
+    lemma_words=&#91;lemmatizer.lemmatize(w) for w in stem_words]
+    return " ".join(lemma_words)
+  
+  #at the end of the algo we return filtered words
+  return " ".join(filtered_words)
+
+#preprocess the sentiment text
+df_sentiment&#91;'text'] = df_sentiment&#91;'text'].apply(lambda x: preprocess(x, stemming=False, lemmatizing=True))
+df_sentiment</code></pre>
+<!-- /wp:code -->
+
+<!-- wp:image {"align":"center","width":480,"height":616,"sizeSlug":"large","className":"is-style-default"} -->
+<div class="wp-block-image is-style-default"><figure class="aligncenter size-large is-resized"><img src=@@@df_sentiment_preprocessed alt="" width="480" height="616"/><figcaption>_</figcaption></figure></div>
+<!-- /wp:image -->
+
+<p>There are several ways we can analyze the results from the sentiment analysis. One common practice is to separate the samples with negative sentiment from the ones with a positive sentiment and extract what are the most common words. </p>
+
+<!-- wp:code {"backgroundColor":"primary"} -->
+<pre class="wp-block-code has-primary-background-color has-background"><code>df_neg = df_sentiment&#91;df_sentiment&#91;'sentiment'] &lt; 0]
+df_pos = df_sentiment&#91;df_sentiment&#91;'sentiment'] &gt; 0]
+</code></pre>
+<!-- /wp:code -->
+
+<p>First of all, let us see how many positive and negative reviews we have been inferring from our data, to have a general idea about the opinion of the public regarding @project_name:</p>
+
+<!-- wp:code {"backgroundColor":"primary","textColor":"background"} -->
+<pre class="wp-block-code has-background-color has-primary-background-color has-text-color has-background"><code>print(len(df_neg))
+print(len(df_pos))
+\
+@@@len_df_neg
+@@@len_df_pos
+</code></pre>
+<!-- /wp:code -->
+
+<p>Let us extract the most common words found in both positive and negative positive reviews:</p>
+
+<!-- wp:code {"backgroundColor":"primary"} -->
+<pre class="wp-block-code has-primary-background-color has-background"><code>positive_words = pd.DataFrame(&#91;dict(Counter(' '.join(df_pos&#91;'text'].values.tolist()).split(' ')))]).T.sort_values(0, ascending=False)&#91;0:100].index
+
+negative_words = pd.DataFrame(&#91;dict(Counter(' '.join(df_neg&#91;'text'].values.tolist()).split(' ')))]).T.sort_values(0, ascending=False)&#91;0:100].index</code></pre>
+<!-- /wp:code -->
+
+<p>These are the most common words found in the positive tweets:</p>
+
+<!-- wp:code {"backgroundColor":"primary"} -->
+<pre class="wp-block-code has-primary-background-color has-background"><code>@@@positive_words</code></pre>
+<!-- /wp:code -->
+
+<p>These, instead, are the most common words found in the negative tweets:</p>
+
+<!-- wp:code {"backgroundColor":"primary"} -->
+<pre class="wp-block-code has-primary-background-color has-background"><code>@@@negative_words</code></pre>
+<!-- /wp:code -->
+
+<!-- wp:heading {"textColor":"secondary"} -->
+<h2 class="has-secondary-color has-text-color" id="5-conclusion">5. Conclusion</h2>
+<!-- /wp:heading -->
+
+<p>Given the insights we have been inferencing using NLP we can see that there is a prevalent @sentiment_score regarding @@@project name.</p>