Skip to content

Commit 7cf846d

Browse files
committed
new article
1 parent 2b7d1b7 commit 7cf846d

3 files changed

Lines changed: 120 additions & 39 deletions

File tree

-1012 Bytes
Loading
47.6 KB
Loading
Lines changed: 120 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,14 @@
1-
<!-- wp:paragraph -->
21
<p>In this post, I am going to download and analyze the tweets regarding Bitcoin tweets from the last two weeks and perform sentiment analysis to gather market intelligence. What are people's opinions about Bitcoin tweets?</p>
3-
<!-- /wp:paragraph -->
42

53
<!-- wp:heading {"textColor":"secondary"} -->
64
<h2 class="has-secondary-color has-text-color" id="what-is-sentiment-analysis">What is Sentiment Analysis?</h2>
75
<!-- /wp:heading -->
86

9-
<!-- wp:paragraph -->
107
<p>To do this, I will need to use Natural Language Processing as a way to gain insights into my data. One of the most common forms of analysis we can exploit using NLP is called sentiment analysis, and it consists of converting a text into a score that estimates its sentiment. There are several models we can use to perform sentiment analysis, but they all fulfill the same purpose.</p>
11-
<!-- /wp:paragraph -->
128

13-
<!-- wp:paragraph -->
149
<p>The most common use case of sentiment analysis is to estimate the demand of the market for a certain product, hopefully entering into a trend just when it begins. In Finance, this is one of the most searched ML applications.</p>
15-
<!-- /wp:paragraph -->
1610

17-
<!-- wp:paragraph -->
1811
<p>The project will be following these steps:</p>
19-
<!-- /wp:paragraph -->
2012

2113
<!-- wp:list {"ordered":true} -->
2214
<ol><li>Download data from Twitter</li><li>Preprocess the data</li><li>Perform sentiment analysis</li><li>Analyze results</li></ol>
@@ -26,25 +18,17 @@ <h2 class="has-secondary-color has-text-color" id="what-is-sentiment-analysis">W
2618
<h2 class="has-secondary-color has-text-color" id="1-download-data-from-twitter">1. Download data from Twitter</h2>
2719
<!-- /wp:heading -->
2820

29-
<!-- wp:paragraph -->
3021
<p>To download data from Twitter without using its metered API, hence without any limit on the volume of data I wish to scrape, I can use different libraries. One of the most common is called <strong>twint</strong>, however, after the latest Twitter updates, has not been working very well. </p>
31-
<!-- /wp:paragraph -->
3222

33-
<!-- wp:paragraph -->
3423
<p>As a valid and also simpler alternative, I will be using <strong>snscrape</strong>. </p>
35-
<!-- /wp:paragraph -->
3624

3725
<!-- wp:code {"backgroundColor":"primary"} -->
3826
<pre class="wp-block-code has-primary-background-color has-background"><code>!pip install snscrape</code></pre>
3927
<!-- /wp:code -->
4028

41-
<!-- wp:paragraph -->
4229
<p>After installing the library with pip, I will need to declare which are the search parameters. Because I may need to use it on more queries, for example, I could search for the sentiment on the top 10 Billionaires, I want to be able to have a control panel that gives instruction to the program. </p>
43-
<!-- /wp:paragraph -->
4430

45-
<!-- wp:paragraph -->
4631
<p>As such, I will use movie_dict as a variable to store all the instructions to perform multiple searches. For each search, a csv will be created with all the data I have been able to scrape from Twitter:</p>
47-
<!-- /wp:paragraph -->
4832

4933
<!-- wp:code {"backgroundColor":"primary"} -->
5034
<pre class="wp-block-code has-primary-background-color has-background"><code>import snscrape.modules.twitter as sntwitter
@@ -54,13 +38,11 @@ <h2 class="has-secondary-color has-text-color" id="1-download-data-from-twitter"
5438
from datetime import datetime
5539
import os
5640

57-
{'bitcoin': ['bitcoin since:2022-01-01 until:2022-01-17', 1000]}
41+
movie_dict = {'bitcoin': ['bitcoin since:2022-01-01 until:2022-01-17', 1000]}
5842
</code></pre>
5943
<!-- /wp:code -->
6044

61-
<!-- wp:paragraph -->
6245
<p>The following is the code that executes the scrape:</p>
63-
<!-- /wp:paragraph -->
6446

6547
<!-- wp:code {"backgroundColor":"primary"} -->
6648
<pre class="wp-block-code has-primary-background-color has-background"><code>today = datetime.today().strftime('%Y%m%d')&#91;2:]+'_'
@@ -81,17 +63,13 @@ <h2 class="has-secondary-color has-text-color" id="1-download-data-from-twitter"
8163
bar.finish()</code></pre>
8264
<!-- /wp:code -->
8365

84-
<!-- wp:paragraph -->
8566
<p>This code is an improved version of the <a href="https://medium.com/dataseries/how-to-scrape-millions-of-tweets-using-snscrape-195ee3594721">standard code used to run a query</a> to filter the tweets you wish to download from Twitter. You can use it to download not only one query, but a list of query</p>
86-
<!-- /wp:paragraph -->
8767

8868
<!-- wp:heading {"textColor":"secondary"} -->
8969
<h2 class="has-secondary-color has-text-color" id="2-preprocess-the-data">2. Preprocess the data</h2>
9070
<!-- /wp:heading -->
9171

92-
<!-- wp:paragraph -->
9372
<p>Now that a csv file has been created for every query in my control panel, let us look at the raw data of a single query:</p>
94-
<!-- /wp:paragraph -->
9573

9674
<!-- wp:code {"backgroundColor":"primary"} -->
9775
<pre class="wp-block-code has-primary-background-color has-background"><code>import pandas as pd
@@ -101,9 +79,7 @@ <h2 class="has-secondary-color has-text-color" id="2-preprocess-the-data">2. Pre
10179
df</code></pre>
10280
<!-- /wp:code -->
10381

104-
<!-- wp:paragraph -->
10582
<p>Because some of the rows may be null when importing the dataset, I am dropping them and resetting the index. I am also going to apply a small preprocessing snippet. Preprocessing is a step that you can customize depending on your needs. In this case, because I only want to get rid of links and non-ascii characters, I am going to use the following two functions:</p>
106-
<!-- /wp:paragraph -->
10783

10884
<!-- wp:code {"backgroundColor":"primary"} -->
10985
<pre class="wp-block-code has-primary-background-color has-background"><code>#get rid of links and hashtags
@@ -114,25 +90,19 @@ <h2 class="has-secondary-color has-text-color" id="2-preprocess-the-data">2. Pre
11490
df</code></pre>
11591
<!-- /wp:code -->
11692

117-
<!-- wp:paragraph -->
11893
<p>This is a screenshot of the dataframe after preprocessing:</p>
119-
<!-- /wp:paragraph -->
12094

12195
<!-- wp:image {"align":"center","width":480,"height":616,"sizeSlug":"large","className":"is-style-default"} -->
122-
<div class="wp-block-image is-style-default"><figure class="aligncenter size-large is-resized"><img src="https://raw.githubusercontent.com/arditoibryan/pythonkai/main/_content/articles/220118_bitcoin_tweet/df.png" alt="" width="390" height="616"/><figcaption>df raw</figcaption></figure></div>
96+
<div class="wp-block-image is-style-default"><figure class="aligncenter size-large is-resized"><img src=@@@df alt="" width="390" height="616"/><figcaption>@@@df_caption</figcaption></figure></div>
12397
<!-- /wp:image -->
12498

12599
<!-- wp:heading {"textColor":"secondary"} -->
126100
<h2 class="has-secondary-color has-text-color" id="3-perform-sentiment-analysis">3. Perform sentiment analysis</h2>
127101
<!-- /wp:heading -->
128102

129-
<!-- wp:paragraph -->
130103
<p>I am now going to apply a sentiment analysis to our cleaned data. There is a myriad of sentiment analysis libraries you can use to perform the same task, from <strong>transformers</strong>, <strong>textblob</strong>, <strong>spacy</strong>. For this tutorial I am going to use the latest version of spacy, and its extension called <a href="https://spacy.io/universe/project/spacy-textblob" target="_blank" rel="noreferrer noopener">spacytextblob</a>.</p>
131-
<!-- /wp:paragraph -->
132104

133-
<!-- wp:paragraph -->
134105
<p>To install it, I will need to run the following commands and restart the notebook:</p>
135-
<!-- /wp:paragraph -->
136106

137107
<!-- wp:code {"backgroundColor":"primary"} -->
138108
<pre class="wp-block-code has-primary-background-color has-background"><code>!pip install spacytexblob==3.0.1
@@ -141,9 +111,7 @@ <h2 class="has-secondary-color has-text-color" id="3-perform-sentiment-analysis"
141111
!python -m spacy download en_core_web_sm</code></pre>
142112
<!-- /wp:code -->
143113

144-
<!-- wp:paragraph -->
145114
<p>Once the installation is complete, we can run the sentiment analysis and append the score to our dataframe:</p>
146-
<!-- /wp:paragraph -->
147115

148116
<!-- wp:code {"backgroundColor":"primary"} -->
149117
<pre class="wp-block-code has-primary-background-color has-background"><code>import spacy
@@ -157,14 +125,127 @@ <h2 class="has-secondary-color has-text-color" id="3-perform-sentiment-analysis"
157125
df_sentiment</code></pre>
158126
<!-- /wp:code -->
159127

160-
<!-- wp:paragraph -->
161128
<p>As we can see, this is the final result:</p>
162-
<!-- /wp:paragraph -->
163129

164130
<!-- wp:image {"align":"center","width":480,"height":616,"sizeSlug":"large","className":"is-style-default"} -->
165-
<div class="wp-block-image is-style-default"><figure class="aligncenter size-large is-resized"><img src="https://raw.githubusercontent.com/arditoibryan/pythonkai/main/_content/articles/220118_bitcoin_tweet/df_sentiment.png" alt="" width="480" height="616"/><figcaption>sentiment</figcaption></figure></div>
131+
<div class="wp-block-image is-style-default"><figure class="aligncenter size-large is-resized"><img src=@@@df_sentiment alt="" width="480" height="616"/><figcaption>_</figcaption></figure></div>
166132
<!-- /wp:image -->
167133

168-
<!-- wp:paragraph -->
169134
<p>I decided to sort the values from the most negative, so that we could see some of the most shocking comments regarding Bitcoin tweets.</p>
170-
<!-- /wp:paragraph -->
135+
136+
<!-- wp:heading {"textColor":"secondary"} -->
137+
<h2 class="has-secondary-color has-text-color" id="4-analyze-results">4. Analyze results</h2>
138+
<!-- /wp:heading -->
139+
140+
<p>Before analyzing the content of the tweets, we are first going to preprocess our data even more. There are several preprocessing strategies, in this post, we are going to:</p>
141+
142+
<!-- wp:list -->
143+
<ul><li>Lemmatize each word</li><li>Delete extra characters</li><li>Remove stop words</li></ul>
144+
<!-- /wp:list -->
145+
146+
<p>I am using my own function to perform this cleaning. Because of the high availability of similar preprocessing functions, if you wish to try other code, perhaps simpler or that it only performs a single preprocessing step, you can easily google it:</p>
147+
148+
<!-- wp:code {"backgroundColor":"primary"} -->
149+
<pre class="wp-block-code has-primary-background-color has-background"><code>import re
150+
import nltk
151+
nltk.download('wordnet')
152+
nltk.download('stopwords')
153+
from nltk.tokenize import RegexpTokenizer
154+
from nltk.stem import WordNetLemmatizer,PorterStemmer
155+
from nltk.corpus import stopwords
156+
lemmatizer = WordNetLemmatizer()
157+
stemmer = PorterStemmer()
158+
159+
#adding a counter to check the progress of the algo while it runs
160+
global counter
161+
counter = 0
162+
def preprocess(sentence, stemming=False, lemmatizing=False):
163+
global counter
164+
counter += 1
165+
if counter % 100 == 0:
166+
pass
167+
#print(counter)
168+
169+
#clean as much as possible, but not apply strong editing to the text, yet
170+
sentence=str(sentence)
171+
tokenizer = RegexpTokenizer(r'\w+')
172+
173+
sentence = sentence.lower()
174+
sentence=sentence.replace('{html}',"")
175+
cleanr = re.compile('&lt;.*?&gt;')
176+
cleantext = re.sub(cleanr, '', sentence)
177+
rem_url=re.sub(r'http\S+', '',cleantext)
178+
rem_num = re.sub('&#91;0-9]+', '', rem_url)
179+
tokens = tokenizer.tokenize(rem_num)
180+
181+
filtered_words = &#91;w for w in tokens if len(w) &gt; 2 if not w in stopwords.words('english')]
182+
183+
if stemming == True and lemmatizing == False:
184+
stem_words=&#91;stemmer.stem(w) for w in filtered_words]
185+
return " ".join(stem_words)
186+
187+
if stemming == False and lemmatizing == True:
188+
lemma_words=&#91;lemmatizer.lemmatize(w) for w in filtered_words]
189+
return " ".join(lemma_words)
190+
191+
if stemming == True and lemmatizing == True:
192+
stem_words=&#91;stemmer.stem(w) for w in filtered_words]
193+
lemma_words=&#91;lemmatizer.lemmatize(w) for w in stem_words]
194+
return " ".join(lemma_words)
195+
196+
#at the end of the algo we return filtered words
197+
return " ".join(filtered_words)
198+
199+
#preprocess the sentiment text
200+
df_sentiment&#91;'text'] = df_sentiment&#91;'text'].apply(lambda x: preprocess(x, stemming=False, lemmatizing=True))
201+
df_sentiment</code></pre>
202+
<!-- /wp:code -->
203+
204+
<!-- wp:image {"align":"center","width":480,"height":616,"sizeSlug":"large","className":"is-style-default"} -->
205+
<div class="wp-block-image is-style-default"><figure class="aligncenter size-large is-resized"><img src=@@@df_sentiment_preprocessed alt="" width="480" height="616"/><figcaption>_</figcaption></figure></div>
206+
<!-- /wp:image -->
207+
208+
<p>There are several ways we can analyze the results from the sentiment analysis. One common practice is to separate the samples with negative sentiment from the ones with a positive sentiment and extract what are the most common words. </p>
209+
210+
<!-- wp:code {"backgroundColor":"primary"} -->
211+
<pre class="wp-block-code has-primary-background-color has-background"><code>df_neg = df_sentiment&#91;df_sentiment&#91;'sentiment'] &lt; 0]
212+
df_pos = df_sentiment&#91;df_sentiment&#91;'sentiment'] &gt; 0]
213+
</code></pre>
214+
<!-- /wp:code -->
215+
216+
<p>First of all, let us see how many positive and negative reviews we have been inferring from our data, to have a general idea about the opinion of the public regarding @project_name:</p>
217+
218+
<!-- wp:code {"backgroundColor":"primary","textColor":"background"} -->
219+
<pre class="wp-block-code has-background-color has-primary-background-color has-text-color has-background"><code>print(len(df_neg))
220+
print(len(df_pos))
221+
\
222+
@@@len_df_neg
223+
@@@len_df_pos
224+
</code></pre>
225+
<!-- /wp:code -->
226+
227+
<p>Let us extract the most common words found in both positive and negative positive reviews:</p>
228+
229+
<!-- wp:code {"backgroundColor":"primary"} -->
230+
<pre class="wp-block-code has-primary-background-color has-background"><code>positive_words = pd.DataFrame(&#91;dict(Counter(' '.join(df_pos&#91;'text'].values.tolist()).split(' ')))]).T.sort_values(0, ascending=False)&#91;0:100].index
231+
232+
negative_words = pd.DataFrame(&#91;dict(Counter(' '.join(df_neg&#91;'text'].values.tolist()).split(' ')))]).T.sort_values(0, ascending=False)&#91;0:100].index</code></pre>
233+
<!-- /wp:code -->
234+
235+
<p>These are the most common words found in the positive tweets:</p>
236+
237+
<!-- wp:code {"backgroundColor":"primary"} -->
238+
<pre class="wp-block-code has-primary-background-color has-background"><code>@@@positive_words</code></pre>
239+
<!-- /wp:code -->
240+
241+
<p>These, instead, are the most common words found in the negative tweets:</p>
242+
243+
<!-- wp:code {"backgroundColor":"primary"} -->
244+
<pre class="wp-block-code has-primary-background-color has-background"><code>@@@negative_words</code></pre>
245+
<!-- /wp:code -->
246+
247+
<!-- wp:heading {"textColor":"secondary"} -->
248+
<h2 class="has-secondary-color has-text-color" id="5-conclusion">5. Conclusion</h2>
249+
<!-- /wp:heading -->
250+
251+
<p>Given the insights we have been inferencing using NLP we can see that there is a prevalent @sentiment_score regarding @@@project name.</p>

0 commit comments

Comments
 (0)