examples/fix-speed/calculate.py
import sys
def run(X, Y):
total = 0
for x in range(X):
for y in range(Y):
width, height = read_config()
total += x*width + y*height
print(total)
def read_config():
import csv
with open('config.csv', newline='') as fh:
reader = csv.DictReader(fh)
for row in reader:
return int(row['width']), int(row['height'])
if __name__ == "__main__":
if len(sys.argv) != 3:
exit("Usage: {sys.argv[0]} X Y")
X = int(sys.argv[1])
Y = int(sys.argv[2])
run(X, Y)
Can you suggest how to improve the speed of this code?
In order to try it create a file called config.csv with this content
width,height
23,19
and then run
time python3 calculate.py 1000 500
On my computer this takes 5 seconds to run.
]]>
Nikita writes:
The microservice processed all requests between different clients and DDB. In addition to this, during the transfer period, both RDS and DDB were supported before the full switch to DDB. I can talk about the general approaches I used to build this microservice, how I worked with the legacy code, monitoring, and what was the outcome. Also, I will give a summary of all the pros and cons I faced and things that you could do better from the beginning.
I'm a full-stack developer currently working at Check Point in Tel Aviv. My stack is Angular + Python (Flask, FastApi). I'm also interested in web accessibility.
1 00:00:02.040 --> 00:00:31.269 Gabor Szabo: Hello, and welcome to the Codemaven Channel and code meet and meeting meet up group. My name is Gabor Sabo. I teach Python. I'll help help companies with python and trust. And I also organize these meetings, these events online, because I think it's a very useful platform to share knowledge. And I'm really happy that Nikita agreed to give this presentation. Hello, Nikita.
2 00:00:31.270 --> 00:00:32.040 Nikita Barysheva: Hi!
3 00:00:32.250 --> 00:00:33.379 Nikita Barysheva: Nice to meet you all.
4 00:00:33.380 --> 00:00:35.440 Gabor Szabo: And sorry, and
5 00:00:36.170 --> 00:01:03.490 Gabor Szabo: Those people who are present thank you for for joining the the meeting, you can freely ask questions in the in the chat. And if you're watching the video on Youtube, then please, like the video and follow the channel and join our meet up. Meet up groups, so you will be notified when we have the new meetings new events. So with that, said Nikita, it's your turn. Please introduce yourself and give your presentation.
6 00:01:04.170 --> 00:01:13.930 Nikita Barysheva: Yeah, sure, I'll start sharing my screen, and then probably I'll start exploring one second K,
7 00:01:14.770 --> 00:01:20.399 Nikita Barysheva: Hi, everyone. Once again. My name is Nikita, and I'm a software developer checkpoint company right now.
8 00:01:20.590 --> 00:01:32.040 Nikita Barysheva: And today I want to talk about one of the things I had in my previous experience when we decided to switch from Rds to Dynamodb
9 00:01:32.160 --> 00:01:45.890 Nikita Barysheva: for our users table. And how we thought about it. What was the overall architecture, and how we build a micro services that helped us to switch to make this switch.
10 00:01:46.826 --> 00:02:05.609 Nikita Barysheva: We'll cover different topics like we will talk about general differences, about our disb. And like, I will highlight some main things that might make you think why, to switch from one database to another, or will help to understand our motivation behind it.
11 00:02:05.740 --> 00:02:17.010 Nikita Barysheva: And we will go over the architecture of the micro service that we build, and I'll give you some examples over there, and we can talk about it in more details if you want.
12 00:02:17.700 --> 00:02:23.740 Nikita Barysheva: So let's have a quick overview. I'm I don't know
13 00:02:23.990 --> 00:02:42.469 Nikita Barysheva: everyone familiar with the Dynamodb or Rds. What is the differences. But the main ones, like dynamodb, is a key value like no scale database. It's fully managed by aws, and it's very good for applications with like low latency, with like flexible data models
14 00:02:42.850 --> 00:02:48.559 Nikita Barysheva: and opposite orders, is the SQL database, and we have, like a
15 00:02:48.830 --> 00:02:59.830 Nikita Barysheva: predefined schemas, and it's also managed by aws, but the difference between diamond monds that for this you really have to invest more
16 00:03:00.110 --> 00:03:13.869 Nikita Barysheva: into like knowledge into setting setting up things over there, and that for sure, if you have like complex queries and joins and etc, this is better for your solution.
17 00:03:15.296 --> 00:03:26.219 Nikita Barysheva: When you decide on like which database to use, you will probably look at several things like scalability, performance availability.
18 00:03:26.410 --> 00:03:28.730 Nikita Barysheva: And here I present some
19 00:03:29.070 --> 00:03:36.149 Nikita Barysheva: basic stuff about differences like for each of them, and dynamic being out there but overall, saying again.
20 00:03:36.160 --> 00:04:03.969 Nikita Barysheva: the for scalability. We know that dynamodity automatically scales horizontally, and that really helps to manage like a large amount of traffics without any interventions at the same time, like our desk scales vertically, and it has to increase the instance size, and this increase. It also takes time, and there might be some gaps in performance also because of that
21 00:04:03.990 --> 00:04:05.570 Nikita Barysheva: and the
22 00:04:05.760 --> 00:04:30.199 Nikita Barysheva: for performance. It really depends on the type of instance you chose the storage. Like, as I said before, you really need to know what you are doing there, and how you're setting it up, because if you won't do it like properly, you might, you might have some slowness or database won't be available. Something will be down, and users won't be happy.
23 00:04:30.250 --> 00:04:37.455 Nikita Barysheva: And as for availability, I found out we found out basically for ourself.
24 00:04:38.560 --> 00:04:46.200 Nikita Barysheva: like, say, big difference for dynamodb. And there is a thing like you that you can activate that calls global tables
25 00:04:46.390 --> 00:04:50.660 Nikita Barysheva: like, it's like a multi-region multi master, right? Database solution
26 00:04:50.790 --> 00:05:02.400 Nikita Barysheva: for this. It supports. And let's call it multi az multi availability zones. It's replicates the data across different availability zones. But in the same region.
27 00:05:03.740 --> 00:05:09.610 Nikita Barysheva: And the another thing that you will have to consider it will be interesting for you is like
28 00:05:10.090 --> 00:05:11.680 Nikita Barysheva: cost consideration.
29 00:05:13.510 --> 00:05:28.390 Nikita Barysheva: or is, let's say, dynamic price pricing and Rds cost will increase as you scale vertical, like large instances horizontally like read replicas. Also Rds also provides like on demand.
30 00:05:28.490 --> 00:05:38.600 Nikita Barysheva: but still, if you like, chose the the instance with a special like to say storage of I don't remember exactly the batches there, but
31 00:05:38.830 --> 00:05:43.869 Nikita Barysheva: let's say, 60 gigas. You will have to pay for 60 jigas. Even you use 20 of them.
32 00:05:44.190 --> 00:06:03.179 Nikita Barysheva: So efficient hybrid handling. Dynamodb is really optimized for hybrid scenarios, and it doesn't provide different replicas. Okay, it can handle millions of requests per second with the architecture that Kws provides to us. And the
33 00:06:03.480 --> 00:06:12.200 Nikita Barysheva: Pre. We want. We all want to have, let's say, predictable cost and capacity modes, and because of that.
34 00:06:12.560 --> 00:06:18.709 Nikita Barysheva: into benefit of Dynamodb, Dynamodb offers 2 modes, provisioned capacity. When you
35 00:06:18.910 --> 00:06:24.920 Nikita Barysheva: have when you set up the database. Basically, you have to say how many read and writes like.
36 00:06:26.620 --> 00:06:34.859 Nikita Barysheva: what is the bar? Let's say for them for for your database, and or you can use on demand
37 00:06:35.010 --> 00:06:53.600 Nikita Barysheva: that will automatically scale up your traffic and ensure you pay only for the usage. We had a situation when we we worked on one online store, and we had a situation that we didn't predict, because no one is about like no one's following super bowl in United States.
38 00:06:53.800 --> 00:07:00.989 Nikita Barysheva: But we did. We just lost it. And the traffic went up
39 00:07:01.790 --> 00:07:08.899 Nikita Barysheva: and the people tried to buy beer in United States order it online, and we didn't expect that. But thanks to
40 00:07:09.060 --> 00:07:16.610 Nikita Barysheva: the dynamo debit architecture like scaled up automatically and we are, we were on the pretty good side.
41 00:07:18.662 --> 00:07:22.460 Nikita Barysheva: These are very general customer reviews. Okay.
42 00:07:22.930 --> 00:07:30.140 Nikita Barysheva: hey? I just wanted to give you some examples. Don't take it like strict that you have to calculate it like this. I just wanted to
43 00:07:30.470 --> 00:07:32.630 Nikita Barysheva: have you, Nick.
44 00:07:33.130 --> 00:07:35.150 Nikita Barysheva: Basic understanding. Okay.
45 00:07:36.065 --> 00:07:41.960 Nikita Barysheva: For this, you pay, for instance, cost and for the storage.
46 00:07:42.830 --> 00:07:50.800 Nikita Barysheva: Once again the the specification could be more complicated. But we're talking about basics.
47 00:07:50.930 --> 00:07:57.739 Nikita Barysheva: And for dynamo dB, you pay for right capacity units, read capacity units, and also for data storage.
48 00:07:58.080 --> 00:08:05.850 Nikita Barysheva: But where regarding the storage, and I wanted to give you some like more detailed calculations here.
49 00:08:07.160 --> 00:08:13.440 Nikita Barysheva: If you, for example, want to store 5 gigs, 10 gigas, 20 gigs.
50 00:08:13.610 --> 00:08:21.249 Nikita Barysheva: You will pay the same price for the storage all this time, because, as I said before, you
51 00:08:21.470 --> 00:08:24.479 Nikita Barysheva: choose the storage type, and you have to pay for it.
52 00:08:24.650 --> 00:08:28.259 Nikita Barysheva: even if you pay, even if you use less.
53 00:08:28.470 --> 00:08:36.140 Nikita Barysheva: Okay? At the same time you see that for download, my dB, this thing is dynamic.
54 00:08:37.179 --> 00:08:41.329 Nikita Barysheva: and it depends on the real real story that you use.
55 00:08:41.620 --> 00:08:48.309 Nikita Barysheva: There are more things that I mentioned here. I'm not sure if you want to be overwhelmed right now let me know. But
56 00:08:48.660 --> 00:08:56.309 Nikita Barysheva: these are the very basic things that I wanted you to consider just to understand are the dynamodity.
57 00:08:56.950 --> 00:08:57.940 Nikita Barysheva: And
58 00:08:58.050 --> 00:09:11.530 Nikita Barysheva: yeah, so we talked about different like database types like SQL, Nonsql, specifically, Rds actually mentioned didn't mention it. But it considered Postgresql.
59 00:09:11.710 --> 00:09:15.120 Nikita Barysheva: if it's important and diamond.
60 00:09:15.630 --> 00:09:22.709 Nikita Barysheva: Now, I want to talk about the the actual problem that we had and the solution
61 00:09:22.890 --> 00:09:24.870 Nikita Barysheva: that we found out for ourselves.
62 00:09:28.790 --> 00:09:33.650 Nikita Barysheva: So the overall problem was that that when the
63 00:09:34.180 --> 00:09:38.829 Nikita Barysheva: the number of users, like number of requests to the database
64 00:09:38.950 --> 00:09:42.869 Nikita Barysheva: scaled like went up, we had spikes.
65 00:09:43.020 --> 00:09:59.789 Nikita Barysheva: Our our desk like didn't work well sometimes. So we decided that we need to do something more stable. And we started to consider different databases. And because we had previous experience with dynamic or another project we had.
66 00:10:00.150 --> 00:10:09.900 Nikita Barysheva: we decided that we want to build an architecture where all our clients will go to the dynamo dB,
67 00:10:10.060 --> 00:10:15.369 Nikita Barysheva: through a user's micro service. But
68 00:10:15.520 --> 00:10:22.409 Nikita Barysheva: the problem is another problem is that today all of our users are stored in Postgrescale.
69 00:10:22.960 --> 00:10:26.039 Nikita Barysheva: So how to how to manage it.
70 00:10:26.230 --> 00:10:35.769 Nikita Barysheva: 2 different databases, and like, not just physically, database. I mean different types, databases. That's kind of challenge. Okay?
71 00:10:35.930 --> 00:10:41.610 Nikita Barysheva: So these are really again.
72 00:10:42.104 --> 00:10:49.590 Nikita Barysheva: general overview of the solution. But on the left side. You see the clients. Each of them is like A,
73 00:10:49.840 --> 00:10:51.350 Nikita Barysheva: the client that
74 00:10:51.510 --> 00:10:57.339 Nikita Barysheva: it could be a back end client that wants to get data about the special and specific user.
75 00:10:57.500 --> 00:11:01.464 Nikita Barysheva: or to get all the users by some condition and
76 00:11:02.330 --> 00:11:16.599 Nikita Barysheva: how we do it. We decided to implement several feature flags, including like that, will tell us where we should read the data from or where we want now to write the data to.
77 00:11:16.870 --> 00:11:24.160 Nikita Barysheva: And basing on this feature flex. We were doing like, get requests, or we're doing like.
78 00:11:24.690 --> 00:11:33.529 Nikita Barysheva: put post to delete. We do all the separations based on this feature flags. And this is the like
79 00:11:33.770 --> 00:11:40.780 Nikita Barysheva: Postgresql architecture, nothing like special here. And this is the user service. So we have the
80 00:11:41.200 --> 00:11:47.700 Nikita Barysheva: containers here, and we use readies for caching, for caching and Dynama dB.
81 00:11:48.030 --> 00:11:54.420 Nikita Barysheva: Without additional details here. But I mean I think it could be pretty clear what we are
82 00:11:54.760 --> 00:11:59.329 Nikita Barysheva: trying to do here. Let me know if you have any questions so far.
83 00:11:59.700 --> 00:12:06.740 Nikita Barysheva: I will. I will be happy to answer them, because this scheme, if if you have question, I will be happy to answer them just
84 00:12:06.860 --> 00:12:10.370 Nikita Barysheva: for you to be and to make it more clear later.
85 00:12:12.630 --> 00:12:13.830 Nikita Barysheva: And then.
86 00:12:15.170 --> 00:12:23.570 Nikita Barysheva: except the fact that we want to transfer to Dynamodb, we need to have this transition period. So.
87 00:12:23.800 --> 00:12:27.390 Nikita Barysheva: as you saw on the previous like scheme.
88 00:12:27.660 --> 00:12:36.229 Nikita Barysheva: we decided to plan to implement the service like Api, that will handle all crowd operations related to our dynamic.
89 00:12:36.530 --> 00:12:42.040 Nikita Barysheva: And we also need to transfer all users data from Rds to Dynamodb.
90 00:12:42.160 --> 00:12:49.420 Nikita Barysheva: This was done. We wrote different scripts. We basically can grab the data from their desk.
91 00:12:49.730 --> 00:12:58.090 Nikita Barysheva: transform the data as we want. And to basically transfer this data to Dynamodb.
92 00:12:58.640 --> 00:13:02.679 Nikita Barysheva: And we also decided, as I mentioned
93 00:13:02.860 --> 00:13:11.260 Nikita Barysheva: in a previous slide. We decided that we want to have feature flags. The feature of the 1st feature flag read user from Dynamodb.
94 00:13:11.450 --> 00:13:12.520 Nikita Barysheva: If it through.
95 00:13:12.780 --> 00:13:19.020 Nikita Barysheva: we go to Dynamodb the micro service and Dynamodb. If it's false, we go directly to the Postgrescale
96 00:13:19.160 --> 00:13:24.129 Nikita Barysheva: and read user to our Ds and write write user to our Ds and to dynamodb.
97 00:13:24.330 --> 00:13:30.630 Nikita Barysheva: these are 2 flags that basically we need to support this period when we
98 00:13:31.270 --> 00:13:37.399 Nikita Barysheva: we work with both databases. So we're trying, we try to make the spirit as short as possible.
99 00:13:37.680 --> 00:13:41.950 Nikita Barysheva: to make some like tests on the Qa. On staging and then on production.
100 00:13:42.504 --> 00:13:47.880 Nikita Barysheva: We still have to work some production. But once we saw that everything works
101 00:13:48.230 --> 00:13:51.070 Nikita Barysheva: like fine. When we don't have any
102 00:13:51.200 --> 00:14:11.700 Nikita Barysheva: request from the customers, we don't have any bugs opening. So we closed right user to our desk. So the channel, let's go back for a second if I can. Yeah, basically, this channel. This path was closed. So we just continued working directly with our user service.
103 00:14:12.150 --> 00:14:23.140 Nikita Barysheva: The read from dynamo debu was always true, and the right to dynamo debut also. True, right to our death was false. So all this scheme started
104 00:14:23.290 --> 00:14:25.420 Nikita Barysheva: working only with this part.
105 00:14:25.690 --> 00:14:27.470 Nikita Barysheva: Avoid imposed risk.
106 00:14:28.655 --> 00:14:29.020 Nikita Barysheva: Okay,
107 00:14:30.780 --> 00:14:42.270 Nikita Barysheva: I wanted to show the client side architecture. The client, as I mentioned before, like every service that wants to get information about the the user
108 00:14:42.800 --> 00:14:48.340 Nikita Barysheva: and just wanted to give some code examples and to explain what we
109 00:14:48.470 --> 00:14:50.530 Nikita Barysheva: just generally try to achieve that.
110 00:14:51.320 --> 00:15:01.800 Nikita Barysheva: We wrote like a model interface, and the purpose of such interface is to be a handle for all calls
111 00:15:01.970 --> 00:15:05.099 Nikita Barysheva: to Dynamodb through the service.
112 00:15:05.370 --> 00:15:13.029 Nikita Barysheva: Okay, handle responses from service. Manage all the Retries, manage all the caching, etcetera. So be
113 00:15:13.250 --> 00:15:19.509 Nikita Barysheva: the one that gets the data for for the client from the service.
114 00:15:20.940 --> 00:15:23.820 Nikita Barysheva: It could like look like that.
115 00:15:23.950 --> 00:15:31.610 Nikita Barysheva: And one of the functions that we could use like get user by email.
116 00:15:32.030 --> 00:15:37.850 Nikita Barysheva: we initiate the user client with some parameters over here.
117 00:15:38.050 --> 00:15:50.039 Nikita Barysheva: One of the parameters that I really like to really like to mention is a requester Id. I will explain later. I can explain actually, right now, because
118 00:15:50.260 --> 00:15:51.730 Nikita Barysheva: why we need it.
119 00:15:52.010 --> 00:16:00.319 Nikita Barysheva: basically for the login and for the tracking, something fells down. I really like that, we know which client made this request.
120 00:16:00.690 --> 00:16:16.299 Nikita Barysheva: and on the left side the function itself which uses the the feature flag. The code could be optimized. Don't look at it like as a perfect one, just wanted to make it as much clear and readable on one slide as possible.
121 00:16:16.830 --> 00:16:22.635 Nikita Barysheva: So if we want to get a user by email, we looking at this feature flag. And
122 00:16:23.180 --> 00:16:27.599 Nikita Barysheva: we basically want to get to have a request to dynamodb
123 00:16:27.790 --> 00:16:40.490 Nikita Barysheva: service. This is a client link. And we want to to make the Cpi call, else we're going as we did it before we just go into Postgresql and getting that data over there.
124 00:16:43.070 --> 00:16:48.000 Nikita Barysheva: And this is the function that user from dynamodvis
125 00:16:48.250 --> 00:16:51.309 Nikita Barysheva: made it detect more specific over here.
126 00:16:51.770 --> 00:16:54.490 Nikita Barysheva: We're setting up all the parameters that we want to get.
127 00:16:54.650 --> 00:17:08.140 Nikita Barysheva: And we're making Api call to the to the route, and we are handling the the response. You can handle it wherever you want. We at that moment in time decided that we want to return
128 00:17:08.359 --> 00:17:09.180 Nikita Barysheva: to
129 00:17:09.440 --> 00:17:25.660 Nikita Barysheva: 2 values here. The 1st one is like represents the status, if it's okay or not. And the second one is the response. So we can check if it's if the request was okay or not. And this is actually the call Api function that actually makes a request
130 00:17:26.160 --> 00:17:44.499 Nikita Barysheva: to the service it has. Like some Retries, you can set up whatever you want, and once again you can make it better. If you want. Logging. You can make your request itself, and for sure, error, error handling with logins also.
131 00:17:45.580 --> 00:17:50.759 Nikita Barysheva: And if any questions so far, let me know
132 00:17:54.020 --> 00:18:01.149 Nikita Barysheva: it's this is one of the examples how a client can make a get request
133 00:18:01.460 --> 00:18:05.160 Nikita Barysheva: to to the micro service that will then
134 00:18:05.320 --> 00:18:08.100 Nikita Barysheva: make like, get a data from the dynamo. dB,
135 00:18:08.310 --> 00:18:13.170 Nikita Barysheva: so let's have a look at one of the routes micro service itself.
136 00:18:16.340 --> 00:18:20.059 Nikita Barysheva: As we know, we decided to use Dynamodb.
137 00:18:20.620 --> 00:18:25.700 Nikita Barysheva: 1st of all, you have to create this table. I just wanted to give you some
138 00:18:25.910 --> 00:18:29.100 Nikita Barysheva: quick overview what is included like.
139 00:18:29.620 --> 00:18:36.320 Nikita Barysheva: you see here some params, including, like key schema, that defines the primary key.
140 00:18:36.680 --> 00:18:40.250 Nikita Barysheva: Primary key could be also like a composite key
141 00:18:40.490 --> 00:18:45.620 Nikita Barysheva: of 2, let's say, 2 fields, and they
142 00:18:45.940 --> 00:18:52.720 Nikita Barysheva: into. This is what actually help us to get to get the data like quicker.
143 00:18:53.170 --> 00:18:58.179 Nikita Barysheva: We have different attribute definitions that describes the primary key
144 00:18:58.901 --> 00:19:07.130 Nikita Barysheva: fields. And we can also set up different global secondary indexes. One of them for me is like email index.
145 00:19:07.260 --> 00:19:14.640 Nikita Barysheva: It can that allows us to search by email also, not only by Id, but you may have different indexes. Not only one
146 00:19:17.307 --> 00:19:19.250 Nikita Barysheva: about the routes.
147 00:19:19.872 --> 00:19:31.540 Nikita Barysheva: maybe it could be obvious for many of you. But actually, I was surprised when it wasn't for some of like other developers. When I talked to them.
148 00:19:31.670 --> 00:19:41.340 Nikita Barysheva: The basic service handles all the get put post. Delete requests easily like should should do it. Okay. But
149 00:19:41.820 --> 00:19:44.709 Nikita Barysheva: things that people are really missing like
150 00:19:45.650 --> 00:19:51.849 Nikita Barysheva: what we we want to update many users at once. What we want to create create many users at once.
151 00:19:55.600 --> 00:19:56.680 Nikita Barysheva: Someone called it.
152 00:19:57.040 --> 00:20:00.379 Nikita Barysheva: So when you talk about dynamo D,
153 00:20:01.020 --> 00:20:09.059 Nikita Barysheva: and when we talk about the cost consideration, it's much like better.
154 00:20:09.190 --> 00:20:12.480 Nikita Barysheva: Let's say you want to create 100 users.
155 00:20:12.720 --> 00:20:17.130 Nikita Barysheva: You have a reason for that. Let's say you don't go in a for loop
156 00:20:17.290 --> 00:20:20.119 Nikita Barysheva: and creating like one after another.
157 00:20:20.290 --> 00:20:35.150 Nikita Barysheva: You're sending the batch of 100 users, and they basically, this batch will be divided by 2 chunks to chunks by 25 records, and it will be like already
158 00:20:35.280 --> 00:20:41.920 Nikita Barysheva: or requests it will be done to Dynamodb, so much, much less. Okay.
159 00:20:42.170 --> 00:20:43.479 Gabor Szabo: There is a question.
160 00:20:43.770 --> 00:20:44.190 Gabor Szabo: Oh.
161 00:20:44.190 --> 00:20:44.690 Nikita Barysheva: Yep.
162 00:20:44.690 --> 00:20:49.730 Gabor Szabo: How did you convert the data model from relational to No. SQL.
163 00:20:50.550 --> 00:20:57.619 Nikita Barysheva: Yeah, okay, that's good. One 3D is actually so, since we know that SQL,
164 00:20:57.730 --> 00:21:00.940 Nikita Barysheva: he's like, Hey, we have this strong structure
165 00:21:01.050 --> 00:21:09.219 Nikita Barysheva: like a fixed structure. We now we know what to expect and basically created another dictionary.
166 00:21:09.330 --> 00:21:16.340 Nikita Barysheva: Whoever of the user and transferred it like basically renew
167 00:21:17.000 --> 00:21:21.549 Nikita Barysheva: what is the the scheme of the Postgresql.
168 00:21:21.780 --> 00:21:29.169 Nikita Barysheva: we received the the users, we transform that like basic dictionary and
169 00:21:29.620 --> 00:21:35.050 Nikita Barysheva: using the post method like with the bulk, created it in Dynamodb.
170 00:21:35.610 --> 00:21:41.820 Nikita Barysheva: And that's that's it. No, no, not really, no, no magic over there. Actually.
171 00:21:43.180 --> 00:21:49.429 Nikita Barysheva: As for the daytime object, maybe this this is what may be specifically interest interesting to you.
172 00:21:50.448 --> 00:21:56.219 Nikita Barysheva: Restored date as a string, so we can convert it.
173 00:21:56.790 --> 00:22:08.040 Nikita Barysheva: You can also store it in a milliseconds. What else we had over there. So we had booleaning like sorry Boolean Boolean fields, and
174 00:22:08.680 --> 00:22:13.980 Nikita Barysheva: nothing really special that can change you
175 00:22:14.170 --> 00:22:18.829 Nikita Barysheva: just converting an object that you're getting from a Postgresql
176 00:22:19.010 --> 00:22:25.190 Nikita Barysheva: to the object that will be suitable for dynamo. dB, yeah.
177 00:22:26.010 --> 00:22:31.520 Nikita Barysheva: and that answer your question, or it can be more specific, excellent.
178 00:22:35.980 --> 00:22:41.230 Nikita Barysheva: and just don't see if it yes or no, just sharing screen.
179 00:22:41.620 --> 00:22:43.050 Gabor Szabo: Yes, he says.
180 00:22:43.800 --> 00:22:47.959 Nikita Barysheva: Okay, yeah. I see here now.
181 00:22:50.110 --> 00:22:56.679 Nikita Barysheva: Yep. So talked about obvious, not obvious things. And
182 00:22:59.590 --> 00:23:05.980 Nikita Barysheva: this is a service set up. I also tried to put many things and
183 00:23:06.270 --> 00:23:15.079 Nikita Barysheva: one screen. Can you see it? Because when I just showed it, live, people struggle to see it. I just want to.
184 00:23:15.080 --> 00:23:20.760 Gabor Szabo: If you could enlarge a little bit the whole thing that might be nice. I don't know if it's if you can do that.
185 00:23:23.290 --> 00:23:30.279 Nikita Barysheva: One second doesn't look like it allows in in this note.
186 00:23:32.750 --> 00:23:34.000 Nikita Barysheva: This doesn't help.
187 00:23:38.820 --> 00:23:48.070 Gabor Szabo: Why did you need? There's also another question. In the meantime, why did you need readies? If dynamo dynamodb has great performance on, reads.
188 00:23:49.990 --> 00:24:00.989 Nikita Barysheva: Because it's also good for saving money actually. And the radius can be can be applied in a second.
189 00:24:01.400 --> 00:24:06.300 Nikita Barysheva: It's a good one, because let's go over here.
190 00:24:14.750 --> 00:24:15.500 Nikita Barysheva: Okay.
191 00:24:15.800 --> 00:24:21.639 Nikita Barysheva: So we now saying that we are working like this, okay.
192 00:24:21.800 --> 00:24:31.819 Nikita Barysheva: we are going from client to dynamo. dB, and I mentioned Red zone here. But also I had, I think, to mention the readies on this layer.
193 00:24:32.390 --> 00:24:40.649 Nikita Barysheva: So what's happening right now is that our client makes another Api call to another service.
194 00:24:41.000 --> 00:24:44.359 Nikita Barysheva: which, like every call, let's say, cost us something.
195 00:24:44.560 --> 00:24:46.810 Nikita Barysheva: and then we go to dynamo. dB,
196 00:24:47.776 --> 00:24:56.910 Nikita Barysheva: it's radius is not only about the speed, it's also about the the money, the costs reduction.
197 00:24:57.390 --> 00:25:02.470 Nikita Barysheva: And we, for example, here at this layer.
198 00:25:02.810 --> 00:25:07.010 Nikita Barysheva: if the if the client was created like
199 00:25:07.310 --> 00:25:13.530 Nikita Barysheva: not properly, and the many requests. You don't catch the requests. You don't catch the results.
200 00:25:13.700 --> 00:25:25.129 Nikita Barysheva: This client will make another request, another request, another request request, and it can grow up dramatically, and your and you will get like a huge cost after that. After all.
201 00:25:25.240 --> 00:25:28.289 Nikita Barysheva: so red is the solution for that. Also.
202 00:25:29.710 --> 00:25:38.919 Nikita Barysheva: Using readies for caching basically allows you, 1st of all to decrease the the load of the
203 00:25:39.190 --> 00:25:45.409 Nikita Barysheva: of the service and the as a result, to decrease the the cost.
204 00:25:45.960 --> 00:25:53.560 Nikita Barysheva: So one of the reasons which not everyone think about in the beginning is like the cost cost reductions.
205 00:25:54.890 --> 00:25:58.660 Nikita Barysheva: Yep, is that okay?
206 00:26:03.800 --> 00:26:04.800 Nikita Barysheva: Trying to
207 00:26:10.780 --> 00:26:12.339 Nikita Barysheva: that answers the question.
208 00:26:15.790 --> 00:26:23.640 Gabor Szabo: I think you can just go. So used, okay, I'm just reading out. So used redis or used aws, caching services.
209 00:26:24.360 --> 00:26:26.529 Nikita Barysheva: Redis redis. We use reddish.
210 00:26:26.530 --> 00:26:27.560 Gabor Szabo: Yeah, okay.
211 00:26:27.560 --> 00:26:34.379 Nikita Barysheva: Because we, we use it widely in all projects. And we can. Yeah, we are used to Redis.
212 00:26:41.870 --> 00:26:44.940 Gabor Szabo: I think we can. You can go back to the code example of.
213 00:26:50.400 --> 00:26:58.774 Nikita Barysheva: Okay, the basically the 1st things that you will need for for the service.
214 00:26:59.670 --> 00:27:01.669 Nikita Barysheva: It's like the the setup is.
215 00:27:01.820 --> 00:27:05.519 Nikita Barysheva: It's pretty easy. First, st where is dynamed again hydantic.
216 00:27:06.150 --> 00:27:10.750 Nikita Barysheva: even though we said that the dynamon dB. Is like a
217 00:27:11.340 --> 00:27:14.620 Nikita Barysheva: we don't have to follow this strict schema.
218 00:27:15.040 --> 00:27:17.980 Nikita Barysheva: It's a very good practice to have one, I mean
219 00:27:18.130 --> 00:27:28.449 Nikita Barysheva: in dynamic D. When you create a record with a field like none, it won't be added, so we won't find it. If you go to the Ui, you you won't have it. But when you
220 00:27:28.820 --> 00:27:33.899 Nikita Barysheva: getting the data when you return the data, it's very good to have the
221 00:27:34.400 --> 00:27:41.840 Nikita Barysheva: and the model that you can use to serialize what you receive from dynamic, from the database. This will help.
222 00:27:42.300 --> 00:27:47.980 Nikita Barysheva: like the client, to know what to expect and
223 00:27:49.110 --> 00:27:54.539 Nikita Barysheva: ex avoid some unpredictable scenarios. So
224 00:27:54.670 --> 00:27:58.719 Nikita Barysheva: the model is always good, even if you work with a
225 00:27:58.860 --> 00:28:02.199 Nikita Barysheva: dynamodity for the key value database
226 00:28:02.560 --> 00:28:08.259 Nikita Barysheva: you have, you see here, like like one of the examples of how you can model data.
227 00:28:08.590 --> 00:28:16.620 Nikita Barysheva: And if there is a function that said user cache that sets a specific key value
228 00:28:16.900 --> 00:28:21.400 Nikita Barysheva: into the into the readies with the like expiration time.
229 00:28:21.560 --> 00:28:35.010 Nikita Barysheva: And so you can find the and set exception log error, that every time that something falls down. You will see it later. We just throw a properly log. Sorry I need to move the bar.
230 00:28:36.690 --> 00:28:41.280 Nikita Barysheva: Yeah, that will give you some details.
231 00:28:42.850 --> 00:28:49.040 Nikita Barysheva: I don't know why, but I also saw a lot a lot of examples of people trying to avoid it in logs.
232 00:28:49.220 --> 00:29:04.850 Nikita Barysheva: Oh, forgetting any logs like, I think it's a must you have to to show. Like to have it written somewhere. What was the error? And what I mentioned before is like, requested Id.
233 00:29:04.950 --> 00:29:19.059 Nikita Barysheva: It's very important if you have many, many services or clients that work with the users database, and that everything goes through the service like a main point. All these cloud iterations you need to know
234 00:29:19.410 --> 00:29:28.760 Nikita Barysheva: made the request. When it's very good for analytics, you can use it for graph later on. You can use it to debug things
235 00:29:29.780 --> 00:29:34.309 Nikita Barysheva: only only positive things from the logs that I see.
236 00:29:34.890 --> 00:29:40.377 Nikita Barysheva: and also for sure it can increase the costs for some reason, for some reason. But
237 00:29:41.080 --> 00:29:43.910 Nikita Barysheva: still have to find a balance.
238 00:29:46.030 --> 00:29:48.930 Nikita Barysheva: Second, yep.
239 00:29:49.290 --> 00:30:12.569 Nikita Barysheva: this is one of the simple examples how you can get the user, and you can get it by Id, by email, you can specify which fields to project. It's like select in the postgres. Clearly, when you do select email like 1st name, last name the same, and it will return you on the the skills, the same you can do in dynamodity
240 00:30:13.270 --> 00:30:17.819 Nikita Barysheva: and basically calls projection attributes.
241 00:30:18.010 --> 00:30:26.010 Nikita Barysheva: Also, you're checking. We're checking. If there are Id, or if there is Id or email provided.
242 00:30:26.480 --> 00:30:28.410 Nikita Barysheva: because it's a very logical thing.
243 00:30:28.580 --> 00:30:34.219 Nikita Barysheva: if nothing of this is provided you don't look for for the user.
244 00:30:34.380 --> 00:30:41.199 Nikita Barysheva: Only if you don't have maybe a route that can do a conditional
245 00:30:41.747 --> 00:30:49.370 Nikita Barysheva: conditional search. But here I decided to show, like the basic one but conditional. I mean, for example.
246 00:30:49.570 --> 00:30:55.500 Nikita Barysheva: and in SQL, when you look for the user, which who's a
247 00:30:55.660 --> 00:30:59.459 Nikita Barysheva: with the name Nikita, all the users with the name Nikita.
248 00:30:59.660 --> 00:31:01.950 Nikita Barysheva: So you don't have Id or email.
249 00:31:02.150 --> 00:31:07.829 Nikita Barysheva: But anyway, you can do it. The same thing over here just didn't include it over here.
250 00:31:12.668 --> 00:31:18.590 Nikita Barysheva: This is the actual process. So before going to Dynamodb.
251 00:31:18.790 --> 00:31:20.669 Nikita Barysheva: we are checking the ready sketch.
252 00:31:20.790 --> 00:31:27.560 Nikita Barysheva: So we're checking the the cash, and if there is nothing in cash, so we proceed to actual
253 00:31:28.228 --> 00:31:30.920 Nikita Barysheva: look up at the table.
254 00:31:31.400 --> 00:31:33.129 Nikita Barysheva: So here is like
255 00:31:33.270 --> 00:31:42.380 Nikita Barysheva: users. Table is like a let's say, an agent to initiate before initiated. Before that knows the function. Get item.
256 00:31:42.770 --> 00:31:45.170 Nikita Barysheva: good item by by key.
257 00:31:45.310 --> 00:31:53.150 Nikita Barysheva: and if we provide the projection attributes, and we provide and we say that please return us some specific fields.
258 00:31:53.320 --> 00:31:59.249 Nikita Barysheva: or we search by email. Once again, you can optimize this code as you wish. But
259 00:32:00.060 --> 00:32:08.269 Nikita Barysheva: again, this is just, for example. And then in the end of the day after we found out the user. If you found out the user, we cache it
260 00:32:12.675 --> 00:32:16.210 Nikita Barysheva: in this specific example. I wanted to
261 00:32:17.120 --> 00:32:23.600 Nikita Barysheva: to mention that we we potentially may have 3 types of successful response over here.
262 00:32:25.440 --> 00:32:39.880 Nikita Barysheva: we may be in the situation. We may end up in a situation when we didn't find the any users according to condition that we were like, according to Id, that was provided or email. And we return that like empty object here.
263 00:32:40.270 --> 00:32:58.629 Nikita Barysheva: or if we provide a projection attribute, we return the data as we received it from the Dynamodb. Or if we did find the user and didn't provide any projection attributes. Here. We want to serialize it. We, as I said, we have a model.
264 00:32:58.750 --> 00:33:01.120 Nikita Barysheva: and we want to serialize the user
265 00:33:01.310 --> 00:33:11.326 Nikita Barysheva: all the day. All the fields that we have inside the user will be patched as they are like a key in its value, and those that are not will get like
266 00:33:12.490 --> 00:33:18.410 Nikita Barysheva: we'll get a default values. Normally you put them as none, because
267 00:33:18.750 --> 00:33:24.750 Nikita Barysheva: there is nothing. There is no reason to put a something not relevant
268 00:33:25.500 --> 00:33:28.119 Nikita Barysheva: depends depends on the business thing. But yeah.
269 00:33:30.190 --> 00:33:34.359 Nikita Barysheva: And another thing that might be
270 00:33:34.780 --> 00:33:38.750 Nikita Barysheva: that might be, that is also important. It's just like error handling.
271 00:33:39.500 --> 00:33:47.180 Nikita Barysheva: We have different types of error handling. Please don't forget about it, please use it, and even though it may look like overwhelmed.
272 00:33:47.320 --> 00:33:53.660 Nikita Barysheva: I find it sometimes much better to have it rather than avoiding it. And after that
273 00:33:53.850 --> 00:34:02.910 Nikita Barysheva: something is crashing, and everyone like trying to understand what was the situation. You can handle it everything, with a general exception. But it.
274 00:34:04.130 --> 00:34:09.680 Nikita Barysheva: if you are provided with the with the tools, why not use it? My idea
275 00:34:13.100 --> 00:34:18.300 Nikita Barysheva: just wanted to sum up the service it's like
276 00:34:18.480 --> 00:34:28.170 Nikita Barysheva: intended for, and there are 2 things that you see here that I didn't mention up. But they're very important for the services like that.
277 00:34:28.550 --> 00:34:40.309 Nikita Barysheva: So the service idea is like handling dynamodity requests. Like all crowd operations, it also should be able to cache things
278 00:34:40.730 --> 00:34:43.080 Nikita Barysheva: to avoid like if
279 00:34:43.239 --> 00:34:50.190 Nikita Barysheva: if the query already like, if the request already got to the service by some reason, but
280 00:34:50.460 --> 00:34:57.029 Nikita Barysheva: you have the cached values. Still, even the service got the request. There is no reason to
281 00:34:57.250 --> 00:34:59.759 Nikita Barysheva: to bother Dynamodb, because it's
282 00:34:59.940 --> 00:35:07.190 Nikita Barysheva: after all, it's another. It's another like cent. It doesn't sound like another dollar. Let's call it so.
283 00:35:07.360 --> 00:35:18.809 Nikita Barysheva: But if you think about the very big scale, like when you have millions, tens of millions of users, if something won't be covered, it could cost you a lot, so
284 00:35:19.310 --> 00:35:23.910 Nikita Barysheva: I would prefer using caching and
285 00:35:24.760 --> 00:35:27.969 Nikita Barysheva: rather than avoiding using it. And
286 00:35:28.400 --> 00:35:30.560 Nikita Barysheva: to save some money over here
287 00:35:31.805 --> 00:35:37.329 Nikita Barysheva: we need to have a proper error error handler. That's why I mentioned 4 of them.
288 00:35:37.730 --> 00:35:42.989 Nikita Barysheva: and maybe someone won't like it. But I did it, and the 2 things that I didn't mention here is like
289 00:35:43.410 --> 00:35:47.720 Nikita Barysheva: you need to have throttling mechanism and rate limiter. Actually.
290 00:35:48.970 --> 00:35:53.220 Nikita Barysheva: it depends, I mean, it could be done on the
291 00:35:53.900 --> 00:36:02.979 Nikita Barysheva: should be done also on the service side, because, like services like what? Let's say, it's standalone thing. But you also may think about like
292 00:36:03.150 --> 00:36:05.360 Nikita Barysheva: trotting on the clients. So I mean
293 00:36:06.180 --> 00:36:16.179 Nikita Barysheva: throttling for sure and rate limiter is one, since we're talking about the service already is 2 things are very important to have in your services like that.
294 00:36:16.340 --> 00:36:18.929 Nikita Barysheva: So don't forget to cover it.
295 00:36:19.090 --> 00:36:25.790 Nikita Barysheva: And then I think that's actually cute.
296 00:36:25.990 --> 00:36:26.790 Nikita Barysheva: Hey?
297 00:36:27.190 --> 00:36:29.089 Nikita Barysheva: Yeah, let's it.
298 00:36:29.270 --> 00:36:32.549 Nikita Barysheva: And I think you did it faster, you know
299 00:36:33.410 --> 00:36:39.910 Nikita Barysheva: if you have any questions, just let me know. I would be happy to answer them again, if not.
300 00:36:40.100 --> 00:36:44.509 Nikita Barysheva: thanks for listening. I hope it was interesting to you, and
301 00:36:44.830 --> 00:36:47.499 Nikita Barysheva: we'll give you some ideas, maybe, or
302 00:36:47.940 --> 00:36:50.580 Nikita Barysheva: you will decide to do something similar to this.
303 00:36:52.160 --> 00:36:53.080 Nikita Barysheva: Just let me know.
304 00:36:54.640 --> 00:36:55.580 Gabor Szabo: Well.
305 00:36:55.890 --> 00:37:05.981 Gabor Szabo: thank thank you for the presentation. Any. If anyone has any more questions. That would be a good idea to ask now. If not, then.
306 00:37:08.180 --> 00:37:16.060 Gabor Szabo: thank you very much for for giving this presentation and for being here, and for the those who were watching it live.
307 00:37:16.190 --> 00:37:23.169 Gabor Szabo: Now you have the chance. So I'm telling it also to the viewers on Youtube Channel that those people who are here
308 00:37:23.820 --> 00:37:33.870 Gabor Szabo: in the live meeting they can stay on, and after we stop the recording we can open the the mics, and then we can have a conversation
309 00:37:33.990 --> 00:37:51.040 Gabor Szabo: asking all kind of other questions that you might have not wanted to do with be on the on the video. So anyway, thank you for being here. Thank you for watching, and don't forget to like the video and follow the channel and see you next time in the next, and and
310 00:37:51.220 --> 00:37:55.589 Gabor Szabo: join the Meetup group. If if you're not there yet and thank you.
311 00:37:56.960 --> 00:37:58.029 Nikita Barysheva: Thank you. Everyone.
]]>Discover how to quickly turn your Python scripts into interactive web apps using Streamlit. This session will cover key features like visualisations, widgets, and deployment, empowering you to create user-friendly interfaces with minimal effort.

1 00:00:02.390 --> 00:00:05.820 Gabor Szabo: So hello and welcome to the Code Maven Channel.
2 00:00:05.960 --> 00:00:14.180 Gabor Szabo: My name is Gabor. I organize these events because I think it's very important for people to be able to share their knowledge.
3 00:00:14.410 --> 00:00:38.479 Gabor Szabo: and it's very useful for everyone else to learn from other people all around the world. I myself usually teach python and rust and help companies introduce testing in these 2 languages or introduce these languages. And that's it. Basically, this channel is mostly with. Now these videos from these meetings.
4 00:00:38.860 --> 00:01:07.330 Gabor Szabo: and I am really happy that you agreed to give this presentation in our meeting, and thank you everyone for joining us here. If you are in the Zoom Meeting. Then feel free to ask questions. Just remember that it's going to be in Youtube. If you're watching it in Youtube, then. And if you enjoy this video, then please, like the video and follow the channel, and later on we'll have below the the video links
5 00:01:07.380 --> 00:01:14.970 Gabor Szabo: and where you can contact layer as well if you are interested later on. So now it's your turn. Go ahead.
6 00:01:15.950 --> 00:01:16.850 Gabor Szabo: Welcome now.
7 00:01:21.690 --> 00:01:30.810 Leah Levy: so hopefully, you can see my screen. So my name is Leah. I'm currently living in the Uk, I'm a data scientist in the Uk.
8 00:01:30.810 --> 00:01:41.190 Gabor Szabo: Maybe I it's only just me, but I can see all the list of the people who are joined. Is it on your screen, or it's just mine. No, it's it's I think you're sharing that one.
9 00:01:43.930 --> 00:01:44.700 Leah Levy: Yeah.
10 00:01:49.860 --> 00:01:53.950 Gabor Szabo: Wait a second. Maybe it's my, it's mine. No view.
11 00:01:54.720 --> 00:01:56.819 Gabor Szabo: Yeah, no, it was mine. Sorry.
12 00:01:58.430 --> 00:02:00.800 Gabor Szabo: Sorry, confusing you. Okay.
13 00:02:02.020 --> 00:02:02.550 Leah Levy: It's okay.
14 00:02:02.550 --> 00:02:06.490 Gabor Szabo: Go ahead. No, no, it's okay. It was on my screen in the.
15 00:02:11.030 --> 00:02:22.899 Leah Levy: yeah. So I'm a data scientist in the for the Uk government. I'm currently get living in in England. I'm hoping to move to Israel soon. So be nice to meet everybody.
16 00:02:23.611 --> 00:02:42.418 Leah Levy: I'm gonna talk today about streamlit, which is a python library and how I use it to like deploy machine learning models and just build web apps. I'll put my contact details in the chat. If you wanna connect with me on Linkedin or follow me on Github.
17 00:02:43.200 --> 00:02:45.630 Leah Levy: be great to be great, to connect
18 00:02:46.694 --> 00:02:55.610 Leah Levy: and please feel free to ask questions as we go along. I've I can see the chat. So if you want to put messages in the chat or come off mute, whatever you prefer.
19 00:02:56.760 --> 00:03:02.790 Leah Levy: So streamlet is a python library. It's open source.
20 00:03:02.790 --> 00:03:10.029 Gabor Szabo: Sorry. Sorry. Just one note, I mean, right now we can see both you and and this and the slides.
21 00:03:10.300 --> 00:03:11.630 Gabor Szabo: and.
22 00:03:11.900 --> 00:03:12.880 Leah Levy: Oh, okay.
23 00:03:12.880 --> 00:03:22.350 Gabor Szabo: So maybe you want to turn off your your camera, or or just show the slides, because in the recording you you will be seen, anyway, probably at the top right corner
24 00:03:23.000 --> 00:03:25.769 Gabor Szabo: that now you can. I can see myself.
25 00:03:26.950 --> 00:03:28.720 Leah Levy: I'll share again. Hold on.
26 00:03:29.060 --> 00:03:29.850 Gabor Szabo: Okay.
27 00:03:36.170 --> 00:03:40.759 Leah Levy: Oh, yeah, it was on a strange I think I was messing around with the settings before.
28 00:03:40.960 --> 00:03:41.710 Leah Levy: Okay.
29 00:03:46.550 --> 00:03:47.979 Gabor Szabo: Oh, now it's good!
30 00:03:48.590 --> 00:03:49.296 Leah Levy: Yeah, okay.
31 00:03:50.100 --> 00:03:50.510 Gabor Szabo: Okay.
32 00:03:51.450 --> 00:03:56.100 Leah Levy: Thanks for letting me know. So you can see just like there's the slideshow.
33 00:03:57.130 --> 00:03:57.710 Gabor Szabo: Yeah.
34 00:03:58.130 --> 00:03:58.790 Leah Levy: Yeah,
35 00:04:02.210 --> 00:04:22.959 Leah Levy: so how many of you ever perhaps worked on a data science project? You've built a machine learning model. And you've wished you could deploy it quickly for others to use. Or perhaps you've built a web application. But front end development isn't really your expertise. It's too complicated. So this is where stream it really comes into its own.
36 00:04:23.270 --> 00:04:41.560 Leah Levy: It makes it easy for python developers to and data scientists to create beautiful interactive web apps without needing any front end development expertise. So it's lightweight. It's really easy to use doesn't require, you know, hundreds of lines of code.
37 00:04:41.620 --> 00:04:56.920 Leah Levy: And there's a really strong community online. So there's people building like add ons constantly. And there's also a strong community of people happy to answer questions and help if you have any issues.
38 00:05:01.950 --> 00:05:10.450 Leah Levy: So Streamline allows you to turn your python scripts into interactive web applications and just a few lines of code. So you don't need to be. Know any like
39 00:05:10.620 --> 00:05:19.650 Leah Levy: break traditional web frameworks like Flask or Django. You don't need any HTML Css. Or javascript. It's all python.
40 00:05:20.640 --> 00:05:32.522 Leah Levy: You can easily customize your web application using like sliders, buttons, check boxes making it interactive. And you're able to capture user, input too.
41 00:05:34.180 --> 00:05:53.920 Leah Levy: The app automatically updates when you're coding in in whatever id prefer, like visual studio code, as soon as you update the code and save it then updates in the in the actual application. I'll show I'll do a demo of it a bit later, so you could see exactly what I mean.
42 00:05:55.020 --> 00:06:00.160 Leah Levy: And but that just like makes development much faster. So you can see your changes as you go along.
43 00:06:00.400 --> 00:06:05.189 Leah Levy: And it works really well with other python libraries, popular ones like
44 00:06:05.370 --> 00:06:11.689 Leah Levy: numpy pandas plotly, even data science ones like tensorflow and scikit-learn.
45 00:06:11.900 --> 00:06:18.939 Leah Levy: So it enables you to visualize data. You can build dashboards, graphs, charts and also
46 00:06:19.470 --> 00:06:23.439 Leah Levy: integrate machine learning models directly into your application.
47 00:06:26.840 --> 00:06:51.719 Leah Levy: So a bit about deploying machine learning models so often. In data science, you go. You put a lot of work into creating it in a model. You've got your data, you've cleaned it. You've built a model. You've tested it, optimized it. You've evaluated the performance. But the real key is to kind of surface that to your end users or your clients
48 00:06:52.170 --> 00:07:01.079 Leah Levy: and using stream that makes it easy. It's quite user friendly interface. And it can handle resource, intensive tasks.
49 00:07:01.690 --> 00:07:03.910 Leah Levy: And it's easy to deploy as well.
50 00:07:04.050 --> 00:07:14.830 Leah Levy: You a basic workflow could be something like loading a pre-trained model from pickle file or on something from hugging face or tensorflow.
51 00:07:16.480 --> 00:07:32.710 Leah Levy: collect input from users. So as soon as they could enter some text. If it's like a chat bot, they could use some sliders and then it uses the machine learning model to make predictions and display the results to users.
52 00:07:33.150 --> 00:07:39.510 Leah Levy: So I've created a couple of examples of
53 00:07:39.610 --> 00:07:46.359 Leah Levy: what it can do. Just like kind of basic one's a dashboard and one's uses a pre-trained machine learning model.
54 00:07:50.010 --> 00:07:59.530 Leah Levy: I'm gonna I've taken some screenshots, but I think it'd be better to just show it live. So I'm just gonna have a go showing like, can you see this.
55 00:08:00.660 --> 00:08:01.400 Gabor Szabo: Then like.
56 00:08:03.980 --> 00:08:06.170 Leah Levy: Because, yeah, the code.
57 00:08:06.620 --> 00:08:07.280 Gabor Szabo: Yes.
58 00:08:09.100 --> 00:08:14.939 Leah Levy: So I've just pre pre-built like this very basic dashboard.
59 00:08:15.070 --> 00:08:17.750 Leah Levy: What it does is
60 00:08:18.230 --> 00:08:24.410 Leah Levy: I've got some dummy data about British culture. I thought I'd make it relative to me.
61 00:08:25.030 --> 00:08:27.209 Leah Levy: and I've just put it into a.
62 00:08:27.210 --> 00:08:29.650 Gabor Szabo: Saying, maybe you can enlarge the fonts a little bit.
63 00:08:32.220 --> 00:08:32.970 Leah Levy: Yeah, let me.
64 00:08:33.276 --> 00:08:33.889 Gabor Szabo: Yeah. Thanks.
65 00:08:35.520 --> 00:08:36.020 Gabor Szabo: Think so.
66 00:08:38.250 --> 00:08:38.909 Gabor Szabo: Noon.
67 00:08:42.010 --> 00:08:43.020 Gabor Szabo: Okay, well.
68 00:08:43.020 --> 00:08:43.420 Leah Levy: Oh, 2.
69 00:08:43.429 --> 00:08:47.150 Gabor Szabo: Yeah, yeah, no, it's good. I see.
70 00:08:48.430 --> 00:08:49.330 Leah Levy: Pardon.
71 00:08:50.920 --> 00:08:51.859 Gabor Szabo: I think it's fine now.
72 00:08:52.500 --> 00:08:53.440 Leah Levy: Okay?
73 00:08:54.357 --> 00:09:02.049 Leah Levy: So in the terminal I just use the command stream. Let run. So I do. Stream lit.
74 00:09:02.210 --> 00:09:06.640 Leah Levy: run, and then the name of your file.
75 00:09:06.830 --> 00:09:13.420 Leah Levy: In this case it's in the app folder, and it's called English chat, Hi.
76 00:09:16.530 --> 00:09:21.744 Leah Levy: and it takes a couple of seconds and it should pop up in like your browser.
77 00:09:23.780 --> 00:09:28.030 Leah Levy: so here you can have you stream it up in your browser. It's popped up here.
78 00:09:29.430 --> 00:09:37.429 Leah Levy: and here's the very basic app that I built in the top right hand corner. You see it running
79 00:09:38.360 --> 00:09:42.490 Leah Levy: and then there's a option here to deploy. If you want. If you're ready to deploy it.
80 00:09:43.405 --> 00:09:45.339 Leah Levy: Oh, what's this?
81 00:09:48.310 --> 00:09:49.360 Leah Levy: Okay?
82 00:09:56.720 --> 00:10:03.289 Leah Levy: If this doesn't work, I will just show you the screenshot instead.
83 00:10:03.890 --> 00:10:05.980 Leah Levy: Okay, so I've saved it here.
84 00:10:06.750 --> 00:10:10.696 Leah Levy: And you'll see an example now, actually, of
85 00:10:12.450 --> 00:10:23.590 Leah Levy: of how it updates in real time. So I've updated the file, the source file. And you see in the top right hand corner. Now there's an option I'll just zoom in and make it a bit bigger.
86 00:10:25.070 --> 00:10:28.779 Leah Levy: but it says source file change, and it gives you the option. Rerun
87 00:10:29.161 --> 00:10:33.799 Leah Levy: and they can click, always rerun. So I don't have to click that every time. So if I try that.
88 00:10:34.150 --> 00:10:38.630 Leah Levy: and it's work now. So this is just like a
89 00:10:39.100 --> 00:10:46.520 Leah Levy: basic application. There's a dropdown menu here, so you can select the category if I wanted to. Just see landmarks. See that
90 00:10:46.830 --> 00:10:50.740 Leah Levy: some reason it's giving me error sports
91 00:10:54.010 --> 00:11:03.280 Leah Levy: and the size of each bubble is the size of visitors per year, and you can hover over, and it gives you a little bit more information. And then if
92 00:11:04.900 --> 00:11:12.589 Leah Levy: yeah, I think I think the map plot little bit is broken on bottom. So that's 1 example. The next
93 00:11:13.270 --> 00:11:22.030 Leah Levy: application. Let me just cancel this. I'll just do control. C, let's run another
94 00:11:23.102 --> 00:11:30.350 Leah Levy: another. This is more of like a machine learning one. So I just run, stream, let run and
95 00:11:31.820 --> 00:11:32.980 Leah Levy: spell check.
96 00:11:50.550 --> 00:11:54.489 Leah Levy: Oh, I know why it's giving me an error because I haven't installed the packages.
97 00:12:09.660 --> 00:12:13.009 Leah Levy: I'm actually just using poetry library, which
98 00:12:13.200 --> 00:12:34.820 Leah Levy: it's it's not sure how common, how widely it's used. But it's a 3rd party. It's like A, it's not an inbuilt typically, you might manage your libraries, use your dependencies using like requirements, dot text file and then have a virtual create a virtual environment. But I'm just.
99 00:12:35.420 --> 00:12:45.569 Leah Levy: I've got used to using poetry, which is another like dependency package. So and that's just
100 00:12:46.130 --> 00:12:48.370 Leah Levy: just to clarify exactly what it is.
101 00:12:51.940 --> 00:12:58.580 Leah Levy: Yeah, that's not working. So let me just show you on the on the slide show.
102 00:12:59.580 --> 00:13:00.750 Leah Levy: Sorry?
103 00:13:09.314 --> 00:13:18.655 Leah Levy: What this is. Is. It imports text blog, which is a light, very lightweight kind of natural language processing library
104 00:13:20.020 --> 00:13:26.119 Leah Levy: and what happens is you put in your spelling. So you put in some text. In this case
105 00:13:26.530 --> 00:13:35.059 Leah Levy: I'm so bad at spelling spell really wrong, and then it returns the correct spelling and then in the top right. You can see it's very kind of
106 00:13:35.320 --> 00:13:47.810 Leah Levy: simply. There's only like 16 lines of code. It's quite lightweight. And I've put a link here to more community projects. You can see on on the stream website.
107 00:13:48.440 --> 00:13:50.030 Leah Levy: they've actually got
108 00:13:51.100 --> 00:13:59.750 Leah Levy: community projects. You can kind of get an idea of flavor, of exactly what's possible. So this one's quite cool. This is like a map.
109 00:14:00.445 --> 00:14:06.500 Leah Levy: Application that somebody's built that's called pretty map, where you kind of visualize
110 00:14:07.361 --> 00:14:11.959 Leah Levy: maps in like different, cool, different, cool ways.
111 00:14:13.051 --> 00:14:22.290 Leah Levy: But just so you can get kind of get an idea of like, it's quite personalizable. It doesn't have to look like they did. All the applications don't necessarily have to look the same.
112 00:14:38.920 --> 00:14:40.470 Leah Levy: Sorry gone too far.
113 00:14:45.890 --> 00:14:53.241 Leah Levy: Okay. So I wanted to talk about deployment. So I mentioned. It's there's different options to deploy.
114 00:14:54.210 --> 00:14:59.230 Leah Levy: Just gonna wait for the slides to kind of sync.
115 00:15:07.560 --> 00:15:08.619 Leah Levy: Not sure.
116 00:15:09.560 --> 00:15:10.799 Leah Levy: Okay, there we go.
117 00:15:13.880 --> 00:15:27.930 Leah Levy: There's a there's a couple of different options you could deploy locally, which is kind of what we've done. Just before we do the stream that run. But in most cases you want to deploy it to a cloud or servers.
118 00:15:28.370 --> 00:15:31.159 Leah Levy: So there's stream that has its own kind of
119 00:15:31.370 --> 00:15:39.799 Leah Levy: built like customized deployment option called the stream community cloud where you can deploy from, get straight from Github.
120 00:15:40.551 --> 00:15:46.568 Leah Levy: But that also supports other deployment options like Docker, Aws
121 00:15:48.475 --> 00:15:53.880 Leah Levy: and all these other options. The another benefit of the community cloud is.
122 00:15:54.720 --> 00:16:12.700 Leah Levy: you can it provides you with analytics data. So how many people have clicked on on your onto your dashboard. Total viewers, most recent viewers, timestamps of people's last visit. So you can kind of get an idea of when people have have used your application.
123 00:16:14.520 --> 00:16:18.800 Leah Levy: So I want to talk about the testing framework in the app.
124 00:16:18.910 --> 00:16:21.500 Leah Levy: This is something.
125 00:16:22.090 --> 00:16:35.319 Leah Levy: Last time I gave this talk at Pi Web in Tel Aviv. Someone asked me about testing. And I thought, Oh, yeah, that's I've not really used the testing framework. So I thought, I put a section in here to to show you kind of how I've done it.
126 00:16:36.415 --> 00:16:58.584 Leah Levy: So stream that has its own. You can use pi test, and and those usual kind of testing frameworks and stream. It has its own framework, which enables developers to build and run headless tests, which I executes the app code directly. So it simulates that user input and inspects the output for correctness.
127 00:16:59.090 --> 00:17:07.560 Leah Levy: for those who don't know headless tests is like a way to run automated browser tests without having, like the user interface.
128 00:17:08.027 --> 00:17:13.299 Leah Levy: So it's a more efficient way of testing the application because it doesn't need to like render the HTML.
129 00:17:13.569 --> 00:17:27.959 Leah Levy: It just sends requests to the server the same way like you would do in a browser, and it's much faster because you don't need to wait for a page to load, and it integrates well into your like any crcd pipelines you might have as well.
130 00:17:29.670 --> 00:17:47.450 Leah Levy: So an example of testing. So on the left hand side. I've written what might be a more traditional way to write a test. So you would import streamlet and also import textblob, which is the library I mentioned before that we used for the spell checker.
131 00:17:47.660 --> 00:17:49.590 Leah Levy: You kind of set up a
132 00:17:50.100 --> 00:17:57.630 Leah Levy: set up the app just as it appears in that, just as what you've to kind of mirror what you've written
133 00:17:58.258 --> 00:18:07.070 Leah Levy: and have some simulated user input and then load the text blob and then run the
134 00:18:07.520 --> 00:18:15.440 Leah Levy: run. The text Blob Library to generate the correct spelling, and then have an assert to correct, to ensure that
135 00:18:15.740 --> 00:18:23.610 Leah Levy: that is, what the output is is what you've expected is that should be the corrected spelling of what you've inputted.
136 00:18:24.489 --> 00:18:32.130 Leah Levy: But on the right, all you need to do is run install the streamer testing framework
137 00:18:32.250 --> 00:18:45.980 Leah Levy: with app test. App test is is what simulates the running of the app, and it provides different methods to set up, manipulate and inspect the app via the Api instead of doing it in the browser
138 00:18:49.370 --> 00:18:57.074 Leah Levy: And then I've just written a function to test the spelling. So you you've got app test, which runs the
139 00:18:57.710 --> 00:19:03.239 Leah Levy: which runs the application as if I was running it in the terminal.
140 00:19:03.950 --> 00:19:09.750 Leah Levy: I is simulate an input of the incorrect spelling and run that.
141 00:19:10.520 --> 00:19:16.360 Leah Levy: and then the assert that the corrected text equals the correct spelling.
142 00:19:17.358 --> 00:19:25.871 Leah Levy: And then I've just written some a couple of other tests this next function just asserts that the
143 00:19:27.180 --> 00:19:33.809 Leah Levy: the application is running and not producing any exception errors. And then this one tests that the title is
144 00:19:33.990 --> 00:19:36.970 Leah Levy: displaying the correct title as we've expected.
145 00:19:39.550 --> 00:19:48.459 Leah Levy: so you'll see it's much quicker. It's fewer lines of code. And you could just run it using like in the terminal using. I test
146 00:19:48.680 --> 00:19:51.339 Leah Levy: as you would like any other testing.
147 00:19:54.660 --> 00:20:03.330 Leah Levy: you can add multiple pages to an app. So you kind of create a new pages folder in the same folder where your application is running
148 00:20:03.934 --> 00:20:15.910 Leah Levy: and then you can give it. You can, whatever you name the whatever you name. The file is what kind of appears on the sidebar and you can amend the
149 00:20:17.040 --> 00:20:23.254 Leah Levy: you can amend the content as you would like any other application. I've put a link in here.
150 00:20:24.030 --> 00:20:25.680 Leah Levy: just so you can kind of
151 00:20:27.610 --> 00:20:30.949 Leah Levy: I was gonna show how to
152 00:20:32.609 --> 00:20:36.229 Leah Levy: it. It gives a good example rather than me, like giving
153 00:20:36.680 --> 00:20:41.279 Leah Levy: setting up lots of different ones. But you can kind of see the from the. It's got a good
154 00:20:41.750 --> 00:20:44.358 Leah Levy: kind of demo page.
155 00:20:49.446 --> 00:20:53.703 Leah Levy: hey? It's got a hello page. It's got a plotting demo.
156 00:20:54.980 --> 00:20:58.089 Leah Levy: yeah, you can have a look in your own time if you like.
157 00:21:24.610 --> 00:21:27.299 Leah Levy: Sorry. My computer's running super slow.
158 00:21:30.410 --> 00:21:32.449 Gabor Szabo: So I just I was just saying.
159 00:21:33.350 --> 00:21:38.320 Leah Levy: It's also supports chat inputs. So oops.
160 00:21:38.920 --> 00:21:47.796 Leah Levy: So if you if you everybody wants to build their own chat bots nowadays, and it provides support for that
161 00:21:48.380 --> 00:21:55.700 Leah Levy: where it kind of mimics a user. And it's got like an assistant with these like different emojis
162 00:21:56.242 --> 00:22:02.159 Leah Levy: so as if you were speaking to a person. Similar to kind of.
163 00:22:02.720 --> 00:22:07.300 Leah Levy: you know, like chat. Gpt's got an assistant kind of answer.
164 00:22:07.560 --> 00:22:30.921 Leah Levy: You can also like stream, the reply, you know how chat gpt kind of streams it, or writes it word by word, instead of just giving you an answer right away. As if somebody just to like make it look like somebody's typing. You can add a delay as well of like a couple of seconds to make it seem like it's thinking about a reply.
165 00:22:32.280 --> 00:22:52.160 Leah Levy: And different things like that. So this is just a an echo bot which just echoes, echoes whatever you type into it. Obviously not using any large language models. But you can use kind of any large language models that you want, and kind of just plug it in to a streaming dashboard.
166 00:23:01.040 --> 00:23:04.700 Leah Levy: So finally, just some additional features
167 00:23:05.710 --> 00:23:16.739 Leah Levy: which I've oops added kind of some links to. So, as I mentioned before, it's got like a whole wide range of different input widgets. And
168 00:23:17.180 --> 00:23:32.760 Leah Levy: I didn't kind of include them all on the dashboard, because I think that this page actually does it in a nicer way. You can see it's got different buttons, check boxes, feedback options, radio buttons.
169 00:23:33.550 --> 00:23:35.240 Leah Levy: sliders.
170 00:23:35.966 --> 00:23:39.269 Leah Levy: Numeric inputs. Yeah, I could just go on, but
171 00:23:40.150 --> 00:23:49.400 Leah Levy: pretty much you know anything you would need to build a nice looking app. It's got another
172 00:23:49.840 --> 00:23:56.568 Leah Levy: another thing is status elements of like progress bars loading
173 00:23:58.890 --> 00:24:03.326 Leah Levy: call out messages, but error boxes I've used before.
174 00:24:04.080 --> 00:24:08.824 Leah Levy: I can't say I've used the balloon ones, but that looks fun
175 00:24:12.470 --> 00:24:20.803 Leah Levy: And it also has integration for like interactive maps, as we saw before, like that, the map application that I
176 00:24:21.340 --> 00:24:27.209 Leah Levy: And it's also you can build interactive charts with like plotly and other similar libraries.
177 00:24:27.640 --> 00:24:36.139 Leah Levy: You can cache large data sets. So particularly when you're working with machine learning models. You're often dealing with
178 00:24:36.250 --> 00:24:48.150 Leah Levy: lot really, really, large data sets which you can cache into memory. So rather than reloading the reloading like a data set each time it can just store it in memory.
179 00:24:50.161 --> 00:25:12.448 Leah Levy: From a safety point of view. I've just looked at the privacy policy and took this this 4th bullet point straight from the privacy policy which is stream that cannot see and does not store any information contained inside stream. The apps like text shots and images, but as general advice, I would say, not to expose sensitive data.
180 00:25:13.020 --> 00:25:17.580 Leah Levy: unless you yeah.
181 00:25:18.310 --> 00:25:40.254 Leah Levy: you can expect, unless you're kind of like it's locked down. It's in a safe, secure environment. And you've got like full access controls and ensure your app is also protected from malicious input, like sequel injections, because, you know any. Any application is susceptible to to being hacked. So I guess just
182 00:25:41.480 --> 00:25:48.060 Leah Levy: be wary of this is probably no different either to to malicious input like that.
183 00:25:52.590 --> 00:25:53.465 Leah Levy: But
184 00:25:54.630 --> 00:26:01.731 Leah Levy: yeah, that's all I prepared for now, but happy to answer questions and go into into more detail on different bits.
185 00:26:03.040 --> 00:26:07.319 Leah Levy: but thank you for your time, and happy to answer any questions.
186 00:26:12.910 --> 00:26:15.524 Gabor Szabo: So thank you for the presentation.
187 00:26:17.190 --> 00:26:25.759 Gabor Szabo: I heard it the second time. I really like the testing part. I always think about testing when I, whatever I try to show.
188 00:26:25.890 --> 00:26:26.970 Gabor Szabo: And
189 00:26:27.810 --> 00:26:38.989 Gabor Szabo: if anyone has questions, then please ask. Now we can also, after the recording, after we stop the recording, we can stay around and have a conversation without the recording.
190 00:26:39.240 --> 00:26:45.520 Gabor Szabo: But anyway, it seems that there are no questions now.
191 00:26:46.440 --> 00:26:50.600 Gabor Szabo: So, Leah, thank you very much for for this presentation.
192 00:26:50.780 --> 00:26:56.499 Gabor Szabo: If we'd like to add anything more, I mean, I'll I'll have the links below also the the video.
193 00:26:59.180 --> 00:27:05.545 Gabor Szabo: So thank you for for giving this presentation. And thank you. Thank you. Thanks. Everyone who was attending. And
194 00:27:06.420 --> 00:27:11.800 Gabor Szabo: and everyone who was watching. So please remember to like the video and follow the Channel and see you
195 00:27:11.980 --> 00:27:15.530 Gabor Szabo: at one of our next one of our upcoming events.
196 00:27:15.960 --> 00:27:16.850 Gabor Szabo: Bye, bye.
197 00:27:18.140 --> 00:27:19.260 Leah Levy: Thanks, bye.
]]>Speaker: Ray Lutz

daffodil (data frames for optimized data inspection and logical (processing)), which can create data frame instances similar to pandas, but using conventional python data types.
This means no conversion to/from the Pandas world, which I have found from testing has a very high overhead. In fact, unless you plan to do at least 30 repetitive column-based operations (like sums, etc) then you should just stay in python world and avoid the conversion time, and you win. But for many, time is not of the essence, or they stay in Pandas world and never need any python. The syntax is easy to use and I am extending it to use SQL database to allow for large table size and use of the robust joins, etc. The SQL part is under work and not released yet.
1 00:00:02.370 --> 00:00:06.679 Gabor Szabo: Hello and welcome to the Code Maven meeting a meeting group
2 00:00:06.860 --> 00:00:12.580 Gabor Szabo: and Youtube Channel. If you are watching this on Youtube, thank you very much for everyone who joined us.
3 00:00:13.080 --> 00:00:17.649 Gabor Szabo: and especially thanks Ray, for giving this talk.
4 00:00:17.790 --> 00:00:26.829 Gabor Szabo: My name is Gabor Sabo. I usually teach python and rust and help companies introduce these languages or introduce testing in these languages.
5 00:00:27.030 --> 00:00:33.439 Gabor Szabo: And I also organize these meetings because I think it's very important to share knowledge and
6 00:00:33.640 --> 00:00:38.700 Gabor Szabo: the Zoom Meetings. And it is online. Events allow us to to
7 00:00:39.040 --> 00:00:47.660 Gabor Szabo: learn from each other, even if they are halfway around the world. And so with that, let me
8 00:00:48.120 --> 00:00:52.799 Gabor Szabo: give the word to you, Ray, and please introduce yourself and and just go ahead.
9 00:00:53.030 --> 00:01:15.399 Gabor Szabo: One thing sorry. Just one thing. Those who are here feel free to ask questions, either in the chat or in the or just speak up. Ray will tell you how it's going to work out. Just remember, if you're recording this, it's going to be in Youtube. So if you don't want to be in your in Youtube, then just write.
10 00:01:15.570 --> 00:01:17.069 Gabor Szabo: So thank you, it's yours.
11 00:01:17.660 --> 00:01:25.920 Ray Lutz: Okay, thank you so much. Gabor. Yes, my name is Ray Lutz. I'm let me go on to the let me share my screen here so we can get started.
12 00:01:27.660 --> 00:01:36.300 Ray Lutz: I am actually not that much that long term of a python user, you know only about maybe 5, 6 years.
13 00:01:36.925 --> 00:01:43.150 Ray Lutz: And then I had quite a wealth of experience before that with other languages, including.
14 00:01:43.320 --> 00:01:58.208 Ray Lutz: you know, assembly language. See? You know, Perl you know, Javascript, all these other kind of languages in one form or another, even though I do really like python. So I I did kind of settle on that
15 00:01:59.190 --> 00:02:00.310 Ray Lutz: for now.
16 00:02:00.670 --> 00:02:08.970 Ray Lutz: And so essentially, today, we're going to talk about this package called Daffodil.
17 00:02:09.110 --> 00:02:19.119 Ray Lutz: And it is data frames for optimized data inspection and logical processing. I came up with that later, you know, after we've chose the name. But
18 00:02:19.290 --> 00:02:26.149 Ray Lutz: the idea is that you see a lot. Df, if you use pandas, you're talking about data frames. Df, and
19 00:02:26.300 --> 00:02:35.600 Ray Lutz: so we wanted something kind of like that. And we use daf. So you know, throughout the code, if you see daf, you know that it's a daffodil data frame
20 00:02:35.710 --> 00:02:37.769 Ray Lutz: instead of a pandas.
21 00:02:39.390 --> 00:02:43.949 Ray Lutz: And I have a Master's degree, mostly electronics. I did do
22 00:02:44.810 --> 00:02:51.010 Ray Lutz: various medical devices and and document processing in my career.
23 00:02:52.170 --> 00:03:05.290 Ray Lutz: Most recently I'm developing audit engine, which is a ballot image auditing platform for checking elections. And underneath the citizens oversight, which is a nonprofit organization.
24 00:03:05.940 --> 00:03:11.629 Ray Lutz: Now, why, Daffodil, we already have pandas. So why would we need something new?
25 00:03:11.760 --> 00:03:20.499 Ray Lutz: Well, I needed a two-dimensional data type sort of a table structure. And so I started using pandas
26 00:03:21.433 --> 00:03:26.579 Ray Lutz: for almost everything I I use. You know, these 2 dimensional tables are really handy.
27 00:03:26.990 --> 00:03:31.890 Ray Lutz: but it turns out that pandas is mostly designed for numerics and
28 00:03:33.630 --> 00:03:35.880 Ray Lutz: it uses numpy under the hood.
29 00:03:37.400 --> 00:03:46.650 Ray Lutz: and so it's slow, really slow for row based operations, and some of them are now not even allowed. So you can't do an append
30 00:03:46.920 --> 00:03:51.099 Ray Lutz: like a Panda row. Seems like a basic thing you might want to do
31 00:03:51.290 --> 00:03:58.989 Ray Lutz: that's now not supported at all in pandas, because they know it's so desperately a disaster.
32 00:03:59.756 --> 00:04:04.090 Ray Lutz: So then you have to go over and and use something else. If you want to do that sort of thing
33 00:04:05.070 --> 00:04:06.400 Ray Lutz: and
34 00:04:07.280 --> 00:04:16.359 Ray Lutz: and also apply, they say, don't use, apply and apply is kind of a handy thing, which means you go row by row, and you apply some function to it
35 00:04:16.760 --> 00:04:18.329 Ray Lutz: at each row.
36 00:04:18.519 --> 00:04:22.530 Ray Lutz: And so you can't do that either, they said. We're deprecating all these things.
37 00:04:22.740 --> 00:04:27.689 Ray Lutz: I think you can still do apply. But they say, you know, it's really not recommended at all.
38 00:04:28.470 --> 00:04:29.809 Ray Lutz: And then
39 00:04:31.010 --> 00:04:39.919 Ray Lutz: it turns out also, when we're using files that are kind of a weird formats. Pandas assumes a lot when it reads them in.
40 00:04:40.090 --> 00:04:46.950 Ray Lutz: and you have to go jump through a lot of hoops to get it to just read it in like like something without doing anything.
41 00:04:47.090 --> 00:04:49.320 Ray Lutz: and then convert things as you go.
42 00:04:50.075 --> 00:04:53.209 Ray Lutz: It has some other problems, too, and we'll get into that.
43 00:04:53.360 --> 00:05:14.250 Ray Lutz: So this is when I started looking for another data type, and I had some various ones that I started using. And I ended up standardizing on this type of a two-dimensional data frame which is based on a list of lists. I call it a lol doesn't mean laughing out loud. It's a list of list type.
44 00:05:14.800 --> 00:05:17.030 Ray Lutz: And so it's a
45 00:05:17.430 --> 00:05:27.109 Ray Lutz: it's a python list. And in each each of these lists you have a additional list, and it's it's rectangular in form.
46 00:05:27.360 --> 00:05:33.910 Ray Lutz: So every single row is the same length, and it has a certain thing. So it's it's a rectangular
47 00:05:34.130 --> 00:05:39.620 Ray Lutz: array, but it's not the array type. It's a list of lists. So it's easy to add to.
48 00:05:39.780 --> 00:05:53.780 Ray Lutz: relatively easy to splice and insert insert rows or columns. You can do a lot of things fairly easily, mostly inserting and and rows as easy columns. Not quite so easy. But
49 00:05:55.260 --> 00:05:58.310 Ray Lutz: it's fairly malleable. And then
50 00:05:58.450 --> 00:06:07.459 Ray Lutz: also you can put anything at all in any one of these cells, and python will handle it just fine, so you could put a whole pandas array in here. If you want.
51 00:06:07.910 --> 00:06:13.879 Ray Lutz: you could put a whole numpy array of a million things in one cell if you want. Okay, so that's
52 00:06:14.000 --> 00:06:15.800 Ray Lutz: it's very versatile that way.
53 00:06:16.280 --> 00:06:27.199 Ray Lutz: So the basic thing is that you have this array, which is just numbered, and the numbers here don't stick to the columns and rows like they do in pandas?
54 00:06:28.680 --> 00:06:36.939 Ray Lutz: maybe other things. They they float like they would in a regular spreadsheet. So if you move the the rows around, the numbers of the rows
55 00:06:37.130 --> 00:06:42.429 Ray Lutz: are going to stay in the same order, even though you might have moved something up there and so forth.
56 00:06:42.870 --> 00:06:47.700 Ray Lutz: But then you can also optionally have names for each column.
57 00:06:47.940 --> 00:06:56.250 Ray Lutz: data types for the name for the columns and a separate type data types object that explains what those are.
58 00:06:56.420 --> 00:07:20.869 Ray Lutz: And then Optional row keys. Okay, these are both dictionaries. So the Header Dictionary, HD. And the Row Keys Key Dictionary are a special type of dictionary which gives you the number of the column, or the number of the row in the dictionary, so I don't know what you call this exactly, but I end up calling it a keyed list.
59 00:07:21.060 --> 00:07:27.140 Ray Lutz: In other words, this, this is this is the the the key.
60 00:07:27.430 --> 00:07:31.920 Ray Lutz: We'll go into the key list later. But but essentially this is the key.
61 00:07:32.310 --> 00:07:36.460 Ray Lutz: and this is the number that refers to an item in a list.
62 00:07:36.860 --> 00:07:37.940 Ray Lutz: And
63 00:07:39.230 --> 00:07:47.170 Ray Lutz: so your your dictionary would have a a key, and the value is always 0 1, 2, 3, 4, and so forth.
64 00:07:47.350 --> 00:08:04.169 Ray Lutz: And there isn't a standard function for this in python like there is like like you can have from keys, and you can give it a single value, and it can have nones all the way, or zeros, whatever you want all the way through. But it doesn't have it automatically. But it's easy to make.
65 00:08:04.410 --> 00:08:06.249 Ray Lutz: So this is what it looks like.
66 00:08:09.330 --> 00:08:20.650 Ray Lutz: Now, as I said, the Row keys and the Header Dictionary are dictionaries, but these are all optional. You could start with nothing, just an array of list of lists, and you still get all the functionality.
67 00:08:20.950 --> 00:08:23.839 Ray Lutz: But you would have to be using these indexes here.
68 00:08:24.280 --> 00:08:25.830 Ray Lutz: All right, let's go on to next.
69 00:08:26.100 --> 00:08:30.579 Ray Lutz: So essentially what my problem was this, if you have.
70 00:08:31.090 --> 00:08:36.630 Ray Lutz: you want to use pandas, and you import pandas here. And you say, I want to start a new data frame.
71 00:08:37.280 --> 00:08:43.970 Ray Lutz: and let's say you go through a bunch of Urls, and you harvest stuff from web pages, and you want to append to this array.
72 00:08:45.200 --> 00:08:46.570 Ray Lutz: If you say
73 00:08:46.690 --> 00:08:55.020 Ray Lutz: my data frame dot, append the web page metadata. You just take a dictionary, and you want to add it to the bottom of the pandas array.
74 00:08:55.250 --> 00:08:59.530 Ray Lutz: It's horrible! And this, in fact, this has been banned by
75 00:08:59.970 --> 00:09:02.950 Ray Lutz: the Pandas people. You can't append anymore.
76 00:09:03.190 --> 00:09:06.760 Ray Lutz: They they just said, This doesn't exist. That's how bad it is.
77 00:09:06.860 --> 00:09:09.240 Ray Lutz: Now, what were they doing? Why is it so bad.
78 00:09:09.420 --> 00:09:14.370 Ray Lutz: It's because what pandas is is, let me go back a second.
79 00:09:15.120 --> 00:09:18.980 Ray Lutz: I gotta. What is it? Shift to go back control?
80 00:09:20.630 --> 00:09:22.279 Ray Lutz: I gotta go with the keys.
81 00:09:23.050 --> 00:09:31.609 Ray Lutz: Okay, so what pandas is is essentially a numpy array vertically right here in a dictionary
82 00:09:32.010 --> 00:09:36.450 Ray Lutz: where you have the name of the dictionary, and the value
83 00:09:36.650 --> 00:09:41.830 Ray Lutz: is a numpy array vertically, and you've got to think of it that way, and they're all the same length.
84 00:09:42.370 --> 00:09:45.710 Ray Lutz: So the numpy array has data in
85 00:09:48.260 --> 00:09:57.079 Ray Lutz: numpy arrays. The data is, is, each value is like rammed up against each other. There's nothing else unlike Python, where
86 00:09:57.270 --> 00:10:06.190 Ray Lutz: even an integer, or whatever you have in here takes quite a bit of overhead. Usually it'll be like, I think, 28 Byte, just to represent an integer. There's a lot of overhead generally.
87 00:10:06.540 --> 00:10:12.820 Ray Lutz: and if you have, if you put a dictionary in each row, then you have the key for each one. That's I'll get into that in a second.
88 00:10:13.010 --> 00:10:19.259 Ray Lutz: My point, though, is that in a pandas array you have the the name, and you have a
89 00:10:20.950 --> 00:10:28.160 Ray Lutz: numpy array, and if you want to add to the bottom. You have to create all new numpy arrays, or us add to each one.
90 00:10:28.480 --> 00:10:32.420 Ray Lutz: They don't let you just add each one. They they create a whole new array every time.
91 00:10:32.770 --> 00:10:37.929 Ray Lutz: so they copy it over and add to the bottom, copy it over, add to the bottom, copy it over it. That's how they do it.
92 00:10:38.240 --> 00:10:39.739 Ray Lutz: And so it takes a long time
93 00:10:41.630 --> 00:10:51.909 Ray Lutz: if you're appending. So they they basically have disallowed this. So if you're not going to do that. Then you can do this. You can say, I want to make a list of dictionaries. I call it a lod.
94 00:10:52.460 --> 00:10:56.850 Ray Lutz: Okay? And it's a list of dictionaries with string keys and anything inside.
95 00:10:57.230 --> 00:11:05.030 Ray Lutz: And then you read the web page and you put your metadata dict and you append to the list of dictionaries. This will work fine.
96 00:11:06.100 --> 00:11:06.680 Gabor Szabo: Be fast
97 00:11:06.840 --> 00:11:15.420 Gabor Szabo: sorry. Let me just say something related to this. It's interesting, because in, in, I think in both in go and in rust
98 00:11:16.160 --> 00:11:21.789 Gabor Szabo: this you can. You can allocate more. Place space for these arrays.
99 00:11:22.090 --> 00:11:47.780 Gabor Szabo: even if you don't use them. So you can say that. Okay, I'm going to have at the end. I'm going to have a hundred or 1,000 long of these vectors or arrays right now. I have one item in there and then, whenever you so, the memory is already allocated, so you can append up till 1,000 without this overhead of recreating the whole array.
100 00:11:48.210 --> 00:11:58.869 Ray Lutz: They could have done a better job in pandas because they would not need to copy over the whole thing. I didn't even know they were doing that when I 1st started.
101 00:11:59.030 --> 00:12:07.589 Ray Lutz: And so I noticed when I got you know, when the array started to get pretty big that it just started to slow down to a snail space. And so what is this? Well.
102 00:12:07.710 --> 00:12:19.159 Ray Lutz: and then, in the documentation it says, Don't do this. What you're going to have to do is create something else, a list of dictionaries, and then at one fell swoop take your list of dictionaries and convert it into a data frame.
103 00:12:19.460 --> 00:12:21.690 Ray Lutz: and then it'll be reasonably fast.
104 00:12:22.100 --> 00:12:24.400 Ray Lutz: But this turns out, is very slow.
105 00:12:26.672 --> 00:12:33.759 Ray Lutz: But it's way faster than than the appending. Okay, so if you're going through and appending to the bottom of the array.
106 00:12:35.960 --> 00:12:42.792 Ray Lutz: this will be faster. But then this part right here is actually kind of slow. But if that's all you're gonna do. And you're just gonna write it out to a cash flow
107 00:12:43.140 --> 00:12:44.440 Ray Lutz: Csv file.
108 00:12:44.650 --> 00:12:50.110 Ray Lutz: Then then you've just wasted a lot of time because you didn't need to go through this here
109 00:12:50.700 --> 00:13:01.689 Ray Lutz: you could. You could just write it straight out. But if you did do a couple of things with it before you did that, you know you you maybe summed everything one time, and you added everything up.
110 00:13:02.681 --> 00:13:08.649 Ray Lutz: Maybe you did some other manipulation. You thought being in Panda's world was was a good idea.
111 00:13:09.272 --> 00:13:11.810 Ray Lutz: But then you had this overhead of doing this.
112 00:13:12.020 --> 00:13:16.770 Ray Lutz: So this works. But it turns out this is very slow, and when you time it.
113 00:13:17.350 --> 00:13:29.710 Ray Lutz: going from a list of dictionaries into pandas with, this is a 1 million integer table, a thousand by a thousand. Okay, that's the size table that we're using for our benchmark.
114 00:13:30.230 --> 00:13:40.079 Ray Lutz: Now, would Pandas normally have a thousand columns? No, right? Because most Panda, you know, most data tables have. Yeah, very few columns.
115 00:13:40.280 --> 00:13:43.350 Ray Lutz: Usually. Yeah, 20 to 30 columns is a big one.
116 00:13:44.218 --> 00:13:49.599 Ray Lutz: For the data tables I'm working with. They have a lot of columns. Okay, like
117 00:13:49.740 --> 00:13:57.470 Ray Lutz: something with 5,000 columns is is pretty big, but you'll see stuff under that and a lot of it. 3, 400 columns.
118 00:13:57.570 --> 00:14:00.429 Ray Lutz: So a thousand by 1,000, not unusual, that I see.
119 00:14:00.850 --> 00:14:12.250 Ray Lutz: and when you convert this in Daffodil, you take the list of dictionaries and make a list of lists with, you know, formatted for daffodil. It takes 139.
120 00:14:13.810 --> 00:14:15.009 Ray Lutz: What is it?
121 00:14:15.770 --> 00:14:22.660 Ray Lutz: Microseconds! Milliseconds, I believe Pandas takes 5,600 more than 5 seconds
122 00:14:23.640 --> 00:14:25.830 Ray Lutz: more than 5 seconds to convert
123 00:14:25.970 --> 00:14:31.070 Ray Lutz: it into pandas. So it is a ridiculous bottleneck.
124 00:14:31.830 --> 00:14:32.700 Ray Lutz: Okay.
125 00:14:33.350 --> 00:14:50.620 Ray Lutz: it takes 139 like, look at the difference here, and if you multiply this out, even though pandas is really really fast. To do certain things like summing columns is ridiculously fast compared to Daffodil. I can sum columns here at 191 ms.
126 00:14:50.720 --> 00:14:52.359 Ray Lutz: It takes only 4.
127 00:14:52.810 --> 00:14:56.289 Ray Lutz: So that's a big difference. So you do a big savings here.
128 00:14:56.430 --> 00:15:09.510 Ray Lutz: If you do a lot of these, then this might make up for this big difference here, but it takes a lot. It takes at least 30, all columns, doing all columns, all sum standard deviation. You got to do 30 of those
129 00:15:10.120 --> 00:15:12.950 Ray Lutz: before you make up for converting it into pandas.
130 00:15:14.670 --> 00:15:30.780 Ray Lutz: So for just a few things like summing columns or something, or just manipulating the data a little bit. You're just better off not getting into pandas because of this ridiculous conversion factor. Now, I tried to get around this problem here.
131 00:15:31.560 --> 00:15:34.549 Ray Lutz: and they're also another problem with pandas.
132 00:15:34.760 --> 00:15:38.289 Ray Lutz: This is integers as soon as you add a string.
133 00:15:39.620 --> 00:15:49.650 Ray Lutz: and and this is the size the size here is 38 MB. Believe it's megabytes for
134 00:15:49.750 --> 00:15:53.450 Ray Lutz: a million integers, and
135 00:15:55.600 --> 00:16:05.859 Ray Lutz: they they're only at 9.3, so it's quite a bit more compact and a pandas. Right, if you have just integers or just floats.
136 00:16:06.190 --> 00:16:12.950 Ray Lutz: but if you get a string. Then this goes up and becomes quite a bit larger by 10 times
137 00:16:13.730 --> 00:16:18.089 Ray Lutz: what, and quite a bit larger than a daffodil table, which really doesn't go up very much.
138 00:16:19.250 --> 00:16:26.170 Ray Lutz: Okay. So then, you know, numpy, we can convert things to numpy really quickly.
139 00:16:26.440 --> 00:16:33.029 Ray Lutz: 48 ms going to numpy, and from numpy doesn't take very long.
140 00:16:33.290 --> 00:16:36.139 Ray Lutz: and then you can manipulate in numpy
141 00:16:36.630 --> 00:16:47.929 Ray Lutz: one column at a time, or or add columns together, or sum the columns. Whatever you want to do. You can then do it directly in numpy and skip over pandas. Pandas is also a big beast.
142 00:16:48.170 --> 00:16:52.400 Ray Lutz: It takes a long time to load, so if you use daffodil.
143 00:16:52.890 --> 00:17:13.309 Ray Lutz: you import daffodil, and then you you create a daffodil array. You can't click on this, or it goes to the next thing. I can't highlight for that reason, but you create a daffodil. Array my daff, and then I go through the URL, and I get this stuff, and I append the dictionary to Daffodil array
144 00:17:13.369 --> 00:17:25.439 Ray Lutz: done, and then I simply write it out directly, and I skip over this thing here. Now. I was in a bad habit of using these pandas arrays for almost everything, because they're so handy.
145 00:17:25.760 --> 00:17:37.760 Ray Lutz: But little did I know that my code was getting to be really slow because of the conversion of the list of dictionaries over to Pandas was taking a long time every single time and then back.
146 00:17:40.047 --> 00:17:45.620 Ray Lutz: So this is when I came up with daffodil, and and you know what it provides is
147 00:17:46.060 --> 00:17:51.780 Ray Lutz: a way of also indexing into these. Now, if you just used a list of dictionaries.
148 00:17:52.210 --> 00:18:06.820 Ray Lutz: if you think about it? You have for every single row the keys are repeated, and then the next row. You repeat the keys, and you repeat the keys. Repeat the keys. So every single row has a lot of overhead, because the keys are being repeated.
149 00:18:09.260 --> 00:18:22.100 Ray Lutz: so when you crunch that down, you know, if we if we if we look back at the at the data, at the data type here, and you see in the row here, it's only that for each row. You just have a list of values.
150 00:18:22.260 --> 00:18:30.510 Ray Lutz: and you don't have the keys. The keys are one time, only you don't need them every single row. So you crunch all the keys up into one row.
151 00:18:31.090 --> 00:18:40.619 Ray Lutz: and then the indexing goes 2 times. So 1st you get the index of the list, and then you index into the list to get the data item. So it's 1 more step to get to it.
152 00:18:42.290 --> 00:18:48.650 Ray Lutz: But these are lists, and there's a lot of benefits to that
153 00:18:51.000 --> 00:18:59.080 Ray Lutz: 1st of all, we can use this type of indexing in python, which they provide all this as part of their infrastructure, so that you can write code
154 00:18:59.240 --> 00:19:02.600 Ray Lutz: that uses the indexing row column to.
155 00:19:03.040 --> 00:19:07.930 Ray Lutz: or you can use it for anything. But in this case the 1st index is row and then column
156 00:19:10.398 --> 00:19:16.771 Ray Lutz: so their own column can be integers, and and that can be either the array index
157 00:19:17.550 --> 00:19:18.500 Ray Lutz: or
158 00:19:20.240 --> 00:19:28.510 Ray Lutz: it can be, if you want it to be, but you have to go. If it's an integer it assumes it's going to be the array index and not a
159 00:19:28.790 --> 00:19:31.040 Ray Lutz: going through the dictionaries.
160 00:19:31.720 --> 00:19:37.109 Ray Lutz: If you want to go through the array index and you have to use a method. But
161 00:19:38.641 --> 00:19:43.559 Ray Lutz: if it's a string, then it assumes that it's a key into the dictionaries.
162 00:19:44.120 --> 00:19:57.309 Ray Lutz: and it can be a list of integers which, then, is the list of array indices that you want to choose. It can be a list of strings. It can be a list of string keys, so you can pull out individual rows, individual columns.
163 00:19:57.570 --> 00:20:03.520 Ray Lutz: Whatever you want, you can index, an individual position, and the array
164 00:20:03.790 --> 00:20:08.950 Ray Lutz: you can slice and dice it you can give it a
165 00:20:09.200 --> 00:20:14.760 Ray Lutz: a range of indexes like 5 to 10, which gives you 5, 6, 7, 8, 9,
166 00:20:15.510 --> 00:20:20.010 Ray Lutz: or you can do a range of keys in a close, closed
167 00:20:20.340 --> 00:20:28.939 Ray Lutz: kind, of which we use a tuple for that. So it's like from C to A, B, so from like column C to column A, B,
168 00:20:29.140 --> 00:20:36.129 Ray Lutz: because you don't know what's after A B, you can't say go to the next one and back up one. You have to give it a closed range.
169 00:20:36.580 --> 00:20:38.570 Ray Lutz: and so we do it like that.
170 00:20:39.070 --> 00:20:44.780 Ray Lutz: Now you can leave out the column if you want to use all columns kind of like star and sequel expression.
171 00:20:45.670 --> 00:20:49.729 Ray Lutz: But you can just leave that out and and then talk about the row.
172 00:20:50.160 --> 00:21:03.749 Ray Lutz: and you can index in. If you, if you append, the things can be in a different order, and they will always go in correctly and to the right. So here I have it screwed up where C is first, st and it ends up putting C in the right thing.
173 00:21:03.910 --> 00:21:07.139 Ray Lutz: And there's all kinds of examples here of how you would.
174 00:21:07.728 --> 00:21:28.079 Ray Lutz: Take that array that we start with here. 1, 2, 3, n045-67-8910. I don't know why I did that, but then you end up saying, I want to use rows one through 0 and one because that's a slice. You get those. You can get the columns same way. You can use the names of the columns.
175 00:21:28.200 --> 00:21:36.639 Ray Lutz: take all rows and names of columns here, and a list, so forth. We also offer
176 00:21:36.830 --> 00:21:42.830 Ray Lutz: a list of Tuples. I'm sorry a list of ranges which is kind of handy sometimes.
177 00:21:43.690 --> 00:21:45.829 Ray Lutz: and then you can
178 00:21:46.100 --> 00:21:53.070 Ray Lutz: get. You can set a value. You can say, I want to set this value to the entire array. It sets the whole thing
179 00:21:53.270 --> 00:22:02.466 Ray Lutz: you can. You can set the 1st few columns, you can slice it and do this. So this is a setting, so you can set
180 00:22:02.940 --> 00:22:11.539 Ray Lutz: and you can also pop in a list like, if you have a list, you want to put that in the column, you put that in and put a list into the row.
181 00:22:12.385 --> 00:22:17.500 Ray Lutz: You can put another daffodil array in, and it will put in this, that.
182 00:22:18.330 --> 00:22:24.870 Ray Lutz: for you know, whatever the rectangular region is, Boeing can put that in all those things work.
183 00:22:26.600 --> 00:22:33.980 Ray Lutz: Now, there's a return mode which is optional. But we're gonna end up putting this into like when you when you do this
184 00:22:35.236 --> 00:22:36.990 Ray Lutz: indexing here.
185 00:22:37.850 --> 00:22:50.590 Ray Lutz: you want to get the value out in this case, because you want to multiply 2 values together. If you set the return mode to Val, then then it'll give you the value directly. If you just did this you would get a daffodil array
186 00:22:51.560 --> 00:22:53.760 Ray Lutz: of the cell 0 1 1.
187 00:22:54.090 --> 00:23:01.039 Ray Lutz: I'm sorry one comma 0. So would be row. One is row 0 row one, and this would be 5 right.
188 00:23:01.350 --> 00:23:26.629 Ray Lutz: and you would get an array of 5. 1 thing in the middle of the ray. Well, you don't want that. You just wanted the value. So if you said, return the value, then you can just multiply it by this value over here and put that, and the one at 2 2 is 0 1 0 1, 2, 0 1, 2 is 10. Multiply those together 5 times 10, and put that in the cell. 2 comma one, and down here
189 00:23:26.890 --> 00:23:29.180 Ray Lutz: 0, 1, 2, 1 is 50.
190 00:23:29.430 --> 00:23:33.750 Ray Lutz: So we multiplied those values together and put it in here. So it's all malleable.
191 00:23:33.890 --> 00:23:38.526 Ray Lutz: You can do it like that just like a spreadsheet, and then
192 00:23:39.810 --> 00:23:43.939 Ray Lutz: we can insert columns here. So we're going to put in a first.st So the
193 00:23:44.160 --> 00:23:50.149 Ray Lutz: if you add a column like house, car and boat, and we call that category
194 00:23:52.740 --> 00:23:59.119 Ray Lutz: then we also say we want to set the key field to category. Now, what it's done is
195 00:23:59.310 --> 00:24:08.039 Ray Lutz: what it does, what it does. Is it one of the columns you can say that's going to be my key field, and then it puts it in that. That dictionary lookup!
196 00:24:08.200 --> 00:24:09.600 Ray Lutz: Called the
197 00:24:11.840 --> 00:24:20.049 Ray Lutz: Dk. Let's see, it's called HD, it's called a key key dictionary. So this is a dictionary lookup so super fast
198 00:24:20.260 --> 00:24:21.729 Ray Lutz: if you have a long one.
199 00:24:22.600 --> 00:24:30.250 Ray Lutz: but it has to be. If you do this, you can't have repeated values in here. It's gonna it's gonna hit the la, the 1st one that it sees.
200 00:24:31.774 --> 00:24:37.559 Ray Lutz: And so here, what we did was we add additional records, and it's going to add them in there
201 00:24:37.760 --> 00:24:38.690 Ray Lutz: with
202 00:24:40.260 --> 00:24:45.879 Ray Lutz: Here, you see the category is in a different order, and it still puts it in
203 00:24:46.530 --> 00:24:54.150 Ray Lutz: and if we have a double in there, it's going to modify the one that's there.
204 00:24:54.560 --> 00:24:57.849 Ray Lutz: So if you have, if you index in and you say?
205 00:24:58.403 --> 00:25:02.540 Ray Lutz: House, car, boat and house car boat, Mall Van Condo.
206 00:25:02.680 --> 00:25:04.559 Ray Lutz: I think I have it in the next one.
207 00:25:05.330 --> 00:25:15.550 Ray Lutz: where, if you say house and you give it new values. It's going to modify the one that's there. Okay, so it doesn't add another one called house.
208 00:25:16.650 --> 00:25:23.520 Ray Lutz: Then you can select by using a select where statement.
209 00:25:23.980 --> 00:25:34.650 Ray Lutz: This is where lambda statements are really useful, where you just say Lambda Row, and you say the row where the C value is greater than 20. I want to select those row those rows.
210 00:25:35.270 --> 00:25:42.259 Ray Lutz: It makes a new daffodil table, but it doesn't make new rows.
211 00:25:42.640 --> 00:25:47.410 Ray Lutz: These are actually the rows from this table just referenced over here.
212 00:25:48.110 --> 00:25:53.340 Ray Lutz: So it uses a by reference, just like Pandas Python does all the time.
213 00:25:53.460 --> 00:26:04.570 Ray Lutz: So you're not actually creating a new whole table. These are not unique values. These are actually the same list values from over here, put in over here so that you've just selected them.
214 00:26:04.770 --> 00:26:10.030 Ray Lutz: And so this daffodil table only has a list of references to the same data.
215 00:26:10.310 --> 00:26:15.149 Ray Lutz: all right, so that this way these selections are very fast because it doesn't do any copying
216 00:26:15.270 --> 00:26:17.197 Ray Lutz: unless you wanted to.
217 00:26:17.820 --> 00:26:18.740 Ray Lutz: Okay,
218 00:26:20.300 --> 00:26:30.059 Ray Lutz: you can select a record by the key. You can also just do it this way. Put the key in to the indexing, and then say, you want it to be a dictionary.
219 00:26:30.380 --> 00:26:35.700 Ray Lutz: Now, what we're going to end up doing is putting comma in here. Whoops can't click.
220 00:26:35.850 --> 00:26:44.570 Ray Lutz: You put a comma in here and put a R type return type equals Dick inside here instead of having dot 2, Dick.
221 00:26:44.730 --> 00:26:51.919 Ray Lutz: because it's handier to know, like in this mode, if you want it to be a list.
222 00:26:52.340 --> 00:26:57.040 Ray Lutz: that's what is already in the array. The list. So if you want the list out.
223 00:26:57.220 --> 00:26:59.189 Ray Lutz: You don't want to convert it to
224 00:27:00.940 --> 00:27:09.149 Ray Lutz: Say a dictionary or a whole array, because you're going to get a whole array out of this selection. One row. But it's going to be a daffodil array data type.
225 00:27:11.580 --> 00:27:21.490 Ray Lutz: so what you don't. If you want to get a list out of it. It's nice to know ahead of time. So, and we'll all show you that in a second, because there's another thing I want to show you, which is called a keyed list.
226 00:27:23.220 --> 00:27:30.630 Ray Lutz: so you can get at different types out. If you have 2 deck, 2 list, 2 value, you can just print, or there's other things to numpy
227 00:27:31.125 --> 00:27:34.909 Ray Lutz: to pandas. You know there's other things you can convert to here.
228 00:27:36.590 --> 00:27:39.989 Ray Lutz: So a common usage pattern is to process things by row.
229 00:27:40.752 --> 00:27:46.529 Ray Lutz: Where you would have somehow you're transforming the original row into a new row.
230 00:27:47.320 --> 00:27:50.870 Ray Lutz: and then you append the new row to the new daffodil table.
231 00:27:51.000 --> 00:28:09.870 Ray Lutz: Now, depending upon what the transform does, it might give you the same data again, with just something modified. It might mutate that row, and you would get it back here. When you append this, it's the same row as the original with a mutation. Guess what? That's going to modify the old row, so you don't necessarily want to do that. If you're making a mutation
232 00:28:13.400 --> 00:28:19.369 Ray Lutz: and then you would append it to the new table.
233 00:28:19.590 --> 00:28:27.799 Ray Lutz: and then you can do, you can put it out. It turns out you don't have to flatten. We've discovered later, and I want to show you that in a second it automatically flattens.
234 00:28:29.580 --> 00:28:37.630 Ray Lutz: so you can just apply. So if you have a transform row function, you just say, apply the function and it applies it, row by row
235 00:28:37.780 --> 00:28:41.289 Ray Lutz: and then gives you a new daffodil table. So you just go. You can do it this way.
236 00:28:42.380 --> 00:28:54.709 Ray Lutz: And you can also then just apply the data types at the end. If you want like, you can read it in, apply the data types, apply the transform. You don't need to flatten it anymore. Because I'll show you why most of the time.
237 00:28:54.980 --> 00:28:57.570 Ray Lutz: And then you just say to Csv and write it out.
238 00:28:57.960 --> 00:29:01.169 Ray Lutz: So here's where you write it in. You transform, row by row.
239 00:29:02.940 --> 00:29:14.319 Ray Lutz: if you're doing this, daffodil works really? Well, okay. And this same sort of transform can be applied to. I'll show you in a second, when we're expanding this to use SQL,
240 00:29:15.330 --> 00:29:16.210 Ray Lutz: backing
241 00:29:18.610 --> 00:29:24.210 Ray Lutz: so we avoid copies. This is what makes it faster way faster than pandas. Most of the time
242 00:29:24.390 --> 00:29:26.660 Ray Lutz: pandas is is fast.
243 00:29:26.990 --> 00:29:33.479 Ray Lutz: If you're doing those matrix manipulate not their array manipulations that are used in numpy.
244 00:29:33.810 --> 00:29:40.909 Ray Lutz: But if you do stupid things like like add columns and add rows and pen things and stuff. It gets really, really slow.
245 00:29:41.040 --> 00:29:48.260 Ray Lutz: And also when you're when you end up copying. So we're using references to existing data rather than recopying unless you want to
246 00:29:49.530 --> 00:29:53.550 Ray Lutz: so row selections, reuses the existing Header Dictionary
247 00:29:53.870 --> 00:30:00.470 Ray Lutz: and the selected list values from the source daffodil array, and then
248 00:30:00.730 --> 00:30:10.269 Ray Lutz: processing by columns is slower, but you can usually avoid that. What you want to do is in one fell swoop. If you want to add columns and drop them.
249 00:30:11.090 --> 00:30:12.809 Ray Lutz: You do that all at one time.
250 00:30:13.310 --> 00:30:16.820 Ray Lutz: and, in fact, if you want to do 8 that a
251 00:30:19.150 --> 00:30:20.690 Ray Lutz: if you want to flip, the
252 00:30:20.960 --> 00:30:25.246 Ray Lutz: flip, the array on a on a diagonal, which is
253 00:30:26.710 --> 00:30:29.919 Ray Lutz: Why can't I think of it? It starts with France. I can't think of a trance.
254 00:30:30.895 --> 00:30:33.259 Ray Lutz: We'll get to that in a second, but it but the
255 00:30:34.496 --> 00:30:43.109 Ray Lutz: daffodil is pretty slow with doing when you you know head to head when you're doing manipulations of
256 00:30:43.460 --> 00:30:44.700 Ray Lutz: numerics.
257 00:30:45.190 --> 00:30:51.100 Ray Lutz: But when you're doing this type of row selections, it's much faster
258 00:30:51.440 --> 00:31:03.609 Ray Lutz: and column basing. Oh, it's transposition. If you say Flip is true, and you're doing adding rows or subtracting them. You can also flip it for free, because you have to make a whole new one.
259 00:31:04.720 --> 00:31:12.470 Ray Lutz: so you can flip it for free, if you want to, when you're changing the columns, dropping them and adding them. But you want to do that all at one time.
260 00:31:13.170 --> 00:31:14.359 Ray Lutz: Add and drop.
261 00:31:14.460 --> 00:31:21.460 Ray Lutz: basically modify. The columns, end up with a new array that has the columns that you need, and then mutate it in place.
262 00:31:23.110 --> 00:31:25.880 Ray Lutz: In other words don't add columns one at a time.
263 00:31:26.470 --> 00:31:29.340 Ray Lutz: and because it's going to just be a lot of overhead.
264 00:31:31.280 --> 00:31:37.400 Ray Lutz: But if you have columns in there, then you can just don't use the ones that you don't want to use. Okay, next thing.
265 00:31:37.790 --> 00:31:44.809 Ray Lutz: So the keyed list is one of the core technologies that we developed inside this at once, we got some more experience.
266 00:31:45.500 --> 00:31:51.710 Ray Lutz: So a key keyed list is basically, if you
267 00:31:51.870 --> 00:31:58.130 Ray Lutz: take a if you do a zip of keys and values. This creates a conventional dictionary. So you have.
268 00:31:58.300 --> 00:32:04.090 Ray Lutz: if you want to. Let's say you have a list and you have keys. You want to apply to a list values.
269 00:32:04.250 --> 00:32:07.959 Ray Lutz: You have to go through this transformation, and this takes time.
270 00:32:08.570 --> 00:32:12.000 Ray Lutz: It distributes the values to each item in the dictionary.
271 00:32:12.340 --> 00:32:16.860 Ray Lutz: You create a dictionary with these keys, and then you put a value on each one.
272 00:32:17.100 --> 00:32:25.010 Ray Lutz: and it's it's in memory. Now. All of a sudden the values are distributed out in this dictionary. You don't know what order they're entering some weird order now.
273 00:32:25.760 --> 00:32:32.019 Ray Lutz: The dictionary takes care of making them in the same order, but in the actual dictionary itself. I don't know what order they're in.
274 00:32:32.610 --> 00:32:34.240 Ray Lutz: They're not a list anymore.
275 00:32:34.360 --> 00:32:35.690 Ray Lutz: Let's put it that way
276 00:32:36.480 --> 00:32:42.069 Ray Lutz: so you can get the list out by saying, Dick. Dot values. You can get the list out
277 00:32:42.560 --> 00:32:48.910 Ray Lutz: and you can get the keys out. It's it's not a list. At this point you'd have to convert it to a list. It's a keys.
278 00:32:49.290 --> 00:32:51.120 Ray Lutz: It's a keys type oops.
279 00:32:51.740 --> 00:32:56.009 Ray Lutz: So we propose this concept except of a keyed list
280 00:32:56.680 --> 00:33:01.210 Ray Lutz: which contains a header dictionary that contains indexes for each key and
281 00:33:02.950 --> 00:33:09.609 Ray Lutz: this is one way to create the header dictionary. And it's this is an easy way to understand it, but it's not the most optimal way to do it.
282 00:33:09.760 --> 00:33:11.419 Ray Lutz: So you're going to have a column
283 00:33:11.540 --> 00:33:23.539 Ray Lutz: name and the index for the index and the column for the enumeration of the keys. So this index is going to go. 0 1, 2, 3, 4, 5, and that's going to be the value. Right? You go through all the keys, and you put them up here.
284 00:33:23.640 --> 00:33:28.090 Ray Lutz: So this stays the same for every
285 00:33:28.400 --> 00:33:36.420 Ray Lutz: keyed list of the same size and with the same columns. You don't need a new header dictionary. You can use the same one for different keyed lists.
286 00:33:36.850 --> 00:33:41.490 Ray Lutz: And the key. The list is
287 00:33:41.870 --> 00:33:50.640 Ray Lutz: a list so, and like a regular dictionary that has, that distributes the values amongst all the keys in the structure.
288 00:33:50.950 --> 00:33:53.400 Ray Lutz: The list is this is still a list.
289 00:33:54.930 --> 00:34:02.499 Ray Lutz: It still looks like a dictionary. You have a key and a value, but it's structured, and you can still get the list out.
290 00:34:04.370 --> 00:34:09.670 Ray Lutz: It looks like a dictionary, but it's not designed like a dictionary.
291 00:34:10.050 --> 00:34:15.480 Ray Lutz: It has a header which has, excuse me
292 00:34:16.360 --> 00:34:19.470 Ray Lutz: like a 0 b, 1 c, 2,
293 00:34:20.190 --> 00:34:28.119 Ray Lutz: and then a list associated with that and and that this way, it's faster to to do things. So if you have a keyed list.
294 00:34:28.610 --> 00:34:32.309 Ray Lutz: and like A is 34 B, 45, and C is 56,
295 00:34:33.159 --> 00:34:36.180 Ray Lutz: and you have values here. 1, 2, 3.
296 00:34:37.460 --> 00:34:42.620 Ray Lutz: You can say, the key list. Dot values equals. This value list, assign new values.
297 00:34:43.440 --> 00:34:46.719 Ray Lutz: And now you have a new keyed list. With those values in there
298 00:34:47.530 --> 00:34:52.000 Ray Lutz: you could do the same thing with the dictionary. It would put new values into the dictionary.
299 00:34:53.510 --> 00:34:57.329 Ray Lutz: If you say, what is the value of B.
300 00:34:58.010 --> 00:35:04.480 Ray Lutz: You know you're saying, I want to assign 67 to the to the B.
301 00:35:04.930 --> 00:35:06.970 Ray Lutz: Now you have 67 here
302 00:35:07.450 --> 00:35:11.410 Ray Lutz: the values list that you originally used. Sorry I can't click
303 00:35:11.540 --> 00:35:17.100 Ray Lutz: the values list that you originally used also got changed because it's the same list.
304 00:35:17.460 --> 00:35:18.729 Ray Lutz: When you said.
305 00:35:19.160 --> 00:35:31.599 Ray Lutz: I want to use. Assign this values list to the keyed list. Dot values. It did not make a new list. It did not recopy anything. All it did is add a reference in here to this existing list.
306 00:35:33.840 --> 00:35:41.649 Ray Lutz: and then the values list is the key list. Dot values output. True.
307 00:35:41.780 --> 00:35:45.139 Ray Lutz: the is function means it is exactly the same thing.
308 00:35:45.900 --> 00:35:49.550 Ray Lutz: It's the same thing in memory. There's no new version of it.
309 00:35:50.370 --> 00:35:55.670 Ray Lutz: So a keyed list means that we can
310 00:35:57.003 --> 00:36:03.320 Ray Lutz: number one. If you have a list, and you want to put it into your daffodil array.
311 00:36:04.780 --> 00:36:11.099 Ray Lutz: Don't turn it into addiction like, if you have a list. It goes directly into that list in the array.
312 00:36:11.430 --> 00:36:15.050 Ray Lutz: Now, if you want to iterate through the daffodil array.
313 00:36:15.650 --> 00:36:20.339 Ray Lutz: it's convenient to iterate through with keyed lists, because, if you modify one.
314 00:36:20.450 --> 00:36:25.580 Ray Lutz: it actually modifies the array without having to recopy it in just the way a
315 00:36:25.730 --> 00:36:28.430 Ray Lutz: like. If you had a list of dictionaries
316 00:36:28.710 --> 00:36:35.689 Ray Lutz: and you go through the list of dictionaries, and you you have a dictionary in hand, and you change that item.
317 00:36:36.690 --> 00:36:44.360 Ray Lutz: It actually is the same dictionary as in the main ray of list of dictionaries, and it'll change it in the list of dictionaries.
318 00:36:44.560 --> 00:36:50.450 Ray Lutz: Now, if you have a daffodil array and you pull a dictionary out.
319 00:36:50.830 --> 00:36:56.150 Ray Lutz: It's not the same data as what's in the array, and if you change it, it doesn't change what's in the array.
320 00:36:56.510 --> 00:36:58.570 Ray Lutz: But if you take a keyed list out.
321 00:36:58.900 --> 00:37:07.050 Ray Lutz: and you change that item in that list. That list is the same one that's in the array, and you've changed it without having to recopy it back in. So then, it works the same way as dictionaries do.
322 00:37:07.830 --> 00:37:11.330 Ray Lutz: I don't know if I I probably can add another slide for that to explain it.
323 00:37:12.240 --> 00:37:14.490 Ray Lutz: Now we're going to go to a new topic.
324 00:37:14.940 --> 00:37:18.390 Ray Lutz: Csv, reading very, very fast. If you have
325 00:37:21.110 --> 00:37:25.429 Ray Lutz: as you stay in string type, so as long as you don't convert anything.
326 00:37:26.140 --> 00:37:30.528 Ray Lutz: the python reader is really fast
327 00:37:31.580 --> 00:37:39.080 Ray Lutz: for a million rows. According to this guy here. This reference he timed it. I'm not sure I trust this, but anyway, I used it because it was a reference.
328 00:37:39.360 --> 00:37:44.210 Ray Lutz: and if you do a Pandas read Csv. It's much more time.
329 00:37:46.960 --> 00:37:51.909 Ray Lutz: Pandas, read Csv with a chunk size is for some reason worse.
330 00:37:52.080 --> 00:37:55.340 Ray Lutz: Dask, worst data frame
331 00:37:55.630 --> 00:38:04.049 Ray Lutz: a data table, I guess, is another option. It's not as fast. This looks like absurdly
332 00:38:04.280 --> 00:38:15.620 Ray Lutz: way better than it really is, so I'll have to look into that. But it's still very, very fast, because it doesn't do any type conversion for you and pandas does this automatically try to be
333 00:38:15.780 --> 00:38:17.700 Ray Lutz: be easy to use.
334 00:38:17.830 --> 00:38:20.650 Ray Lutz: But if you don't want that, it doesn't, doesn't happen
335 00:38:21.290 --> 00:38:26.700 Ray Lutz: so later you can apply the D types and unflatten.
336 00:38:27.230 --> 00:38:36.669 Ray Lutz: unflatten them, which would would bring them up into become a data python data type, such as like, if you have a dictionary in a cell.
337 00:38:37.170 --> 00:38:42.340 Ray Lutz: and it gets turned into what either Json or what I call pyon.
338 00:38:42.490 --> 00:38:44.270 Ray Lutz: which we're going to get into in a second.
339 00:38:44.710 --> 00:38:50.050 Ray Lutz: then it will reform that into the dictionary within the cell.
340 00:38:52.220 --> 00:38:54.230 Ray Lutz: Now, Csv. Writer.
341 00:38:54.690 --> 00:39:02.520 Ray Lutz: right flattens automatically to pion. I didn't know this pion is something that I dreamed up as a name.
342 00:39:03.140 --> 00:39:07.729 Ray Lutz: It means python object notation. And it's similar to Json.
343 00:39:08.630 --> 00:39:14.329 Ray Lutz: It's actually a superset of Json Javascript. Object notation
344 00:39:14.520 --> 00:39:18.730 Ray Lutz: is Json. And this is simply python object notation.
345 00:39:18.920 --> 00:39:25.750 Ray Lutz: but it can express sets, tuples, dicks, lists, functions, etc. So we can do everything
346 00:39:26.633 --> 00:39:40.640 Ray Lutz: within python, and it already does for the most part, except for functions. It'll create sets, Tuples, diction lists automatically, and a Csv writer without any you doing anything.
347 00:39:41.710 --> 00:39:43.569 Ray Lutz: I just stumbled across this.
348 00:39:44.205 --> 00:39:50.850 Ray Lutz: Now Pyon already exists. It's already defined. It's what you get if you rep, or something.
349 00:39:53.110 --> 00:39:54.560 Ray Lutz: Generally speaking.
350 00:39:54.840 --> 00:40:02.690 Ray Lutz: not always, because sometimes the wrappers are broken in in these things, but that they should do is define this as being
351 00:40:03.190 --> 00:40:08.180 Ray Lutz: a Csv writer should use Reper, and sometimes it uses the Str function instead.
352 00:40:11.540 --> 00:40:21.019 Ray Lutz: it's better than Pickle, Json, Pickle, and other variants of Json for working with python types, in my opinion.
353 00:40:21.940 --> 00:40:25.030 Ray Lutz: So I generated this pyon pyon tools.
354 00:40:25.210 --> 00:40:30.269 Ray Lutz: python module. It isn't quite published yet, but I'm using it myself.
355 00:40:30.630 --> 00:40:37.289 Ray Lutz: and it turns out, Csv, but it's very simple, because what you're doing is you're using the wrapper method for any object.
356 00:40:37.440 --> 00:40:40.980 Ray Lutz: and it basically does it already.
357 00:40:41.090 --> 00:40:48.960 Ray Lutz: But when you're using Csv writer, you don't have to change this. So the way I stumbled across. This is, I had dictionaries
358 00:40:49.170 --> 00:40:54.449 Ray Lutz: in, and my daffodil array. I wrote it out to a file.
359 00:40:54.600 --> 00:41:01.139 Ray Lutz: and it automatically converted them and flattened them out into character strings, the normal ones that you see
360 00:41:02.866 --> 00:41:13.269 Ray Lutz: when you look at when you look at a dictionary like we just were looking at some that look just like when you look at this, this dictionary right here
361 00:41:13.420 --> 00:41:21.259 Ray Lutz: opens bracket single quote, a single quote, Colon 0, comma. All that sort of thing.
362 00:41:21.610 --> 00:41:31.939 Ray Lutz: This is the expression, a string expression that represents a dictionary. It isn't the dictionary itself. Dictionary itself is some other, you know, thing in memory, and
363 00:41:32.100 --> 00:41:42.130 Ray Lutz: of a fairly complex structure that python has suppressed. And what you understand as a dictionary, are these symbols right here?
364 00:41:42.430 --> 00:41:52.550 Ray Lutz: Those symbols are character strings that can be represented in a file. So this is what you get. If you have a dictionary, which is this header dictionary with those things in it. That's exactly what you find in the
365 00:41:52.940 --> 00:41:54.910 Ray Lutz: and the and the Csv file
366 00:41:55.290 --> 00:42:04.269 Ray Lutz: this right here. Unfortunately, he doesn't use double quotes. So it's not exactly Json. If they allowed you to say, use double quotes instead. This would be Json.
367 00:42:04.920 --> 00:42:11.979 Ray Lutz: and then you could use it with other tools, a little bit of a ripple there with what they use in python.
368 00:42:12.100 --> 00:42:14.969 Ray Lutz: and maybe we can get Csv. Writer to
369 00:42:15.140 --> 00:42:19.120 Ray Lutz: optionally. Use double quotes, so would still be valid Pyon.
370 00:42:19.750 --> 00:42:23.499 Ray Lutz: but also meets the subset of Json.
371 00:42:24.730 --> 00:42:27.319 Ray Lutz: I think the Python community should embrace
372 00:42:27.550 --> 00:42:30.940 Ray Lutz: the pyon that they've already defined, but they don't have a name for it
373 00:42:31.270 --> 00:42:38.240 Ray Lutz: and provide options for Csv. Writer to use double quotes and stuff in there, because then it would produce Json.
374 00:42:38.570 --> 00:42:43.040 Ray Lutz: But this is how we flatten things from Daffodil. We do almost do nothing.
375 00:42:43.160 --> 00:42:45.529 Ray Lutz: Python already does it for us.
376 00:42:46.340 --> 00:42:53.050 Ray Lutz: Now, when we import Csv. We do it in a very controlled, explicit manner. So it comes in as strings.
377 00:42:54.110 --> 00:43:08.590 Ray Lutz: and then we convert them. Unfortunately, pandas is optimized for tables with just numerics and simple header normal for Csv, and they don't. It's hard to work around this. You can do it, but it's just a pain in the ass to try to get it to, to do weird things.
378 00:43:10.230 --> 00:43:13.730 Ray Lutz: So what we do is it dot
379 00:43:14.000 --> 00:43:21.190 Ray Lutz: daffodil d types, which is something you can specify, and it doesn't do anything. It just gets carried around in the in the frame.
380 00:43:21.510 --> 00:43:23.669 Ray Lutz: But if you
381 00:43:23.910 --> 00:43:33.459 Ray Lutz: are importing things you can say, apply it, and then it will apply d types to the columns that you want to apply to. If the columns don't exist. It's not going to hurt it.
382 00:43:33.620 --> 00:43:35.420 Ray Lutz: You if you drop them.
383 00:43:37.130 --> 00:43:41.140 Ray Lutz: You don't want to apply d types to any columns you're not going to actually use.
384 00:43:41.410 --> 00:43:50.979 Ray Lutz: So if you bring in an array, and it's got 5,000 columns, you only need 3 of them. First, st drop everything else that you don't need, or just work on the ones that you want to work on.
385 00:43:51.120 --> 00:44:01.939 Ray Lutz: In fact, you don't need to drop them if you brought it all the way in. Just work on the columns that you want to work with, and then just ignore the rest. As soon as you start converting the thing, then you're starting to add time.
386 00:44:04.110 --> 00:44:06.869 Ray Lutz: So we have a few other features. I want to mention.
387 00:44:07.450 --> 00:44:10.200 Ray Lutz: number one. We have an indirect functionality
388 00:44:10.490 --> 00:44:15.199 Ray Lutz: where a Dick can specify the contents of specific columns. So inside of a cell
389 00:44:15.930 --> 00:44:23.250 Ray Lutz: you have a dictionary, and that dictionary actually specifies column names and values
390 00:44:23.510 --> 00:44:40.629 Ray Lutz: which are to be interpreted as part of the actual array. But it's it's got an indirection. So you 1st go into the cell, you find out what's specified there and then that's to be interpreted as the rest of the array, and this is useful for sparse arrays. As I was saying, what I was working with
391 00:44:40.760 --> 00:44:44.310 Ray Lutz: was very sparse. Away with like 5,600 columns.
392 00:44:44.740 --> 00:44:48.410 Ray Lutz: and only about 50 of them are used at any one time.
393 00:44:48.960 --> 00:44:51.849 Ray Lutz: So if you use, if you represent this as A
394 00:44:51.990 --> 00:44:55.661 Ray Lutz: as a actual Csv file or anybody like that.
395 00:44:57.050 --> 00:45:02.380 Ray Lutz: It's very, very costly, because you have all these commas right, comma comma comma comma.
396 00:45:02.530 --> 00:45:12.309 Ray Lutz: and to represent all of the 5,600 columns when you're only going to use 50, and then you have them in there, and you got to try to figure out which ones they are. It's a mess. So
397 00:45:12.850 --> 00:45:23.699 Ray Lutz: in this case, even though you want the array to be logically 5,600 columns for any one row. You don't want to have to specify more than just the 50 columns that you're working with.
398 00:45:24.380 --> 00:45:37.019 Ray Lutz: And so in that one cell, what you have is a dictionary which specifies all of the columns that you're working with, and then it is logically considered, part of the array. So if you sum the column, the rows.
399 00:45:37.340 --> 00:45:41.520 Ray Lutz: it figures out where those it takes that indirection into account.
400 00:45:41.670 --> 00:45:45.160 Ray Lutz: expands them and works with summing them that way.
401 00:45:46.600 --> 00:45:57.230 Ray Lutz: We have a from Pdf that will take a Pdf. File with a you know, header and columns and and parse it.
402 00:45:57.590 --> 00:46:06.050 Ray Lutz: usually skipping a few things. You can use a few controls there to skip things. But to just convert from basic Pdf files that you might find
403 00:46:06.692 --> 00:46:11.399 Ray Lutz: a little shortcut, you can do it yourself, but this shortcuts some work.
404 00:46:11.700 --> 00:46:19.234 Ray Lutz: It offers the adders attributes, attrs. Dictionary as part of the
405 00:46:21.550 --> 00:46:26.399 Ray Lutz: part of the class, for an instance, has this, and this is the same as in Pandas.
406 00:46:26.620 --> 00:46:35.670 Ray Lutz: You can add any day any kind of attribute you want to a data frame and
407 00:46:35.950 --> 00:46:44.030 Ray Lutz: what I find is convenient. There's is like, if I've already figured out like these are the metadata columns. They're all strings, and the rest of it is data.
408 00:46:45.090 --> 00:46:54.610 Ray Lutz: Once I parse that and I know where they are, I need to pass that along and say, these are the metadata columns. This is the number of columns that's the metadata
409 00:46:55.215 --> 00:47:00.590 Ray Lutz: and that's easy to do. You just put that into this adders, and then
410 00:47:00.810 --> 00:47:14.399 Ray Lutz: your next function says, Well, how many metadata columns are there? Oh, it's an address. I already know that now you have to know that it's in there yet, you know, but at least you don't have to pass another variable along to say here's or recalculate it even worse.
411 00:47:14.590 --> 00:47:21.159 Ray Lutz: So so that's something that turns out pandas had. And we just and using that the same way.
412 00:47:21.590 --> 00:47:27.490 Ray Lutz: And then we're now offering a join method that is efficient and mimics a join in. SQL.
413 00:47:29.430 --> 00:47:35.409 Ray Lutz: pandas doesn't have a real join, it has a merge. So when you do a merge in pandas you.
414 00:47:36.910 --> 00:47:44.299 Ray Lutz: You do this, you essentially are doing what this does here. But if when we're
415 00:47:44.440 --> 00:47:46.030 Ray Lutz: and I'll get to this in a second.
416 00:47:46.590 --> 00:47:55.179 Ray Lutz: we're extending daffodil to use SQL. In the background. And when the SQL. Machine does a join.
417 00:47:55.290 --> 00:47:59.290 Ray Lutz: it doesn't actually do anything, it actually just keeps track of the joint.
418 00:47:59.720 --> 00:48:19.770 Ray Lutz: And if you say I want to take these columns in this table, and I want to join them with these columns in this table along this key, it doesn't do anything. It just remembers that. And if you say I also want to join this table in this table, in this table and this on these keys. You could do them one at a time, and you can do up to, I think, 64 times, or maybe it's 16. But there's a certain limit.
419 00:48:19.910 --> 00:48:22.240 Ray Lutz: and then, once you have all your joints done.
420 00:48:22.720 --> 00:48:31.590 Ray Lutz: and you say I want to select these column, these rows out of my join tables. Then it does it. Then it figures it out and it pulls all the data in and does it.
421 00:48:31.980 --> 00:48:32.810 Ray Lutz: Okay.
422 00:48:32.960 --> 00:48:51.309 Ray Lutz: so it's nice that way that SQL, when it does a join, it creates a view. It doesn't actually create a it doesn't actually do anything. Now, pandas always does something. It always does a merge. It takes data from one thing, it puts it with this one and merges it together. Essentially, that's what this join does. But
423 00:48:51.470 --> 00:48:57.639 Ray Lutz: this one is, we'll be using the the SQL. Type of join when we get to that.
424 00:48:58.010 --> 00:49:00.029 Ray Lutz: So there's the plans for SQL.
425 00:49:01.270 --> 00:49:10.500 Ray Lutz: Now, right now, daffodilla rays must fit into memory at this time. So so if it doesn't fit into memory, you're going to have to chunk it
426 00:49:13.440 --> 00:49:29.729 Ray Lutz: which we do. So we have a thing where we we chunk things and there's a lot of infrastructure. I might add a daffodil that are that that chunks things. So so basically, we, we have a chunk of like a hundred things in one chunk, and we have thousands of those
427 00:49:29.980 --> 00:49:37.310 Ray Lutz: we don't actually want to combine them necessarily upfront. We combine them all at one fell swoop and then make one big file.
428 00:49:37.440 --> 00:49:39.339 Ray Lutz: or just work with the chunks.
429 00:49:40.470 --> 00:49:50.440 Ray Lutz: What we'll do here with SQL is use, and since the row-based daffodil arrays are similar to SQL. Data tables. They're also row based.
430 00:49:51.120 --> 00:49:53.339 Ray Lutz: But they have column operations. Of course.
431 00:49:53.860 --> 00:50:14.559 Ray Lutz: we'll add a quarks. We'll add additional keyword arguments in the indexing to specify whether it will be an SQL. Table or another way to say it is just you take the original daffodil.to SQL. And we'll give it a name, and then we'll get this daffodil table main underscore. SQL. Daff, we'll just call it that.
432 00:50:14.760 --> 00:50:18.009 Ray Lutz: You don't have to use this name, and that will be
433 00:50:18.800 --> 00:50:21.019 Ray Lutz: how we refer to it within python.
434 00:50:21.360 --> 00:50:25.990 Ray Lutz: And this actually will look like a daffodil table. But it's actually in SQL,
435 00:50:26.190 --> 00:50:31.890 Ray Lutz: so we don't actually have the table in daffodil. It's basically a proxy to the actual table.
436 00:50:32.610 --> 00:50:39.430 Ray Lutz: Then operations on SQL. Def. Will operate as if the table were in memory. But it actually is operating SQL. Engine.
437 00:50:39.890 --> 00:50:40.880 Ray Lutz: and
438 00:50:41.570 --> 00:50:55.309 Ray Lutz: the results is, we can allow, much larger tables while still manipulating the daffodil array paradigm with selection and indexing done in a pythonic way. So essentially, we're still going to use those square brackets, you know. The 1st one is the row.
439 00:50:55.420 --> 00:51:03.800 Ray Lutz: Well, that's like select, you know. The second one is column. So select, star. That would be kind of like the first.st The second thing, the columns that you want.
440 00:51:03.980 --> 00:51:11.139 Ray Lutz: and then you say which things you want to select in a in a SQL. Statement that would be the 1st
441 00:51:11.370 --> 00:51:15.381 Ray Lutz: parameter, and selecting it as and and the like. So
442 00:51:16.730 --> 00:51:27.859 Ray Lutz: I won't go into some of the difficulties that we found in sqlite. But sqlite does not have a row based like a vector based operation so that we can have.
443 00:51:31.860 --> 00:51:34.319 Ray Lutz: We can have python
444 00:51:34.530 --> 00:51:47.540 Ray Lutz: an apply that would take, say, a row from the table, run it through python, and return on entire row, and then add it to a new table. It doesn't provide that in SQL lite that would be an extension we'd want to see
445 00:51:47.998 --> 00:51:55.519 Ray Lutz: all they allow is Scalar returns, and then it's a lot of work to do it, so it's better to bring a chunk out of the table
446 00:51:55.850 --> 00:52:01.129 Ray Lutz: as a daffodil array. Apply it within python, and then move it back into
447 00:52:01.500 --> 00:52:05.320 Ray Lutz: the SQL. Right? That's the best way to do it right now. It's the fastest.
448 00:52:07.336 --> 00:52:10.609 Ray Lutz: So once, if you want to use
449 00:52:11.560 --> 00:52:14.560 Ray Lutz: we would also support general SQL. Queries
450 00:52:14.810 --> 00:52:20.530 Ray Lutz: and and the proxy. So if you say I want to do, I want to actually use this SQL, query.
451 00:52:21.110 --> 00:52:30.490 Ray Lutz: Then I keep doing that click can't click when I'm in this. So if you want to have an SQL. Query and apply it to this, this proxy.
452 00:52:31.860 --> 00:52:34.610 Ray Lutz: We don't know the name of the table over there, necessarily.
453 00:52:34.800 --> 00:52:38.880 Ray Lutz: And one thing about python is you.
454 00:52:39.010 --> 00:52:47.389 Ray Lutz: When you create a daffodil table. You don't know what name you're going to apply to it, because it's not the way Python works. It doesn't know what name it has. In fact, it could have many names.
455 00:52:49.910 --> 00:52:53.390 Ray Lutz: In SQL. When you have a table it has a name.
456 00:52:53.530 --> 00:52:56.559 Ray Lutz: and you have to use that name to refer to it all the time.
457 00:52:58.870 --> 00:53:05.299 Ray Lutz: so we don't. We're gonna be have to name our table with some arbitrary name, if you don't give it one.
458 00:53:05.670 --> 00:53:06.899 Ray Lutz: And then
459 00:53:07.660 --> 00:53:13.269 Ray Lutz: or we may actually always name it with arbitrary name and then map it over. But essentially.
460 00:53:14.780 --> 00:53:18.910 Ray Lutz: when you do an SQL statement.
461 00:53:19.450 --> 00:53:22.219 Ray Lutz: and you say, I want to
462 00:53:22.340 --> 00:53:25.069 Ray Lutz: like, select blah blah blah from
463 00:53:25.770 --> 00:53:27.700 Ray Lutz: you have to put in a table name.
464 00:53:27.960 --> 00:53:39.890 Ray Lutz: Okay? And you're not going to know what that name is. And so that's why we're going to have to have some substitution going on with that. And that won't be too hard for people to do if they want to use general purpose. Queries
465 00:53:42.040 --> 00:53:47.979 Ray Lutz: pretty much. Everything within pandas is repeat, is available within daffodil
466 00:53:49.160 --> 00:53:55.049 Ray Lutz: But some things that are not available in pandas are available like we can do append which has been deprecated.
467 00:53:58.270 --> 00:54:19.940 Ray Lutz: but most things run the same way. A little bit different, because pandas is normally columns of numpy arrays, and so, if you don't, and so, if you, the 1st value in the square brackets is a column by default, whereas in our mode the 1st thing by default is the row just to be aware of that
468 00:54:23.060 --> 00:54:28.340 Ray Lutz: and so pretty much they're all there. And again the timing we went over briefly at the beginning.
469 00:54:30.730 --> 00:54:38.663 Ray Lutz: Daffodil is faster for array, manipulation like appending rows. But pandas is faster. If you're going to do column based
470 00:54:39.590 --> 00:54:41.020 Ray Lutz: manipulations.
471 00:54:41.300 --> 00:54:45.950 Ray Lutz: Basically. Here was my summary about use cases, and when you want to use each one.
472 00:54:46.160 --> 00:54:50.950 Ray Lutz: if you have existing data in well-defined, column-based format, then.
473 00:54:51.820 --> 00:54:54.970 Ray Lutz: and almost all data is numeric.
474 00:54:55.760 --> 00:55:01.139 Ray Lutz: and you don't want to do appending or ponification other than maybe creating some additional columns.
475 00:55:02.210 --> 00:55:09.709 Ray Lutz: and then maybe produce plots after you analyze it, and and so forth. Then pandas might be the best choice for sure.
476 00:55:11.190 --> 00:55:13.289 Ray Lutz: as long as the data fits in memory.
477 00:55:13.980 --> 00:55:24.469 Ray Lutz: once it gets out of memory, then maybe you're going to use Daffodil, SQL. Might be a good choice. We'll have to see that hasn't really been haven't really tested that enough to know if that's going to be a better choice for you.
478 00:55:25.040 --> 00:55:30.090 Ray Lutz: If you're building data tables by analyzing converting images or other data penning to a table
479 00:55:30.250 --> 00:55:44.499 Ray Lutz: that is not going to be pandas. If you want to have small utility tables used for tracking processes or parsing data, driving state machines, all these kind of little tables you might use all the time throughout your code.
480 00:55:44.970 --> 00:55:51.520 Ray Lutz: Use daffodil tables. Don't get involved with pandas. That's for data analysis. And those specific things.
481 00:55:53.091 --> 00:55:55.550 Ray Lutz: Once you build the table.
482 00:55:55.670 --> 00:55:58.259 Ray Lutz: Then you might want to use pandas or numpy.
483 00:55:58.860 --> 00:56:07.090 Ray Lutz: We can convert individual columns, and this is pretty good way to do it. Convert it to numpy arrays so that the columns can be managed.
484 00:56:07.886 --> 00:56:13.330 Ray Lutz: There can be, you know, some, for example, like, if you sum 2 columns, you get another whole column.
485 00:56:13.760 --> 00:56:33.910 Ray Lutz: This kind of operation here will work. If it's a numpy array, or you can multiply columns, you can do all of the functions, add them together multiply by a scalar. You can all of these functions here, or say, divide one column by another one. It creates another whole column, and and that expression is very fast.
486 00:56:34.480 --> 00:56:42.450 Ray Lutz: So you can do this by just having a dictionary of numpy arrays converted on the columns that you want to use.
487 00:56:43.210 --> 00:56:55.069 Ray Lutz: If you want to do state updates like, like tracking the state of a user in a web-based application or something. Then you're going to want to use an SQL or no SQL. Or something kind of database, and not use any of these. Of course.
488 00:56:55.837 --> 00:57:03.530 Ray Lutz: Statuses that you know I've used it quite a bit myself. It hasn't really been adopted very much. That's okay. I mean, we're still
489 00:57:03.690 --> 00:57:06.800 Ray Lutz: sort of researching the best ways for this to work.
490 00:57:07.398 --> 00:57:12.489 Ray Lutz: I've used it myself. I've convert almost everything over from Pandas, and and I love it.
491 00:57:15.100 --> 00:57:22.539 Ray Lutz: and a couple of cases I still have to use. SQL. Because the tables got big. And so that's why I want to convert over to Daffodil. SQL,
492 00:57:23.301 --> 00:57:25.869 Ray Lutz: and that's pretty much what I had there.
493 00:57:26.330 --> 00:57:30.270 Ray Lutz: Okay, so I'm done. Guess
494 00:57:30.410 --> 00:57:35.550 Ray Lutz: my contact is Ray [email protected], or you can. That's my email.
495 00:57:37.210 --> 00:57:42.549 Ray Lutz: Any questions. I guess I I used up the whole hour a little bit more and not
496 00:57:43.430 --> 00:57:49.000 Ray Lutz: gave room for many questions. I see a chat room is okay. I'll leave.
497 00:57:49.000 --> 00:57:51.159 Gabor Szabo: I think apologies, no worries.
498 00:57:51.160 --> 00:57:51.800 Gabor Szabo: Don't do that.
499 00:57:52.590 --> 00:57:55.050 Gabor Szabo: If anyone has questions, then then please do ask.
500 00:57:55.200 --> 00:57:59.460 Gabor Szabo: I just wanted to say something. 1st of all. Thank you very much for the presentation.
501 00:57:59.750 --> 00:58:04.290 Gabor Szabo: But one thing that is sort of related.
502 00:58:04.970 --> 00:58:17.130 Gabor Szabo: that I see many people using pandas for, for I just read in Csv file and do some simple manipulation, and they always go to Pandas, because that's what they learned.
503 00:58:17.470 --> 00:58:21.289 Gabor Szabo: and they don't use the Standard Csv Library in.
504 00:58:21.805 --> 00:58:22.320 Ray Lutz: Yeah.
505 00:58:22.600 --> 00:58:33.609 Gabor Szabo: And and I had a feeling that pandas is just way too big for this. But now your your numbers show that it's way slower than than using the Standard Csv Library.
506 00:58:34.620 --> 00:58:38.569 Ray Lutz: Way slower and also just getting in and out of it.
507 00:58:39.090 --> 00:58:43.039 Ray Lutz: like, if you're if you're just staying within the Pandas world.
508 00:58:43.280 --> 00:58:46.079 Ray Lutz: and you're doing stuff that are pandas related.
509 00:58:47.080 --> 00:58:48.270 Ray Lutz: It's great.
510 00:58:48.440 --> 00:58:52.260 Ray Lutz: And and I think, though, that what we're going to find
511 00:58:52.400 --> 00:59:02.439 Ray Lutz: is that using a daffodil array and converting it to a dictionary of numpy arrays, which is kind of what's inside pandas. But pandas has grown so big
512 00:59:02.780 --> 00:59:10.259 Ray Lutz: that I've watched it load. It takes several seconds, maybe like 5, 10 seconds for it just to be imported.
513 00:59:10.600 --> 00:59:15.820 Ray Lutz: So when you're running one of these interpreters and you're using a huge library like Pandas.
514 00:59:16.550 --> 00:59:19.070 Ray Lutz: I mean the main Panda's class.
515 00:59:19.310 --> 00:59:23.379 Ray Lutz: Just one class is like 13,000 lines.
516 00:59:23.690 --> 00:59:28.290 Ray Lutz: It's a real. It's all in one file, I mean, I'm really surprised. They still write them this way.
517 00:59:28.400 --> 00:59:35.630 Ray Lutz: but it's it's a very highly functional thing. And here's the thing is that people are nowadays.
518 00:59:36.320 --> 00:59:45.059 Ray Lutz: They might be using an AI machine to assist them, and they say, You know, read this in, and and you know, do a few conversions and then put out, help me do this plot.
519 00:59:45.770 --> 00:59:52.860 Ray Lutz: The AI machines. They know perfectly well how to use pandas, and they'll do it now, for that
520 00:59:53.260 --> 00:59:59.179 Ray Lutz: efficiency is not important, really. You may wait a few extra seconds. But who cares?
521 01:00:00.353 --> 01:00:09.340 Ray Lutz: So? And Daffodil is is a little bit different animal, that actually is all python.
522 01:00:10.120 --> 01:00:15.499 Ray Lutz: and not that pandas is not, you know pandas is numpy, and
523 01:00:15.750 --> 01:00:30.809 Ray Lutz: but it's restrictive in what it can put in into its list, and it's designed around numerics. So when you start adding strings or anything else. It just freaks out. It's just like you're just going to have. Then you got to go back to a dictionary of lists, a list of dictionaries, I mean.
524 01:00:31.100 --> 01:00:36.260 Ray Lutz: and that's what I ended up doing. But then I didn't have the functionality of selecting rows and other things that you want.
525 01:00:38.170 --> 01:00:40.676 Ray Lutz: Now will this catch on? I don't know.
526 01:00:42.030 --> 01:00:46.110 Ray Lutz: I think that that it for most people
527 01:00:46.290 --> 01:00:49.019 Ray Lutz: I mean, I still like to use pandas for
528 01:00:49.170 --> 01:00:54.550 Ray Lutz: for certain things, just because I know that the AI machine knows exactly what to do with it.
529 01:00:55.293 --> 01:01:04.750 Ray Lutz: Once the AI machine understands Daffodil, it might do it, but I don't think it's really the use case. Daffodil is more for programmers than for
530 01:01:05.280 --> 01:01:07.370 Ray Lutz: people who are data analysts.
531 01:01:08.320 --> 01:01:11.200 Ray Lutz: It's more for somebody who wants to program in
532 01:01:12.350 --> 01:01:18.409 Ray Lutz: that use these, I mean, I use them all the time, because if you don't use one.
533 01:01:18.920 --> 01:01:27.129 Ray Lutz: and you're you know that it's not suitable to use pandas for this. So then you want to refer to a column of the array.
534 01:01:27.910 --> 01:01:36.059 Ray Lutz: Well, it's a list of dictionaries, so you don't have columns. You have to go through and write in a comprehension that pulls out the column. You can do that.
535 01:01:36.600 --> 01:01:41.220 Ray Lutz: but it's easier just to have something that's all well tested, and everything that pulls that column out
536 01:01:42.930 --> 01:01:47.280 Ray Lutz: makes conversions and so forth. So it's it's a handy thing to have.
537 01:01:47.590 --> 01:01:49.899 Ray Lutz: and I think it's logical to have
538 01:01:50.070 --> 01:01:52.420 Ray Lutz: a next step up. So we have
539 01:01:52.550 --> 01:01:58.670 Ray Lutz: fairly high level data structures in python, such as lists, dictionaries
540 01:01:59.010 --> 01:02:02.419 Ray Lutz: very highly functional and really nice.
541 01:02:02.920 --> 01:02:09.959 Ray Lutz: But we need to move up a level and have a two-dimensional functional data frame within the python world
542 01:02:10.220 --> 01:02:11.510 Ray Lutz: and and not
543 01:02:12.090 --> 01:02:19.120 Ray Lutz: make it numpy. Not that I'm against numpy. It's just that it's very restrictive as to what you can put into those cells.
544 01:02:20.880 --> 01:02:25.430 Ray Lutz: you can put some strings in. I think it's up to 20 characters or something. So.
545 01:02:26.550 --> 01:02:31.590 Ray Lutz: okay, all right. Well, thank you so much. I guess
546 01:02:32.030 --> 01:02:35.199 Ray Lutz: you're right. It's it's most people
547 01:02:35.570 --> 01:02:43.179 Ray Lutz: I think are going to say, well, I'm just going to continue to use python pandas, because that's what I'm used to, and
548 01:02:43.680 --> 01:02:45.039 Ray Lutz: I don't really care about
549 01:02:45.180 --> 01:02:56.389 Ray Lutz: time so much as what you're saying here, and I'm not doing appending. But if you're doing the appending, if you're building these tables up, that's when Daffodil becomes a pretty handy little tool.
550 01:02:57.490 --> 01:02:59.970 Ray Lutz: Okay, thanks a lot, Gabor. I guess that's the end.
551 01:03:00.360 --> 01:03:09.079 Gabor Szabo: Yeah. So thank you. Thank you again for for giving this presentation. Thank you. Everyone who listened to the were present and listened to the presentation.
552 01:03:09.330 --> 01:03:17.269 Gabor Szabo: If you like the video, then please like it, and follow the channel and see you next time.
553 01:03:17.810 --> 01:03:19.889 Ray Lutz: Okay, thanks. A lot. Okay. Bye.
]]>While profiling a slow process I stumbled upon a surprising way to reduce our memory consumption. This talk will present some useful profiling tools, and an important thing to know when using AbstractBaseClass extensively. In this session, we will dive into the realm of Python optimization, as we cover some essential profiling tools designed to identify and resolve performance bottlenecks in your code. We'll navigate through practical examples, showcasing how these tools can provide invaluable insights into your application's memory and CPU usage patterns. Furthermore, we'll delve into some nuances of AbstractBaseClass usage, and its implications on speed and memory management in Python applications. Whether you're a seasoned developer or just starting your journey with Python, this session offers some practical strategies to optimize Python programs effectively.

1 00:00:02.250 --> 00:00:29.170 Gabor Szabo: Hi, and welcome to the Codeme events, meetings and the Codeme events channel. If you're watching it on Youtube, my name is Gabor Sabo. I usually teach python and rust and help companies with these 2 languages mostly. And I also organize these events because I really like the idea of sharing knowledge, I mean receiving knowledge from other people like this time, Toma.
2 00:00:29.330 --> 00:00:49.120 Gabor Szabo: and from around the world. So that's a good good idea, I think. And that's it. If you're watching the Youtube, then then please, like the video and those who follow the channel and thanks everyone who arrived to this meeting, and especially Tomer, for giving us the presentation. Now it's your turn.
3 00:00:49.510 --> 00:00:52.949 Gabor Szabo: So welcome to introduce yourself. And yeah.
4 00:00:53.120 --> 00:00:56.100 Tomer Brisker: Thank you. Let me share my screen.
5 00:00:58.800 --> 00:01:00.990 Tomer Brisker: Okay, can you see it?
6 00:01:01.920 --> 00:01:02.670 Gabor Szabo: Yes.
7 00:01:03.060 --> 00:01:03.920 Tomer Brisker: Excellent.
8 00:01:05.334 --> 00:01:09.469 Tomer Brisker: Okay, so 1st of all, I have to make a confession.
9 00:01:09.770 --> 00:01:14.040 Tomer Brisker: It wasn't 6 lines of code. It was actually 7 lines of code.
10 00:01:14.460 --> 00:01:28.110 Tomer Brisker: And I guess you're all pretty wondering, curious what these lines of code were. So here they are. Okay. Okay. It was 8 lines of code. If you count the space in between the functions.
11 00:01:28.660 --> 00:01:39.080 Tomer Brisker: and we'll dive into what exactly these lines of code mean a bit later, and why these allowed us to save so much memory.
12 00:01:39.220 --> 00:01:41.780 Tomer Brisker: But 1st of all, just so, you believe me.
13 00:01:41.930 --> 00:01:47.609 Tomer Brisker: this is for a memory usage graph in production. When we deployed this fix.
14 00:01:47.770 --> 00:02:01.359 Tomer Brisker: As you can see, the deployment was around 5 10 Pm. Which is a great time to deploy fixes. If I remember correctly, this was the Thursday, which is the end of the week for Israel, perfect time for deploying to production.
15 00:02:01.870 --> 00:02:13.220 Tomer Brisker: But 1st of all, do we have anyone in the call who happens to be a Us. Citizen or has to file us. Tax supports.
16 00:02:18.200 --> 00:02:21.170 Tomer Brisker: feel free to wave, or something.
17 00:02:21.860 --> 00:02:24.939 Tomer Brisker: If there are, I guess I guess not.
18 00:02:25.090 --> 00:02:31.839 Tomer Brisker: Well, if you were, I guess this would probably look pretty familiar to you.
19 00:02:31.950 --> 00:02:43.169 Tomer Brisker: So for those of you who don't know practically all us citizens are required to file tax supports annually for the Irs
20 00:02:43.440 --> 00:02:47.260 Tomer Brisker: for the income taxes. This is a
21 00:02:47.360 --> 00:02:52.590 Tomer Brisker: pretty painful process. It requires filling out a lot of obscure forms.
22 00:02:53.093 --> 00:03:17.609 Tomer Brisker: And if you make some mistakes on it, you can find yourself in jail. So most people either pay one of the existing companies who provide services for filing tech supports or pay an accountant to do the tax reports for them. So Hi! My name is Tomer. I'm the tech lead at. We are fixing the issue of Irs tech support filing.
23 00:03:18.833 --> 00:03:27.250 Tomer Brisker: We are developing a simple to use application that allows users to file their taxes seamlessly with the irs.
24 00:03:27.793 --> 00:03:48.870 Tomer Brisker: Usually takes most users around half an hour to do their taxes. Which is pretty awesome compared to what it normally takes, which is many hours and oh, and we don't charge them nearly as much as an accountant or one of the existing providers charge for this.
25 00:03:49.694 --> 00:04:01.029 Tomer Brisker: If I look a bit tired in the recording. I'm going to let you guess which one of these are to blame today. Him she's on the left.
26 00:04:01.935 --> 00:04:04.964 Tomer Brisker: She's 6 months old.
27 00:04:05.810 --> 00:04:11.380 Tomer Brisker: I live with my partner and 2 kids and give a time, which is a suburb of Tel Aviv
28 00:04:12.070 --> 00:04:18.920 Tomer Brisker: and our dog but I I guess you're not here to hear about me and my life.
29 00:04:19.149 --> 00:04:22.340 Tomer Brisker: You're here to hear about performance in Python.
30 00:04:23.170 --> 00:04:25.279 Tomer Brisker: So let's dive in.
31 00:04:25.420 --> 00:04:30.670 Tomer Brisker: Our story begins when we noticed there's some
32 00:04:30.830 --> 00:04:35.610 Tomer Brisker: certain action in our system that's taking quite a long time to complete
33 00:04:36.321 --> 00:04:41.589 Tomer Brisker: in fact, we even got some reports from users, complaining that they were hitting timeouts
34 00:04:41.740 --> 00:04:45.890 Tomer Brisker: when they were running this specific action within the system.
35 00:04:46.710 --> 00:05:01.249 Tomer Brisker: and I was assigned to this task, started digging in, and I managed to produce it locally. I created a nice little script that created the exact conditions of the users that were timing out
36 00:05:01.800 --> 00:05:09.590 Tomer Brisker: and the 1st step was to see how long it actually takes. And yeah, it was actually pretty slow.
37 00:05:09.690 --> 00:05:20.060 Tomer Brisker: Python comes with a couple of built in modules in the Standard Library that are pretty nice when you're timing things those time and those time it
38 00:05:20.599 --> 00:05:25.270 Tomer Brisker: I will leave feeding the exact documentations of them to the listener.
39 00:05:26.530 --> 00:05:27.429 Tomer Brisker: But
40 00:05:28.480 --> 00:05:54.030 Tomer Brisker: we actually have. It's it's very useful. If you know what you're looking for. For example, if there's a specific method that you know, is slow, and you want to measure some change that you make to it, and you want to see the impact of it. This is very useful. We even have a little wrapper method that allows us to easily measure the timings for various functions that we call
41 00:05:54.405 --> 00:06:02.299 Tomer Brisker: but what do we do if we're not sure where the slowness is coming from this specific action in the system.
42 00:06:02.470 --> 00:06:16.779 Tomer Brisker: This was a pretty complex section. It involved calling several different services, a lot of methods, so it's pretty difficult if you don't know what where the slowness is coming from.
43 00:06:18.680 --> 00:06:23.669 Tomer Brisker: So this is what profilers were invented for
44 00:06:24.368 --> 00:06:31.150 Tomer Brisker: profiler. There's a TV series called the Profiler. We're not going to talk about it. I have no idea what it's about.
45 00:06:31.430 --> 00:06:35.990 Tomer Brisker: but profilers generally come into different varieties.
46 00:06:36.560 --> 00:07:03.049 Tomer Brisker: They are. There are that we will mention that as we go, the 1st variety is deterministic profilers. These are profilers, that essentially every time you make a method call. They register that method call. They write down the start time, and once that method returns they write down the end time. Python, like standard library has a nice one called C profile.
47 00:07:03.330 --> 00:07:24.280 Tomer Brisker: It gives you a context manager. Basically, you wrap the code that you want to measure with the context manager. Then call whatever slow function you want to profile and save the statistics to a file. And see profile will actually take care of going over all of the method calls within that function.
48 00:07:24.410 --> 00:07:30.459 Tomer Brisker: measuring how long they take, how much, how many times each method is called, etc.
49 00:07:30.560 --> 00:07:41.699 Tomer Brisker: And it saves all of these statistics into a file which can be used can be read using a module called pstats.
50 00:07:41.810 --> 00:08:05.169 Tomer Brisker: Pstats allows you reading these files. And just so, there's a lot of information here. You can see we'll dive into it a little bit to understand what this table is about. So 1st of all, on the right, we can see the file, name, line number and function name, pretty basic. So you know what we're calling here
51 00:08:05.260 --> 00:08:18.769 Tomer Brisker: on the left hand side, we can see the number of calls to each function, as you can see. The 1st 3 here were called just once. This is actually the script that I was using to de- debug this issue.
52 00:08:19.561 --> 00:08:29.669 Tomer Brisker: The second column. Here is total time, which is the time that was spent within this specific method in total all of the times that it was called.
53 00:08:29.840 --> 00:08:39.579 Tomer Brisker: and those cumulative time which is basically the time that was spent within this method and any other method that was called from within that method.
54 00:08:40.289 --> 00:08:40.799 Tomer Brisker: and
55 00:08:41.130 --> 00:08:53.159 Tomer Brisker: something stood out pretty quickly to me out of 12 seconds. Runtime, in total. About 6 seconds or half the runtime was spent in one specific method.
56 00:08:53.955 --> 00:08:59.914 Tomer Brisker: And this method is ABC surplus check. Interesting. Okay?
57 00:09:00.620 --> 00:09:13.350 Tomer Brisker: let's see. And even more interesting is the number of times. This was called so in 12 seconds. We actually called this method 175,000 times.
58 00:09:13.410 --> 00:09:40.590 Tomer Brisker: That's the number on the right. And if those a slash, and another number here, that means that this method was calling another, it was calling itself recursively. So in this case we were calling ABC. Subtract check a bit over 3 million times. So about 20 times for every single call that we were making to subclass check, it was actually making about 20 different calls on average.
59 00:09:41.633 --> 00:09:53.739 Tomer Brisker: Okay, pretty interesting. But still, I'm not quite sure why we're calling this method so many times, or why is it taking so long when this method is being called.
60 00:09:53.920 --> 00:10:04.660 Tomer Brisker: And that's what the second type of profilers is really useful for identifying the second type of profilers is statistical profilers.
61 00:10:04.720 --> 00:10:21.380 Tomer Brisker: These are profilers that basically take a snapshot of your python call stack, or in any other language, the call stack. Every certain interval. Usually this is done. Every few milliseconds, the shorter the interval.
62 00:10:21.380 --> 00:10:37.109 Tomer Brisker: obviously the higher the impact it has on performance. On the other hand, if you set too long of an interval you might miss very quick method calls return within the interval, and they won't actually be registered when running the profile.
63 00:10:37.190 --> 00:10:44.800 Tomer Brisker: and a very common way of looking at statistical profiles is using a tool called flame charts.
64 00:10:45.400 --> 00:11:06.120 Tomer Brisker: The way that flame charts work. Basically, you have 2 axes here. The x-axis is the time. So the bigger the block is on the X-axis. That means the longer time that was spent within that specific block, and the Y-axis is the stack.
65 00:11:06.420 --> 00:11:26.929 Tomer Brisker: So you can see the actual call stack of every single method. And you can see why it was being called. Well, the call was coming from. See the bigger ones, the smaller ones. And then that's very helpful. When you need to debug and identify why, a certain method is being called a lot of times.
66 00:11:27.900 --> 00:11:31.660 Tomer Brisker: So I used one of these statistical profilers
67 00:11:31.990 --> 00:11:54.509 Tomer Brisker: specifically, one called pyspy. There's multiple different profilers available for python, and each language has its own ecosystem of profilers. I'm just showing the ones that I used in this case, but there are various other tools that are useful, and they're all good in their own fields.
68 00:11:55.047 --> 00:12:02.359 Tomer Brisker: So I run a statistical profile of pispy on this web producer that I created.
69 00:12:02.490 --> 00:12:10.869 Tomer Brisker: And hmm, yeah, okay, this is fine. This is fine. I can deal with that.
70 00:12:11.030 --> 00:12:20.069 Tomer Brisker: as you can see, a flame chart when you have a very complex operation, can be very, very, very difficult to read.
71 00:12:20.650 --> 00:12:27.250 Tomer Brisker: Sometimes there's something that stands out, you see, a very big block that's taking a very long time to call.
72 00:12:27.380 --> 00:12:52.389 Tomer Brisker: and you can identify the bottleneck pretty quickly from looking at this. But other cases everything is on fire, and you don't really know what's going on. Specifically, in this case we were seeing a certain method call being called 3 million times, which makes sense that it would be very difficult to identify all of these different calls within the flame chart.
73 00:12:52.640 --> 00:12:57.119 Tomer Brisker: and for that there was a nice tool called Sandwich.
74 00:12:57.330 --> 00:13:02.459 Tomer Brisker: Not that kind of sandwich. There's a tool called speed scope, and it has a
75 00:13:02.670 --> 00:13:05.350 Tomer Brisker: way of showing flame charges.
76 00:13:05.460 --> 00:13:18.199 Tomer Brisker: playing charts in a different way, which is, they call it sandwich. Basically on the left hand side, we can see all of the different method calls within our application within the run that was profiled.
77 00:13:18.560 --> 00:13:27.659 Tomer Brisker: And we can solve this list by the total time and by the self time. This is the same, by the way, as we saw previously total time, and
78 00:13:27.790 --> 00:13:33.126 Tomer Brisker: the cumulative time with in the c profile
79 00:13:33.860 --> 00:13:36.710 Tomer Brisker: And then once you click on one of these
80 00:13:37.000 --> 00:13:42.490 Tomer Brisker: you can see on the right hand side those 2 parts, those the callers and the callees.
81 00:13:42.500 --> 00:14:09.719 Tomer Brisker: The top half shows you where this method was being called from. So, for example, in this case we can see subtest check was mostly called from instance check, and some other internal methods. Also. Here we can see. Instance, check was being called from various other methods. And here the X-axis actually shows the time. That's cumulative by the specific
82 00:14:09.810 --> 00:14:29.109 Tomer Brisker: method. So this isn't a single call. This is the total of the times that it was called from here, and obviously I can show the internals of our system. But you can see that there were several places that we were calling. Instance check, pretty commonly leading to most of
83 00:14:29.170 --> 00:14:41.379 Tomer Brisker: the load. On this method, and on the left, you can see again. This was about 8 seconds in this test run. So quite a long time
84 00:14:41.480 --> 00:14:44.690 Tomer Brisker: from the overall. Time.
85 00:14:45.310 --> 00:14:50.109 Tomer Brisker: Okay, so instance, check subclass check.
86 00:14:50.290 --> 00:15:05.529 Tomer Brisker: This is like built in python stuff, right? And it's not something in our code base. What what should I do about it? It's pretty odd, I don't know. Let's, I guess, ask Dr. Google.
87 00:15:06.302 --> 00:15:16.930 Tomer Brisker: and turns out there's an open issue about ABC subclass check, which has a very poor performance. And I think a memory leak. Hmm!
88 00:15:17.610 --> 00:15:18.680 Tomer Brisker: More league.
89 00:15:19.030 --> 00:15:20.260 Tomer Brisker: Interesting.
90 00:15:21.537 --> 00:15:25.922 Tomer Brisker: Memories. Memory is pretty expensive.
91 00:15:26.940 --> 00:15:35.520 Tomer Brisker: and turns out I'm pretty bad at counting. So there's not actually 2 kinds of profilers, those 3 kinds of profilers.
92 00:15:35.710 --> 00:15:41.029 Tomer Brisker: There's also memory profilers besides the runtime profilers.
93 00:15:41.170 --> 00:16:10.060 Tomer Brisker: memory. Obviously it's expensive. If you need to use and allocate a lot of it, but it's also expensive in terms of performance, because if the python runtime runs out of memory, it has to make system calls to allocate additional memory to the python program. The garbage collector also has to go over all of the memory and clean up unused memory. So the more memory. You allocate the slow, the garbage collection will be
94 00:16:10.120 --> 00:16:38.349 Tomer Brisker: so. These tools, the memory profilers, allow us to identify issues with our memory allocations. Sometimes our program can be very fast. But allocates a very large amount of memory. Just recently we had a case, actually, where a certain process was crashing, and we were seeing pods being killed. So out of memory killed
95 00:16:38.380 --> 00:16:45.739 Tomer Brisker: basically in Kubernetes, when you allocate a certain amount of memory to a process.
96 00:16:45.930 --> 00:17:15.490 Tomer Brisker: if you over, if the process runs over the memory. The Kubernetes controller will kill it. So it doesn't starve out other processes. And in this specific case memory was running out so quickly that it wasn't even sending telemetry data to Prometheus. And we used actually a memory profiler to identify where exactly this memory was being allocated so rapidly that it was killing our pods.
97 00:17:16.339 --> 00:17:34.970 Tomer Brisker: So let's talk about memory profile as a bit. There's a really nice one for Python called memory. It gives you some nice runtime statistics on your program. In this case this is a reproducer script that I was using.
98 00:17:34.970 --> 00:17:51.320 Tomer Brisker: We can see that we actually had about 11 million object allocations during the script. You can notice that the runtime here is a bit longer than the 12 seconds that it took when running with just c profile. And that's because
99 00:17:51.330 --> 00:17:57.990 Tomer Brisker: every single memory allocation that the program does. There's some overhead to it when you're profiling it.
100 00:17:58.010 --> 00:18:04.920 Tomer Brisker: So the Runtime was a bit slower here, and we were allocating nearly 2 GB of memory.
101 00:18:07.340 --> 00:18:10.049 Tomer Brisker: When running this this process.
102 00:18:10.410 --> 00:18:23.900 Tomer Brisker: and it also gives you information like which python memory allocator was being used. Number of frames. That's the number of samples that it was taking. This is also statistical profilers.
103 00:18:25.123 --> 00:18:33.740 Tomer Brisker: And it also shows us a nice flame chart like we saw before. But, unlike the runtime profilers.
104 00:18:33.760 --> 00:18:59.500 Tomer Brisker: this flame chart, the X-axis, is actually the size of the memory allocated, so the wider the block is, that means that the memory allocated within this block was higher. And also we can see specific statistics for a certain method call. For example, in this case we were allocating 82 MB of memory and 2 and a half
105 00:18:59.520 --> 00:19:05.030 Tomer Brisker: a thousand objects were being allocated in a single call to subtrust check.
106 00:19:09.150 --> 00:19:11.982 Tomer Brisker: Okay, so let's go back to the bug.
107 00:19:13.050 --> 00:19:20.099 Tomer Brisker: ABC, subtest check has very poor performance. I think. Memory leak. It's open since May 2022.
108 00:19:20.350 --> 00:19:24.630 Tomer Brisker: Anybody in the audience maybe knows who Samuel Colvin is
109 00:19:26.440 --> 00:19:32.050 Tomer Brisker: feel free to unmute. If you do anyone.
110 00:19:32.150 --> 00:19:48.209 Tomer Brisker: Samuel Corvin, that's the guy behind pydantic Pydantic is a very popular data validation Library for python. He opened this issue almost 3 years ago, and it's still open. So
111 00:19:48.856 --> 00:19:53.720 Tomer Brisker: well, I guess case closed. Python is a slow language.
112 00:19:53.940 --> 00:20:09.769 Tomer Brisker: There's nothing to do about it. We have to rewrite our application completely, using go rust elixir. I don't know what are the cool kids using today, Gabbo, I know you do rust a lot. So I guess rewrite right?
113 00:20:10.990 --> 00:20:16.260 Tomer Brisker: No, I could just decide. This is a case of
114 00:20:16.470 --> 00:20:26.119 Tomer Brisker: language limitations. We have to cope with it, and that's it. But they decided to dig in a little bit deeper and try to figure out if there's something we can do to resolve the issue
115 00:20:26.310 --> 00:20:35.859 Tomer Brisker: and to dig in a little bit deeper. We need to discuss ABC a bit, not the TV network abstract based classes
116 00:20:36.280 --> 00:20:41.090 Tomer Brisker: for those of you who are not familiar with abstract based classes.
117 00:20:41.677 --> 00:20:47.099 Tomer Brisker: Which, as we mentioned, have a fairly poor performance for the subclass check.
118 00:20:47.806 --> 00:20:54.949 Tomer Brisker: Abstract-based classes, is a mechanism in Python that allows us to define an abstract class
119 00:20:55.260 --> 00:21:14.820 Tomer Brisker: which allows us to define certain methods. We require any class that is subclassing from that class to do. We require these methods. So if we try to subclass it, and we don't implement these specific methods, the methods. The interpreter will yell at us saying, Hey.
120 00:21:14.930 --> 00:21:19.040 Tomer Brisker: this class has to implement a certain method.
121 00:21:19.701 --> 00:21:34.009 Tomer Brisker: Usually we use subclass ABC's as makes sense, basically defining a specific interface. We want to implement. But they have an interesting idea that is registering virtual subclasses.
122 00:21:34.662 --> 00:21:38.999 Tomer Brisker: Which is, for example, let's say you have a
123 00:21:39.180 --> 00:21:50.129 Tomer Brisker: base class that you've defined, and you want one of the built-in types of python to be a subclass of that. Obviously, you can't
124 00:21:50.380 --> 00:21:54.520 Tomer Brisker: have int subclassing something else right?
125 00:21:55.065 --> 00:22:09.930 Tomer Brisker: And the ABC. Class, or ABC Meta subclass ABC Meta class. Sorry ABC. Meta Meta class allows us to register various other classes as virtual subclasses of the base class.
126 00:22:09.950 --> 00:22:33.360 Tomer Brisker: This is also useful. If you have a class that implements multiple interfaces, let's say you have a class that implements iterable and implements hashable and implements. I don't know sortable, say, and a few others. Obviously, you don't want to have to declare all of these. When you create a class.
127 00:22:33.860 --> 00:22:50.000 Tomer Brisker: you can just register this class as a virtual subclass. And that means that if you look at the method, resolution, or the Mlo. Of a specific object of that class, you won't see these classes as their parent.
128 00:22:50.150 --> 00:23:15.209 Tomer Brisker: That's by the way, the way usually subclass check works. It checks the method, resolution order, and to see if the parent class is there? But since we have to, we allow registering subclasses virtually to classes that aren't the actual parents, there's a specific implementation within ABC. Matter for subclass check and
129 00:23:15.270 --> 00:23:21.079 Tomer Brisker: instance check that allows support for this specific use case.
130 00:23:21.390 --> 00:23:46.080 Tomer Brisker: So just to better understand this use case, let's say we have this class here we have a base class which is an abstract base class. We have class A class B inheriting from that base class, and we have a virtual subclass which isn't inheriting from the base class, but it's registered as a subclass of this base class, and so forth. We have a virtual subclass, a etc, etc.
131 00:23:46.867 --> 00:23:54.869 Tomer Brisker: Let's say we have an object, and we want to check if that object is a subclass or an instance of base class.
132 00:23:55.358 --> 00:24:02.720 Tomer Brisker: In this case we would. Let's say, this object is of type, virtual, subclass, A, we would need to check
133 00:24:02.910 --> 00:24:04.010 Tomer Brisker: the whole
134 00:24:04.160 --> 00:24:25.339 Tomer Brisker: inheritance tree for base class to identify if this class was registered to any of the classes within that inheritance tree. So this calculation is pretty complex. There's also potentially an issue with a bad implementation of
135 00:24:25.420 --> 00:24:53.739 Tomer Brisker: caching within the implementation of abstract base class. But normally this isn't a big issue, because you wouldn't have that many classes inheriting from a single base class, maybe 2, 3, 1020. Usually it's not noticeable. But, as I mentioned, we're dealing with tax reports and tax filing for the Us. Irs.
136 00:24:53.890 --> 00:25:23.869 Tomer Brisker: There's thousands of different forms that the user needs to fill in. Think of the number of states. Each State has its own forms. Each form is composed of multiple different parts, and you can pretty quickly guess the rough number of classes that we have in our system to enable this fairly complex calculation, which has led us to this issue because of the
137 00:25:23.930 --> 00:25:32.029 Tomer Brisker: very large inheritance that we have from our base class that we use for the calculation.
138 00:25:32.380 --> 00:25:49.000 Tomer Brisker: And that's why going back to the solution. This solution worked. Let's look at it a bit more in depth. Obviously, we're using type definitions. We are not barbarians. Previously I was dropping them just to make it easier to look.
139 00:25:49.550 --> 00:26:11.719 Tomer Brisker: But this is very, very straightforward. Those is subclass. And is this instance, and what they do is they go to type. Type is the base class for all classes in case you're not familiar with it, and they call subclass check or instance check on type directly by default. When you call is instance, or is subclass.
140 00:26:11.990 --> 00:26:39.610 Tomer Brisker: The way it works is, it goes to the class that is on the right hand side of the of the second parameter, basically of the function call, and it checks the Meta class for that class looking for subclass check or instance check depending on which method you called, and then it goes up the method resolution order until it finds the implementation. In case of
141 00:26:39.700 --> 00:27:02.590 Tomer Brisker: an abstract base class, it would go to abstract based class ABC subclass check. But here, what we do is we basically bypass the ABC methods and go directly to the source type subclass check, which is the default implementation used by python for any types that aren't abstract based classes.
142 00:27:03.150 --> 00:27:15.699 Tomer Brisker: And then all we had to do in our code base. It's a very simple change. Use is subclass form, first, st subclass instead of using the default, python implementation.
143 00:27:15.800 --> 00:27:35.569 Tomer Brisker: checking. If it's a subclass of the base model. And the reason this worked is because we didn't really care about the virtual subclass aspect of ABC. In our case we were just checking. If a certain object is or isn't, a subclass of our base model.
144 00:27:35.670 --> 00:27:44.720 Tomer Brisker: This wouldn't work, obviously, if we were actually registering our objects into the base model instead of directly inheriting from it.
145 00:27:44.850 --> 00:27:47.290 Tomer Brisker: But in our case this was good enough.
146 00:27:47.520 --> 00:27:57.079 Tomer Brisker: and, as you can see, there's another very nice added benefit to profiling. It lets you add nice statistics to your pull requests.
147 00:27:57.812 --> 00:28:08.119 Tomer Brisker: For example, Runtime went down from 33 seconds to 26 seconds. Memory usage went was improved by 50%, etc, etc.
148 00:28:08.120 --> 00:28:31.359 Tomer Brisker: Actually, I didn't even have to implement this in all of the places. I only had to switch to using the 1st subclass in very specific places I identified, using profiling as being the most common places that this was being called from, and this already gave me a very significant improvement.
149 00:28:31.360 --> 00:28:56.700 Tomer Brisker: And when we actually deployed, this fix turns out that the impact was even higher than the specific use case that I was profiling because this had impact across the system, it could significantly, as you can see here, reduced our memory load. It also improved the system runtime in general, and also the system load. Time was drastically reduced.
150 00:28:56.960 --> 00:29:08.460 Tomer Brisker: making our deployments much faster and saving a lot of costs, questions anybody.
151 00:29:11.030 --> 00:29:16.439 Gabor Szabo: And 1st of all, thanks, thanks for the presentation. Can you go back? 1, 1 slide.
152 00:29:16.880 --> 00:29:17.540 Tomer Brisker: Yes.
153 00:29:17.680 --> 00:29:20.002 Gabor Szabo: What is this? Bump? The hunter.
154 00:29:20.848 --> 00:29:37.821 Tomer Brisker: That's a good question. Actually, that's we're using warning updates. So basically, we spun up a few pods, switched over to them, then spun up a few more pods and it these are
155 00:29:39.060 --> 00:29:56.730 Tomer Brisker: These are. This is the time when there were still some of the older pods running in parallel with the new pods that were using less memory. So this is the 1st drop is when we killed the 1st batch of the old pods, and the second drop is when we killed the second batch.
156 00:29:57.290 --> 00:29:57.870 Gabor Szabo: Hmm!
157 00:30:00.030 --> 00:30:05.289 Gabor Szabo: But why did this? I still don't understand why it went up, but it go up again.
158 00:30:05.554 --> 00:30:10.840 Tomer Brisker: So we started a few pods, killed a few, and then started a bunch more and then killed the rest.
159 00:30:11.010 --> 00:30:11.490 Tomer Brisker: Oh.
160 00:30:11.490 --> 00:30:12.060 Gabor Szabo: Okay.
161 00:30:12.280 --> 00:30:18.859 Tomer Brisker: Oh, man, this is just a loading of the the new ports! While the all the ones were still running.
162 00:30:19.650 --> 00:30:20.610 Gabor Szabo: Okay. Nice.
163 00:30:24.700 --> 00:30:26.580 Tomer Brisker: Any other questions. Anybody?
164 00:30:29.900 --> 00:30:31.789 Tomer Brisker: Okay, thank you very much.
165 00:30:32.380 --> 00:30:39.630 Gabor Szabo: No, it seems so. Thank you. Thank you for giving this presentation and everyone for being here listening.
166 00:30:40.250 --> 00:30:41.120 Gabor Szabo: And
167 00:30:42.190 --> 00:30:49.600 Gabor Szabo: I'm going to stop the video. But please remember to like the video and follow the channel and see you next time, Toya.
168 00:30:50.180 --> 00:30:52.370 Tomer Brisker: Bye, bye, thanks for having me Nobel.
169 00:30:52.370 --> 00:30:53.150 Gabor Szabo: Bye-bye.
]]>for loop and a random number generator.
In this talk, we'll see how to use Monte Carlo simulations to solve various problems that might intimidate you due to lack of match skills.

1 00:00:02.020 --> 00:00:20.500 Gabor Szabo: So Hi, and welcome to the Codemaven Meetup Group and to the Codemaven Youtube Channel. In case you are watching this in Youtube, my name is Gabor. I provide the training services in python and rust and help companies get start using these languages.
2 00:00:20.740 --> 00:00:28.840 Gabor Szabo: And I also think that it's important to share knowledge among people. So that's why I'm organizing these events, these meetings.
3 00:00:29.820 --> 00:00:40.000 Gabor Szabo: So I would like to welcome everyone who joined us at this meeting, and especially Mickey, for agreeing to give this presentation, and that's it. The floor is yours, Mickey.
4 00:00:40.694 --> 00:00:44.060 Miki Tebeka: Hi, everyone. I am going to share my screen.
5 00:00:44.390 --> 00:00:48.980 Miki Tebeka: Then we will start sure.
6 00:00:50.920 --> 00:00:51.690 Miki Tebeka: Okay.
7 00:00:52.080 --> 00:01:03.720 Miki Tebeka: so we are going to talk about what I call simulations for the mathematically challenged. And this is about the tool called simulation. How you can solve various problems.
8 00:01:03.720 --> 00:01:06.970 Gabor Szabo: Sorry. Just just one thing. Can can you move the.
9 00:01:07.230 --> 00:01:07.910 Miki Tebeka: Oh, there!
10 00:01:07.910 --> 00:01:09.300 Gabor Szabo: This again, I'll do this.
11 00:01:10.010 --> 00:01:11.290 Gabor Szabo: Yeah, thanks.
12 00:01:11.290 --> 00:01:17.899 Miki Tebeka: Moved. Okay, sorry. Okay. So my name is Mickey. I've been a professional developer
13 00:01:18.210 --> 00:01:25.368 Miki Tebeka: 37 years. Now. Give or take work mostly with python and go
14 00:01:26.200 --> 00:01:31.599 Miki Tebeka: I teach. I consult, I write books, I do videos. I
15 00:01:31.870 --> 00:01:46.060 Miki Tebeka: enjoy myself in a very geeky way. And and this is a tool that I used in a couple of occasions. I think it's very simple, but not a lot of people aware of that
16 00:01:46.580 --> 00:01:51.379 Miki Tebeka: and it starts usually with the problem that you have usually a data related problem.
17 00:01:53.670 --> 00:01:58.850 Miki Tebeka: You have a cash. You want to know what are the odds that given that
18 00:01:59.080 --> 00:02:10.100 Miki Tebeka: amount of cache hits. What's the average latency? Other questions that usually involve statistics or probability.
19 00:02:10.680 --> 00:02:16.470 Miki Tebeka: And and then you said, Okay, you know, I'll we're all gigs. Right? So we go and hit the books.
20 00:02:17.910 --> 00:02:23.169 Miki Tebeka: But then you start seeing all these kinds of equations.
21 00:02:23.290 --> 00:02:28.480 Miki Tebeka: and usually around that time. I say, you know what. Maybe this problem is, not that important.
22 00:02:28.670 --> 00:02:30.920 Miki Tebeka: And I'll I'll move to do something.
23 00:02:31.680 --> 00:02:37.979 Miki Tebeka: And what I wanted to show you is basically that if you can write a follow up.
24 00:02:38.110 --> 00:02:39.529 Miki Tebeka: you can do statistics.
25 00:02:40.040 --> 00:02:46.170 Miki Tebeka: and that's it. You don't need more than that. You need the follow up, and you need random.
26 00:02:46.530 --> 00:02:55.529 Miki Tebeka: and these are the only 2 tools that you need in order to work. By the way, I'm going to show code, and if you have questions, feel free to ask
27 00:02:58.270 --> 00:03:12.519 Miki Tebeka: if you don't understand the code, if you want to learn about other things. So what are we going to do is we're going to talk about these 5 problems. So we're going to talk about the game of Qatar. And what are the best tiles going to calculate pipe
28 00:03:12.800 --> 00:03:17.480 Miki Tebeka: randomly, which sounds weird. But it's it's another
29 00:03:18.120 --> 00:03:23.610 Miki Tebeka: interesting uses for simulations. We're going to solve the birthday problem.
30 00:03:23.830 --> 00:03:31.290 Miki Tebeka: Given 23 people, I think. What's what are those the 2 people in say in this group has
31 00:03:32.900 --> 00:03:55.700 Miki Tebeka: the same birthday? We're going to see if person is sick, or what are the odds? That person is sick or not given a test that says that he is sick. And we're going to talk about the Monty Hall problem which is philosophically, it's a statistically interesting, but also a very philosophical question, not a philosopher. So you can discuss it later and see what's going on.
32 00:03:56.150 --> 00:04:02.910 Miki Tebeka: So let's start with Catan. Right? So in Catan we have these tiles, and every tile has a number on it.
33 00:04:03.290 --> 00:04:06.159 Miki Tebeka: and then you throw a couple of dices.
34 00:04:06.370 --> 00:04:13.100 Miki Tebeka: and if the number of the dices matches the number of your tile, then you can do things in the game.
35 00:04:13.290 --> 00:04:20.110 Miki Tebeka: So at the beginning you can pick out where you want to put your places, and it's up to you to decide.
36 00:04:20.269 --> 00:04:26.870 Miki Tebeka: Which tile do you want? And you want to know, you know, which tiles are going to get the most hits. What are the probability
37 00:04:27.060 --> 00:04:28.349 Miki Tebeka: of of doing that?
38 00:04:29.890 --> 00:04:30.850 Miki Tebeka: So
39 00:04:33.760 --> 00:04:39.109 Miki Tebeka: This is this is that. Yes, and I'm old. I'm using vim.
40 00:04:39.440 --> 00:04:54.299 Miki Tebeka: Sue, me later. But I think the code is clear enough. So basically, what we're going to do is we're going to do a dice wall, which is basically just a random number between one and 6. This is coming from there.
41 00:04:54.510 --> 00:05:03.489 Miki Tebeka: And then what we're going to do is we run a simulation. So what we're going? We're going to run a lot of dice roll. So I'm going to do a million
42 00:05:05.690 --> 00:05:07.109 Miki Tebeka: vice vers,
43 00:05:08.070 --> 00:05:21.110 Miki Tebeka: And every time I'm going to do 2 dice rows, right? So I get a number. And I'm updating some kind of counter right? We have counter from collections. This is a special data structure that basically stores
44 00:05:21.520 --> 00:05:29.950 Miki Tebeka: how much data we have per count.
45 00:05:30.090 --> 00:05:35.210 Miki Tebeka: And then I'm going over all the numbers right. The minimal
46 00:05:35.680 --> 00:05:48.199 Miki Tebeka: number that you can get with rolling 2 dices is 2, 2 ones and maximum. One is 12, but the range is half open. So we're not going to get there. I'm going to show the fraction
47 00:05:49.340 --> 00:05:53.260 Miki Tebeka: how many? Hey?
48 00:05:54.170 --> 00:05:57.799 Miki Tebeka: This number of the total counts, and I'm going to print it up.
49 00:05:57.920 --> 00:06:15.039 Miki Tebeka: That's that's the code that I'm going to do. And then, if you're going to do pythonpy, you're going to see now, we get probabilities right? And you see that unsurprisingly 7
50 00:06:15.290 --> 00:06:34.940 Miki Tebeka: has the best percentage to to get roll of 2 dices. And you can do the what I call the hard way, which is just, you know, doing all the combinations of all the dice rolls, and then calculate how many there are. But for me as a programmer. This is much easier.
51 00:06:35.610 --> 00:06:39.139 Miki Tebeka: Why, they just write some code. This is like 20 lines of code.
52 00:06:39.370 --> 00:06:45.859 Miki Tebeka: pretty simple. And now I have it. So this is the basic of simulation. We basically
53 00:06:47.220 --> 00:06:54.540 Miki Tebeka: create scenarios using some kind of randomness in scenario. And then we are going to
54 00:06:55.030 --> 00:07:04.519 Miki Tebeka: calculate some statistics about what happened on every scenario, and finally display the result. And this is known as a simulation or Monte Carlo simulation
55 00:07:04.710 --> 00:07:05.669 Miki Tebeka: for what we
56 00:07:08.870 --> 00:07:10.259 Miki Tebeka: questions about this one
57 00:07:16.070 --> 00:07:17.270 Miki Tebeka: no questions.
58 00:07:17.570 --> 00:07:21.660 Miki Tebeka: Alright. By the way, if you ask questions, just open the mic and ask questions, cause
59 00:07:22.140 --> 00:07:26.409 Miki Tebeka: it's hard for me to focus both on the code and on on the zoom screen.
60 00:07:26.960 --> 00:07:34.329 Miki Tebeka: Okay, the next thing we're going to do it's pretty interesting. We're going to calculate pi again randomly.
61 00:07:34.830 --> 00:07:35.890 Miki Tebeka: So
62 00:07:36.820 --> 00:07:42.999 Miki Tebeka: what the way we're going to do it is, we're going to say, let's take a circle which has a radius of one.
63 00:07:43.810 --> 00:07:48.237 Miki Tebeka: And now we're going to concentrate only on top right
64 00:07:48.990 --> 00:07:57.920 Miki Tebeka: square, which is the bonding square for this circle, and we're going to start getting random dots
65 00:07:58.290 --> 00:08:02.520 Miki Tebeka: if the dot falls in the circle, I'm going to paint them as
66 00:08:03.110 --> 00:08:08.290 Miki Tebeka: green, and if it falls outside of the circle, I'm going to paint them as red.
67 00:08:08.860 --> 00:08:16.689 Miki Tebeka: Okay? So once I've done it enough times I can calculate what is the ratio between the green dots and the red dots.
68 00:08:17.180 --> 00:08:21.580 Miki Tebeka: and this ratio is quarter of a pipe.
69 00:08:24.010 --> 00:08:28.210 Miki Tebeka: right? Because the the area of the
70 00:08:30.090 --> 00:08:32.335 Miki Tebeka: the way of the circle is
71 00:08:33.780 --> 00:08:37.330 Miki Tebeka: pi r squared. But r is one. So it's just pi.
72 00:08:37.840 --> 00:08:42.050 Miki Tebeka: so basically, the amount of dust that falls inside the circle should be pi.
73 00:08:42.320 --> 00:08:50.420 Miki Tebeka: but we we doing it only on a quarter of a circle. This is a quarter of a pi, and we are going to get the number pi.
74 00:08:50.800 --> 00:08:56.720 Miki Tebeka: So this is pi, dot, PP. 1,
75 00:08:57.840 --> 00:09:13.699 Miki Tebeka: so again, we we're going to import this time uniform from random, and then sq, and this is going to run for a bit. So I'm going to display display progress bar with something called Tqdm.
76 00:09:15.030 --> 00:09:22.700 Miki Tebeka: and then the radius of is one, and we have n, which is the number of iterations which is a hundred 1 million.
77 00:09:23.370 --> 00:09:35.250 Miki Tebeka: and inner, is the number of points that are inside the circle, which I'm going to do with to start with 0, right? So I'm getting X and Y, which is uniform between 0 and one.
78 00:09:35.960 --> 00:09:42.390 Miki Tebeka: And then, if the point falls inside the circle, I'm just going to increment inner.
79 00:09:43.700 --> 00:09:46.760 Miki Tebeka: So this is how many points fell inside the signal.
80 00:09:48.670 --> 00:10:01.319 Miki Tebeka: Now the ratio is inner divided by N. And as we said, this is quarter of a pi, so we need to print out 4 times this ratio to get to the number of pi
81 00:10:07.660 --> 00:10:14.720 Miki Tebeka: and you know what I'm going to also run the time command to show you how much time it took. So this is, this is a hundred 1 million
82 00:10:15.110 --> 00:10:19.840 Miki Tebeka: Ron's so it's going to take a little bit of fine.
83 00:10:20.300 --> 00:10:21.420 Miki Tebeka: And
84 00:10:23.110 --> 00:10:30.290 Miki Tebeka: and it's a good thing in the winter, because it also warms up your CPU, so you can warm yourself without using the A/C.
85 00:10:35.020 --> 00:10:43.789 Miki Tebeka: I said. Tqdm, which shows the progress bar is really nice, especially if you have long running processes, you know, if your process is actually running or it's not stuck.
86 00:10:43.930 --> 00:10:45.236 Miki Tebeka: So we're
87 00:10:46.570 --> 00:10:56.980 Miki Tebeka: So it's been done. And and we see that we got a 3.1 4.
88 00:10:57.380 --> 00:11:04.729 Miki Tebeka: Yeah, which is close enough to pie. And it took us about 41 seconds to run.
89 00:11:06.320 --> 00:11:12.810 Miki Tebeka: Now, one thing that can help you with simulations is pip pi-pipe.
90 00:11:12.980 --> 00:11:24.209 Miki Tebeka: So if you're not familiar, the python we are using is called C. Python. It's python written in C. There are other pythons, such as Jython, which is Python, written in Java.
91 00:11:24.530 --> 00:11:30.700 Miki Tebeka: and several others, and micropython for micro devices, and there is pi pi.
92 00:11:30.970 --> 00:11:39.090 Miki Tebeka: which is a python written in Python, and it has several optimizations that are not in C. Python.
93 00:11:39.330 --> 00:11:48.439 Miki Tebeka: especially a git compiler, though from 3 13 and up we have an experimental git compiler in C title which should bring it.
94 00:11:49.475 --> 00:11:54.449 Miki Tebeka: and if I'm going to run a pi po eye on that thing.
95 00:11:54.640 --> 00:11:56.969 Miki Tebeka: you're going to see the difference.
96 00:11:57.100 --> 00:12:05.319 Miki Tebeka: Right? No, forgot the time. Command like. You see, this is
97 00:12:05.520 --> 00:12:12.520 Miki Tebeka: 3.5 seconds, so more than 10 times faster on these calculations. So I'm not going to. I'm not saying.
98 00:12:13.420 --> 00:12:22.700 Miki Tebeka: Use pi for everything. There are some compatibility issues with, especially external libraries and maybe other things, and it's not
99 00:12:22.890 --> 00:12:30.900 Miki Tebeka: on par with python. Currently, I think they're on they're on on 3, 10,
100 00:12:31.160 --> 00:12:36.990 Miki Tebeka: equivalent to 3, 10. And right now in Python we are in 3 13. So they take some time
101 00:12:37.400 --> 00:12:40.189 Miki Tebeka: they're building up, and then they close it.
102 00:12:40.690 --> 00:12:48.000 Miki Tebeka: But it's a it's a nice tool to know and and work with questions about this one.
103 00:12:56.360 --> 00:12:59.850 Gabor Szabo: It's not about this one. And and probably it's not
104 00:13:00.490 --> 00:13:13.559 Gabor Szabo: relevant to this, this presentation. But maybe it's for another time is how come actually, this pi can be pi can be so much faster. I would really like to understand this.
105 00:13:13.560 --> 00:13:29.079 Miki Tebeka: The the current. C. Python itself does not do any optimization. So if you look at the C compiler, it has tons of optimization loop, unrolling, constant folding, a lot of many things that the python temperature is not doing at all.
106 00:13:29.784 --> 00:13:40.270 Miki Tebeka: And the other thing is that there is a technology called jit, which is just in time compilation, which means that you run the code. Once in python.
107 00:13:40.430 --> 00:13:45.080 Miki Tebeka: you see what happens there and then you generate specific machine code.
108 00:13:45.230 --> 00:13:52.810 Miki Tebeka: And next time you call the function, it is actually, not the python function. It's called, but the
109 00:13:53.160 --> 00:14:03.310 Miki Tebeka: optimized generated machine code for that. And this is something that Nodejs and other dynamic languages are using, including Java.
110 00:14:03.590 --> 00:14:08.849 Miki Tebeka: to make things faster. And Pypi has pipi. Sorry. Pi Pi has a
111 00:14:09.040 --> 00:14:12.810 Miki Tebeka: a very good git compiler that has been developed for a lot of years.
112 00:14:13.449 --> 00:14:21.089 Miki Tebeka: And that's why it's faster. Basically, pi is written in Python, but eventually generates a
113 00:14:21.520 --> 00:14:28.779 Miki Tebeka: and executable in machine code. So it is pretty fast in this case.
114 00:14:32.600 --> 00:14:34.444 Miki Tebeka: Okay, so
115 00:14:35.730 --> 00:14:43.390 Miki Tebeka: someone joked that, you know, this is every time they go on a Zoom Meeting. That's what comes to their mind right? This is very similar to
116 00:14:43.630 --> 00:14:55.150 Miki Tebeka: to Zoom, and the idea is, and this is known as the birthday problem. Given a group of people.
117 00:14:55.500 --> 00:15:00.259 Miki Tebeka: what are the odds that 2 people have the same birthday?
118 00:15:00.800 --> 00:15:01.740 Miki Tebeka: And
119 00:15:08.330 --> 00:15:10.020 Miki Tebeka: if a
120 00:15:11.310 --> 00:15:19.429 Miki Tebeka: what I pick up as a group. Size is 23 people. So I would like you to just take a guess
121 00:15:19.730 --> 00:15:24.659 Miki Tebeka: like we have a group of 23 people. What are the odds that 2 people have the same birthday?
122 00:15:24.930 --> 00:15:35.990 Miki Tebeka: Yeah. So around the birthday. So I'm going to basically say that I'm not looking at dates. I'm looking at day of year. So we have 365 days per year. So
123 00:15:36.480 --> 00:15:41.359 Miki Tebeka: a random date is basically a number between one and 365. That's
124 00:15:41.590 --> 00:15:50.350 Miki Tebeka: how many days we have in here. And now what I'm saying is given a groups of a given size. Are there any duplicates in the group?
125 00:15:50.770 --> 00:16:00.790 Miki Tebeka: Alright? So basically I creating a set and then going over the numbers generating around the birthday. And then, if it's already. If you already seen this birthday.
126 00:16:01.290 --> 00:16:06.209 Miki Tebeka: we say there is a duplication, so at least 2 people have the same birthday in that group.
127 00:16:06.350 --> 00:16:16.929 Miki Tebeka: otherwise we add it to the group and return. And finally, if you're out of the follow, we say, no, there are no duplicates in this group. So basically, we draw a group of
128 00:16:17.140 --> 00:16:18.390 Miki Tebeka: random numbers.
129 00:16:19.026 --> 00:16:24.570 Miki Tebeka: Between one and 365, and say, is there an overlap here somewhere?
130 00:16:26.440 --> 00:16:38.710 Miki Tebeka: Alright? And now we start again. The simulation. So simulation now is going to run over a million. We're going to do it 1 million times. The group size is 23. And again, none of the applications is
131 00:16:38.900 --> 00:16:48.529 Miki Tebeka: 0 to begin with. And then we run the simulation. And if there is a duplication, we say, Okay, let's increment duplication.
132 00:16:48.640 --> 00:16:54.299 Miki Tebeka: Finally, we are printing what the fraction is from the total.
133 00:16:55.750 --> 00:16:58.950 Miki Tebeka: Okay, so remember the number you guessed.
134 00:17:09.520 --> 00:17:12.040 Miki Tebeka: anyone guessed 50%.
135 00:17:15.579 --> 00:17:18.080 Miki Tebeka: Right? This is seems pretty high right.
136 00:17:18.319 --> 00:17:24.920 Miki Tebeka: And this is something that happens with statistics and probabilities a lot of time. This is not intuitive.
137 00:17:25.069 --> 00:17:33.850 Miki Tebeka: A lot of times we think that we know the answer, and we say, you know, there are a lot of days, only 23 people. How comes? But
138 00:17:34.070 --> 00:17:41.780 Miki Tebeka: if you do it and you do the statistical computation, you'll get exactly the same thing. That's that's the idea. But
139 00:17:42.430 --> 00:17:48.739 Miki Tebeka: again, this is a a more complicated the
140 00:17:49.250 --> 00:17:53.420 Miki Tebeka: computation. And for me, as a developer. This is pretty easy.
141 00:17:53.760 --> 00:17:55.630 Miki Tebeka: you know a follow up around them.
142 00:17:56.360 --> 00:17:57.329 Miki Tebeka: I'm done with.
143 00:18:01.430 --> 00:18:10.880 Gabor Szabo: Yeah, it's nice. What would be interesting, I think, is to see if you take 2 people, 3 people and so on. Up to 365 people.
144 00:18:11.340 --> 00:18:15.830 Gabor Szabo: and for each number to see this, the probability and then graph.
145 00:18:16.185 --> 00:18:27.929 Miki Tebeka: Pretty sure there is. If you go on the web. This is a really known problem. People have done it already. There's a. You can probably see the graph for that.
146 00:18:29.194 --> 00:18:37.129 Miki Tebeka: But this is not exactly true. Right? This is a joke right?
147 00:18:37.280 --> 00:18:46.709 Miki Tebeka: And the chances of a piece of bread falling on the on butter side down is directly proportional to the cost of the carpet, like it's not 50 50.
148 00:18:47.040 --> 00:18:49.360 Miki Tebeka: This is correlations to Mr. Murphy.
149 00:18:49.580 --> 00:18:50.490 Miki Tebeka: Oh.
150 00:18:50.730 --> 00:19:02.460 Miki Tebeka: and the thing is that not everybody is born or does not an average distribution on the days that you are born, especially if you're born on March 29, th
151 00:19:03.080 --> 00:19:09.370 Miki Tebeka: right? This is reducing the odds that you're going to have someone with the same birthday.
152 00:19:12.510 --> 00:19:13.389 Miki Tebeka: So
153 00:19:17.110 --> 00:19:32.149 Miki Tebeka: we. We have a model, and there's a saying, by George, Box, which I really believe that all models are wrong. But some are useful right? So it needs to be interesting enough, or or the answer should be interesting enough, but not to
154 00:19:34.740 --> 00:19:35.405 Miki Tebeka: to
155 00:19:38.780 --> 00:19:45.018 Miki Tebeka: to give you a useful answer. But you can have a model. So I actually took some data from
156 00:19:45.590 --> 00:19:48.589 Miki Tebeka: for about birthdays in general.
157 00:19:49.030 --> 00:19:53.730 Miki Tebeka: Right? So no, this is
158 00:19:55.770 --> 00:20:06.170 Miki Tebeka: us both. Right? So the the Csv file gives you. What is the birthday? Per file? Oh, this is a windows. New lines. Yay,
159 00:20:07.250 --> 00:20:08.130 Miki Tebeka: so
160 00:20:09.380 --> 00:20:14.699 Miki Tebeka: We have a year, month, day of birth, day of the week, and and how many birth there were
161 00:20:15.070 --> 00:20:17.320 Miki Tebeka: per every one of them.
162 00:20:17.941 --> 00:20:24.589 Miki Tebeka: And then what I'm going to do is I'm going to do actually weighted probabilities.
163 00:20:24.910 --> 00:20:28.047 Miki Tebeka: meaning it's not that every day has the
164 00:20:29.080 --> 00:20:41.680 Miki Tebeka: the same probability. But I'm going to use these frequencies to do that. And here I'm going to switch over to tools from the scientific python side of things.
165 00:20:42.410 --> 00:20:45.390 Miki Tebeka: And these tools are pandas and numpy.
166 00:20:46.170 --> 00:21:04.579 Miki Tebeka: And I'm using pandas. If you're not familiar with pandas, and this may be a topic for a different talk, or probably already done it before. Then, this is a really really great library for working with data. I'm using it to load the Csv
167 00:21:06.360 --> 00:21:15.900 Miki Tebeka: so basically, I'm loading the birthdays from this Csv file, and I'm I'm and
168 00:21:18.510 --> 00:21:20.829 Miki Tebeka: converting things to daytime.
169 00:21:20.980 --> 00:21:25.829 Miki Tebeka: And then I'm saying that the the birthday is the day and the month.
170 00:21:27.370 --> 00:21:36.752 Miki Tebeka: and then I'm doing what is known as a group by to get all of the people that were born on the same day in the month
171 00:21:38.260 --> 00:21:45.690 Miki Tebeka: divided by the total number of probabilities, and then return the index and the values am.
172 00:21:46.010 --> 00:21:52.929 Miki Tebeka: So once I have that I can do something else I'm going to use now. Numpy
173 00:21:53.320 --> 00:21:59.770 Miki Tebeka: numpy has a random choice. Basically choice says, pick things from from a group.
174 00:22:00.274 --> 00:22:10.019 Miki Tebeka: But and if you don't say anything, it's it's going to give an equal probability to everything. But you can provide the size and the probabilities.
175 00:22:10.230 --> 00:22:18.226 Miki Tebeka: and then it is going to do a weighted probability meaning there's a bigger chance. If if people are more people are born on.
176 00:22:19.980 --> 00:22:24.335 Miki Tebeka: What is that? September 9, let's say then,
177 00:22:24.950 --> 00:22:30.960 Miki Tebeka: there is a bigger chance. It's going to pick. September 9 versus February 29, th
178 00:22:31.430 --> 00:22:36.951 Miki Tebeka: let's say, or other day, April 26. For some reason I don't know why
179 00:22:37.570 --> 00:22:42.380 Miki Tebeka: people don't like their birthday, they think, or July, July 4.th
180 00:22:43.140 --> 00:22:46.209 Miki Tebeka: I don't know why they have lessened.
181 00:22:46.480 --> 00:22:53.181 Miki Tebeka: Okay, so I'm loading the birthdays from from the Csv file. I'm doing
182 00:22:56.150 --> 00:23:04.390 Miki Tebeka: And again, no 100,000 this time. Simulations. The group size is 23, and
183 00:23:04.550 --> 00:23:10.139 Miki Tebeka: duplicate is 0. And again, I'm doing the same same. Follow up. And again the same thing the time
184 00:23:10.290 --> 00:23:11.440 Miki Tebeka: I'm doing here.
185 00:23:17.470 --> 00:23:22.520 Miki Tebeka: And if I'm running this one we got the same number.
186 00:23:23.640 --> 00:23:33.690 Miki Tebeka: But this is rounding up. Pretty sure if I'm going to show more digits after the decimal. It is going to show more. You're going to see the difference. But the idea is that
187 00:23:34.725 --> 00:23:38.680 Miki Tebeka: we we added more. We made our model more accurate.
188 00:23:38.790 --> 00:23:42.699 Miki Tebeka: but the inner, the even the inaccurate model, was good enough.
189 00:23:42.850 --> 00:23:48.714 Miki Tebeka: right? It was good enough, and that's what saying that all models are wrong, but some are useful.
190 00:23:49.760 --> 00:23:59.990 Miki Tebeka: you don't have to have exact distribution of exact information about your data to gain some insights which are correct from the data. And a lot of time.
191 00:24:00.420 --> 00:24:02.260 Miki Tebeka: you can do approximations.
192 00:24:02.910 --> 00:24:09.580 Miki Tebeka: And statistically, it's still good Facebook questions.
193 00:24:16.730 --> 00:24:17.690 Miki Tebeka: No question.
194 00:24:18.270 --> 00:24:25.577 Miki Tebeka: Okay, so this is about
195 00:24:26.270 --> 00:24:32.039 Miki Tebeka: a question that actually, they they gave to doctors. They said that there is a test for disease
196 00:24:32.370 --> 00:24:42.200 Miki Tebeka: that has 5% false positives, meaning that in 5% of the people. The test will tell you that you're sick, even though you're not sick.
197 00:24:42.530 --> 00:24:44.750 Miki Tebeka: This is what is known as a false positive.
198 00:24:45.320 --> 00:24:51.000 Miki Tebeka: and he says that it knows that the disease strikes about one person in a thousand in the population.
199 00:24:52.330 --> 00:24:55.103 Miki Tebeka: Okay, they said, okay, now we we're taking
200 00:24:55.970 --> 00:25:08.399 Miki Tebeka: a random test. We take a random person from the street, we make the we make the test, and the test says this person is sick. What is the actual probability that this patient is really sick.
201 00:25:09.550 --> 00:25:11.930 Miki Tebeka: Okay? And think about that.
202 00:25:12.388 --> 00:25:18.140 Miki Tebeka: With Covid, for example. Right they swap you for Covid and says you have Covid. Now
203 00:25:18.430 --> 00:25:38.570 Miki Tebeka: you're home. You're not allowed to go out sometimes, you know, for other diseases. It might say that the doctor says, Okay, you're sick now you need a treatment, maybe a violent treatment, maybe something which costs a lot of money. So they ask his doctors to see if they're actually basing their decisions on something which makes sense or not.
204 00:25:38.690 --> 00:25:43.869 Miki Tebeka: Right. So talking about true positives, right? So
205 00:25:44.450 --> 00:25:50.620 Miki Tebeka: if I predicted that the person is sick and they're actually sick. This is what known as a true positive.
206 00:25:51.430 --> 00:25:57.499 Miki Tebeka: We talked about false positive, which is person, is said to be sick, but they are actually healthy.
207 00:25:59.040 --> 00:26:03.640 Miki Tebeka: and we also have false negative saying, this is now
208 00:26:03.860 --> 00:26:06.710 Miki Tebeka: person who is sick, and we said that they're healthy.
209 00:26:06.920 --> 00:26:15.630 Miki Tebeka: and we have a true negative, which is a healthy person, and that says they are healthy. Right? Remember, positive is sick, negative, healthy.
210 00:26:16.180 --> 00:26:19.649 Miki Tebeka: That's it. This thing is known as a confusion matrix.
211 00:26:19.910 --> 00:26:25.470 Miki Tebeka: And on the confusion matrix, you can do a lot of
212 00:26:25.910 --> 00:26:29.789 Miki Tebeka: calculations. When you measure your models.
213 00:26:30.060 --> 00:26:43.740 Miki Tebeka: especially prediction models, you start with the confusion matrix and then say, what is the percent of true positive, can I? There's precision, recall, and and several other things that come to mind.
214 00:26:44.070 --> 00:26:51.570 Miki Tebeka: I call it the. I think the name confusion. Matrix is also very good, because
215 00:26:51.760 --> 00:26:55.139 Miki Tebeka: I always get confused by that. I need to go back and think about
216 00:26:56.000 --> 00:26:58.449 Miki Tebeka: what every term is saying. But we do that.
217 00:26:58.600 --> 00:27:04.280 Miki Tebeka: So let's have a look at the signal. Okay, so
218 00:27:07.510 --> 00:27:10.059 Miki Tebeka: I have a function for warning. So
219 00:27:10.560 --> 00:27:14.720 Miki Tebeka: I want to say that one in
220 00:27:15.310 --> 00:27:27.320 Miki Tebeka: a thousand. Right? They said, one in a thousand is sick. So basically what they're saying, I'm drawing a number between one and N, and if this number is one, and I can pick any number between. N. So 17
221 00:27:27.780 --> 00:27:31.560 Miki Tebeka: 3, any number will will work. I just need
222 00:27:31.730 --> 00:27:34.390 Miki Tebeka: that one in Ni will happen.
223 00:27:36.230 --> 00:27:43.460 Miki Tebeka: And now, I'm I'm going to say, like this. There is a person.
224 00:27:44.190 --> 00:27:56.340 Miki Tebeka: and if the person is sick we are going to say that the person is sick. This is not specified in the question in the equation, but this is the assumption, the assumption that there are no false negatives.
225 00:27:56.680 --> 00:27:58.850 Miki Tebeka: They're only false, positive.
226 00:27:59.020 --> 00:28:06.990 Miki Tebeka: So if you are doing that, and they say if it, the disease is.
227 00:28:07.220 --> 00:28:27.909 Miki Tebeka: we say we have a 5% false positive. So in one. In 20 cases this test is also going to say true. So if the person is sick, the test is going to say, for sure you're sick. If you're healthy, there's 1 in 5%, one in 5, 1 in 20. Sorry chance that you're the test will still say that you're sick, even though you're healthy.
228 00:28:30.070 --> 00:28:35.889 Miki Tebeka: So now, number of sick people and number of people who are diagnosed as sick are both starting at 0.
229 00:28:36.150 --> 00:28:40.569 Miki Tebeka: And now we are running a million simulations
230 00:28:40.720 --> 00:28:47.079 Miki Tebeka: for every one of them. We are picking a person at random, so the chances of the person being sick
231 00:28:47.240 --> 00:28:57.279 Miki Tebeka: like we said, Here, and the disease strikes 1 1 of every 1,000 people in the population.
232 00:28:57.390 --> 00:29:01.800 Miki Tebeka: Right? So there's a 1 in a thousand chance that this person is sick.
233 00:29:02.450 --> 00:29:06.310 Miki Tebeka: and if the person is sick, then we increment the number of sick.
234 00:29:06.440 --> 00:29:10.619 Miki Tebeka: and then we do the diagonals for the person, and if
235 00:29:11.990 --> 00:29:14.820 Miki Tebeka: we diagnose the person is sick, we increment the number of.
236 00:29:14.970 --> 00:29:18.160 Miki Tebeka: But okay.
237 00:29:18.760 --> 00:29:26.069 Miki Tebeka: So now we have the number of people who are actually sick. And I know people who were diagnosed as sick. And we are going to
238 00:29:27.110 --> 00:29:31.512 Miki Tebeka: print out this frequency that says,
239 00:29:33.000 --> 00:29:38.800 Miki Tebeka: what are the percentage of people who are actually sick from the people who are diagnosed as sick.
240 00:29:41.160 --> 00:29:42.610 Miki Tebeka: Anyone care to guess
241 00:29:51.340 --> 00:29:54.650 Miki Tebeka: 2%. See, you are good.
242 00:29:57.160 --> 00:30:02.210 Miki Tebeka: Okay? So a lot of people are saying, you know, this is
243 00:30:02.340 --> 00:30:04.820 Miki Tebeka: what right? The the test is
244 00:30:05.020 --> 00:30:15.159 Miki Tebeka: only 5% false, positive. How come? So 95%. It's okay. But still only 2% of the people are sick.
245 00:30:15.530 --> 00:30:17.620 Miki Tebeka: So and and think about that.
246 00:30:19.520 --> 00:30:24.550 Miki Tebeka: yeah, 1% per divide 5%. That's 2%
247 00:30:25.490 --> 00:30:26.240 Miki Tebeka: So
248 00:30:29.630 --> 00:30:50.819 Miki Tebeka: if you come to think about that, that has a lot of implications. This is, again, this intuition that we have that is usually wrong when it comes to these things. And the 3rd thing that they give this test to a lot of doctors. Most of them get it wrong. And then it means that they're actually basing treatments and other things on something which is
249 00:30:51.220 --> 00:30:52.330 Miki Tebeka: not correct.
250 00:30:53.535 --> 00:30:57.430 Miki Tebeka: So maybe they should run a simulation
251 00:30:57.600 --> 00:31:00.239 Miki Tebeka: and get some understanding of what's going on.
252 00:31:01.920 --> 00:31:02.710 Miki Tebeka: Alright.
253 00:31:03.700 --> 00:31:11.480 Miki Tebeka: The last one that one is known as the Monty Hall problem. And we are
254 00:31:11.970 --> 00:31:14.529 Miki Tebeka: okay. Well, we have a lot of time.
255 00:31:14.860 --> 00:31:24.710 Miki Tebeka: I'll start speaking. No so the Monty Hall problem says, you're in a game show, and you have 3 doors.
256 00:31:24.990 --> 00:31:30.829 Miki Tebeka: and the host says, you know, behind 2 doors there are goats.
257 00:31:31.030 --> 00:31:36.710 Miki Tebeka: and behind and the 3rd door. There is a car that you can win.
258 00:31:37.980 --> 00:31:43.140 Miki Tebeka: and it says, Pick a door, 1, 2, or 3, and you pick a door. Let's say I picked one.
259 00:31:43.270 --> 00:31:50.730 Miki Tebeka: And now the host says, Okay, he goes on to another door. Let's say this time door number 2 opens the door and shows you a goat.
260 00:31:50.990 --> 00:31:53.920 Miki Tebeka: And now, he says, do you want to keep
261 00:31:54.530 --> 00:31:58.800 Miki Tebeka: your original door, or do you want to switch to the second one?
262 00:32:02.050 --> 00:32:14.090 Miki Tebeka: Right? So you have a strategy now to say, now I pick door number one. I'm going to stay with door number one, or after they show me the door with the goat. I want to change my my answer, and I actually go on to pick door number 3.
263 00:32:14.660 --> 00:32:19.359 Miki Tebeka: So what is the strategy? What is the good strategy to work in this case.
264 00:32:19.660 --> 00:32:24.539 Miki Tebeka: Too late, is it? On, on one or 2?
265 00:32:25.203 --> 00:32:35.650 Miki Tebeka: So again, we are going to simulate right? So a random door is around the manager now
266 00:32:36.020 --> 00:32:45.589 Miki Tebeka: and here. What we're doing is that we pick, we say, does staying with the door wins the game right? So we
267 00:32:46.133 --> 00:32:49.850 Miki Tebeka: we pick a 1 door which the car is
268 00:32:51.282 --> 00:32:57.270 Miki Tebeka: under the door, and then we pick another door, which is the door that the player picked.
269 00:32:57.890 --> 00:33:03.619 Miki Tebeka: Now, if the if it is the same door, it means that the player who says I'm going to stay
270 00:33:03.720 --> 00:33:06.080 Miki Tebeka: is going to win the game.
271 00:33:07.580 --> 00:33:14.870 Miki Tebeka: Okay? So I'm just saying, if the color is equal to the player door, then, meaning that the stay strategy wins.
272 00:33:15.240 --> 00:33:17.310 Miki Tebeka: And now I'm going to
273 00:33:18.660 --> 00:33:26.040 Miki Tebeka: to do a million simulations. I'm going to say that these are the number of wins that the stay strategy have.
274 00:33:26.160 --> 00:33:31.619 Miki Tebeka: and these are the 9 number of wins that the switch strategy had.
275 00:33:31.880 --> 00:33:38.980 Miki Tebeka: And I'm going to run the simulation, and then, if stay wins the game, I'm incrementing the stays. Otherwise
276 00:33:39.620 --> 00:33:43.610 Miki Tebeka: I'm going to increment the wins, and then I
277 00:33:44.710 --> 00:33:53.400 Miki Tebeka: divide them by N. So what is the fraction of the one? And I'm printing out? What is the fraction of times we want with staying, and what is the fraction of
278 00:33:53.980 --> 00:33:58.640 Miki Tebeka: things that, and we did by switching
279 00:34:04.620 --> 00:34:10.360 Miki Tebeka: any guesses which strategy is better.
280 00:34:20.120 --> 00:34:25.560 Miki Tebeka: So if you switch door, you have twice as much chances of winning.
281 00:34:25.980 --> 00:34:30.609 Miki Tebeka: Then then stay.
282 00:34:30.900 --> 00:34:35.659 Miki Tebeka: And and this is really counterintuitive. Because why?
283 00:34:36.440 --> 00:34:45.299 Miki Tebeka: Right? I picked the door at one dome. There could be a car under it. The fact that someone showed me another door that I didn't pick as a goat in it shouldn't change that.
284 00:34:45.790 --> 00:34:47.720 Miki Tebeka: But it actually does.
285 00:34:47.920 --> 00:34:56.590 Miki Tebeka: And there's a lot of debate. If you Google, the multi-hole problem with statistics, there's a lot of debate about
286 00:34:56.710 --> 00:34:59.999 Miki Tebeka: what? What does it mean? And and
287 00:35:00.260 --> 00:35:10.629 Miki Tebeka: are these calculations okay or not. For me. Yeah, it's I have a strategy now. So if I see a goat, they pick the other door. And that's it.
288 00:35:14.370 --> 00:35:18.649 Miki Tebeka: Okay? So you can read more on Wikipedia, on the Monty Hall problem and it.
289 00:35:19.582 --> 00:35:25.739 Miki Tebeka: so that's that's basically these are the 4 cases. And I hope I convince you that
290 00:35:26.660 --> 00:35:30.299 Miki Tebeka: when you have these questions don't don't shy away because you don't know the map
291 00:35:30.470 --> 00:35:40.719 Miki Tebeka: because you don't know how to figure out statistics, and a lot of time will also help you, because a lot of time, our intuition, when it comes to statistics and probabilities, is usually wrong.
292 00:35:41.280 --> 00:35:50.400 Miki Tebeka: We, we, as people, have good intuition about small numbers when it comes to large numbers, we are really
293 00:35:51.569 --> 00:35:53.129 Miki Tebeka: very bad at that.
294 00:35:53.827 --> 00:36:04.760 Miki Tebeka: There, there's 1 statistic and and nothing tallible says that every time he works with statistics he need to turn off this part of the brain that says, I know what I'm doing, and just pass the numbers
295 00:36:05.050 --> 00:36:17.420 Miki Tebeka: right? So if you want to learn more, there is a great talk by Jake Vanderplass is an astrophysicist who's heavily involved in scientific python community.
296 00:36:17.850 --> 00:36:35.519 Miki Tebeka: and he has that's where that's where I started, and he shows some other simulations and how to do statistics. You can read more about the multicolor simulation in in Wikipedia. By the way, I think even in Google sheets and and excel, you can run
297 00:36:36.570 --> 00:36:42.519 Miki Tebeka: Monte Carlo simulations, which is pretty awesome, or in excel
298 00:36:46.570 --> 00:36:50.369 Miki Tebeka: in in excel. Now we have python in excel. Right? So this is
299 00:36:50.760 --> 00:36:54.949 Miki Tebeka: great, and there's a library called Simpy.
300 00:36:55.100 --> 00:37:10.809 Miki Tebeka: If you want to do what is known as a discrete simulation. And Simpa, basically for every tick tell, you tell you process to do something, and then you can simulate cars going and people crossing the road and cell phone towers, and a lot of, and a lot of other things.
301 00:37:14.760 --> 00:37:23.070 Miki Tebeka: Zia, I'm not sure what your name is. But, he said, the simulation will not help you if the problem is not well specified, and that that is true.
302 00:37:23.360 --> 00:37:30.109 Miki Tebeka: Right. So you need a good definition of the problem before you start. If you have a vague definition of the problem, then
303 00:37:30.650 --> 00:37:31.560 Miki Tebeka: now
304 00:37:40.240 --> 00:37:48.350 Miki Tebeka: you, you can think about it any way you want. I'm not sure why you're saying that the Monty Hall problem is not fully defined, but we can talk about it later.
305 00:37:50.230 --> 00:37:52.989 Miki Tebeka: Because this is just about picking a strategy to win.
306 00:37:53.100 --> 00:37:55.940 Miki Tebeka: And I think the the switch one is is a winning strategy.
307 00:37:58.290 --> 00:38:03.219 Miki Tebeka: That's it. All of this code is in my github
308 00:38:03.490 --> 00:38:11.880 Miki Tebeka: talks repo, so you can look at the code there. All all the things that that they have there.
309 00:38:13.050 --> 00:38:20.920 Miki Tebeka: I wrote a book on Python with quizzes, if you want to to buy it, and if you have questions, that's a good time to ask them. Not
310 00:38:28.840 --> 00:38:30.660 Miki Tebeka: no question. Gabor.
311 00:38:30.970 --> 00:38:36.500 Gabor Szabo: Well, I don't have any question either. Now. I already asked the ones I had.
312 00:38:36.620 --> 00:38:40.469 Gabor Szabo: I just want to thank you and thank everyone who who joined us.
313 00:38:41.320 --> 00:38:42.590 Gabor Szabo: And
314 00:38:44.080 --> 00:38:56.819 Gabor Szabo: if you are watching the video and you reach this point, then please remember to like the video and follow the channel. And under the video you will find a link to where you will have the link to this Github page.
315 00:38:56.930 --> 00:39:05.280 Gabor Szabo: To this Github Repo, and you will be able to also find find Mickey. I guess you also.
316 00:39:05.730 --> 00:39:06.530 Miki Tebeka: Yeah, yeah, sure.
317 00:39:06.530 --> 00:39:07.200 Gabor Szabo: Share your link.
318 00:39:07.200 --> 00:39:09.780 Miki Tebeka: If you have any questions I will answer.
319 00:39:11.540 --> 00:39:15.230 Gabor Szabo: Okay, so thank you very much.
320 00:39:15.510 --> 00:39:16.870 Miki Tebeka: So more for organizing this.
321 00:39:17.710 --> 00:39:21.050 Gabor Szabo: You're welcome, and I hope to see you in other presentations.
322 00:39:21.050 --> 00:39:22.110 Miki Tebeka: Awesome. Thank you.
323 00:39:22.110 --> 00:39:22.770 Gabor Szabo: Bye, bye.
]]>This lecture shows use real world use cases, knowhows and troubleshooting methods for using asyncio in python

1 00:00:01.740 --> 00:00:30.889 Gabor Szabo: So Hello, and welcome to the Codemavens, Meetup Group and Codemavens Channel. If you're watching it on on Youtube, my name is Gabor Sabo. I usually teach python and rust at companies and also introduce test automation and Ci, and that the area and I also think that sharing knowledge is extremely important among high tech
2 00:00:31.470 --> 00:00:56.010 Gabor Szabo: programmers and and people working in the high tech industry. So that's why I am organizing these these meetings, these presentations. As you can see it's being recorded. It's going to be on Youtube, please, like the video and follow the channel below the video, you will find a link to information about Al and and about the content of this video.
3 00:00:56.130 --> 00:01:03.549 Gabor Szabo: And I would like to welcome everyone who's who joined us in the live meeting, and especially like Eyal, giving us the presentation.
4 00:01:03.820 --> 00:01:11.349 Gabor Szabo: So now it's yours. Please introduce yourself and share the screen as you feel fit and welcome.
5 00:01:12.590 --> 00:01:30.479 Eyal Balla: Thank you. So I'll share the screen. So this is presentation about it's the it's like a take off on a presentation I did like, I think, 2 years ago, or maybe 3 years ago.
6 00:01:33.070 --> 00:01:42.467 Eyal Balla: bit about me. So I've been developing for more than 15 years and working in python like
7 00:01:43.120 --> 00:01:49.530 Eyal Balla: 5 to 10 years. And currently, I lead the data team in scenario.
8 00:01:50.390 --> 00:01:51.629 Eyal Balla: See? That's me.
9 00:01:52.994 --> 00:02:04.040 Eyal Balla: So you can find me in the links, and if you're interested we're also hiring. So you're welcome to try and join. And we're looking for people that
10 00:02:04.520 --> 00:02:08.080 Eyal Balla: working python, and that is their passion.
11 00:02:08.650 --> 00:02:13.000 Eyal Balla: So today, what I'm gonna do is I'm gonna go through
12 00:02:13.340 --> 00:02:29.884 Eyal Balla: a bit about what Asyncayo is and try to give a real world example from things that we do, and then we'll try and talk about some advanced topics regarding Asyncayo. And so that's what we're gonna do.
13 00:02:31.060 --> 00:02:36.420 Eyal Balla: I think it's important that you guys feel free to step in and ask questions if you need
14 00:02:38.620 --> 00:02:45.520 Eyal Balla: and cause there, there's gonna be a bit code and some topics, and maybe
15 00:02:45.730 --> 00:02:52.010 Eyal Balla: hopefully, it's gonna be all clear. But if somebody has any question, then feel free to jump in.
16 00:02:52.550 --> 00:03:00.030 Eyal Balla: So 1st of all, what what is Asyncaio. So Asyncaio is a style of concurrent programming in python.
17 00:03:00.340 --> 00:03:04.660 Eyal Balla: So why do we need it? So you can think of
18 00:03:04.960 --> 00:03:08.569 Eyal Balla: wanting to do multiple things in python at the same time.
19 00:03:08.770 --> 00:03:09.506 Eyal Balla: So
20 00:03:11.140 --> 00:03:19.540 Eyal Balla: a simple way to do it is using a fork right? So you can run multiple python processes at the same time.
21 00:03:20.010 --> 00:03:43.789 Eyal Balla: So the OS handles the concurrency. And you can actually use multi-cores on your machines. And the problem is that you get duplicated memory because each process has its own memory space right? And in order to communicate between the different python processes you need like OS level communications. So pipes and
22 00:03:43.950 --> 00:03:45.463 Eyal Balla: files, and
23 00:03:47.161 --> 00:03:59.079 Eyal Balla: sorry other ways of multi-process communication. So you'd say, Okay, maybe we can do it some some other way. So there's also multi-threading in python.
24 00:03:59.390 --> 00:04:02.710 Eyal Balla: So this is nice. So you can create new thread. And
25 00:04:03.242 --> 00:04:11.339 Eyal Balla: it looks like you can run multiple things at the same time. But then there's Gil. So Gil is the global interpreter lock.
26 00:04:11.878 --> 00:04:17.962 Eyal Balla: I know there's an agenda in python to try and remove it. But for now it's there
27 00:04:19.910 --> 00:04:25.750 Eyal Balla: And Gil prevents multiple commands of python running in the same process at the same time.
28 00:04:25.860 --> 00:04:34.925 Eyal Balla: So always concurrency for threading is done when you do. things like
29 00:04:35.580 --> 00:04:41.672 Eyal Balla: accessing files or network, or anytime they give access from the your
30 00:04:42.660 --> 00:04:46.390 Eyal Balla: python commands into the OS. This is done.
31 00:04:47.221 --> 00:04:59.660 Eyal Balla: There's something you don't do implicitly. You do it. You can't do it explicitly. You can only do it implicitly by accessing something or doing something that requires OS interaction.
32 00:05:00.720 --> 00:05:12.629 Eyal Balla: And there's Asyncayo. So what is Asyncaio? It's an I/O event manager. Okay? And it, helps you manage states. So you can have multiple states of your system
33 00:05:12.750 --> 00:05:24.389 Eyal Balla: on the same thread. And you can actually explicitly manage the context switching. So you can say, I want to work on multiple items. These are the multiple items. And I want to work on them.
34 00:05:24.920 --> 00:05:33.670 Eyal Balla: So if we look at high level at what the options are for multiprocessing. Say, we have like multiprocesses.
35 00:05:33.780 --> 00:05:47.120 Eyal Balla: so you can have concurrency, and you can use all the cpus. But you know you can't run many processes on the same machine, because each uses a full CPU, so maybe
36 00:05:47.740 --> 00:05:52.530 Eyal Balla: one to 10 cpus and processes.
37 00:05:52.800 --> 00:06:01.410 Eyal Balla: And you can use, generally, the the standard library blocking components and synchronization tools.
38 00:06:02.110 --> 00:06:16.011 Eyal Balla: And then if you need something that's maybe a bit higher on the scalability you can use threads so you can. You have a single process, and Gil is protecting you from
39 00:06:16.910 --> 00:06:22.843 Eyal Balla: doing things between threads which which ha! Which touch memory, and
40 00:06:23.670 --> 00:06:34.039 Eyal Balla: in intrusive way. And but the problem is that you let the OS think your code and you can't really
41 00:06:34.792 --> 00:06:36.720 Eyal Balla: control it manually.
42 00:06:37.070 --> 00:06:51.839 Eyal Balla: And then there's asking where you can actually handle multiple thousands of scalable small components you call them quote coroutines and but pro, and and it's in the application level.
43 00:06:51.940 --> 00:06:58.970 Eyal Balla: So this is what we're gonna look at today. And we're gonna see how it's work, how it works and how you can control it.
44 00:07:00.620 --> 00:07:03.188 Eyal Balla: So like every other
45 00:07:04.440 --> 00:07:09.849 Eyal Balla: program in the world, there's the hello world like in for asking Kaya. Right? So there's
46 00:07:12.010 --> 00:07:15.570 Eyal Balla: Let's see. Can you see my cursor, then you can.
47 00:07:16.310 --> 00:07:23.630 Eyal Balla: So there's a a regular hello world. And there's the one with asking Kyle, wait, okay, but
48 00:07:23.900 --> 00:07:29.589 Eyal Balla: this program is not really helpful, right? Because it doesn't show anything that's important for us.
49 00:07:30.210 --> 00:07:50.489 Eyal Balla: So what what do we want to do with? I think in the real world? So we want to use it for handling multiple heavy I/O processes. So like maybe database accesses or multiple web requests or file sharing or accessing many I/O components at the same time.
50 00:07:51.330 --> 00:08:10.769 Eyal Balla: And you can always use as incaio using multi processes. So you can have. maybe in a cloud application. You can have multiple pods right? And but also you can run it on multiple processes and have have the ability to use multiple cpus if needed.
51 00:08:13.430 --> 00:08:14.330 Eyal Balla: But
52 00:08:14.440 --> 00:08:28.290 Eyal Balla: the downside of Asyncago is that it's almost a different programming language. It looks like python, and it the constructs are very much like Python, and you just use like a few more keywords. But
53 00:08:28.810 --> 00:08:36.129 Eyal Balla: it's very different in concept, because each coroutine the functions that you call with the weight
54 00:08:37.038 --> 00:08:45.999 Eyal Balla: they have to be short enough to allow multiple contacts to run together. So you mustn't use
55 00:08:46.528 --> 00:08:56.480 Eyal Balla: long computations. You can't block the event. Queue. Okay, just like in iui application. You don't want the the main loop to be blocked.
56 00:08:56.890 --> 00:09:10.859 Eyal Balla: And also you can't use general purpose OS blocking commands like, create connections for socket, or select or sleep. So you have things that are, I think, ios specific
57 00:09:13.062 --> 00:09:20.630 Eyal Balla: so you even have like different libraries that you can use in Asyncaio. So
58 00:09:21.888 --> 00:09:33.809 Eyal Balla: if you usually use requests, I suggest you try use Httpx. It has a better behavior fast Api over Django and flask.
59 00:09:34.671 --> 00:09:37.929 Eyal Balla: Mpg, instead of psychopg, etcetera.
60 00:09:38.130 --> 00:09:40.760 Eyal Balla: Okay, so
61 00:09:42.130 --> 00:09:55.740 Eyal Balla: if we look at Asyncaio at the main building blocks, what we have is, we have the the main Asyncaio run command. So what it does it receives in a
62 00:09:57.490 --> 00:10:05.362 Eyal Balla: in a sync way, it received in a sync context, it receives a core routine, something that
63 00:10:05.900 --> 00:10:09.830 Eyal Balla: is run on the asicio loop and creates a loop
64 00:10:09.990 --> 00:10:16.710 Eyal Balla: and runs the core routine in it. And usually this is how you do the entry point into Asyncaya context.
65 00:10:17.300 --> 00:10:20.779 Eyal Balla: Okay? And then you have the core routines. This, these look like,
66 00:10:21.330 --> 00:10:26.570 Eyal Balla: I define Async Def and a function. And this creates a core routine.
67 00:10:26.770 --> 00:10:31.479 Eyal Balla: Okay and core routines. You can run either using domain loop
68 00:10:31.600 --> 00:10:38.720 Eyal Balla: or create a Co a context, a task using the Asyncare create task.
69 00:10:39.560 --> 00:10:40.470 Eyal Balla: Okay?
70 00:10:40.740 --> 00:10:49.319 Eyal Balla: And also when you want to wait for something to happen, and you want to release the context. So you run, await.
71 00:10:49.480 --> 00:10:59.880 Eyal Balla: call, wait in the call routine, and then you wait for the you wait. The context itself waits until the Async context is finished
72 00:11:00.260 --> 00:11:04.519 Eyal Balla: and then returns the control to the main loop that calls this.
73 00:11:04.620 --> 00:11:05.430 Eyal Balla: Okay?
74 00:11:06.350 --> 00:11:07.370 Eyal Balla: So
75 00:11:08.022 --> 00:11:21.649 Eyal Balla: I'm gonna show you guys an example of a small program. And so what this does it. This is a synchronous program. It reads from S. 3, and then queries some database. Right?
76 00:11:22.040 --> 00:11:23.960 Eyal Balla: So we have, like a
77 00:11:24.090 --> 00:11:30.433 Eyal Balla: 2 contexts. One. This is the 1st one. This is second one, and they're not.
78 00:11:31.700 --> 00:11:48.750 Eyal Balla: they're not dependent on each other. You can see the it just gets a file, and then just runs a query. And seeing we want to try and do this together because we want to have the context returned with the content itself. But we don't have some kind of connection between the 2 contexts.
79 00:11:49.460 --> 00:11:56.740 Eyal Balla: So what you can do is you can move to Asyncayo, define this as a coroutine using the
80 00:11:57.150 --> 00:11:58.716 Eyal Balla: aio bottle.
81 00:12:00.194 --> 00:12:10.339 Eyal Balla: Async library and using Asynpg use, create, run the create a quoting from the query of the database.
82 00:12:10.660 --> 00:12:21.090 Eyal Balla: and then you can use gather to run the 2 core routines together. Okay, independently, one of each other, one from each other. So when
83 00:12:21.826 --> 00:12:30.723 Eyal Balla: the execution of the query runs the return of the items, and the body and of the file is done. This is done
84 00:12:31.300 --> 00:12:40.689 Eyal Balla: asynchronously, and while waiting for the I/O, the the context is continued to the other con. Other part.
85 00:12:41.160 --> 00:12:42.000 Eyal Balla: Okay.
86 00:12:43.430 --> 00:12:46.690 Eyal Balla: Questions great.
87 00:12:47.450 --> 00:13:16.690 Eyal Balla: So more more options. Also supports context managers. So this is very convenient because you can create. For instance, in this you can create, you look at the the bottom part here I had to connect and then close using finally. But I can also create a context manager from this and open the connection with the context manager exits close the connection. So I don't have to use
88 00:13:16.710 --> 00:13:19.260 Eyal Balla: explicit exception handling.
89 00:13:19.850 --> 00:13:21.320 Eyal Balla: And also
90 00:13:22.012 --> 00:13:33.339 Eyal Balla: I think I supports iterators so you can. use like generator way of controlling small parts of the code
91 00:13:33.848 --> 00:13:43.049 Eyal Balla: one after the other, and using Async interval. So this is these are patents very known in Python, and you can also use them in Async context.
92 00:13:45.200 --> 00:13:50.409 Eyal Balla: So now I want to show you guys, maybe a problem from our day to day work.
93 00:13:50.870 --> 00:13:56.281 Eyal Balla: I'll present the problem first.st So what we want to do is we want to do some
94 00:13:58.582 --> 00:14:21.089 Eyal Balla: in integration which reads data from an external source and then enriches it. Okay, gets information from maybe a database, and then adds to the context from the external source, and then writes the results into our database as entities. Okay? And I think the the main issue here is that maybe
95 00:14:21.659 --> 00:14:28.550 Eyal Balla: we have multiple customers. Some are small, some are large, and customers have maybe
96 00:14:29.200 --> 00:14:44.130 Eyal Balla: tens of thousands of entities. So there's a lot of reading from the external source, and also maybe a lot of writing and reading into the database. So we have a lot of I/O. And this actually fits very well the Asyncaio concepts.
97 00:14:44.460 --> 00:14:48.700 Eyal Balla: Okay, so what do we want to do?
98 00:14:49.388 --> 00:15:01.639 Eyal Balla: We wanna call something once in a while and go over each of the customers, get the information and then update with the enriched information into our database. Okay? So
99 00:15:01.810 --> 00:15:06.059 Eyal Balla: like an even nave implementation would be something like this.
100 00:15:07.620 --> 00:15:13.880 Eyal Balla: You define all the the the bootstraps
101 00:15:14.549 --> 00:15:37.869 Eyal Balla: needed, and then you get the list of customers. And then for each customer, you do the information, the enrichment. So you get the settings and you run the enricher on the information, and then you in per customer. You get the information from the integration system and enrich it and write it into your database. Right?
102 00:15:38.640 --> 00:15:40.020 Eyal Balla: So this is nice.
103 00:15:40.935 --> 00:15:41.740 Eyal Balla: But
104 00:15:42.558 --> 00:15:54.589 Eyal Balla: the problem with that when we look at this. So this runs per per customer. So that means that until one customer is done the other, the next customer doesn't start.
105 00:15:54.770 --> 00:15:59.140 Eyal Balla: Okay? So if we have small customers and large customers, then
106 00:16:00.190 --> 00:16:06.090 Eyal Balla: we have a problem that small customers are impacted by the size of large customers right?
107 00:16:06.910 --> 00:16:17.090 Eyal Balla: And also, once we have something that's bigger than the the total chrome time, then it doesn't actually
108 00:16:17.991 --> 00:16:30.509 Eyal Balla: it's not actually up to the time it's called time. So system is not up to the functionality according to the time constraints that supposed to run through.
109 00:16:32.395 --> 00:16:36.110 Eyal Balla: So what can we do so. I think the 1st thing we can do
110 00:16:36.310 --> 00:16:48.919 Eyal Balla: is separated per customer. So we can have, some kind of injection of the customer id through a queue and have the system run only per customer. So
111 00:16:49.160 --> 00:16:54.019 Eyal Balla: it reads the information from the queue gets the customer id
112 00:16:54.380 --> 00:17:01.040 Eyal Balla: here, and then runs the same thing just for a specific customer.
113 00:17:01.160 --> 00:17:03.090 Eyal Balla: So how does this help? So
114 00:17:03.310 --> 00:17:10.749 Eyal Balla: what we can do now is we can scale out. So we can have multiple instances of this specific code run
115 00:17:12.778 --> 00:17:19.050 Eyal Balla: together each on a different customer, and assuming that they're not dependent then.
116 00:17:19.699 --> 00:17:27.270 Eyal Balla: Small customers are now not impacted by the size of large customers and the number of the the time that you wanna
117 00:17:27.829 --> 00:17:29.460 Eyal Balla: to run. This is.
118 00:17:30.045 --> 00:17:40.390 Eyal Balla: at most, the time of the biggest customer. Right? So you can scale out as much as you want, and the time that this whole process takes is the time of the biggest customer.
119 00:17:41.270 --> 00:17:42.110 Eyal Balla: Okay?
120 00:17:42.890 --> 00:17:49.890 Eyal Balla: So till now we did not touch anything. That's as Asyncare, right? So we just used like a simple
121 00:17:50.600 --> 00:17:56.219 Eyal Balla: design patterns that allow scaling out of of loops.
122 00:17:57.300 --> 00:18:02.860 Eyal Balla: So now let's try and use Asyncayo to improve the performance of this whole loop.
123 00:18:03.030 --> 00:18:05.249 Eyal Balla: So what do we do?
124 00:18:05.887 --> 00:18:09.710 Eyal Balla: We create a core routine and run it, using Asikaya run.
125 00:18:11.583 --> 00:18:17.220 Eyal Balla: This coroutine is very much similar to what we ran before.
126 00:18:19.340 --> 00:18:25.209 Eyal Balla: Except that now when we look at what happens inside the run for customer.
127 00:18:25.450 --> 00:18:27.550 Eyal Balla: this looks a bit different.
128 00:18:27.810 --> 00:18:31.698 Eyal Balla: So what do we do? First, st we create
129 00:18:33.466 --> 00:18:44.300 Eyal Balla: we run through the pages. Okay? And when we want to enrich the items, we create batches of coroutines, and then we run them together.
130 00:18:44.420 --> 00:18:59.979 Eyal Balla: Okay, so here. The coroutines are created according to the number of the integration items, and when you enrich and read the information each each batch of the
131 00:19:00.460 --> 00:19:04.559 Eyal Balla: coroutines runs together so you can wait for
132 00:19:06.570 --> 00:19:11.189 Eyal Balla: So so they happen together and wait only for the I/O for each of the items.
133 00:19:11.760 --> 00:19:12.580 Eyal Balla: Okay,
134 00:19:16.630 --> 00:19:20.870 Naty Harary: Yeah, I have a question, and just should I interrupt you? Mid sentence?
135 00:19:20.870 --> 00:19:21.420 Eyal Balla: Yeah.
136 00:19:22.394 --> 00:19:29.490 Naty Harary: So, as far as I know, in Async I/O, it is enough to mark the function itself. Async.
137 00:19:29.700 --> 00:19:34.740 Naty Harary: you just wait for it. So I'm not really familiar with the syntax like here.
138 00:19:35.258 --> 00:19:40.680 Naty Harary: So why do we need to ask Sync this as well? I'm not really sure I understand.
139 00:19:42.430 --> 00:19:48.560 Eyal Balla: What we're doing here is we wait for for each of the pages. So this is a nastic iterator, right?
140 00:19:48.940 --> 00:19:53.389 Eyal Balla: But this, this is. This is the core routine which returns an sn key iterator.
141 00:19:53.510 --> 00:19:56.173 Eyal Balla: And when when you
142 00:19:57.270 --> 00:20:06.239 Eyal Balla: what each of these pages contains items, so you want to enrich each of the items. So you create core teams for each of the items to be enriched.
143 00:20:06.390 --> 00:20:16.100 Eyal Balla: And when you run them you run them using gathers because we can when you run a weight on something. Okay, this makes the the
144 00:20:16.570 --> 00:20:24.700 Eyal Balla: higher level function. Wait till it's done. Okay, this is a way to synchronize as in context.
145 00:20:25.220 --> 00:20:29.610 Eyal Balla: Okay, so here you synchronize multiple as in context, using gather.
146 00:20:31.990 --> 00:20:41.489 Naty Harary: I see. So you just gather all the chunks that you have, and you create them with the asset iterator rather than just give a big function and just asking on that right? That's.
147 00:20:41.490 --> 00:21:00.950 Eyal Balla: Because you want to split your context into smaller processing units, each of them may be I/O bound so together the I would run together on each of the in parallel. On each of the items.
148 00:21:01.940 --> 00:21:03.429 Naty Harary: Got it. Thank you.
149 00:21:06.130 --> 00:21:07.540 Eyal Balla: Okay. So
150 00:21:08.060 --> 00:21:27.009 Eyal Balla: now, like I said before, so enrichment haps happens in parallel. And but still you can scale out. So you can have the multiple services. And so the total performance here is, not blocked. And also small customers are not impacted by the large customers.
151 00:21:30.120 --> 00:21:34.040 Eyal Balla: Okay. So some other things you should take into consideration.
152 00:21:34.300 --> 00:21:45.839 Eyal Balla: So I think the 1st thing is exception handling. So when you create as in context, you sometimes need to handle exception in the the top level.
153 00:21:46.000 --> 00:22:04.660 Eyal Balla: So when you do that you can register manual exception handler, so you get the the main loop and set the exception handler, and you can handle the main the errors from that are created from each of the tasks
154 00:22:06.276 --> 00:22:16.647 Eyal Balla: separately, because if you don't do that, then the the Asyncaio context would ha would behave
155 00:22:18.610 --> 00:22:27.379 Eyal Balla: it. It would exit on each of the when each when one of the sub core teams throws an exception into the main context.
156 00:22:27.760 --> 00:22:28.620 Eyal Balla: Okay?
157 00:22:28.810 --> 00:22:34.353 Eyal Balla: So sometimes you want to wait, maybe maybe for the last one, or
158 00:22:35.440 --> 00:22:41.499 Eyal Balla: perhaps some other behavior that is specific for your system. And you can do this this way
159 00:22:42.607 --> 00:22:54.359 Eyal Balla: specifically for gather you can have 2 ways to handle exceptions, so you can do it. Inside each of the core routines like we I did before, or you can
160 00:22:54.610 --> 00:23:16.869 Eyal Balla: ask the gather to collect all the exceptions from each of the core routines, and then you can handle errors together. For instance, if you want to have some retry mechanism, then this is a good way to do it. So you gather all the errors, and then you can retry all those that failed, or decide to do whatever you want to do with those that did not succeed.
161 00:23:19.144 --> 00:23:29.915 Eyal Balla: Regarding testing. So I think, if you look at this core routines what I want to test here is I want to test, maybe.
162 00:23:32.830 --> 00:23:40.820 Eyal Balla: functional response. Okay, something like a happy path and maybe an exception to test the raise for status.
163 00:23:42.300 --> 00:23:50.960 Eyal Balla: So the important part is to mark your test as pi test mark asking Kyle, so this allows you to run test in asking context.
164 00:23:51.770 --> 00:23:57.170 Eyal Balla: There's like Htpx gives Htpx mock. So you can use that.
165 00:23:57.300 --> 00:24:04.819 Eyal Balla: And and then you can inject the response. Here, for instance, and test your happy flow.
166 00:24:05.150 --> 00:24:17.508 Eyal Balla: and also you can always use pytest raises like you did before. And assuming you marked as Asyncio, you can. Test the flow of async and
167 00:24:18.130 --> 00:24:19.460 Eyal Balla: exception flow.
168 00:24:19.720 --> 00:24:20.520 Eyal Balla: Okay?
169 00:24:22.605 --> 00:24:23.380 Eyal Balla: Sorry.
170 00:24:24.240 --> 00:24:31.040 Eyal Balla: What you can also do is there's Async mock like a unit test magic mock
171 00:24:32.065 --> 00:24:40.340 Eyal Balla: you can mock coroutines. So here's an example of how you mark a coroutines and test it. So
172 00:24:40.510 --> 00:24:48.583 Eyal Balla: this is something that's nice to know, and I think it's very valuable when you're testing and mocking
173 00:24:50.250 --> 00:24:51.110 Eyal Balla: coroutines
174 00:24:51.798 --> 00:25:02.409 Eyal Balla: I think that today the default patch returns in a Async context, or a mock on a magic mark or an Async mark, according to the
175 00:25:02.938 --> 00:25:09.211 Eyal Balla: type of function that it gets. So if this is the core routine, then this would return.
176 00:25:10.050 --> 00:25:17.320 Eyal Balla: create this as an Async mark, and if not, it be a magic mark according to what is needed.
177 00:25:18.980 --> 00:25:24.740 Eyal Balla: Something else that's very important for developers is the ability to debug.
178 00:25:25.240 --> 00:25:33.950 Eyal Balla: So I think Kyle gives a debug mode when you run with the environment variable. You get
179 00:25:34.473 --> 00:25:43.669 Eyal Balla: also track, trace track trace backs on asking functions when they're not awaited. So you can find out where where this happens, and when
180 00:25:44.410 --> 00:25:57.470 Eyal Balla: and also this monitors thread safety. So when you, behave, something in your system behaves unsafe regarding the different core routines and the memory they touch.
181 00:25:57.900 --> 00:26:08.020 Eyal Balla: So you get errors in your in your logs. And also this helps debugs debug slow core routines. Because.
182 00:26:08.592 --> 00:26:10.857 Eyal Balla: I think Cayo is very
183 00:26:12.057 --> 00:26:24.449 Eyal Balla: very sensitive to having core routines and long coroutines blocking short coroutines. So this actually helps you understand better the flow of your code once you use Async I/O
184 00:26:26.425 --> 00:26:38.450 Eyal Balla: so this is how a slow log look looks like. So if I do something very slow, you'd get like a log, saying, this is this has taken too long. Okay? So you would know that
185 00:26:38.890 --> 00:26:40.749 Eyal Balla: you want to look at dysfunction.
186 00:26:42.468 --> 00:26:46.551 Eyal Balla: Also something you want you might want to consider is
187 00:26:47.599 --> 00:26:52.540 Eyal Balla: having something that solvers running in your context in your services.
188 00:26:53.396 --> 00:27:01.890 Eyal Balla: So aio debug allows you to log slow callbacks inside your production pods.
189 00:27:02.010 --> 00:27:08.362 Eyal Balla: And this you can enable a specific
190 00:27:09.420 --> 00:27:12.209 Eyal Balla: callbacks when this happens. And this is
191 00:27:12.340 --> 00:27:14.906 Eyal Balla: really great, because this has
192 00:27:16.144 --> 00:27:20.340 Eyal Balla: almost no performance impact on the actual services.
193 00:27:20.490 --> 00:27:26.070 Eyal Balla: And it allows you to understand better how your code behaves in production.
194 00:27:27.820 --> 00:27:28.650 Eyal Balla: Okay?
195 00:27:29.270 --> 00:27:30.080 Eyal Balla: Great
196 00:27:31.981 --> 00:27:46.289 Eyal Balla: also, something you can do is you can monitor each of the the the different tasks. There's asking Kyle, get all tasks and
197 00:27:46.770 --> 00:27:50.107 Eyal Balla: current tasks. So you can run
198 00:27:51.380 --> 00:27:56.220 Eyal Balla: a core routine once in a while to understand what is running and
199 00:27:56.340 --> 00:28:00.020 Eyal Balla: and get the stacks and understand the behavior.
200 00:28:00.690 --> 00:28:01.500 Eyal Balla: Okay.
201 00:28:03.120 --> 00:28:11.940 Eyal Balla: So this is about it. So I went over the Asynch concurrent programming framework.
202 00:28:12.702 --> 00:28:27.219 Eyal Balla: I think we saw a real world example and understood a bit how I think I behaves, and something, and why we want we'd want to use it. And also we looked at some debug testing and exceptional handling tools.
203 00:28:27.810 --> 00:28:28.950 Eyal Balla: and that's it.
204 00:28:31.170 --> 00:28:32.000 Eyal Balla: Questions.
205 00:28:36.240 --> 00:28:36.940 lapid: Can I?
206 00:28:39.040 --> 00:28:40.060 lapid: Do you hear me?
207 00:28:40.820 --> 00:28:41.869 Gabor Szabo: Yes, yes, we can hear you.
208 00:28:41.870 --> 00:28:56.800 lapid: Oh, Hi, yeah, Hi, so you touched a little bit on it. But I'm when I'm doing something that I called it like a project that develops into something a little bit bigger. I find myself sometimes I I just get lost.
209 00:28:57.150 --> 00:29:01.299 lapid: Oh, I I can't verify myself that I actually
210 00:29:02.800 --> 00:29:06.060 lapid: control all the coroutines properly, because many times
211 00:29:06.430 --> 00:29:20.690 lapid: queues that feed one another like I have some streaming, and then some queues and something like that. And so you touched a little bit on that, and how you how you monitor that! But can you expand a little bit like how do you deal with that cause. I I just
212 00:29:21.140 --> 00:29:29.730 lapid: afterwards I go back, and I just print constantly, and I check the timing, and I waste a lot of time on that, and I feel like maybe someone more experienced has a better solution.
213 00:29:31.030 --> 00:29:34.775 Eyal Balla: So I think, whe when you
214 00:29:35.460 --> 00:29:41.375 Eyal Balla: the, I think that the 1st thing is to build your software like in
215 00:29:42.340 --> 00:29:44.319 Eyal Balla: even though it's Async.
216 00:29:44.440 --> 00:29:59.850 Eyal Balla: you need to build it like a top down architecture, understanding which parts are calling what other the other parts, and making sure that when you synchronize correctly, then, once you do that, things are easier. I think.
217 00:30:01.503 --> 00:30:12.799 Eyal Balla: So you need so like other others. Other considerations in software development. You need to have a a solid design
218 00:30:13.250 --> 00:30:17.059 Eyal Balla: as a as a beginning, right? And then you can use
219 00:30:17.707 --> 00:30:21.539 Eyal Balla: something like the test monitor right here.
220 00:30:21.650 --> 00:30:26.649 Eyal Balla: So you can add this as something that you can call within your code.
221 00:30:27.310 --> 00:30:35.309 Eyal Balla: And this actually helps you understand the the different core routines that are run at the same time
222 00:30:35.530 --> 00:30:57.230 Eyal Balla: and can help, you understand, together with the slow core routines, understand the impact of each of the the different coroutines running. And when you, I think that when you say I want to understand. So you have some kind of a problem, right? You have, maybe something that doesn't get the ability to run at all.
223 00:30:57.380 --> 00:31:07.317 Eyal Balla: and you don't know why. So the reason to this is probably something is blocking the the main loop right? It's too long. So you'd get
224 00:31:08.220 --> 00:31:17.930 Eyal Balla: messages on the slow callbacks. And then you would see this in the running tests and understand the context of how it ran.
225 00:31:19.290 --> 00:31:21.089 Eyal Balla: So does this make sense.
226 00:31:21.380 --> 00:31:46.119 lapid: Yeah, some something on that on that area. It it's more so that I when the polish gets big enough, I you know, I've designed patterns for code that I know that I follow. That helps me, you know, every time again, to a code that I didn't touch for you, I know, like, okay, this is how I what I do in order to actually add a new feature. But somehow, when I when I develop with, I think.
227 00:31:46.430 --> 00:32:05.980 lapid: unless it's something the whole development many times like, if I want to change some change, something in the future, I find myself having to go very deep into the cold like, I don't mind this, or maybe I just don't know how to do a design partner. Well, and for my expense just affect the stuff that
228 00:32:06.150 --> 00:32:15.399 lapid: change in their behavior from something that's asynchronousy to asynchronically had forced me to change my code way deeper than I wanted.
229 00:32:15.650 --> 00:32:22.590 lapid: So this is what actually, I'm I'm curious about this is like, this is the pendate I experienced.
230 00:32:23.800 --> 00:32:28.240 lapid: did it? Didn't like next like, Are you? Are you?
231 00:32:28.430 --> 00:32:40.939 lapid: The the something changing in the future. I want to add something, and this thing now is let's say I. I have some. I'm scraping some information, and I'm leaving it, and I want to do it in parallel.
232 00:32:41.100 --> 00:32:44.689 lapid: But and I have an existing project that was not
233 00:32:44.840 --> 00:32:54.990 lapid: so so far didn't assume anything that has to be in parallel cause. Cause I I had a different data source that I used before, and it was way, way, way faster. So now.
234 00:32:54.990 --> 00:32:55.840 lapid: so.
235 00:32:56.690 --> 00:33:02.479 Eyal Balla: So I I think what you would do is you would add, maybe an Asyncare context to
236 00:33:02.620 --> 00:33:06.920 Eyal Balla: part of the code right, and then
237 00:33:07.310 --> 00:33:14.299 Eyal Balla: run it, maybe, with Asynchaire run, and the rest would remain synchronous.
238 00:33:14.780 --> 00:33:20.479 Eyal Balla: So you can limit the the extent of what you're looking to.
239 00:33:21.090 --> 00:33:27.019 Eyal Balla: and also, as always be sure to test the specific part
240 00:33:27.350 --> 00:33:32.620 Eyal Balla: as a as as a different, like a different library, that you're calling
241 00:33:33.050 --> 00:33:35.920 Eyal Balla: and and treat it like one
242 00:33:36.190 --> 00:33:44.869 Eyal Balla: so like a different code component, a different part of the code and put it somewhere. That's self-contained
243 00:33:46.018 --> 00:33:48.710 Eyal Balla: and maybe that can help.
244 00:33:49.990 --> 00:34:00.119 lapid: Yeah. So so what you're describing is how I solved it. But I didn't. I was not. Actually. I was asking myself like if I had to go over the project again.
245 00:34:00.440 --> 00:34:06.849 lapid: saying, Oh, maybe in the future some I will have some asynchronic way, some asynchronic part.
246 00:34:07.720 --> 00:34:13.109 lapid: I want to actually prepare my code for the possibility of something as running in Async in the future.
247 00:34:13.980 --> 00:34:17.570 Eyal Balla: So I can tell you that we had.
248 00:34:18.586 --> 00:34:23.349 Eyal Balla: we needed to move from synchronous code to Async code in our company.
249 00:34:23.860 --> 00:34:25.130 Eyal Balla: And this is a
250 00:34:25.620 --> 00:34:31.530 Eyal Balla: this is, quite a big migration, because, as I, as I described in the beginning of the talk.
251 00:34:31.690 --> 00:34:39.639 Eyal Balla: using Asyncahyo is like something very different than the design of a synchronous program.
252 00:34:40.050 --> 00:34:43.910 Eyal Balla: So I don't think I have like a
253 00:34:44.429 --> 00:34:56.759 Eyal Balla: anything that I can say. Well, if you write a synchronous problem, and you want to prepare, do this and that because I think that you need to look at it in a very different way. Writing Async code and sync code.
254 00:34:57.500 --> 00:35:01.580 lapid: Okay. So it sounds like you went through the same problems I had. So.
255 00:35:02.470 --> 00:35:02.940 Eyal Balla: Yeah.
256 00:35:02.940 --> 00:35:04.480 lapid: At least we suffer together.
257 00:35:06.480 --> 00:35:08.990 Eyal Balla: Suffering. Sharing is always good.
258 00:35:08.990 --> 00:35:11.800 lapid: Yeah, yeah, thank you.
259 00:35:12.560 --> 00:35:13.260 Eyal Balla: You're welcome.
260 00:35:15.480 --> 00:35:16.050 Eyal Balla: Anything else.
261 00:35:16.050 --> 00:35:21.090 Naty Harary: Yeah. Yeah, I have a question. I'm using a lot of 3rd party.
262 00:35:21.766 --> 00:35:39.670 Naty Harary: Libraries. Test Api sequel. Alchemy like that. And they sometimes hide the implementation of I think I/O, and I always wondered, because I just believe it works well. Is there any way to worry the event loop. So I know.
263 00:35:39.910 --> 00:35:43.440 Naty Harary: like, what's running right now.
264 00:35:43.810 --> 00:35:48.510 Naty Harary: Is it even possible? Is that something that python hides from us completely?
265 00:35:48.690 --> 00:35:56.280 Eyal Balla: So so this is like a way to query all the tasks that are running in Asyncayo.
266 00:35:56.680 --> 00:35:57.110 Naty Harary: All right.
267 00:35:58.330 --> 00:36:06.459 Eyal Balla: And also there's a library I did not talk about here, and it's called Aio Monitor. So you can look into that, too.
268 00:36:07.030 --> 00:36:08.999 Eyal Balla: And it's very nice.
269 00:36:09.965 --> 00:36:11.860 Eyal Balla: So you can try that, too.
270 00:36:12.550 --> 00:36:15.994 Naty Harary: Alright cool cause you talked about the timing, so I didn't know if
271 00:36:16.470 --> 00:36:19.789 Naty Harary: if there are the concerns but a or mine, too. That's cool.
272 00:36:19.930 --> 00:36:20.890 Naty Harary: We'll check it.
273 00:36:22.530 --> 00:36:23.380 Naty Harary: Thank you.
274 00:36:27.600 --> 00:36:29.680 Eyal Balla: Okay, so I think we're done.
275 00:36:30.493 --> 00:36:46.420 Eyal Balla: Thank you guys for listening. And you can reach me this email or Linkedin. And also there's a Github project with this presentation and all the code samples together.
276 00:36:47.370 --> 00:36:51.650 Eyal Balla: I can. I think I sent it to Gabor last time, but I can send it again.
277 00:36:51.880 --> 00:36:53.749 Eyal Balla: and he can spread it out.
278 00:36:53.970 --> 00:36:58.322 Gabor Szabo: Yes, that would be a good idea. I think I'm going to include it in in the there is this
279 00:36:58.740 --> 00:37:09.540 Gabor Szabo: web page about the the presentation which will be linked from the video. And and then on that page you'll see also, I include these these links as well.
280 00:37:09.810 --> 00:37:12.380 Gabor Szabo: Oh, so your your Linkedin, and your
281 00:37:12.870 --> 00:37:15.660 Gabor Szabo: and the link to your that Github page.
282 00:37:15.950 --> 00:37:16.730 Gabor Szabo: We'll get to.
283 00:37:16.730 --> 00:37:17.390 Eyal Balla: Okay. Great.
284 00:37:17.390 --> 00:37:18.530 Gabor Szabo: Repository.
285 00:37:18.820 --> 00:37:19.959 lapid: It's a nice, clear.
286 00:37:19.960 --> 00:37:22.739 Gabor Szabo: No more questions. Then. Thank you very much.
287 00:37:23.530 --> 00:37:27.019 Gabor Szabo: Yeah, thank you. Everyone for participating. And.
288 00:37:27.020 --> 00:37:28.569 Eyal Balla: I think there's 1 more question.
289 00:37:28.570 --> 00:37:31.890 lapid: Oh, I if I can ask chitchat questions.
290 00:37:32.120 --> 00:37:32.950 Gabor Szabo: You're good. Go ahead.
291 00:37:32.950 --> 00:37:37.289 lapid: So you said you you have a company like what your company does, and
292 00:37:37.430 --> 00:37:39.700 lapid: can you elaborate a little bit for more.
293 00:37:40.840 --> 00:37:43.977 Eyal Balla: Sure. So scenario, we do.
294 00:37:45.833 --> 00:37:55.216 Eyal Balla: we do security for health for healthcare. So we give hospitals tools to understand security posture and
295 00:37:56.927 --> 00:38:03.010 Eyal Balla: and attack detection. So we detect, malicious content and attacks on hospitals.
296 00:38:04.950 --> 00:38:06.262 Eyal Balla: And I think,
297 00:38:07.670 --> 00:38:16.929 Eyal Balla: because hospitals are very sensitive. So we need to handle like a very high scale, we do with with passive network inspection.
298 00:38:17.150 --> 00:38:19.275 Eyal Balla: And so we handle like,
299 00:38:20.770 --> 00:38:30.910 Eyal Balla: quite a lot of information in our cloud. So we need to use tools and also using asyncaio helps us
300 00:38:31.040 --> 00:38:34.540 Eyal Balla: scale out and handle things as we need.
301 00:38:35.900 --> 00:38:37.260 Eyal Balla: I hope that.
302 00:38:38.536 --> 00:38:39.730 lapid: Get answered.
303 00:38:40.010 --> 00:38:46.181 lapid: Yeah, no, I'm just curious, like, it's very. It's very far from my my expertise. I'm a i'm a data scientist. And
304 00:38:46.960 --> 00:38:54.960 lapid: MI came across a sync when some of my project needed some boost.
305 00:38:55.770 --> 00:39:01.929 lapid: And you're looking also for data scientist. I'm not available now. But in about a month or 2.
306 00:39:03.532 --> 00:39:14.997 Eyal Balla: So I think data scientist is not something that we're currently looking for. But you, you could actually look at the company career page. There are several positions, and
307 00:39:16.130 --> 00:39:19.450 Eyal Balla: we're expanding, and it's it's a good time, I think.
308 00:39:20.570 --> 00:39:22.509 Eyal Balla: Alright. Then brush.
309 00:39:24.170 --> 00:39:24.860 lapid: Alright. Thank you.
310 00:39:26.120 --> 00:39:32.120 Gabor Szabo: So. So thank you. Thank you very much. Thank you, Al, for the presentation, and for all the questions people
311 00:39:32.430 --> 00:39:38.840 Gabor Szabo: and everyone who was watching, please again, like the video, as I told you, and follow the Channel.
312 00:39:38.980 --> 00:39:46.519 Gabor Szabo: And if you would like to present at one of our meetings, then please get in touch with me.
313 00:39:46.790 --> 00:39:53.759 Gabor Szabo: I would be happy to to provide the the place for people to to share their their knowledge.
314 00:39:54.540 --> 00:39:55.679 Gabor Szabo: Thank you very much.
315 00:39:56.000 --> 00:39:56.850 Gabor Szabo: Goodbye.
316 00:39:56.850 --> 00:39:57.250 Eyal Balla: Bye-bye.
317 00:39:57.250 --> 00:39:57.590 Dmitry Morgovsky: Sure.
318 00:39:57.590 --> 00:39:58.919 lapid: Bye-bye. Thank you.
319 00:39:59.980 --> 00:40:01.209 Shalaka Deshan: Thank you, anyway.
]]>Join us and time-travel across the evolution of Python monitoring mechanisms. We'll delve into history from dedicated tools like sys.monitoring to more advanced techniques such as ceval and import hooks. This session will provide a comprehensive overview of how monitoring practices have developed over the years, offering insights into the best practices for maintaining and debugging your Python code and the pros and cons of each approach. Whether you're a seasoned developer or new to Python, you'll gain valuable knowledge on how to keep your code running smoothly and efficiently without hurting performance or your dev velocity with tedious maintenance.

1 00:00:00.720 --> 00:00:02.690 Haki Benita: This meeting is being recorded.
2 00:00:03.400 --> 00:00:04.320 Gabor Szabo: Okay.
3 00:00:05.800 --> 00:00:12.250 Gabor Szabo: yeah. So Hi, and welcome to the python Maven, let's call it Python Maven. This is the code Maven
4 00:00:12.500 --> 00:00:41.910 Gabor Szabo: Youtube channel. And we are organizing these meetings in the Codebay Events group, but sort of it has 3 separate sessions, and this is going to be the the Python specific one. My name is Gabor Sabo. I usually teach python and rust and help companies introduce testing, and I also like to organize these events and allow people to share their knowledge with with each other.
5 00:00:42.270 --> 00:00:46.010 Gabor Szabo: You're welcome. I'm really happy that you're here
6 00:00:46.140 --> 00:01:04.909 Gabor Szabo: in this session, listening, as I mentioned earlier, you're welcome to to comment or use the chat and ask questions. And if you're just watching the video recorded on Youtube, then please remember to like the video and follow the channel.
7 00:01:05.080 --> 00:01:11.990 Gabor Szabo: and let's welcome hockey now, and let him introduce you. Introduce yourself and and
8 00:01:12.700 --> 00:01:17.579 Gabor Szabo: and give the presentation. So thank you for accepting the invitation.
9 00:01:18.970 --> 00:01:31.149 Haki Benita: Thank you. Thank you, Gabo. 1st of all, I like the fact that we have this intimate group that we can freely talk. I actually encourage you to consider opening the mics.
10 00:01:31.210 --> 00:02:01.090 Haki Benita: Because I think we can actually have a conversation throughout the presentation. I like to give interactive presentation. Your call. You're the boss, and just a quick introduction about the subject and about myself. So we are going to talk about how to make your back end war. And I want to start by apologizing for the tacky headline. But unfortunately, these types of tacky headlines do work. Believe it or not.
11 00:02:01.610 --> 00:02:09.010 Haki Benita: So. My name is Jake Benita. I'm a software developer and a technical lead. I'm currently leading a team
12 00:02:09.289 --> 00:02:18.949 Haki Benita: of developers working on a very large ticketing platform and Israel serving about one and a half unique
13 00:02:19.580 --> 00:02:32.470 Haki Benita: 1.5 million unique paying users every month. And I also like to write and talk about python performance and databases. And you can find my stuff on my website.
14 00:02:33.110 --> 00:02:47.839 Haki Benita: So today, we are going to talk about some lesser known features of indexes. And we're going to try and understand how they work and when we can and should use them
15 00:02:47.850 --> 00:03:14.629 Haki Benita: to do that, we are going to build a URL shortener together, and we're going to do it in Django. I would say that since this is a talk about python, I'm going to use Django and the Django Orm. But the concepts that I'm going to describe are not specific to Django, and they're not specific to Postgres. Heck. They're not even specific to python. But this is a good environment to explain the concepts with.
16 00:03:15.390 --> 00:03:19.889 Haki Benita: So what is a URL shortener? You probably know about
17 00:03:19.900 --> 00:03:39.330 Haki Benita: other types of URL shorteners? You have bitly. You have the late googl buffer, Li, and so on. Basically, URL shortener is a system that provides a short URL that redirects to a longer URL. Now, why would you want to do that
18 00:03:39.330 --> 00:04:02.240 Haki Benita: first.st If you are operating in text constrained environments like SMS messages or Tweets, you might want to share a very large link. So you want to make it shorter, so it consumes less space. This is where short Urls can be handy. Another nice feature of URL shortening that whenever someone clicks the short URL,
19 00:04:02.240 --> 00:04:16.500 Haki Benita: the URL shorten and redirects to the long URL, and keeps a track of how many people click that link. So if you have something like a campaign that you want to launch, and you want to keep track of how many people clicked your link.
20 00:04:16.820 --> 00:04:20.149 Haki Benita: This is what you would use a URL shortener for
21 00:04:20.310 --> 00:04:48.240 Haki Benita: so to build our URL shortener in Django, we're going to start with this very, very simple model. We are calling the model short URL, we have an Id column which is the primary key. It's just an auto incrementing integer field. We have the key. That's a unique short piece of text that uniquely identifies our short URL. This is the short key at the end of the short URL.
22 00:04:48.500 --> 00:05:07.030 Haki Benita: We then have the URL, which is the long URL, we want to redirect to. We also want to keep track of when the URL was created. We do that using the created at column. And finally, we want to keep track of how many users click the link, and we do that with the hits column
23 00:05:07.180 --> 00:05:08.110 Haki Benita: at the bottom.
24 00:05:08.960 --> 00:05:19.650 Haki Benita: So for our demonstration. So we actually have something to work with. I loaded 1 million short Urls into the table. Okay, now, this is not a lot. But we are going to see, some
25 00:05:20.700 --> 00:05:25.929 Haki Benita: performance gains with just 1 million rows. Okay.
26 00:05:26.810 --> 00:05:33.380 Haki Benita: so this talk is about python. But it's essentially about SQL, so
27 00:05:33.510 --> 00:05:54.859 Haki Benita: in Django, if you want to get the SQL. Generated by Django for a given query set. You can do that by accessing the query, set dot query and print it. In this case I'm doing short URL filter on a specific key dot query. And I can actually get Django to print
28 00:05:55.190 --> 00:05:59.549 Haki Benita: the SQL. That it generated for this query set right.
29 00:06:00.040 --> 00:06:26.740 Haki Benita: So, after viewing the query set, it's also very interesting to see how the database is planning to execute my query. Right? So I can do that by executing the function. Explain. This, translates into an explain query, command in SQL. And what I get in return is not the results of the query, but the execution plan, which is how the database is planning
30 00:06:26.930 --> 00:06:30.979 Haki Benita: to execute my query. Now, when we just use, explain
31 00:06:31.200 --> 00:06:36.260 Haki Benita: the database doesn't actually execute the query. It just produces a plan
32 00:06:36.370 --> 00:06:53.839 Haki Benita: sometimes, especially when we're benchmarking and we're trying to improve performance. It can be useful to produce the execution plan, but also have the database, execute this query and return some useful execution data. For that we can use a slightly different variation of the explain command.
33 00:06:53.970 --> 00:07:13.319 Haki Benita: which is, explain, analyze in Django. You can do that by using. Explain, analyze. True in SQL. Postgres. Specifically, you can do explain, analyze on timing on in parenthesis, following by the query, and then you get some additional information about the execution plan.
34 00:07:13.350 --> 00:07:27.339 Haki Benita: first, st because the database actually executed the query. You can see at the bottom that we get how long it took the database to produce an execution plan in this case that would be 0 point 1 4 0 ms.
35 00:07:27.710 --> 00:07:38.510 Haki Benita: and I also get how long. It took the database to execute the query from start to end. In this case that would be 0 point 0 4 6. Okay.
36 00:07:39.430 --> 00:07:47.120 Haki Benita: Now, in addition to the timing. I'm also getting a very, very interesting piece of information inside the execution plan.
37 00:07:47.260 --> 00:07:53.699 Haki Benita: Okay, what I get is the estimated cost and the actual cost
38 00:07:53.820 --> 00:07:58.059 Haki Benita: that the database encountered while executing the query. So
39 00:07:59.010 --> 00:08:15.400 Haki Benita: discussing the cost-based optimizer is slightly outside the scope of this talk, I would just say that, comparing the expected cost to the actual cost is a very useful measure to try and identify bad execution plans.
40 00:08:16.100 --> 00:08:17.350 Haki Benita: Finally.
41 00:08:17.990 --> 00:08:28.419 Haki Benita: another way of viewing queries is to turn on the logger for the database backend in Django. This way, whenever the database, whenever Django executes a query.
42 00:08:29.040 --> 00:08:32.620 Haki Benita: it logs the SQL. That was produced by the aura.
43 00:08:33.510 --> 00:08:34.475 Haki Benita: So
44 00:08:35.700 --> 00:09:05.329 Haki Benita: to actually start discussing some indexing techniques, we need to start implementing some. You know, business processes. So let's start with the most basic thing that URL shortener actually does. And that's look up the URL to redirect to by a key. So a user uses one of our short Urls, we get the unique key. And we need to find the long URL to redirect to. Okay, this is like the bread and butter of this system.
45 00:09:05.440 --> 00:09:27.109 Haki Benita: So if we want to implement this very, very simple function. We can do something like that. Def resolve. Okay, that's the name of the function. We want to resolve a key to a URL. We accept a key, and then we execute this simple query to just get a show, URL for this key. If we don't find anything we return none. Otherwise we return the URL to redirect to
46 00:09:27.110 --> 00:09:37.730 Haki Benita: okay. Now we want to look at the SQL. That Django generated for this function. Right? So we execute this function on some random key
47 00:09:37.950 --> 00:09:57.950 Haki Benita: with SQL. Logging turned on, and we can see the query right here. Now, if you look at this query, it looks like Django, basically fetch everything from the short URL table for the key that we asked for right select star from short URL, where Key equals something.
48 00:09:58.270 --> 00:10:05.050 Haki Benita: If we want to look at how postgres is actually executing this query.
49 00:10:05.210 --> 00:10:12.719 Haki Benita: we can use the explain command. And what we get is that Postgres is planning to use an index scan
50 00:10:13.535 --> 00:10:20.159 Haki Benita: on the index we have on the key column. Okay, now.
51 00:10:21.180 --> 00:10:28.839 Haki Benita: to understand what exactly it means in index scan. Let's take a second to talk about btree index.
52 00:10:29.040 --> 00:10:42.120 Haki Benita: So Btree index is like the king of all indexes. This is the default index in most database engines. If you're not sure what type of index you're using. You're probably using a btree index. Okay?
53 00:10:42.560 --> 00:11:11.160 Haki Benita: So to understand how A B tree index works. Let's start by building one. So imagine you have these values, one through 9, and you want to create a B tree index on them. You start by sorting the values and storing them in leaf blocks. You can see the leaf blocks at the bottom. They are sorted from left to right. We have 1, 2, 3, all the way through 9. Now every entry in the leaf blocks contains a list of tids. These are pointers to rows in the table.
54 00:11:11.400 --> 00:11:15.460 Haki Benita: That store rows with these values. Okay.
55 00:11:16.290 --> 00:11:28.179 Haki Benita: now, above the leaves, we have branches and root block that acts as a directory to these leaf blocks. So let's see how this works. Let's imagine that we want to look.
56 00:11:28.180 --> 00:11:38.290 Gabor Szabo: Sorry. Just someone says says that it doesn't see this the slides. So I just wanted to. And I'm unsure if the other people do see the light slide. So if
57 00:11:38.670 --> 00:11:53.529 Gabor Szabo: I asked it in the chat, but no one answered. So I hope that people other people okay. So some other people see it so my recommendation is to Eduard Eduardo to turn, maybe on and off the I mean, maybe exit zoom and enter zoom again. Sorry for the.
58 00:11:53.530 --> 00:11:54.940 Haki Benita: Okay, no problem.
59 00:11:55.120 --> 00:11:56.160 Haki Benita: Yeah.
60 00:11:56.400 --> 00:11:59.700 Haki Benita: Okay, okay, so let's
61 00:12:01.690 --> 00:12:31.100 Haki Benita: okay. So let's try to search for the value 5 in the V 3 index that we just built. So we start with the root block and we start scanning from left to right. So 5 is larger than 3. So we skip the 1st entry 3 is between 3 and 7, 5 is between 3 and 7, so we follow this pointer to the middle leaf block. We then start scanning the leaf block from left to right. The 1st value is 4. It's not a match.
62 00:12:31.100 --> 00:12:36.150 Haki Benita: The next value is 5. That's a match, and now we can
63 00:12:36.150 --> 00:12:47.970 Haki Benita: scan. We can follow the pointers from this leaf block to the rows in the table. We can read the rows and do whatever we need to do with these rows. Okay, now.
64 00:12:48.310 --> 00:13:15.100 Haki Benita: let's go back to our query. Okay, one second, yeah. Let's go back to our query. Remember that we said that Django generated this query and this query is fetching everything right, basically select star from short URL. But, in fact, if you think about it, we don't actually care about all these fields right? We only care about the URL. I mean, we're not looking to resolve
65 00:13:16.290 --> 00:13:27.129 Haki Benita: a key to a URL for the purpose of redirecting. I don't care when it was created. I don't care about the Id. I already have the key right, and I don't care about the head counter at this point
66 00:13:27.610 --> 00:13:30.209 Haki Benita: right? So I don't care about all these fields. So
67 00:13:30.770 --> 00:13:55.089 Haki Benita: one thing that we can do is instead of fetching all of these fields, how about if we just fetch what we actually need. Right? So in Django, we can do that by adding values list. URL. Now the function is slightly different. But if we look at the SQL. Generated by this function, we can see that now, instead of fetching all the columns in the row, we just fetch the URL. So this is exactly what we need.
68 00:13:55.200 --> 00:14:10.249 Haki Benita: If we look at this execution plan once again for this query, we can see that again. Django is using postgres is using an index scan on the unique index that we have on the key. Right? So now.
69 00:14:10.920 --> 00:14:30.719 Haki Benita: once we found a matching row, we can follow the pointer to the table. We can get the URL from the table. So if you imagine the amount of disk reads, I need to do to satisfy this query. I'm starting by reading their root block. Right? So that's 1 read. Then I need to follow the branch all the way to the leaf. Let's say that we have just.
70 00:14:30.730 --> 00:14:41.789 Haki Benita: you know, root block, and then directly to the leaf. So reading the leaf is another read, and then we need to follow the link from the leaf block to read the row from the table. So this is a unique
71 00:14:41.970 --> 00:14:52.020 Haki Benita: column. So we have at most one row. So that's another read. So basically, we did 3 random reads to satisfy this query right now.
72 00:14:53.290 --> 00:15:03.019 Haki Benita: this query is executed a lot. This is basically what our system is doing right. It's getting keys and resolving them to Urls to redirect right
73 00:15:03.360 --> 00:15:17.979 Haki Benita: now. We already established that all we care about in this specific scenario is just the URL. I don't care about anything else. I care just about the URL. So what if? And stay with me? This is mind blowing.
74 00:15:17.980 --> 00:15:34.249 Haki Benita: What if, instead of going to the table to get the URL. What if I could include the URL in the leaf block in the index this way? When I found a matching entry in the leaf block, I would have the URL just sitting there.
75 00:15:34.310 --> 00:15:52.420 Haki Benita: Right? So this mind-blowing idea is called inclusive index. Okay, in other databases it's called covering index or inclusive indexes, and what it allows us to do, it allows us to store additional information in the leaf block.
76 00:15:52.500 --> 00:16:14.569 Haki Benita: So if we want to use an inclusive index in Django, we can add the include argument to the unique constraint. Now look, the key is indexed. The URL is not indexed. It's just included in the leaf block. Okay. Now, if we generate a migration, we apply it and we try the query again.
77 00:16:15.500 --> 00:16:21.569 Haki Benita: You can see that once again, Postgres is using our index, our unique index on the key. But there is
78 00:16:21.900 --> 00:16:33.889 Haki Benita: very, very subtle difference here. If you notice. Previously we had an index scan using our unique index. This time we have an index only scan.
79 00:16:34.020 --> 00:17:03.620 Haki Benita: This means that Postgres was able to satisfy the query without accessing the table. All the data that it needs was already in the leaf block. So if we once again imagine how many reads we need to do to satisfy this query, using the inclusive index, we read the root block. We follow the pointer all the way down to the leaf block, and now, instead of going to the table to read the URL. We have the URL right there in the leaf block. So we only need to read
80 00:17:03.670 --> 00:17:05.849 Haki Benita: 2 blocks from disk.
81 00:17:06.150 --> 00:17:17.110 Haki Benita: Okay, the way to identify. This is by the operator on the index only scan right? So we have an index scan, and we have an index. Only scan.
82 00:17:18.170 --> 00:17:39.170 Haki Benita: So quick recap about inclusive indexes, as I mentioned in other databases. They are sometimes called covering indexes, and they allow us to fulfill queries without accessing the table. However, you should use them with caution. Because if you think about it, we're basically duplicating data from the table to the index. Okay?
83 00:17:39.170 --> 00:17:49.959 Haki Benita: So if you have a very big big piece of like information like URL can be very, very big. So basically, I'm now storing the URL
84 00:17:50.140 --> 00:18:09.440 Haki Benita: twice. So the index could get very, very big. I'm actually not a big fan of inclusive indexes. But I can think of 2 scenarios where it might be a good idea. First, st if you have very wide tables. Imagine, like data, warehouse type of tables, denormalized tables.
85 00:18:09.600 --> 00:18:11.520 Haki Benita: and you have a very
86 00:18:12.250 --> 00:18:22.290 Haki Benita: predefined set of queries that are executed very, very often on a very, very small subset of columns, you can consider doing using
87 00:18:23.440 --> 00:18:50.249 Haki Benita: an inclusive index. And also, I personally found that non unique composite indexes can be good candidates for inclusive indexes that is, indexes on multiple columns that are not used to enforce a unique constraint. Sometimes they can benefit from switching to just a composite index to an inclusive index. Okay, questions so far before we move on to the next use case.
88 00:18:55.710 --> 00:19:02.210 Haki Benita: Okay, if you have any questions, feel free, let's move on to the next to the next use case.
89 00:19:02.800 --> 00:19:04.080 Haki Benita: So now
90 00:19:04.230 --> 00:19:16.229 Haki Benita: we want to find unused keys right? We have this business question. We want to know how many show through ours. We have with no hits at all. Okay, we have 0 hits.
91 00:19:17.070 --> 00:19:23.050 Haki Benita: So we start by implementing this very, very simple function. We call it, find unusedindexes.
92 00:19:23.350 --> 00:19:26.190 Haki Benita: and it returns a query set where
93 00:19:26.790 --> 00:19:43.480 Haki Benita: with short Urls, where hits equals 0. Once again, if we want to see what the query looks like we can print the result of query. We can see that it returns like star from short URL, where hits equals 0.
94 00:19:44.560 --> 00:19:58.929 Haki Benita: Once again, through the process, we produce an execution plan. This time. We can see that Postgres is doing a sequential scan on short URL. A sequential scan is basically a full table. Scan postgres is just
95 00:19:59.010 --> 00:20:18.369 Haki Benita: reading the table row by row, looking for rows where hits equals 0. We can see that the execution time at the bottom is 116 ms. Let's say, for the sake of discussion, that this is very, very slow, and we want to try and improve that.
96 00:20:18.450 --> 00:20:48.250 Haki Benita: So if you go to like 99% of developers at Dba, they will tell you what's the problem and just slap a B tree on it. Right. So we add a B tree index on the hits column. We do that in Django using dB index equals. True, we generate a migration. We apply the migration. We once again produce the execution plan with, analyze, and lo and behold.
97 00:20:48.310 --> 00:20:56.180 Haki Benita: Postgres is using our index short. URL hits Ix. And, as you can see the execution. Time
98 00:20:56.810 --> 00:21:02.370 Haki Benita: is very, very fast compared to before, so we're done right.
99 00:21:03.230 --> 00:21:06.060 Haki Benita: We can call it the day we can go for lunch.
100 00:21:06.330 --> 00:21:08.609 Haki Benita: We're happy. It's fast. Now
101 00:21:09.310 --> 00:21:20.299 Haki Benita: stop, let's take a second to talk about performance and what it actually means. Okay? Because intuitively, when we talk about performance, we talk about
102 00:21:20.380 --> 00:21:37.639 Haki Benita: speed right? We want things to be very, very quick. But I think, or the way I view performance is that we need to balance different types of resources. And I want to illustrate this with an example. Okay, let's say that you have this batch processing job running at night.
103 00:21:37.640 --> 00:21:53.420 Haki Benita: Now, this batch processing job runs at the middle of the night, where you have very, very little users, and it runs very, very fast. It takes like this batch processing job like 10 seconds to complete. You're so happy, so fast. However.
104 00:21:53.720 --> 00:22:05.569 Haki Benita: however, this job consumes huge amounts of memory, huge amounts of CPU and huge amounts of disk space right. What if I told you that
105 00:22:06.440 --> 00:22:12.950 Haki Benita: if we are willing to compromise, and instead of completing in 10 seconds, it takes a minute
106 00:22:13.410 --> 00:22:38.970 Haki Benita: right? It consumes very little memory disk space and CPU, right? I'm guessing that if you pay a lot of money for memory, you are willing to make this compromise. Okay, I'll give you another example. Let's say that you have this background job running in the middle of the day. Right now, this background job consumes a lot of CPU so much CPU, in fact, that it starts to interfere with user traffic in the system.
107 00:22:39.030 --> 00:23:07.120 Haki Benita: In this case, instead of optimizing for time, you might be optimizing for CPU, right? You're willing to compromise a few seconds. But you don't want the background job to consume a lot of CPU. So when we talk about performance. We talk about more than just speed. We're talking about how we can balance different resources in the system, usually depending on some type of context time of day the type of resource that we have available at this time. Right?
108 00:23:07.670 --> 00:23:23.450 Haki Benita: So remember that we slapped A B tree on it right? And it was very, very fast, but I'm not sure that was like the most optimal thing that we could done. We could have done. So. Let's go to the database and see
109 00:23:23.580 --> 00:23:33.769 Haki Benita: and check the size of the index we created to solve this teeny, tiny problem. Okay, so this index.
110 00:23:34.570 --> 00:23:41.979 Haki Benita: right is 7 MB. Okay, so that's pretty big for for this type of index.
111 00:23:42.120 --> 00:23:47.420 Haki Benita: So our 7 MB index includes
112 00:23:47.630 --> 00:23:57.789 Haki Benita: all the rows in the table. Right? We just add a dB index through create a B tree index on the column. So it contains all the 1 million rows in the table. But
113 00:23:58.570 --> 00:24:05.790 Haki Benita: we actually don't care about all the rows in the table. Right? Nobody asked us how many
114 00:24:06.150 --> 00:24:25.690 Haki Benita: short Urls you have with less than 5 hits, or more than 266 hits, or exactly 1,000 hits. Nobody cares about that. We had a very specific question that we wanted to answer in regards to the hits. We wanted to find how many short Urls we have with exactly 0 hits.
115 00:24:26.100 --> 00:24:37.350 Haki Benita: So what if, instead of indexing the all the rows in the table, we could index just a portion of the rows, the part of the table that we actually care about.
116 00:24:37.810 --> 00:24:51.950 Haki Benita: Right? So this is a once again mind-blowing idea, and this is made possible with something called partial indexes. Partial indexes, allows us to index just a part of the table that we actually care about.
117 00:24:52.810 --> 00:25:08.019 Haki Benita: So going back to our Django model right. 1st we start by removing the dB index from the column definition, you should never use dB index. Regardless of this, and then, instead of adding this default index. On the column.
118 00:25:08.020 --> 00:25:28.989 Haki Benita: we add a proper index. Right? But we add a condition. Okay, so what this does, it creates an index on the Id column with a condition where hits equals 0. This would cause postgres to create an index just on the rows that satisfy this query. Just on rows
119 00:25:29.200 --> 00:25:54.569 Haki Benita: where hits equal 0. Right? So we generate the migration. We apply the migration, and we try the query again. We produce an execution plan, and we can see that Postgres is using our index. Right? We see an index scan using short URL unused part Ix. This is the index we just created. Okay, so Postgres is able to use the index we just created the partial index
120 00:25:55.000 --> 00:26:04.670 Haki Benita: to satisfy this very specific query. We can also see that the query is very, very fast, even compared to the full index. Right?
121 00:26:05.090 --> 00:26:13.180 Haki Benita: But that wasn't the motivation here, right? This is not what we look to optimize. If we go back
122 00:26:13.320 --> 00:26:28.990 Haki Benita: to the database, and we look at the size of this index. Look at that. The partial index is just 88 kB in size. Okay? Previously the full index was 7 MB. The partial index is 88 kB.
123 00:26:28.990 --> 00:26:48.659 Haki Benita: So I did the math seriously. I opened excel. I did the math. That's 99% smaller. Okay, so that's a lot of space. Now, at this point you're probably saying, Come on, man, it's just 7 MB. Who cares? But if you go back to your system, and you have huge tables with hundreds and billions of rows. Right?
124 00:26:48.840 --> 00:27:06.290 Haki Benita: Check the size of your B 3 indexes. They can become huge. I've seen situations where the B 3 index was larger than the table. Okay, and if you have a lot of indexes it can grow out of control very, very quickly.
125 00:27:07.020 --> 00:27:21.090 Haki Benita: So, as you may guess, I'm a very, very big fan of partial indexes. They produce smaller indexes, and I highly encourage you to use them whenever possible. One limitation of partial indexes is that
126 00:27:22.030 --> 00:27:26.349 Haki Benita: the database can only use partial indexes when
127 00:27:26.500 --> 00:27:52.249 Haki Benita: the query uses the exact same condition as the predicate in the index. Right? The database is not even smart enough to do something like like, where hits equal one minus one. Okay to this level. Okay, so it's limited to queries that use the exact same condition. Usually it's fine, because you know, why would you do hits equal one minus one.
128 00:27:52.380 --> 00:27:53.080 Haki Benita: I don't know.
129 00:27:53.520 --> 00:27:58.490 Haki Benita: I personally found that noble columns are great candidates
130 00:27:58.780 --> 00:28:09.290 Haki Benita: for partial indexes, because in postgres, for example, null values are indexed, and usually you don't want to use an index for is null queries. So I found that
131 00:28:09.480 --> 00:28:34.749 Haki Benita: whenever I have a noble column with an index on it, I can benefit from making it a partial indexes. In fact, I wrote an entire article on how we save 20 GB of unused disk space simply by identifying noble columns with indexes and switching them to use partial index. Okay, so questions about partial indexes before we move on to the next use case.
132 00:28:36.780 --> 00:28:38.540 Haki Benita: Gabor, you have a question.
133 00:28:42.160 --> 00:28:42.730 Haki Benita: No.
134 00:28:42.730 --> 00:28:45.249 Gabor Szabo: Sorry there is this sorry? Actually, there is this question.
135 00:28:46.340 --> 00:29:04.110 Haki Benita: Oh, is it a good idea to recalculate the hits and partial indexes? How frequently! Well, the nice thing about indexes and btrees in general that they are always in sync with the data in the table, it's actually part of the transaction. So when you, for example, increment.
136 00:29:05.180 --> 00:29:07.990 Haki Benita: when you increment the counter for the 1st time
137 00:29:08.290 --> 00:29:11.070 Haki Benita: the row would just disappear from the index.
138 00:29:11.250 --> 00:29:26.029 Haki Benita: Right? So I'm guessing that you're asking, because you have some experience with like materialized views and stuff like that. So you don't actually have to maintain it actively. It's just maintained by the database.
139 00:29:26.460 --> 00:29:32.839 Haki Benita: It's truly an amazing feature. You should definitely use that any more questions before we move on to
140 00:29:33.140 --> 00:29:36.009 Haki Benita: a very exotic type of index in postgres.
141 00:29:36.750 --> 00:29:38.110 Haki Benita: Ow.
142 00:29:41.210 --> 00:29:46.360 Haki Benita: okay, great. So let's talk about another type. Another use case.
143 00:29:47.270 --> 00:30:00.790 Haki Benita: So first, st in the 1st use case, we wanted to resolve the key to a URL right? This is the redirect action. This time we want to do a reverse. Look up. We want to ask
144 00:30:01.000 --> 00:30:09.090 Haki Benita: how many keys we have pointing to this specific URL. So we wanna search for keys by the URL.
145 00:30:09.530 --> 00:30:20.539 Haki Benita: So we implement this very simple function called reverse lookup. It accepts a URL and returns query, set of short Urls. Okay?
146 00:30:21.210 --> 00:30:49.150 Haki Benita: So if we want to see what the query looks like. We use dot query. And we can see select star from short URL where URL equals something. Okay, if we produce an execution plan. We can see that the database is doing a sequential scan on the short URL table that is, scanning the entire table, sifting row by row, finding matches for our query.
147 00:30:49.430 --> 00:30:50.800 Haki Benita: Whoa!
148 00:30:51.590 --> 00:30:55.929 Haki Benita: And we can see that it's relatively
149 00:30:56.140 --> 00:31:00.379 Haki Benita: slow, right? It's like 105
150 00:31:00.500 --> 00:31:03.990 Haki Benita: milliseconds so compared to the index
151 00:31:04.320 --> 00:31:08.840 Haki Benita: queries that we saw before. That's that's pretty slow. Right?
152 00:31:09.220 --> 00:31:23.659 Haki Benita: So you know, once again, 99% of the people would just say, Come on, man, I'm hungry. Let's order some food. Just slop a B tree on it. So this is what we do right? We start by adding A B tree on the URL
153 00:31:23.860 --> 00:31:37.679 Haki Benita: right? We generate and apply the migration. Now we execute the exact same query again, and we can see that now Postgres is using the index that we just created. We can see an index scan using
154 00:31:38.030 --> 00:31:57.059 Haki Benita: the index on the URL column, and also it's very fast. Previously it was like a 100 ms. Now it's 0 point 1 ms. So that's a very, very big and significant improvement. We can all seek to launch and be very, very happy and satisfied with ourselves. But
155 00:31:57.770 --> 00:32:09.459 Haki Benita: are we done? Do you think that we are done? Is there anything that we can optimize? Now, if you are paying attention throughout this presentation. You know that we can definitely do better than that.
156 00:32:09.830 --> 00:32:16.550 Haki Benita: Let's go to the database and check the size of the index. Okay? So the size of the index.
157 00:32:16.740 --> 00:32:22.669 Haki Benita: Okay, stay with me. 47 MB. If you remember the previous
158 00:32:23.050 --> 00:32:28.779 Haki Benita: use case, we had an index on all the heads. It was 7 MB. I told you it was large.
159 00:32:28.950 --> 00:32:44.159 Haki Benita: This index on the same amount of rows is 47 MB. That's very, very big, and the reason that it's very, very big is that the URL is very, very big, right? The beach index
160 00:32:44.390 --> 00:32:49.879 Haki Benita: holds the actual values in the leaf block. So if we are indexing.
161 00:32:50.020 --> 00:32:58.219 Haki Benita: A column with very large values like Urls, can be very, very big. So if we are indexing
162 00:32:58.430 --> 00:33:03.490 Haki Benita: a column with very, very big values, these values are also present in the index.
163 00:33:04.000 --> 00:33:14.130 Haki Benita: and the index can get very, very big. So previously, when we were indexing integers, it was 7 MB. Now we're indexing large pieces of text Urls.
164 00:33:14.410 --> 00:33:18.940 Haki Benita: and that's 47 MB. Okay, so
165 00:33:19.430 --> 00:33:28.389 Haki Benita: let's pause for a second. Okay, I know that btree is like the magic for 90% of the use cases. But there are other types of indexes that we can use.
166 00:33:28.955 --> 00:33:32.949 Haki Benita: So let's pause for a second and ask ourselves, what do we know about.
167 00:33:33.210 --> 00:33:48.990 Haki Benita: what do we know about the URL? Okay? So 1st of all, we know that URL is not unique. Right? We can have multiple keys pointing to the same URL. We can have, for example, different campaigns with different short Urls
168 00:33:49.100 --> 00:33:55.800 Haki Benita: pointing to the same URL. There's no restriction in the system. You can have many keys pointing to the same URL. So it's not unique.
169 00:33:55.930 --> 00:33:57.940 Haki Benita: However, however.
170 00:33:59.780 --> 00:34:06.770 Haki Benita: if we actually look at the data, we see that we don't have a lot of duplicate long Urls right
171 00:34:06.970 --> 00:34:07.889 Haki Benita: like.
172 00:34:09.444 --> 00:34:18.389 Haki Benita: It's not likely that people will use the same show to a lot to point to the same URL like at the at the very least.
173 00:34:18.650 --> 00:34:22.639 Haki Benita: they would have different utm parameters for the same. URL.
174 00:34:22.780 --> 00:34:33.040 Haki Benita: So while it's it's, it's not a restriction. You can have many keys, pointing to the same URL. It's not likely, so we don't have a lot of duplicate values.
175 00:34:34.199 --> 00:34:36.219 Haki Benita: So now I want to introduce you
176 00:34:36.710 --> 00:35:00.369 Haki Benita: to what I call the Ugly Duckling of index types in postgres, the Hash Index. Okay? And to understand how a hash index works and why it's different from B 3 index. Let's start by actually building a hash index ourselves. So imagine we have these values, A, BC and D, and we want to index them using a hash index.
177 00:35:00.730 --> 00:35:20.800 Haki Benita: So we start by applying a hash function on each value. So postgres in our example, it has different hash functions for different types. So you can see that we have hash for text char arrays, even Json types, Timestamps, and so on.
178 00:35:20.930 --> 00:35:34.680 Haki Benita: In our case we have just one character. So it uses hashchar. If we actually apply this function on the values we get the hash values. The next step is we want to divide these
179 00:35:34.870 --> 00:35:36.829 Haki Benita: values into buckets.
180 00:35:37.030 --> 00:35:43.100 Haki Benita: So we start by dividing them into 2 buckets. Basically, we apply modular 2 on
181 00:35:44.050 --> 00:36:04.600 Haki Benita: on the hash value, and then we assign each value to a bucket. So we can see that A goes to bucket one and BC and D goes to bucket 0. So this is our hash index. Okay, so we have 3 hash values in bucket 0, each hash value points to
182 00:36:04.860 --> 00:36:10.809 Haki Benita: somewhere in the table. Okay, just like we had the Tids in the B tree. We have
183 00:36:10.980 --> 00:36:32.230 Haki Benita: the tids right here in the hash index. Now, if we want to use this hash index to find some value, we do the exact same thing, but the opposite, but the other way around, right? So if you want to search for the value B, for example, we apply a hash function on it. We get the hash value. We apply modulus number of buckets to get the
184 00:36:32.360 --> 00:36:54.430 Haki Benita: bucket. In this case 0, and then we go to bucket 0 and we start scanning the pointers to find matching hash. Once we found a matching hash, we can take this 2, which is a pointer to a place in the table, and we can go scan this row and look for matching rows. Okay, so this is how a hash index works in postgres.
185 00:36:55.190 --> 00:37:14.639 Haki Benita: Now, if we want to create a hash index in Django. We need to use the special hash index from postgres contrip. Okay? The reason for that is that hash index is not the default index type in postgres. So we need to explicitly say, we want a hash index. Okay.
186 00:37:15.260 --> 00:37:19.239 Haki Benita: so in this case we are creating a hash index on the URL field.
187 00:37:19.770 --> 00:37:46.360 Haki Benita: and the name of this index is going to be short. URL Hix. I like to use a suffix that indicates the type of the index. So when you know, when I look at execution planes, I can quickly identify the type of the index. So I usually use Ix for B. 3 indexes, and then I use part Ix. For partial hix for hash indexes, and so on. You can come up with whatever convention you want.
188 00:37:47.920 --> 00:37:48.900 Haki Benita: So
189 00:37:49.530 --> 00:38:00.809 Haki Benita: we generate the migration, we apply the migration and produce an execution plan. And we can see that Postgres is using our hash index. Okay? Now.
190 00:38:00.940 --> 00:38:01.990 Haki Benita: okay.
191 00:38:02.180 --> 00:38:18.460 Haki Benita: 1st observation. This is very, very fast. Okay, so you can see that 0 point 0 7 ms. That's very, very fast. But that's not all. If we look at the size of our hash index. Compared
192 00:38:18.730 --> 00:38:34.859 Haki Benita: to the Beecher index, we can see that the hash index is 30% smaller. Okay, trust me, I took a calculator, an old casio. And I calculated the difference. It's 30% smaller. Okay, that's very, very significant. Okay.
193 00:38:35.340 --> 00:38:37.929 Haki Benita: if we put all the data in a table.
194 00:38:38.180 --> 00:38:46.570 Haki Benita: You can see that the hash index in this case, with both faster and smaller.
195 00:38:46.860 --> 00:38:47.990 Haki Benita: So that's
196 00:38:48.170 --> 00:39:06.030 Haki Benita: a win-win all around. Okay, faster and smaller than the default. B, 3. Index. Now, I did a little experiment. Okay. So what I did was, I created a hash index and a btree index on the key and on the URL. Okay, you can see the the chart right here.
197 00:39:06.490 --> 00:39:35.660 Haki Benita: I have a hash index on the key. I have a hash index on the URL, I have a B tree on the key, and I have a B tree on the URL. And what I did is I started adding rows to the table. Okay, you can see at the bottom the bottom axis. That's the number of rows. So I started adding rows into the table until I get to a million rows. Now, every time I added rows to the table I took a snapshot of the sizes of the hash index of all the indexes, and then I put this
198 00:39:35.740 --> 00:39:39.649 Haki Benita: all the data in this chart, and we can see some
199 00:39:39.740 --> 00:39:43.597 Haki Benita: interesting things. Okay. 1st of all.
200 00:39:44.510 --> 00:39:46.580 Haki Benita: 1st of all, if you look at the.
201 00:39:47.000 --> 00:39:49.219 Haki Benita: If you look at the red line.
202 00:39:49.470 --> 00:39:52.999 Haki Benita: which is the B tree on the URL big piece of text.
203 00:39:53.690 --> 00:40:18.479 Haki Benita: and the green line which is the B tree on the key the short piece of text. 1st of all, you can see that both of them grow basically linearly as I add more rows to the table, right? So we can see like this linear line increasing right? As I add, more rows, the size of the index increases. We can also see that the red line, the B tree on the URL is always larger.
204 00:40:18.850 --> 00:40:21.239 Haki Benita: the the B tree on the key right?
205 00:40:21.780 --> 00:40:30.559 Haki Benita: So the reason for that is that the URL is a big piece of text, and the key is a short piece of text. This tells us
206 00:40:30.890 --> 00:40:33.730 Haki Benita: that the size of the bee tree is
207 00:40:33.840 --> 00:40:36.900 Haki Benita: very much affected by the size
208 00:40:37.250 --> 00:40:40.240 Haki Benita: of the column that it indexes.
209 00:40:40.380 --> 00:40:49.959 Haki Benita: So A B tree on URL will be bigger than A B tree on key for the same amount of rows, because a URL is bigger than a key.
210 00:40:50.270 --> 00:40:56.780 Haki Benita: So that's about the B 2 indexes. However, if we look at the hash indexes. That's the blue.
211 00:40:57.900 --> 00:40:59.700 Haki Benita: the yellow lines.
212 00:41:00.190 --> 00:41:02.260 Haki Benita: 1st of all, we can see that
213 00:41:03.480 --> 00:41:10.410 Haki Benita: the size of the hash index, if I add more rows is not affected by the size of the value.
214 00:41:10.540 --> 00:41:18.259 Haki Benita: because URL is big key small. But as I add more rows to the table. The size of the hash index is the same. Okay.
215 00:41:18.400 --> 00:41:27.409 Haki Benita: The second thing that I can see is that in this specific case the hash index was consistently lower, smaller.
216 00:41:27.690 --> 00:41:35.050 Haki Benita: Then the same index, the same B, 3 index on the same column. Okay. So in this case the hash index was always smaller.
217 00:41:35.520 --> 00:41:40.680 Haki Benita: Another thing that we can see in this chart that, unlike the B 3 index that grows linearly.
218 00:41:41.050 --> 00:41:48.299 Haki Benita: the hash index grows in like steps. Right? You can see the step, and then it's flat. Step flat.
219 00:41:48.700 --> 00:42:09.099 Haki Benita: So what's happening in a hash index is once we have, we start adding rows to the hash index, and then we have some bucket, and this bucket starts to fill up. Now, when a bucket fills up, postgres, needs to split this bucket. Now, when the bucket is split, postgres, pre allocates
220 00:42:09.580 --> 00:42:12.570 Haki Benita: storage disk space for this bucket.
221 00:42:12.700 --> 00:42:16.419 Haki Benita: So the steps that you see is the bucket splits
222 00:42:16.540 --> 00:42:21.430 Haki Benita: where postgres allocates additional storage to split the bucket.
223 00:42:21.770 --> 00:42:22.630 Haki Benita: Right?
224 00:42:22.970 --> 00:42:25.229 Haki Benita: So this is why hash index
225 00:42:25.420 --> 00:42:28.239 Haki Benita: grows in in, in, in steps.
226 00:42:29.060 --> 00:42:35.259 Haki Benita: So hash index is ideal. When we have very few duplicates
227 00:42:35.470 --> 00:42:59.300 Haki Benita: in the rows that we want to index, and the reason for that is, if we have lots of duplicates, the values would map to the same bucket, and we won't get the benefit of a hash index. The reason that a hash index made sense in our case is that URL is mostly unique. It's almost unique. Okay, it's not unique by definition. But there's not a lot of duplicates.
228 00:42:59.680 --> 00:43:18.200 Haki Benita: We also saw that, unlike a B tree index, hash index is not affected by the size of the values that it indexes, and the reason for that is that the hash index doesn't actually include the values. It includes hash values. Okay, this is why I can index very, very big values, big strings
229 00:43:18.540 --> 00:43:40.110 Haki Benita: with a relatively small index. Okay, as we saw hash index under some circumstances, can be both smaller and faster than A. B 3 index, and the reason that a lot of people are unfamiliar with a hash index is that prior to Postgres 10, which is already pretty old because we're now at Postgres 17.
230 00:43:40.580 --> 00:44:04.829 Haki Benita: If you went to the documentation for Hash Index, there would be like this huge warning, saying, Beware, do not use hash indexes. They are not production ready. So a lot of developers became used to not using hash indexes, but starting in postgres 10, you can definitely use hash indexes in production. They are production ready, and as we saw, they can be very, very good under some circumstances.
231 00:44:06.160 --> 00:44:12.890 Haki Benita: We're talking about hash indexes. It is very important to also know the restrictions of hash indexes. 1st of all, hash index
232 00:44:14.290 --> 00:44:32.920 Haki Benita: cannot be used to create. You can create a unique hash index, and the reason that you can is that a hash index does not contain the actual values, just hash values. And technically, you can have multiple different values producing the exact same hash value.
233 00:44:33.090 --> 00:44:43.399 Haki Benita: So it can. You can create a unique hash index. However, okay, and that's the comment at the bottom, we can talk about it later. If you want. You can enforce unique
234 00:44:43.680 --> 00:44:47.209 Haki Benita: with the hash index using an exclusion constraint.
235 00:44:47.440 --> 00:44:56.589 Haki Benita: Okay, next, we can't have a composite hash index. We can't have a hash index on multiple columns. Okay?
236 00:44:57.410 --> 00:45:02.989 Haki Benita: And we can use hash index for sorting and range searches, because once again.
237 00:45:03.280 --> 00:45:10.940 Haki Benita: hash index does not contain the actual values. Just the hash values right? So I can't use a hash index for things like.
238 00:45:11.390 --> 00:45:17.379 Haki Benita: you know, between greater than less than and so on. Just equality.
239 00:45:18.540 --> 00:45:24.421 Haki Benita: So quick. Recap just 4 more slides. I promise. Okay,
240 00:45:26.090 --> 00:45:34.610 Haki Benita: when to use indexes. So remember, indexes can make queries faster. We saw that in all of our examples.
241 00:45:34.650 --> 00:45:56.340 Haki Benita: using an index, made the query faster. However, the not free, they come at a cost. You need to maintain this index, and this index maintenance happens when you insert when you update and when you delete. So the more indexes you create, the faster your queries are. But the slower every other operation is
242 00:45:56.500 --> 00:46:18.380 Haki Benita: okay. Another thing to consider, and this is often overlooked. Indexes can be very, very big. They consume a lot of disk space when you go back to your databases. After this talk, please go do slash di plus, and look at the sizes of your index. I think that if you never looked at the size of your indexes.
243 00:46:18.620 --> 00:46:23.349 Haki Benita: You're going to be very much surprised at what you're going to find.
244 00:46:24.180 --> 00:46:41.909 Haki Benita: and finally using an index is not always best. If you have a query that needs to access a large portion of the table. Sometimes it doesn't make sense to use an index for that. Okay, there's no magic number, but, you know.
245 00:46:42.190 --> 00:46:43.480 Haki Benita: keep that in mind.
246 00:46:44.710 --> 00:46:55.220 Haki Benita: So we talked about index types and features. We talked about partial indexes, inclusive between indexes, and we talked about hash index.
247 00:46:55.420 --> 00:47:07.439 Haki Benita: We talked a little bit about how to evaluate performance. I don't know if you noticed, but throughout throughout this presentation we went through the same process over and over again. We start by
248 00:47:07.600 --> 00:47:25.639 Haki Benita: executing some query with, explain, analyze, to get the timing with no indexes. This is basically establishing a baseline right? And then we start by experimenting with different types of indexes. So usually, we start with a B tree. We take a measure of the time using, explain, analyze.
249 00:47:25.640 --> 00:47:40.620 Haki Benita: and then we take the size of the index. We put it all in a nice table. We start experimenting. And once you have all the data organized like that. It's a lot easier to reach a decision on what is the best indexing approach
250 00:47:40.630 --> 00:47:42.499 Haki Benita: for your specific use case.
251 00:47:42.560 --> 00:47:53.119 Haki Benita: And also and hopefully, you remember that indexes performance is not just about speed. As we saw, we can get significant
252 00:47:53.660 --> 00:47:57.540 Haki Benita: disk space reductions with a very, very.
253 00:47:57.600 --> 00:48:09.329 Haki Benita: with a very small price of speed sometimes makes sense to make this compromise. We also, throughout this talk, saw how to use, explain
254 00:48:09.360 --> 00:48:31.259 Haki Benita: how to use, explain, analyze how to debug SQL in Django, and we also saw a lot of execution plans. I don't know if you noticed, but if you've never seen execution plans before, hopefully, when you go back to your system. You start doing, explain, analyze some of the queries you run a lot. You get to actually understand what the database is doing. Now
255 00:48:31.560 --> 00:48:45.659 Haki Benita: in this talk I talked only about inclusive indexes, partial indexes, and hash index, but, in fact, there are many, many different other types of indexes that are exotic and very, very cool. We have
256 00:48:46.330 --> 00:48:56.900 Haki Benita: Brent indexes. We have function based indexes, and we have a lot of different flavors of things that we can do. And you can check out this
257 00:48:57.300 --> 00:49:04.960 Haki Benita: class 3 h packed with astral magic for your benefit and
258 00:49:05.810 --> 00:49:13.720 Haki Benita: finally check me out in all of these places, and I'm happy to take questions or discuss whatever you want.
259 00:49:19.490 --> 00:49:22.113 Gabor Szabo: Whoa, thank you.
260 00:49:23.750 --> 00:49:26.585 Gabor Szabo: Because, yeah.
261 00:49:27.400 --> 00:49:28.630 Haki Benita: Hectic.
262 00:49:30.335 --> 00:49:35.410 Gabor Szabo: Yeah, this is not a question, Hank. His article on Hash Indexes is truly excellent.
263 00:49:35.520 --> 00:49:42.589 Gabor Szabo: I believe it remains one of the top search results for anyone looking for resources on hash indexes.
264 00:49:42.760 --> 00:49:47.639 Haki Benita: It's true, it's true. This is one of the top searches for hash index in postgres.
265 00:49:47.910 --> 00:49:48.340 Gabor Szabo: Yeah.
266 00:49:48.340 --> 00:49:53.060 Haki Benita: Yeah, I managed to catch this trend very, very early on.
267 00:49:54.515 --> 00:49:55.540 Gabor Szabo: Okay.
268 00:49:55.790 --> 00:49:56.270 Haki Benita: Mom.
269 00:49:56.270 --> 00:50:01.189 Gabor Szabo: Comments, questions before we. We close this session.
270 00:50:02.340 --> 00:50:05.000 Gabor Szabo: We know where where to find you.
271 00:50:05.160 --> 00:50:07.829 Gabor Szabo: We have the. We'll have the link.
272 00:50:08.320 --> 00:50:16.320 Gabor Szabo: You can add the links to the post of the of the of the video as well, so people can find find it easily, easily.
273 00:50:17.100 --> 00:50:19.660 Gabor Szabo: and any comments.
274 00:50:19.660 --> 00:50:20.020 Haki Benita: Okay.
275 00:50:20.020 --> 00:50:21.859 Gabor Szabo: Questions, apparently not.
276 00:50:21.860 --> 00:50:24.780 Haki Benita: Yeah, I want to thank you, Gabra, for hosting this meeting.
277 00:50:24.780 --> 00:50:25.650 Gabor Szabo: It was excellent.
278 00:50:26.146 --> 00:50:27.139 Haki Benita: Meet up!
279 00:50:27.140 --> 00:50:32.660 Gabor Szabo: Yeah. Well, yeah. So thank you very much for this presentation.
280 00:50:32.770 --> 00:50:41.470 Gabor Szabo: If anyone has questions, then we'll see how to find the hockey later on in this on this slide, and then we'll put it in under the video.
281 00:50:42.020 --> 00:50:52.750 Gabor Szabo: Thank you for for supporting us. Thank you for being here. Thank you very much to you to giving the presentation, please, like the video and follow the channel. Yeah.
282 00:50:53.020 --> 00:51:10.139 Gabor Szabo: And if you would like to give any presentation, you're welcome to contact me as well, and we'll see how we can schedule a presentation at what time, and and so on. So thank you very much, and
283 00:51:10.430 --> 00:51:15.029 Gabor Szabo: see you at the next meeting next video, whatever.
284 00:51:15.400 --> 00:51:16.869 Gabor Szabo: Thank you. Bye, bye.
285 00:51:16.870 --> 00:51:18.830 Haki Benita: Thank you very much. Everyone. Good night.
]]>Indexes are extremely powerful and ORMs like Django and SQLAlchemy provide many ways of harnessing their powers to make queries faster and the database more efficient. In this talk I reveal the secrets of DBAs with some advanced indexing techniques such as partial, function based and inclusive B-Tree indexes, and who knows, maybe even some index types you never heard of before!

1 00:00:00.720 --> 00:00:02.690 Haki Benita: This meeting is being recorded.
2 00:00:03.400 --> 00:00:04.320 Gabor Szabo: Okay.
3 00:00:05.800 --> 00:00:12.250 Gabor Szabo: yeah. So Hi, and welcome to the python Maven, let's call it Python Maven. This is the code Maven
4 00:00:12.500 --> 00:00:41.910 Gabor Szabo: Youtube channel. And we are organizing these meetings in the Codebay Events group, but sort of it has 3 separate sessions, and this is going to be the the Python specific one. My name is Gabor Sabo. I usually teach python and rust and help companies introduce testing, and I also like to organize these events and allow people to share their knowledge with with each other.
5 00:00:42.270 --> 00:00:46.010 Gabor Szabo: You're welcome. I'm really happy that you're here
6 00:00:46.140 --> 00:01:04.909 Gabor Szabo: in this session, listening, as I mentioned earlier, you're welcome to to comment or use the chat and ask questions. And if you're just watching the video recorded on Youtube, then please remember to like the video and follow the channel.
7 00:01:05.080 --> 00:01:11.990 Gabor Szabo: and let's welcome hockey now, and let him introduce you. Introduce yourself and and
8 00:01:12.700 --> 00:01:17.579 Gabor Szabo: and give the presentation. So thank you for accepting the invitation.
9 00:01:18.970 --> 00:01:31.149 Haki Benita: Thank you. Thank you, Gabo. 1st of all, I like the fact that we have this intimate group that we can freely talk. I actually encourage you to consider opening the mics.
10 00:01:31.210 --> 00:02:01.090 Haki Benita: Because I think we can actually have a conversation throughout the presentation. I like to give interactive presentation. Your call. You're the boss, and just a quick introduction about the subject and about myself. So we are going to talk about how to make your back end war. And I want to start by apologizing for the tacky headline. But unfortunately, these types of tacky headlines do work. Believe it or not.
11 00:02:01.610 --> 00:02:09.010 Haki Benita: So. My name is Jake Benita. I'm a software developer and a technical lead. I'm currently leading a team
12 00:02:09.289 --> 00:02:18.949 Haki Benita: of developers working on a very large ticketing platform and Israel serving about one and a half unique
13 00:02:19.580 --> 00:02:32.470 Haki Benita: 1.5 million unique paying users every month. And I also like to write and talk about python performance and databases. And you can find my stuff on my website.
14 00:02:33.110 --> 00:02:47.839 Haki Benita: So today, we are going to talk about some lesser known features of indexes. And we're going to try and understand how they work and when we can and should use them
15 00:02:47.850 --> 00:03:14.629 Haki Benita: to do that, we are going to build a URL shortener together, and we're going to do it in Django. I would say that since this is a talk about python, I'm going to use Django and the Django Orm. But the concepts that I'm going to describe are not specific to Django, and they're not specific to Postgres. Heck. They're not even specific to python. But this is a good environment to explain the concepts with.
16 00:03:15.390 --> 00:03:19.889 Haki Benita: So what is a URL shortener? You probably know about
17 00:03:19.900 --> 00:03:39.330 Haki Benita: other types of URL shorteners? You have bitly. You have the late googl buffer, Li, and so on. Basically, URL shortener is a system that provides a short URL that redirects to a longer URL. Now, why would you want to do that
18 00:03:39.330 --> 00:04:02.240 Haki Benita: first.st If you are operating in text constrained environments like SMS messages or Tweets, you might want to share a very large link. So you want to make it shorter, so it consumes less space. This is where short Urls can be handy. Another nice feature of URL shortening that whenever someone clicks the short URL,
19 00:04:02.240 --> 00:04:16.500 Haki Benita: the URL shorten and redirects to the long URL, and keeps a track of how many people click that link. So if you have something like a campaign that you want to launch, and you want to keep track of how many people clicked your link.
20 00:04:16.820 --> 00:04:20.149 Haki Benita: This is what you would use a URL shortener for
21 00:04:20.310 --> 00:04:48.240 Haki Benita: so to build our URL shortener in Django, we're going to start with this very, very simple model. We are calling the model short URL, we have an Id column which is the primary key. It's just an auto incrementing integer field. We have the key. That's a unique short piece of text that uniquely identifies our short URL. This is the short key at the end of the short URL.
22 00:04:48.500 --> 00:05:07.030 Haki Benita: We then have the URL, which is the long URL, we want to redirect to. We also want to keep track of when the URL was created. We do that using the created at column. And finally, we want to keep track of how many users click the link, and we do that with the hits column
23 00:05:07.180 --> 00:05:08.110 Haki Benita: at the bottom.
24 00:05:08.960 --> 00:05:19.650 Haki Benita: So for our demonstration. So we actually have something to work with. I loaded 1 million short Urls into the table. Okay, now, this is not a lot. But we are going to see, some
25 00:05:20.700 --> 00:05:25.929 Haki Benita: performance gains with just 1 million rows. Okay.
26 00:05:26.810 --> 00:05:33.380 Haki Benita: so this talk is about python. But it's essentially about SQL, so
27 00:05:33.510 --> 00:05:54.859 Haki Benita: in Django, if you want to get the SQL. Generated by Django for a given query set. You can do that by accessing the query, set dot query and print it. In this case I'm doing short URL filter on a specific key dot query. And I can actually get Django to print
28 00:05:55.190 --> 00:05:59.549 Haki Benita: the SQL. That it generated for this query set right.
29 00:06:00.040 --> 00:06:26.740 Haki Benita: So, after viewing the query set, it's also very interesting to see how the database is planning to execute my query. Right? So I can do that by executing the function. Explain. This, translates into an explain query, command in SQL. And what I get in return is not the results of the query, but the execution plan, which is how the database is planning
30 00:06:26.930 --> 00:06:30.979 Haki Benita: to execute my query. Now, when we just use, explain
31 00:06:31.200 --> 00:06:36.260 Haki Benita: the database doesn't actually execute the query. It just produces a plan
32 00:06:36.370 --> 00:06:53.839 Haki Benita: sometimes, especially when we're benchmarking and we're trying to improve performance. It can be useful to produce the execution plan, but also have the database, execute this query and return some useful execution data. For that we can use a slightly different variation of the explain command.
33 00:06:53.970 --> 00:07:13.319 Haki Benita: which is, explain, analyze in Django. You can do that by using. Explain, analyze. True in SQL. Postgres. Specifically, you can do explain, analyze on timing on in parenthesis, following by the query, and then you get some additional information about the execution plan.
34 00:07:13.350 --> 00:07:27.339 Haki Benita: first, st because the database actually executed the query. You can see at the bottom that we get how long it took the database to produce an execution plan in this case that would be 0 point 1 4 0 ms.
35 00:07:27.710 --> 00:07:38.510 Haki Benita: and I also get how long. It took the database to execute the query from start to end. In this case that would be 0 point 0 4 6. Okay.
36 00:07:39.430 --> 00:07:47.120 Haki Benita: Now, in addition to the timing. I'm also getting a very, very interesting piece of information inside the execution plan.
37 00:07:47.260 --> 00:07:53.699 Haki Benita: Okay, what I get is the estimated cost and the actual cost
38 00:07:53.820 --> 00:07:58.059 Haki Benita: that the database encountered while executing the query. So
39 00:07:59.010 --> 00:08:15.400 Haki Benita: discussing the cost-based optimizer is slightly outside the scope of this talk, I would just say that, comparing the expected cost to the actual cost is a very useful measure to try and identify bad execution plans.
40 00:08:16.100 --> 00:08:17.350 Haki Benita: Finally.
41 00:08:17.990 --> 00:08:28.419 Haki Benita: another way of viewing queries is to turn on the logger for the database backend in Django. This way, whenever the database, whenever Django executes a query.
42 00:08:29.040 --> 00:08:32.620 Haki Benita: it logs the SQL. That was produced by the aura.
43 00:08:33.510 --> 00:08:34.475 Haki Benita: So
44 00:08:35.700 --> 00:09:05.329 Haki Benita: to actually start discussing some indexing techniques, we need to start implementing some. You know, business processes. So let's start with the most basic thing that URL shortener actually does. And that's look up the URL to redirect to by a key. So a user uses one of our short Urls, we get the unique key. And we need to find the long URL to redirect to. Okay, this is like the bread and butter of this system.
45 00:09:05.440 --> 00:09:27.109 Haki Benita: So if we want to implement this very, very simple function. We can do something like that. Def resolve. Okay, that's the name of the function. We want to resolve a key to a URL. We accept a key, and then we execute this simple query to just get a show, URL for this key. If we don't find anything we return none. Otherwise we return the URL to redirect to
46 00:09:27.110 --> 00:09:37.730 Haki Benita: okay. Now we want to look at the SQL. That Django generated for this function. Right? So we execute this function on some random key
47 00:09:37.950 --> 00:09:57.950 Haki Benita: with SQL. Logging turned on, and we can see the query right here. Now, if you look at this query, it looks like Django, basically fetch everything from the short URL table for the key that we asked for right select star from short URL, where Key equals something.
48 00:09:58.270 --> 00:10:05.050 Haki Benita: If we want to look at how postgres is actually executing this query.
49 00:10:05.210 --> 00:10:12.719 Haki Benita: we can use the explain command. And what we get is that Postgres is planning to use an index scan
50 00:10:13.535 --> 00:10:20.159 Haki Benita: on the index we have on the key column. Okay, now.
51 00:10:21.180 --> 00:10:28.839 Haki Benita: to understand what exactly it means in index scan. Let's take a second to talk about btree index.
52 00:10:29.040 --> 00:10:42.120 Haki Benita: So Btree index is like the king of all indexes. This is the default index in most database engines. If you're not sure what type of index you're using. You're probably using a btree index. Okay?
53 00:10:42.560 --> 00:11:11.160 Haki Benita: So to understand how A B tree index works. Let's start by building one. So imagine you have these values, one through 9, and you want to create a B tree index on them. You start by sorting the values and storing them in leaf blocks. You can see the leaf blocks at the bottom. They are sorted from left to right. We have 1, 2, 3, all the way through 9. Now every entry in the leaf blocks contains a list of tids. These are pointers to rows in the table.
54 00:11:11.400 --> 00:11:15.460 Haki Benita: That store rows with these values. Okay.
55 00:11:16.290 --> 00:11:28.179 Haki Benita: now, above the leaves, we have branches and root block that acts as a directory to these leaf blocks. So let's see how this works. Let's imagine that we want to look.
56 00:11:28.180 --> 00:11:38.290 Gabor Szabo: Sorry. Just someone says says that it doesn't see this the slides. So I just wanted to. And I'm unsure if the other people do see the light slide. So if
57 00:11:38.670 --> 00:11:53.529 Gabor Szabo: I asked it in the chat, but no one answered. So I hope that people other people okay. So some other people see it so my recommendation is to Eduard Eduardo to turn, maybe on and off the I mean, maybe exit zoom and enter zoom again. Sorry for the.
58 00:11:53.530 --> 00:11:54.940 Haki Benita: Okay, no problem.
59 00:11:55.120 --> 00:11:56.160 Haki Benita: Yeah.
60 00:11:56.400 --> 00:11:59.700 Haki Benita: Okay, okay, so let's
61 00:12:01.690 --> 00:12:31.100 Haki Benita: okay. So let's try to search for the value 5 in the V 3 index that we just built. So we start with the root block and we start scanning from left to right. So 5 is larger than 3. So we skip the 1st entry 3 is between 3 and 7, 5 is between 3 and 7, so we follow this pointer to the middle leaf block. We then start scanning the leaf block from left to right. The 1st value is 4. It's not a match.
62 00:12:31.100 --> 00:12:36.150 Haki Benita: The next value is 5. That's a match, and now we can
63 00:12:36.150 --> 00:12:47.970 Haki Benita: scan. We can follow the pointers from this leaf block to the rows in the table. We can read the rows and do whatever we need to do with these rows. Okay, now.
64 00:12:48.310 --> 00:13:15.100 Haki Benita: let's go back to our query. Okay, one second, yeah. Let's go back to our query. Remember that we said that Django generated this query and this query is fetching everything right, basically select star from short URL. But, in fact, if you think about it, we don't actually care about all these fields right? We only care about the URL. I mean, we're not looking to resolve
65 00:13:16.290 --> 00:13:27.129 Haki Benita: a key to a URL for the purpose of redirecting. I don't care when it was created. I don't care about the Id. I already have the key right, and I don't care about the head counter at this point
66 00:13:27.610 --> 00:13:30.209 Haki Benita: right? So I don't care about all these fields. So
67 00:13:30.770 --> 00:13:55.089 Haki Benita: one thing that we can do is instead of fetching all of these fields, how about if we just fetch what we actually need. Right? So in Django, we can do that by adding values list. URL. Now the function is slightly different. But if we look at the SQL. Generated by this function, we can see that now, instead of fetching all the columns in the row, we just fetch the URL. So this is exactly what we need.
68 00:13:55.200 --> 00:14:10.249 Haki Benita: If we look at this execution plan once again for this query, we can see that again. Django is using postgres is using an index scan on the unique index that we have on the key. Right? So now.
69 00:14:10.920 --> 00:14:30.719 Haki Benita: once we found a matching row, we can follow the pointer to the table. We can get the URL from the table. So if you imagine the amount of disk reads, I need to do to satisfy this query. I'm starting by reading their root block. Right? So that's 1 read. Then I need to follow the branch all the way to the leaf. Let's say that we have just.
70 00:14:30.730 --> 00:14:41.789 Haki Benita: you know, root block, and then directly to the leaf. So reading the leaf is another read, and then we need to follow the link from the leaf block to read the row from the table. So this is a unique
71 00:14:41.970 --> 00:14:52.020 Haki Benita: column. So we have at most one row. So that's another read. So basically, we did 3 random reads to satisfy this query right now.
72 00:14:53.290 --> 00:15:03.019 Haki Benita: this query is executed a lot. This is basically what our system is doing right. It's getting keys and resolving them to Urls to redirect right
73 00:15:03.360 --> 00:15:17.979 Haki Benita: now. We already established that all we care about in this specific scenario is just the URL. I don't care about anything else. I care just about the URL. So what if? And stay with me? This is mind blowing.
74 00:15:17.980 --> 00:15:34.249 Haki Benita: What if, instead of going to the table to get the URL. What if I could include the URL in the leaf block in the index this way? When I found a matching entry in the leaf block, I would have the URL just sitting there.
75 00:15:34.310 --> 00:15:52.420 Haki Benita: Right? So this mind-blowing idea is called inclusive index. Okay, in other databases it's called covering index or inclusive indexes, and what it allows us to do, it allows us to store additional information in the leaf block.
76 00:15:52.500 --> 00:16:14.569 Haki Benita: So if we want to use an inclusive index in Django, we can add the include argument to the unique constraint. Now look, the key is indexed. The URL is not indexed. It's just included in the leaf block. Okay. Now, if we generate a migration, we apply it and we try the query again.
77 00:16:15.500 --> 00:16:21.569 Haki Benita: You can see that once again, Postgres is using our index, our unique index on the key. But there is
78 00:16:21.900 --> 00:16:33.889 Haki Benita: very, very subtle difference here. If you notice. Previously we had an index scan using our unique index. This time we have an index only scan.
79 00:16:34.020 --> 00:17:03.620 Haki Benita: This means that Postgres was able to satisfy the query without accessing the table. All the data that it needs was already in the leaf block. So if we once again imagine how many reads we need to do to satisfy this query, using the inclusive index, we read the root block. We follow the pointer all the way down to the leaf block, and now, instead of going to the table to read the URL. We have the URL right there in the leaf block. So we only need to read
80 00:17:03.670 --> 00:17:05.849 Haki Benita: 2 blocks from disk.
81 00:17:06.150 --> 00:17:17.110 Haki Benita: Okay, the way to identify. This is by the operator on the index only scan right? So we have an index scan, and we have an index. Only scan.
82 00:17:18.170 --> 00:17:39.170 Haki Benita: So quick recap about inclusive indexes, as I mentioned in other databases. They are sometimes called covering indexes, and they allow us to fulfill queries without accessing the table. However, you should use them with caution. Because if you think about it, we're basically duplicating data from the table to the index. Okay?
83 00:17:39.170 --> 00:17:49.959 Haki Benita: So if you have a very big big piece of like information like URL can be very, very big. So basically, I'm now storing the URL
84 00:17:50.140 --> 00:18:09.440 Haki Benita: twice. So the index could get very, very big. I'm actually not a big fan of inclusive indexes. But I can think of 2 scenarios where it might be a good idea. First, st if you have very wide tables. Imagine, like data, warehouse type of tables, denormalized tables.
85 00:18:09.600 --> 00:18:11.520 Haki Benita: and you have a very
86 00:18:12.250 --> 00:18:22.290 Haki Benita: predefined set of queries that are executed very, very often on a very, very small subset of columns, you can consider doing using
87 00:18:23.440 --> 00:18:50.249 Haki Benita: an inclusive index. And also, I personally found that non unique composite indexes can be good candidates for inclusive indexes that is, indexes on multiple columns that are not used to enforce a unique constraint. Sometimes they can benefit from switching to just a composite index to an inclusive index. Okay, questions so far before we move on to the next use case.
88 00:18:55.710 --> 00:19:02.210 Haki Benita: Okay, if you have any questions, feel free, let's move on to the next to the next use case.
89 00:19:02.800 --> 00:19:04.080 Haki Benita: So now
90 00:19:04.230 --> 00:19:16.229 Haki Benita: we want to find unused keys right? We have this business question. We want to know how many show through ours. We have with no hits at all. Okay, we have 0 hits.
91 00:19:17.070 --> 00:19:23.050 Haki Benita: So we start by implementing this very, very simple function. We call it, find unusedindexes.
92 00:19:23.350 --> 00:19:26.190 Haki Benita: and it returns a query set where
93 00:19:26.790 --> 00:19:43.480 Haki Benita: with short Urls, where hits equals 0. Once again, if we want to see what the query looks like we can print the result of query. We can see that it returns like star from short URL, where hits equals 0.
94 00:19:44.560 --> 00:19:58.929 Haki Benita: Once again, through the process, we produce an execution plan. This time. We can see that Postgres is doing a sequential scan on short URL. A sequential scan is basically a full table. Scan postgres is just
95 00:19:59.010 --> 00:20:18.369 Haki Benita: reading the table row by row, looking for rows where hits equals 0. We can see that the execution time at the bottom is 116 ms. Let's say, for the sake of discussion, that this is very, very slow, and we want to try and improve that.
96 00:20:18.450 --> 00:20:48.250 Haki Benita: So if you go to like 99% of developers at Dba, they will tell you what's the problem and just slap a B tree on it. Right. So we add a B tree index on the hits column. We do that in Django using dB index equals. True, we generate a migration. We apply the migration. We once again produce the execution plan with, analyze, and lo and behold.
97 00:20:48.310 --> 00:20:56.180 Haki Benita: Postgres is using our index short. URL hits Ix. And, as you can see the execution. Time
98 00:20:56.810 --> 00:21:02.370 Haki Benita: is very, very fast compared to before, so we're done right.
99 00:21:03.230 --> 00:21:06.060 Haki Benita: We can call it the day we can go for lunch.
100 00:21:06.330 --> 00:21:08.609 Haki Benita: We're happy. It's fast. Now
101 00:21:09.310 --> 00:21:20.299 Haki Benita: stop, let's take a second to talk about performance and what it actually means. Okay? Because intuitively, when we talk about performance, we talk about
102 00:21:20.380 --> 00:21:37.639 Haki Benita: speed right? We want things to be very, very quick. But I think, or the way I view performance is that we need to balance different types of resources. And I want to illustrate this with an example. Okay, let's say that you have this batch processing job running at night.
103 00:21:37.640 --> 00:21:53.420 Haki Benita: Now, this batch processing job runs at the middle of the night, where you have very, very little users, and it runs very, very fast. It takes like this batch processing job like 10 seconds to complete. You're so happy, so fast. However.
104 00:21:53.720 --> 00:22:05.569 Haki Benita: however, this job consumes huge amounts of memory, huge amounts of CPU and huge amounts of disk space right. What if I told you that
105 00:22:06.440 --> 00:22:12.950 Haki Benita: if we are willing to compromise, and instead of completing in 10 seconds, it takes a minute
106 00:22:13.410 --> 00:22:38.970 Haki Benita: right? It consumes very little memory disk space and CPU, right? I'm guessing that if you pay a lot of money for memory, you are willing to make this compromise. Okay, I'll give you another example. Let's say that you have this background job running in the middle of the day. Right now, this background job consumes a lot of CPU so much CPU, in fact, that it starts to interfere with user traffic in the system.
107 00:22:39.030 --> 00:23:07.120 Haki Benita: In this case, instead of optimizing for time, you might be optimizing for CPU, right? You're willing to compromise a few seconds. But you don't want the background job to consume a lot of CPU. So when we talk about performance. We talk about more than just speed. We're talking about how we can balance different resources in the system, usually depending on some type of context time of day the type of resource that we have available at this time. Right?
108 00:23:07.670 --> 00:23:23.450 Haki Benita: So remember that we slapped A B tree on it right? And it was very, very fast, but I'm not sure that was like the most optimal thing that we could done. We could have done. So. Let's go to the database and see
109 00:23:23.580 --> 00:23:33.769 Haki Benita: and check the size of the index we created to solve this teeny, tiny problem. Okay, so this index.
110 00:23:34.570 --> 00:23:41.979 Haki Benita: right is 7 MB. Okay, so that's pretty big for for this type of index.
111 00:23:42.120 --> 00:23:47.420 Haki Benita: So our 7 MB index includes
112 00:23:47.630 --> 00:23:57.789 Haki Benita: all the rows in the table. Right? We just add a dB index through create a B tree index on the column. So it contains all the 1 million rows in the table. But
113 00:23:58.570 --> 00:24:05.790 Haki Benita: we actually don't care about all the rows in the table. Right? Nobody asked us how many
114 00:24:06.150 --> 00:24:25.690 Haki Benita: short Urls you have with less than 5 hits, or more than 266 hits, or exactly 1,000 hits. Nobody cares about that. We had a very specific question that we wanted to answer in regards to the hits. We wanted to find how many short Urls we have with exactly 0 hits.
115 00:24:26.100 --> 00:24:37.350 Haki Benita: So what if, instead of indexing the all the rows in the table, we could index just a portion of the rows, the part of the table that we actually care about.
116 00:24:37.810 --> 00:24:51.950 Haki Benita: Right? So this is a once again mind-blowing idea, and this is made possible with something called partial indexes. Partial indexes, allows us to index just a part of the table that we actually care about.
117 00:24:52.810 --> 00:25:08.019 Haki Benita: So going back to our Django model right. 1st we start by removing the dB index from the column definition, you should never use dB index. Regardless of this, and then, instead of adding this default index. On the column.
118 00:25:08.020 --> 00:25:28.989 Haki Benita: we add a proper index. Right? But we add a condition. Okay, so what this does, it creates an index on the Id column with a condition where hits equals 0. This would cause postgres to create an index just on the rows that satisfy this query. Just on rows
119 00:25:29.200 --> 00:25:54.569 Haki Benita: where hits equal 0. Right? So we generate the migration. We apply the migration, and we try the query again. We produce an execution plan, and we can see that Postgres is using our index. Right? We see an index scan using short URL unused part Ix. This is the index we just created. Okay, so Postgres is able to use the index we just created the partial index
120 00:25:55.000 --> 00:26:04.670 Haki Benita: to satisfy this very specific query. We can also see that the query is very, very fast, even compared to the full index. Right?
121 00:26:05.090 --> 00:26:13.180 Haki Benita: But that wasn't the motivation here, right? This is not what we look to optimize. If we go back
122 00:26:13.320 --> 00:26:28.990 Haki Benita: to the database, and we look at the size of this index. Look at that. The partial index is just 88 kB in size. Okay? Previously the full index was 7 MB. The partial index is 88 kB.
123 00:26:28.990 --> 00:26:48.659 Haki Benita: So I did the math seriously. I opened excel. I did the math. That's 99% smaller. Okay, so that's a lot of space. Now, at this point you're probably saying, Come on, man, it's just 7 MB. Who cares? But if you go back to your system, and you have huge tables with hundreds and billions of rows. Right?
124 00:26:48.840 --> 00:27:06.290 Haki Benita: Check the size of your B 3 indexes. They can become huge. I've seen situations where the B 3 index was larger than the table. Okay, and if you have a lot of indexes it can grow out of control very, very quickly.
125 00:27:07.020 --> 00:27:21.090 Haki Benita: So, as you may guess, I'm a very, very big fan of partial indexes. They produce smaller indexes, and I highly encourage you to use them whenever possible. One limitation of partial indexes is that
126 00:27:22.030 --> 00:27:26.349 Haki Benita: the database can only use partial indexes when
127 00:27:26.500 --> 00:27:52.249 Haki Benita: the query uses the exact same condition as the predicate in the index. Right? The database is not even smart enough to do something like like, where hits equal one minus one. Okay to this level. Okay, so it's limited to queries that use the exact same condition. Usually it's fine, because you know, why would you do hits equal one minus one.
128 00:27:52.380 --> 00:27:53.080 Haki Benita: I don't know.
129 00:27:53.520 --> 00:27:58.490 Haki Benita: I personally found that noble columns are great candidates
130 00:27:58.780 --> 00:28:09.290 Haki Benita: for partial indexes, because in postgres, for example, null values are indexed, and usually you don't want to use an index for is null queries. So I found that
131 00:28:09.480 --> 00:28:34.749 Haki Benita: whenever I have a noble column with an index on it, I can benefit from making it a partial indexes. In fact, I wrote an entire article on how we save 20 GB of unused disk space simply by identifying noble columns with indexes and switching them to use partial index. Okay, so questions about partial indexes before we move on to the next use case.
132 00:28:36.780 --> 00:28:38.540 Haki Benita: Gabor, you have a question.
133 00:28:42.160 --> 00:28:42.730 Haki Benita: No.
134 00:28:42.730 --> 00:28:45.249 Gabor Szabo: Sorry there is this sorry? Actually, there is this question.
135 00:28:46.340 --> 00:29:04.110 Haki Benita: Oh, is it a good idea to recalculate the hits and partial indexes? How frequently! Well, the nice thing about indexes and btrees in general that they are always in sync with the data in the table, it's actually part of the transaction. So when you, for example, increment.
136 00:29:05.180 --> 00:29:07.990 Haki Benita: when you increment the counter for the 1st time
137 00:29:08.290 --> 00:29:11.070 Haki Benita: the row would just disappear from the index.
138 00:29:11.250 --> 00:29:26.029 Haki Benita: Right? So I'm guessing that you're asking, because you have some experience with like materialized views and stuff like that. So you don't actually have to maintain it actively. It's just maintained by the database.
139 00:29:26.460 --> 00:29:32.839 Haki Benita: It's truly an amazing feature. You should definitely use that any more questions before we move on to
140 00:29:33.140 --> 00:29:36.009 Haki Benita: a very exotic type of index in postgres.
141 00:29:36.750 --> 00:29:38.110 Haki Benita: Ow.
142 00:29:41.210 --> 00:29:46.360 Haki Benita: okay, great. So let's talk about another type. Another use case.
143 00:29:47.270 --> 00:30:00.790 Haki Benita: So first, st in the 1st use case, we wanted to resolve the key to a URL right? This is the redirect action. This time we want to do a reverse. Look up. We want to ask
144 00:30:01.000 --> 00:30:09.090 Haki Benita: how many keys we have pointing to this specific URL. So we wanna search for keys by the URL.
145 00:30:09.530 --> 00:30:20.539 Haki Benita: So we implement this very simple function called reverse lookup. It accepts a URL and returns query, set of short Urls. Okay?
146 00:30:21.210 --> 00:30:49.150 Haki Benita: So if we want to see what the query looks like. We use dot query. And we can see select star from short URL where URL equals something. Okay, if we produce an execution plan. We can see that the database is doing a sequential scan on the short URL table that is, scanning the entire table, sifting row by row, finding matches for our query.
147 00:30:49.430 --> 00:30:50.800 Haki Benita: Whoa!
148 00:30:51.590 --> 00:30:55.929 Haki Benita: And we can see that it's relatively
149 00:30:56.140 --> 00:31:00.379 Haki Benita: slow, right? It's like 105
150 00:31:00.500 --> 00:31:03.990 Haki Benita: milliseconds so compared to the index
151 00:31:04.320 --> 00:31:08.840 Haki Benita: queries that we saw before. That's that's pretty slow. Right?
152 00:31:09.220 --> 00:31:23.659 Haki Benita: So you know, once again, 99% of the people would just say, Come on, man, I'm hungry. Let's order some food. Just slop a B tree on it. So this is what we do right? We start by adding A B tree on the URL
153 00:31:23.860 --> 00:31:37.679 Haki Benita: right? We generate and apply the migration. Now we execute the exact same query again, and we can see that now Postgres is using the index that we just created. We can see an index scan using
154 00:31:38.030 --> 00:31:57.059 Haki Benita: the index on the URL column, and also it's very fast. Previously it was like a 100 ms. Now it's 0 point 1 ms. So that's a very, very big and significant improvement. We can all seek to launch and be very, very happy and satisfied with ourselves. But
155 00:31:57.770 --> 00:32:09.459 Haki Benita: are we done? Do you think that we are done? Is there anything that we can optimize? Now, if you are paying attention throughout this presentation. You know that we can definitely do better than that.
156 00:32:09.830 --> 00:32:16.550 Haki Benita: Let's go to the database and check the size of the index. Okay? So the size of the index.
157 00:32:16.740 --> 00:32:22.669 Haki Benita: Okay, stay with me. 47 MB. If you remember the previous
158 00:32:23.050 --> 00:32:28.779 Haki Benita: use case, we had an index on all the heads. It was 7 MB. I told you it was large.
159 00:32:28.950 --> 00:32:44.159 Haki Benita: This index on the same amount of rows is 47 MB. That's very, very big, and the reason that it's very, very big is that the URL is very, very big, right? The beach index
160 00:32:44.390 --> 00:32:49.879 Haki Benita: holds the actual values in the leaf block. So if we are indexing.
161 00:32:50.020 --> 00:32:58.219 Haki Benita: A column with very large values like Urls, can be very, very big. So if we are indexing
162 00:32:58.430 --> 00:33:03.490 Haki Benita: a column with very, very big values, these values are also present in the index.
163 00:33:04.000 --> 00:33:14.130 Haki Benita: and the index can get very, very big. So previously, when we were indexing integers, it was 7 MB. Now we're indexing large pieces of text Urls.
164 00:33:14.410 --> 00:33:18.940 Haki Benita: and that's 47 MB. Okay, so
165 00:33:19.430 --> 00:33:28.389 Haki Benita: let's pause for a second. Okay, I know that btree is like the magic for 90% of the use cases. But there are other types of indexes that we can use.
166 00:33:28.955 --> 00:33:32.949 Haki Benita: So let's pause for a second and ask ourselves, what do we know about.
167 00:33:33.210 --> 00:33:48.990 Haki Benita: what do we know about the URL? Okay? So 1st of all, we know that URL is not unique. Right? We can have multiple keys pointing to the same URL. We can have, for example, different campaigns with different short Urls
168 00:33:49.100 --> 00:33:55.800 Haki Benita: pointing to the same URL. There's no restriction in the system. You can have many keys pointing to the same URL. So it's not unique.
169 00:33:55.930 --> 00:33:57.940 Haki Benita: However, however.
170 00:33:59.780 --> 00:34:06.770 Haki Benita: if we actually look at the data, we see that we don't have a lot of duplicate long Urls right
171 00:34:06.970 --> 00:34:07.889 Haki Benita: like.
172 00:34:09.444 --> 00:34:18.389 Haki Benita: It's not likely that people will use the same show to a lot to point to the same URL like at the at the very least.
173 00:34:18.650 --> 00:34:22.639 Haki Benita: they would have different utm parameters for the same. URL.
174 00:34:22.780 --> 00:34:33.040 Haki Benita: So while it's it's, it's not a restriction. You can have many keys, pointing to the same URL. It's not likely, so we don't have a lot of duplicate values.
175 00:34:34.199 --> 00:34:36.219 Haki Benita: So now I want to introduce you
176 00:34:36.710 --> 00:35:00.369 Haki Benita: to what I call the Ugly Duckling of index types in postgres, the Hash Index. Okay? And to understand how a hash index works and why it's different from B 3 index. Let's start by actually building a hash index ourselves. So imagine we have these values, A, BC and D, and we want to index them using a hash index.
177 00:35:00.730 --> 00:35:20.800 Haki Benita: So we start by applying a hash function on each value. So postgres in our example, it has different hash functions for different types. So you can see that we have hash for text char arrays, even Json types, Timestamps, and so on.
178 00:35:20.930 --> 00:35:34.680 Haki Benita: In our case we have just one character. So it uses hashchar. If we actually apply this function on the values we get the hash values. The next step is we want to divide these
179 00:35:34.870 --> 00:35:36.829 Haki Benita: values into buckets.
180 00:35:37.030 --> 00:35:43.100 Haki Benita: So we start by dividing them into 2 buckets. Basically, we apply modular 2 on
181 00:35:44.050 --> 00:36:04.600 Haki Benita: on the hash value, and then we assign each value to a bucket. So we can see that A goes to bucket one and BC and D goes to bucket 0. So this is our hash index. Okay, so we have 3 hash values in bucket 0, each hash value points to
182 00:36:04.860 --> 00:36:10.809 Haki Benita: somewhere in the table. Okay, just like we had the Tids in the B tree. We have
183 00:36:10.980 --> 00:36:32.230 Haki Benita: the tids right here in the hash index. Now, if we want to use this hash index to find some value, we do the exact same thing, but the opposite, but the other way around, right? So if you want to search for the value B, for example, we apply a hash function on it. We get the hash value. We apply modulus number of buckets to get the
184 00:36:32.360 --> 00:36:54.430 Haki Benita: bucket. In this case 0, and then we go to bucket 0 and we start scanning the pointers to find matching hash. Once we found a matching hash, we can take this 2, which is a pointer to a place in the table, and we can go scan this row and look for matching rows. Okay, so this is how a hash index works in postgres.
185 00:36:55.190 --> 00:37:14.639 Haki Benita: Now, if we want to create a hash index in Django. We need to use the special hash index from postgres contrip. Okay? The reason for that is that hash index is not the default index type in postgres. So we need to explicitly say, we want a hash index. Okay.
186 00:37:15.260 --> 00:37:19.239 Haki Benita: so in this case we are creating a hash index on the URL field.
187 00:37:19.770 --> 00:37:46.360 Haki Benita: and the name of this index is going to be short. URL Hix. I like to use a suffix that indicates the type of the index. So when you know, when I look at execution planes, I can quickly identify the type of the index. So I usually use Ix for B. 3 indexes, and then I use part Ix. For partial hix for hash indexes, and so on. You can come up with whatever convention you want.
188 00:37:47.920 --> 00:37:48.900 Haki Benita: So
189 00:37:49.530 --> 00:38:00.809 Haki Benita: we generate the migration, we apply the migration and produce an execution plan. And we can see that Postgres is using our hash index. Okay? Now.
190 00:38:00.940 --> 00:38:01.990 Haki Benita: okay.
191 00:38:02.180 --> 00:38:18.460 Haki Benita: 1st observation. This is very, very fast. Okay, so you can see that 0 point 0 7 ms. That's very, very fast. But that's not all. If we look at the size of our hash index. Compared
192 00:38:18.730 --> 00:38:34.859 Haki Benita: to the Beecher index, we can see that the hash index is 30% smaller. Okay, trust me, I took a calculator, an old casio. And I calculated the difference. It's 30% smaller. Okay, that's very, very significant. Okay.
193 00:38:35.340 --> 00:38:37.929 Haki Benita: if we put all the data in a table.
194 00:38:38.180 --> 00:38:46.570 Haki Benita: You can see that the hash index in this case, with both faster and smaller.
195 00:38:46.860 --> 00:38:47.990 Haki Benita: So that's
196 00:38:48.170 --> 00:39:06.030 Haki Benita: a win-win all around. Okay, faster and smaller than the default. B, 3. Index. Now, I did a little experiment. Okay. So what I did was, I created a hash index and a btree index on the key and on the URL. Okay, you can see the the chart right here.
197 00:39:06.490 --> 00:39:35.660 Haki Benita: I have a hash index on the key. I have a hash index on the URL, I have a B tree on the key, and I have a B tree on the URL. And what I did is I started adding rows to the table. Okay, you can see at the bottom the bottom axis. That's the number of rows. So I started adding rows into the table until I get to a million rows. Now, every time I added rows to the table I took a snapshot of the sizes of the hash index of all the indexes, and then I put this
198 00:39:35.740 --> 00:39:39.649 Haki Benita: all the data in this chart, and we can see some
199 00:39:39.740 --> 00:39:43.597 Haki Benita: interesting things. Okay. 1st of all.
200 00:39:44.510 --> 00:39:46.580 Haki Benita: 1st of all, if you look at the.
201 00:39:47.000 --> 00:39:49.219 Haki Benita: If you look at the red line.
202 00:39:49.470 --> 00:39:52.999 Haki Benita: which is the B tree on the URL big piece of text.
203 00:39:53.690 --> 00:40:18.479 Haki Benita: and the green line which is the B tree on the key the short piece of text. 1st of all, you can see that both of them grow basically linearly as I add more rows to the table, right? So we can see like this linear line increasing right? As I add, more rows, the size of the index increases. We can also see that the red line, the B tree on the URL is always larger.
204 00:40:18.850 --> 00:40:21.239 Haki Benita: the the B tree on the key right?
205 00:40:21.780 --> 00:40:30.559 Haki Benita: So the reason for that is that the URL is a big piece of text, and the key is a short piece of text. This tells us
206 00:40:30.890 --> 00:40:33.730 Haki Benita: that the size of the bee tree is
207 00:40:33.840 --> 00:40:36.900 Haki Benita: very much affected by the size
208 00:40:37.250 --> 00:40:40.240 Haki Benita: of the column that it indexes.
209 00:40:40.380 --> 00:40:49.959 Haki Benita: So A B tree on URL will be bigger than A B tree on key for the same amount of rows, because a URL is bigger than a key.
210 00:40:50.270 --> 00:40:56.780 Haki Benita: So that's about the B 2 indexes. However, if we look at the hash indexes. That's the blue.
211 00:40:57.900 --> 00:40:59.700 Haki Benita: the yellow lines.
212 00:41:00.190 --> 00:41:02.260 Haki Benita: 1st of all, we can see that
213 00:41:03.480 --> 00:41:10.410 Haki Benita: the size of the hash index, if I add more rows is not affected by the size of the value.
214 00:41:10.540 --> 00:41:18.259 Haki Benita: because URL is big key small. But as I add more rows to the table. The size of the hash index is the same. Okay.
215 00:41:18.400 --> 00:41:27.409 Haki Benita: The second thing that I can see is that in this specific case the hash index was consistently lower, smaller.
216 00:41:27.690 --> 00:41:35.050 Haki Benita: Then the same index, the same B, 3 index on the same column. Okay. So in this case the hash index was always smaller.
217 00:41:35.520 --> 00:41:40.680 Haki Benita: Another thing that we can see in this chart that, unlike the B 3 index that grows linearly.
218 00:41:41.050 --> 00:41:48.299 Haki Benita: the hash index grows in like steps. Right? You can see the step, and then it's flat. Step flat.
219 00:41:48.700 --> 00:42:09.099 Haki Benita: So what's happening in a hash index is once we have, we start adding rows to the hash index, and then we have some bucket, and this bucket starts to fill up. Now, when a bucket fills up, postgres, needs to split this bucket. Now, when the bucket is split, postgres, pre allocates
220 00:42:09.580 --> 00:42:12.570 Haki Benita: storage disk space for this bucket.
221 00:42:12.700 --> 00:42:16.419 Haki Benita: So the steps that you see is the bucket splits
222 00:42:16.540 --> 00:42:21.430 Haki Benita: where postgres allocates additional storage to split the bucket.
223 00:42:21.770 --> 00:42:22.630 Haki Benita: Right?
224 00:42:22.970 --> 00:42:25.229 Haki Benita: So this is why hash index
225 00:42:25.420 --> 00:42:28.239 Haki Benita: grows in in, in, in steps.
226 00:42:29.060 --> 00:42:35.259 Haki Benita: So hash index is ideal. When we have very few duplicates
227 00:42:35.470 --> 00:42:59.300 Haki Benita: in the rows that we want to index, and the reason for that is, if we have lots of duplicates, the values would map to the same bucket, and we won't get the benefit of a hash index. The reason that a hash index made sense in our case is that URL is mostly unique. It's almost unique. Okay, it's not unique by definition. But there's not a lot of duplicates.
228 00:42:59.680 --> 00:43:18.200 Haki Benita: We also saw that, unlike a B tree index, hash index is not affected by the size of the values that it indexes, and the reason for that is that the hash index doesn't actually include the values. It includes hash values. Okay, this is why I can index very, very big values, big strings
229 00:43:18.540 --> 00:43:40.110 Haki Benita: with a relatively small index. Okay, as we saw hash index under some circumstances, can be both smaller and faster than A. B 3 index, and the reason that a lot of people are unfamiliar with a hash index is that prior to Postgres 10, which is already pretty old because we're now at Postgres 17.
230 00:43:40.580 --> 00:44:04.829 Haki Benita: If you went to the documentation for Hash Index, there would be like this huge warning, saying, Beware, do not use hash indexes. They are not production ready. So a lot of developers became used to not using hash indexes, but starting in postgres 10, you can definitely use hash indexes in production. They are production ready, and as we saw, they can be very, very good under some circumstances.
231 00:44:06.160 --> 00:44:12.890 Haki Benita: We're talking about hash indexes. It is very important to also know the restrictions of hash indexes. 1st of all, hash index
232 00:44:14.290 --> 00:44:32.920 Haki Benita: cannot be used to create. You can create a unique hash index, and the reason that you can is that a hash index does not contain the actual values, just hash values. And technically, you can have multiple different values producing the exact same hash value.
233 00:44:33.090 --> 00:44:43.399 Haki Benita: So it can. You can create a unique hash index. However, okay, and that's the comment at the bottom, we can talk about it later. If you want. You can enforce unique
234 00:44:43.680 --> 00:44:47.209 Haki Benita: with the hash index using an exclusion constraint.
235 00:44:47.440 --> 00:44:56.589 Haki Benita: Okay, next, we can't have a composite hash index. We can't have a hash index on multiple columns. Okay?
236 00:44:57.410 --> 00:45:02.989 Haki Benita: And we can use hash index for sorting and range searches, because once again.
237 00:45:03.280 --> 00:45:10.940 Haki Benita: hash index does not contain the actual values. Just the hash values right? So I can't use a hash index for things like.
238 00:45:11.390 --> 00:45:17.379 Haki Benita: you know, between greater than less than and so on. Just equality.
239 00:45:18.540 --> 00:45:24.421 Haki Benita: So quick. Recap just 4 more slides. I promise. Okay,
240 00:45:26.090 --> 00:45:34.610 Haki Benita: when to use indexes. So remember, indexes can make queries faster. We saw that in all of our examples.
241 00:45:34.650 --> 00:45:56.340 Haki Benita: using an index, made the query faster. However, the not free, they come at a cost. You need to maintain this index, and this index maintenance happens when you insert when you update and when you delete. So the more indexes you create, the faster your queries are. But the slower every other operation is
242 00:45:56.500 --> 00:46:18.380 Haki Benita: okay. Another thing to consider, and this is often overlooked. Indexes can be very, very big. They consume a lot of disk space when you go back to your databases. After this talk, please go do slash di plus, and look at the sizes of your index. I think that if you never looked at the size of your indexes.
243 00:46:18.620 --> 00:46:23.349 Haki Benita: You're going to be very much surprised at what you're going to find.
244 00:46:24.180 --> 00:46:41.909 Haki Benita: and finally using an index is not always best. If you have a query that needs to access a large portion of the table. Sometimes it doesn't make sense to use an index for that. Okay, there's no magic number, but, you know.
245 00:46:42.190 --> 00:46:43.480 Haki Benita: keep that in mind.
246 00:46:44.710 --> 00:46:55.220 Haki Benita: So we talked about index types and features. We talked about partial indexes, inclusive between indexes, and we talked about hash index.
247 00:46:55.420 --> 00:47:07.439 Haki Benita: We talked a little bit about how to evaluate performance. I don't know if you noticed, but throughout throughout this presentation we went through the same process over and over again. We start by
248 00:47:07.600 --> 00:47:25.639 Haki Benita: executing some query with, explain, analyze, to get the timing with no indexes. This is basically establishing a baseline right? And then we start by experimenting with different types of indexes. So usually, we start with a B tree. We take a measure of the time using, explain, analyze.
249 00:47:25.640 --> 00:47:40.620 Haki Benita: and then we take the size of the index. We put it all in a nice table. We start experimenting. And once you have all the data organized like that. It's a lot easier to reach a decision on what is the best indexing approach
250 00:47:40.630 --> 00:47:42.499 Haki Benita: for your specific use case.
251 00:47:42.560 --> 00:47:53.119 Haki Benita: And also and hopefully, you remember that indexes performance is not just about speed. As we saw, we can get significant
252 00:47:53.660 --> 00:47:57.540 Haki Benita: disk space reductions with a very, very.
253 00:47:57.600 --> 00:48:09.329 Haki Benita: with a very small price of speed sometimes makes sense to make this compromise. We also, throughout this talk, saw how to use, explain
254 00:48:09.360 --> 00:48:31.259 Haki Benita: how to use, explain, analyze how to debug SQL in Django, and we also saw a lot of execution plans. I don't know if you noticed, but if you've never seen execution plans before, hopefully, when you go back to your system. You start doing, explain, analyze some of the queries you run a lot. You get to actually understand what the database is doing. Now
255 00:48:31.560 --> 00:48:45.659 Haki Benita: in this talk I talked only about inclusive indexes, partial indexes, and hash index, but, in fact, there are many, many different other types of indexes that are exotic and very, very cool. We have
256 00:48:46.330 --> 00:48:56.900 Haki Benita: Brent indexes. We have function based indexes, and we have a lot of different flavors of things that we can do. And you can check out this
257 00:48:57.300 --> 00:49:04.960 Haki Benita: class 3 h packed with astral magic for your benefit and
258 00:49:05.810 --> 00:49:13.720 Haki Benita: finally check me out in all of these places, and I'm happy to take questions or discuss whatever you want.
259 00:49:19.490 --> 00:49:22.113 Gabor Szabo: Whoa, thank you.
260 00:49:23.750 --> 00:49:26.585 Gabor Szabo: Because, yeah.
261 00:49:27.400 --> 00:49:28.630 Haki Benita: Hectic.
262 00:49:30.335 --> 00:49:35.410 Gabor Szabo: Yeah, this is not a question, Hank. His article on Hash Indexes is truly excellent.
263 00:49:35.520 --> 00:49:42.589 Gabor Szabo: I believe it remains one of the top search results for anyone looking for resources on hash indexes.
264 00:49:42.760 --> 00:49:47.639 Haki Benita: It's true, it's true. This is one of the top searches for hash index in postgres.
265 00:49:47.910 --> 00:49:48.340 Gabor Szabo: Yeah.
266 00:49:48.340 --> 00:49:53.060 Haki Benita: Yeah, I managed to catch this trend very, very early on.
267 00:49:54.515 --> 00:49:55.540 Gabor Szabo: Okay.
268 00:49:55.790 --> 00:49:56.270 Haki Benita: Mom.
269 00:49:56.270 --> 00:50:01.189 Gabor Szabo: Comments, questions before we. We close this session.
270 00:50:02.340 --> 00:50:05.000 Gabor Szabo: We know where where to find you.
271 00:50:05.160 --> 00:50:07.829 Gabor Szabo: We have the. We'll have the link.
272 00:50:08.320 --> 00:50:16.320 Gabor Szabo: You can add the links to the post of the of the of the video as well, so people can find find it easily, easily.
273 00:50:17.100 --> 00:50:19.660 Gabor Szabo: and any comments.
274 00:50:19.660 --> 00:50:20.020 Haki Benita: Okay.
275 00:50:20.020 --> 00:50:21.859 Gabor Szabo: Questions, apparently not.
276 00:50:21.860 --> 00:50:24.780 Haki Benita: Yeah, I want to thank you, Gabra, for hosting this meeting.
277 00:50:24.780 --> 00:50:25.650 Gabor Szabo: It was excellent.
278 00:50:26.146 --> 00:50:27.139 Haki Benita: Meet up!
279 00:50:27.140 --> 00:50:32.660 Gabor Szabo: Yeah. Well, yeah. So thank you very much for this presentation.
280 00:50:32.770 --> 00:50:41.470 Gabor Szabo: If anyone has questions, then we'll see how to find the hockey later on in this on this slide, and then we'll put it in under the video.
281 00:50:42.020 --> 00:50:52.750 Gabor Szabo: Thank you for for supporting us. Thank you for being here. Thank you very much to you to giving the presentation, please, like the video and follow the channel. Yeah.
282 00:50:53.020 --> 00:51:10.139 Gabor Szabo: And if you would like to give any presentation, you're welcome to contact me as well, and we'll see how we can schedule a presentation at what time, and and so on. So thank you very much, and
283 00:51:10.430 --> 00:51:15.029 Gabor Szabo: see you at the next meeting next video, whatever.
284 00:51:15.400 --> 00:51:16.869 Gabor Szabo: Thank you. Bye, bye.
285 00:51:16.870 --> 00:51:18.830 Haki Benita: Thank you very much. Everyone. Good night.
]]>The Reference Model for disease progression was initially a diabetes model. It used the approach of assembling models and validating them against different populations from clinical trials.
The model performs simulation at the individual level while modeling entire populations using the MIcro-Simulation Tool (MIST), employing High Performance Computing (HPC), and using machine learning techniques to combine models.
The Reference Model technology was transformed to model COVID-19 near the start of the epidemic. The model is now composed of multiple models from multiple contributors that represent different phenomena: It includes infectiousness models, transmission models, human response / behavior models, hospitalization models, mortality models, and observation models. Some of those models were calculated at different scales including cell scale, organ scale, individual scale, and population scale.
The Reference Model has therefore reached the achievement of being the first known multi-scale ensemble model for COVID-19. This project is ongoing and this presentation is constantly updated for each venue. To access the most recent publication please use this link
Jacob Barhak is an independent Computational Disease Modeler focusing on machine comprehension of clinical data. The Reference Model for disease progression is patented technology that was self developed by Dr. Barhak. The Reference model is the most validated Diabetes model known worldwide and also the first COVID-19 multi-scale ensemble model. His efforts also include standardizing clinical data through ClinicalUnitMapping.com and he is the developer of the Micro Simulation Tool (MIST). Dr. Barhak has a diverse international background in engineering and computing science. He is active within the python community and organizes the Austin Evening of Python Coding meetup. For additional information please visit

Lessons Learned from Modeling COVID-19: Steps to Take at the Start of the Next Pandemic[v1]
1 00:00:02.260 --> 00:00:20.330 Gabor Szabo: Hello, and welcome to the Code Maven Channel on Youtube and our meeting. Thank you for everyone who is arrived to this meeting, and especially to Jacob, who is going to give the presentation. If you are unfamiliar with the Channel, then
2 00:00:20.500 --> 00:00:48.179 Gabor Szabo: we have these live presentations, meetings with live presentations, mostly about stuff related to python and rust programming, and also something about git and version control. And this area my name is Gabor. I'm the host of this. I teach python and rust at corporations. And I also help companies to get started with
3 00:00:48.960 --> 00:00:52.559 Gabor Szabo: test automation and continuous integration
4 00:00:52.630 --> 00:00:56.399 Gabor Szabo: area that that sort of area.
5 00:00:57.730 --> 00:01:04.399 Gabor Szabo: And and we have this meeting, so people can share their their knowledge with each other.
6 00:01:04.540 --> 00:01:11.199 Gabor Szabo: So thank you for arriving. And before I let Jacob start talking about
7 00:01:11.590 --> 00:01:20.060 Gabor Szabo: himself and introducing himself. Please, like the video, follow the channel. I always forget to say this. So now I remembered.
8 00:01:20.200 --> 00:01:23.540 Gabor Szabo: So the thank you, and and it's yours.
9 00:01:24.210 --> 00:01:54.160 Jacob Barhak: Okay, we're going to try to make it as a conversation as much as possible. My name is Jacob Barhack. I've been developing disease models for since 2,006. So it's almost, it's about 19 years now, a bit less than 19 years. And I have technology for disease modeling. Now, disease modeling means computational disease modeling. And I use a lot of python. This is actually, when I was introduced to Python in 2,006, and all of this project was made in python.
10 00:01:54.761 --> 00:02:10.820 Jacob Barhak: It requires a lot of computing power, and the idea is to be able to explain diseases. So I started with diabetes. I was actually hired by university to as a programmer to actually write disease modeling software.
11 00:02:11.030 --> 00:02:38.280 Jacob Barhak: I'm still using of an option offshoot of the same software like 19 years later. Now it's called missed the micro simulation tool, and it allows you to simulate many, many individuals going through a disease. And you define what the disease is. I started with diabetes. I actually got to the point that diabetes. I have one of the most sophisticated diabetes models worldwide.
12 00:02:38.280 --> 00:02:44.609 Jacob Barhak: and this is patented technology. And at the end, you will see I have a conflict of interest statement. Because
13 00:02:44.610 --> 00:03:10.790 Jacob Barhak: I believe this technology is worth a lot. So whatever I'm telling you, take it with a grain of salt. This is developed technology, everything that I promise you double check the nice thing about this technology. It does the double checking for you for some degree. We'll explain it later. It uses some AI ideas and technologies to actually implement what's happening here. But it's not the AI that you're familiar with. It's a mix.
14 00:03:11.050 --> 00:03:13.479 Jacob Barhak: And this is why it's patented. So
15 00:03:14.070 --> 00:03:24.709 Jacob Barhak: let me explain what happens. With this technology, I started with diabetes in 2,000 and 2,020
16 00:03:24.830 --> 00:03:29.980 Jacob Barhak: Covid arrived to the Us. I was in the Us. And I
17 00:03:30.140 --> 00:03:54.935 Jacob Barhak: started migrating my technology towards modeling Covid. And now I can explain Covid. But let me tell you what explain means. Oh, by the way, this presentation was given in many places you can follow up how it changed because it did change all of those things. You can download on the links at the end. You'll have a QR code. Actually, I'll show it to you now. But
18 00:03:55.550 --> 00:04:16.769 Jacob Barhak: you will have a QR code that you can download this or actually view it. You'll need a strong machine to actually view it, because it's a huge file. It's like a quarter of a gig size to actually download and view on your browser. But it has everything, including results. And it is interactive. It's an HTML file. I'm using python technology called to actually do this.
19 00:04:16.769 --> 00:04:29.960 Jacob Barhak: But it's less about all of this, more about disease models. Later on. You can ask me about everything else, so you can download and see how things changed, even in my perspective. But now I believe things have been stable for the last
20 00:04:30.390 --> 00:04:38.560 Jacob Barhak: approximately 2 years, so I'm pretty sure I can explain, at least in the Us. What happened with Covid and explain means.
21 00:04:39.030 --> 00:04:51.659 Jacob Barhak: Let's let's take it a step north before I show the model and explain it. Think about it. We might have more pandemics in the future. We probably will actually have them all the time. We have many diseases going on.
22 00:04:51.810 --> 00:05:10.500 Jacob Barhak: But can we explain really what's happening. I was amazed when I started working with the medical people that how little they know about some of the things going on, and the fact is, they are overwhelmed with data. The fact that they can remember and do something good is miraculous at this
23 00:05:10.550 --> 00:05:36.520 Jacob Barhak: speaks a lot about their profession. They are doing the best they can, but they cannot even memorize the medical papers coming out every 6 seconds, and every 6 seconds a new medical paper coming out. There's no way one doctor will remember at all. And I'm not talking about all those medical databases, huge amounts of data that are being accumulated by by bodies like the National Institutes of Health.
24 00:05:37.460 --> 00:05:59.149 Jacob Barhak: All of this means that we need now to help computers help up crunch all this and give us good results and good explanations about what's going on, because the way we are dealing with medicine now will change with the data. It's already changing. And this model and many other tools related to it will change it. So
25 00:05:59.300 --> 00:06:05.260 Jacob Barhak: now let's go back to the model and explain how I can explain things. So
26 00:06:06.270 --> 00:06:11.290 Jacob Barhak: the reference model for disease. Progression is kind of a statistical model.
27 00:06:11.690 --> 00:06:31.800 Jacob Barhak: What it does it says you. Each disease has states it's a state transition model, where you can be either no covid or can be covered, infected. You can recover or die from covid. Notice that there is no error back from recovered to infected, because I'm modeling the beginning of the disease. April 2020.
28 00:06:32.520 --> 00:06:43.590 Jacob Barhak: Because the idea is that the next disease we want to have a tool that will explain it to us in reasonable time, and I believe this is one of the tools that can help do that. So
29 00:06:43.930 --> 00:06:44.650 Jacob Barhak: oh.
30 00:06:45.080 --> 00:07:07.769 Jacob Barhak: I'm I'm trying to extract from the beginning data that people accumulated. I'm using the Covid tracking project data. They allow me to use it even for commercial purposes. And you can actually go and track and see for each State in the Us. How many people got infected, how many people died, and they kept the very pretty good record about it.
31 00:07:08.000 --> 00:07:24.509 Jacob Barhak: and later on there are other organizations that took over. But they were, I believe, the best at start. So I'm using that data. And now I'm I'm trying to get a model that explains all those numbers that they appear, that they that they report
32 00:07:25.000 --> 00:07:25.830 Jacob Barhak: so
33 00:07:26.390 --> 00:07:32.730 Jacob Barhak: to do this, I assume that there are those States, and this is the beginning of pandemic. So there is no reinfection.
34 00:07:33.160 --> 00:07:41.250 Jacob Barhak: and I'm trying to match their numbers. How do I try to match them? Each error has several words about above them. Each word.
35 00:07:41.250 --> 00:07:44.160 Gabor Szabo: Wait a second, Jacob. So- so
36 00:07:44.760 --> 00:07:54.509 Gabor Szabo: Jim is writing that screen share, not showing. If that is wanted. I I see on the screen these boxes of No. Covid.
37 00:07:54.510 --> 00:08:08.830 Jacob Barhak: No Covid Covid infected, possibly, hospitalized Jim. And look, if you have the option of actually choosing what screen you choose. But sometimes you can choose. You can see the shared screen. I can stop the share and share it again. If it's okay with you guys
38 00:08:09.000 --> 00:08:11.570 Jacob Barhak: or Jim, did you find the share.
39 00:08:11.980 --> 00:08:15.290 Jim Mccormack: I'll look, Jacob, don't let me take you out of flow. Go, please proceed. Thank you.
40 00:08:15.290 --> 00:08:22.830 Jacob Barhak: So look at. Look at the link I sent. You can actually bring up the presentation and follow up. I'm on the second tab called introduction.
41 00:08:23.460 --> 00:08:33.630 Jacob Barhak: It's it's an interactive on your machine. You should be able to download that you have good Wi-fi just download it to your machine and you can follow me there if you don't see the presentation.
42 00:08:33.730 --> 00:08:34.389 Jim Mccormack: Got it.
43 00:08:34.390 --> 00:08:35.260 Jim Mccormack: It's loading.
44 00:08:36.049 --> 00:08:38.569 Jacob Barhak: Yeah, I know it takes a minute to load.
45 00:08:39.282 --> 00:09:00.519 Jacob Barhak: It's it's huge, but it has everything encapsulated. Part of the reason is to keep it to a producible as much as possible at the end. You'll see a producibility section. I don't give away all the code, but I do give. Keep track of everything that I noted. Like. You see all those boxes here, those are all the references where you can actually extract the data.
46 00:09:00.599 --> 00:09:14.679 Jacob Barhak: And one, some of the links actually became defect. I found some other links so can show you where I got the data from, or make sure that people can actually try to reconstruct this as much as possible, because we have a reproducibility crisis in science.
47 00:09:14.909 --> 00:09:23.569 Jacob Barhak: anyway, back to the boxes. So on top of those boxes you'll see words like infectious transmission, response, recovery, and mortality.
48 00:09:24.019 --> 00:09:29.869 Jacob Barhak: Each one of those represents not one model, but many models.
49 00:09:30.449 --> 00:09:42.189 Jacob Barhak: The technology that I'm using is called an ensemble model. An ensemble is like choir in in music. Well, ensemble models are very similar. You have. Don't have one model, you have many of them.
50 00:09:42.329 --> 00:09:52.189 Jacob Barhak: so I have many model for infections, many models for transmission, many models for response, many models for recovery and mortality and hospitalization. You'll see later.
51 00:09:52.369 --> 00:10:21.229 Jacob Barhak: But on top of it. I actually have an observer looking at all of this, saying, you know your numbers are wrong. So you have to actually correct them, because your observer. Actually, you have multiple observer models, each one seeing something different, telling you something different about those numbers. Do you incorporate all of those? And now you have many, many models, and you have to run them all and takes quite a bit of computing power. Later on I'll show you I have a computer here still crunching data, because I'm still working on this and making sure that my numbers are okay.
52 00:10:21.569 --> 00:10:22.439 Jacob Barhak: So
53 00:10:24.309 --> 00:10:41.049 Jacob Barhak: all of this takes a lot of computing power, you will see later on. This computation took about 3 years of computation on a single CPU. The ones I'm working on now will take half a half a year on a big 24 core machine 32 threads. So
54 00:10:42.299 --> 00:10:54.739 Jacob Barhak: ever. And when you run it in the cloud it takes many, many, many processors, because this takes a lot of computing power just like AI because it uses AI technology, we're gonna talk about it later in a second.
55 00:10:55.605 --> 00:11:18.749 Jacob Barhak: So I take all those models. I also take information from the cover tracking project about what happened in each State like numbers of and and you see some of those numbers later. I have information from us. Census about each State States are not the same. They different sizes, different population, density, different age age
56 00:11:19.109 --> 00:11:39.639 Jacob Barhak: curves population curves in school. Also, there's information about number of interactions and even the weather. I include the weather as part of the simulations, and later we'll ask some questions. But let's say, show you what the models look like and why they are. They are the way they are. So let's
57 00:11:39.979 --> 00:11:44.819 Jacob Barhak: explain one motivation. Why, this is important. To have a model like this
58 00:11:45.769 --> 00:12:10.249 Jacob Barhak: in the Dhs is the Department of Homeland Security in the Us. During Covid. It's kept a document called the Master Question List about COVID-19, and it always very, very organized way. They say this is what we know, or we think we know. And this is what we want to know. So they had a master question list about question about Covid things they didn't know.
59 00:12:10.729 --> 00:12:24.479 Jacob Barhak: In the 26th of May 2020, in that version of that table of that paper, and which evolved throughout time. By the way, you can actually download it later in the presentation, and check it out in
60 00:12:25.089 --> 00:12:37.069 Jacob Barhak: the versions change, but it still exists somewhere. All those versions you can still find them. The Department of home security kept very, very meticulous record, which is very good. I I give them
61 00:12:37.339 --> 00:12:47.179 Jacob Barhak: a good grade, because this is one of the most important documents, so you can extract information, for from they did very good job. But
62 00:12:47.839 --> 00:12:55.989 Jacob Barhak: even with all this, in the 26th of May they were still asking that question, what is the average infections period during which individual can transmit the disease?
63 00:12:57.259 --> 00:13:07.479 Jacob Barhak: Why is it important? Think about it. You are now the government, and you have to decide. If you have a lockdown, or how much people to how much time do you keep people in curfew?
64 00:13:08.009 --> 00:13:12.569 Jacob Barhak: Or even if they're sick, how much time you keep them not roaming around.
65 00:13:14.069 --> 00:13:16.979 Jacob Barhak: They didn't know they kind of admitted.
66 00:13:18.669 --> 00:13:19.434 Jacob Barhak: So
67 00:13:21.569 --> 00:13:29.999 Jacob Barhak: Since the Department of Homeland Security didn't know those things, and they asked us. And
68 00:13:30.599 --> 00:13:33.819 Jacob Barhak: at this point I started looking for answers.
69 00:13:34.069 --> 00:13:57.659 Jacob Barhak: And actually this happened later on in the pandemic, because some of those models came in like a year later, but some of them existed even before. So you can extract some information which is semi like from this paper from Bai Lee, and let me explain what this curve means. This is the infectious curve. It's relative infection. It tells you how much virus you shed, meaning, how much virus your body
70 00:13:57.969 --> 00:14:02.709 Jacob Barhak: gives away compared to your Max, your Max is one.
71 00:14:03.449 --> 00:14:26.721 Jacob Barhak: So how much virus your body generates, and this is the day. So a day 0 almost nothing. You just got infected. You don't generate the virus, or at least not enough to actually spread around the number is so slow, low you don't see it, and then for next 2 days you don't, and then it starts growing, and then it goes away. This, actually, in this case, in this paper, actually draw, took the information from
72 00:14:27.899 --> 00:14:40.199 Jacob Barhak: I took this information and and manipulated a little bit because it was not exactly like this, but some other models. They actually says, this is how we measured it. Notice how different the models are.
73 00:14:40.199 --> 00:14:59.819 Jacob Barhak: Here's another one. Actually, in this paper. They had, like 6 of those models, but 5 or 6 I don't remember, but each one looked different. So I took 2 from that publication and think about it. Each person also must be behaves differently. When I run the simulations, I run the simulation for each individual, so each individual may be different than another.
74 00:14:59.869 --> 00:15:06.769 Jacob Barhak: but in this situation I assume that all of them have the same infections of the entire cohort, and we are looking at the average.
75 00:15:06.909 --> 00:15:11.699 Jacob Barhak: We can do the simulations not like this, but we're looking for the average curve. So
76 00:15:11.909 --> 00:15:36.589 Jacob Barhak: even if I say, Oh, this is like that, or this is like that, or this person behaves like this, or like this. Some of those papers came in later in the pandemic. But still, once you have this information, or even if you don't have this information, but you have assumptions from other diseases. Say, Oh, you know, when this disease looks like that, and maybe take the infections curve from that disease, you can plug it in.
77 00:15:37.329 --> 00:15:56.959 Jacob Barhak: And what the model does it uses? Ai techniques. Everyone probably is familiar with optimization technique called gradient descent. So using gradient descent after running all of those simulations, it will find the optimal one from all of the models that you plugged in.
78 00:15:57.279 --> 00:16:06.979 Jacob Barhak: Let me explain how it starts at the beginning. You don't know what what curve is dominant, or what is actually correct. What you do is, you assume
79 00:16:07.409 --> 00:16:23.439 Jacob Barhak: all models are the same meaning. If 5 people come to me and tell you a story. You believe them all the same way without knowing anything better. So you just average whatever they're saying. This is the average of all of the models that you saw before.
80 00:16:24.549 --> 00:16:25.659 Jacob Barhak: Now.
81 00:16:26.499 --> 00:16:47.699 Jacob Barhak: During simulation, you actually run simulations and know this is better. This is worse, and little by little, you start optimizing. And after many, many iterations, you take a lot of computing power. This is basically how AI models train using very similar technique. So I'm using the same technique over some other medium which is not a neural network. As you know it.
82 00:16:47.849 --> 00:16:58.359 Jacob Barhak: it's somewhat similar, but not exactly. It's a different state thing, state transition, and it actually runs. It runs a little bit differently. I'll I can go into details if you're interested.
83 00:16:58.589 --> 00:17:15.439 Jacob Barhak: and then you end up with something that looks like this. This is the answer for the Department of Homeland Security. Hopefully, in the future, you will see they will use technology and be able to extract an answer fairly quickly.
84 00:17:15.599 --> 00:17:29.799 Jacob Barhak: and then don't go a long time in the pandemic without knowing by the way, this may change with the numbers. But over time. And of course, there's bios variance and stuff like this. But at least the beginning, you have a basic idea of what's happening.
85 00:17:29.799 --> 00:17:49.269 Jacob Barhak: And this is what was happening according to all the numbers and all the assumptions that you will see coming in later on. Remember, a model is not true. It's an assumption, and what it does, it takes all this ensemble and kind of puts it into the places. This is the most reasonable set of assumptions that matches the best, the data best
86 00:17:50.259 --> 00:17:52.609 Jacob Barhak: anyone has any question. Or can I proceed?
87 00:17:55.109 --> 00:17:56.389 Jacob Barhak: Okay, I'll proceed.
88 00:17:56.839 --> 00:18:12.969 Jacob Barhak: Here's about transmission. Yeah, I'll make it interesting for you if I ask you if you met. It's 1 thing having, in fact, being infectious. But what if I meet April 2020? I have Covid, and let's say I meet one of you for 15 min.
89 00:18:13.269 --> 00:18:25.959 Jacob Barhak: What's the chance of you getting the Covid from me? Don't answer it. I'll tell you. I asked this question many times. Many people, many people tell me 70% or encounter for, say, 15 min.
90 00:18:26.129 --> 00:18:34.129 Jacob Barhak: And then I tell them it's slow. Then look at themselves 50%, and then tell them it's lower. Then they get to 10%. And I tell them it's still lower.
91 00:18:34.409 --> 00:18:39.339 Jacob Barhak: And then they end up amazed that it's only between one and 2%,
92 00:18:39.779 --> 00:19:00.269 Jacob Barhak: because what drives Covid crazy is not the fact that the transmission happens immediately. It's like, if if virus will be so infectious, and everyone will be infected in no time. What what drives it is the fact that we have so many interactions amongst ourselves. So a person meeting, and we have those every day with many, many people.
93 00:19:00.399 --> 00:19:03.109 Jacob Barhak: So if at some point me
94 00:19:03.429 --> 00:19:09.859 Jacob Barhak: having interaction with one person, the chance is very low. But since I meet many people for many days.
95 00:19:10.309 --> 00:19:21.119 Jacob Barhak: this is what drives it the chance for me transmitting 1% over 10 days with many people, it's much, much higher than me. With one person for 15 min.
96 00:19:21.120 --> 00:19:26.539 Gabor Szabo: Doesn't it change? Depending on the on the length of the time you spend with the person.
97 00:19:26.540 --> 00:19:42.980 Jacob Barhak: Yeah, you can argue that you can argue, think about encounter, think about like a 15 min encounter as an average. If you spend the person twice. Then basically, it's not twice the probability, but it's it's very close to twice the probability.
98 00:19:42.980 --> 00:19:50.709 Gabor Szabo: But what I'm saying is that let's say you you meet a hundred people for 1 min versus one person for a hundred minute.
99 00:19:51.370 --> 00:19:58.529 Jacob Barhak: Yeah. So all of those things scales kind of differently. It's, you know, like a Bernoulli test.
100 00:19:59.700 --> 00:20:18.979 Jacob Barhak: It looks like a Bernoulli test that you run many, many times you have a chance of like flipping a coin that is biased, and how many times each one, each period of time, let's say, 15 min. You flip one of those coins. So according to this, you can actually calculate an approximation to the function.
101 00:20:21.220 --> 00:20:28.770 Jacob Barhak: so it's simple statistics that you learned at school. But now it's actually being active, very useful in those cases.
102 00:20:28.770 --> 00:20:52.959 Jacob Barhak: Later on, you put it. We tried in the past to do it in the Marcus model. There are multiple ways to calculate it. And some people come up with different functions. It doesn't matter, really, because all of those are assumptions, all those ones are incorrect to some degree. The question is, which one is most plausible under all of the things that you know, and for this, you need a lot of assumptions, a lot of computing power. And this is what I saw.
103 00:20:53.480 --> 00:21:01.439 Jacob Barhak: So even if it's not about newly test, and someone else comes up with different infections function. I can plug it in and see what happens.
104 00:21:01.720 --> 00:21:03.770 Jacob Barhak: You you understand what I mean.
105 00:21:04.130 --> 00:21:27.249 Jacob Barhak: But in this situation I took multiple functions that took into account individual encounter the population density of the State, some random, constant, just just put plug it in just in case to make some noise. Sometimes it's helpful in some models, at least even to root out some things. And I also added something interesting temperature. Here, let me ask you
106 00:21:27.270 --> 00:21:41.540 Jacob Barhak: what happens? Colder States or warmer States, the transmission is higher. Think about it. If you're in New York or Michigan, or you've Texas or Florida, where is the chances for you to actually transmit the disease? Higher?
107 00:21:42.560 --> 00:21:54.420 Jacob Barhak: Think about it. It matters, and we'll see the answer at the end. Think about the answer to yourselves, but later on you'll you'll get the answer from from me when I show the results.
108 00:21:56.150 --> 00:22:13.710 Jacob Barhak: So I also took into account response models. Pandemics are. Also. It also matters how people behave if you are afraid of the pandemic, and you stick at home and don't go nowhere. Your chances of transmitting or getting disease are much, much lower.
109 00:22:14.640 --> 00:22:15.860 Jacob Barhak: But then.
110 00:22:16.150 --> 00:22:40.989 Jacob Barhak: if you run around and ignore the disease, like many people did, and which is model number 3 here that say, Oh, I don't care about Covid, and then eventually get Covid, and then you're at home and don't see anyone because you're at home, or even worse. If you are not at home, and you just ignore Covid and go around. It's worse. So different people behave differently, actually, different parts of society behave differently same, just like infectiousness curve.
111 00:22:41.140 --> 00:22:42.859 Jacob Barhak: Each one behaves differently.
112 00:22:42.980 --> 00:22:54.039 Jacob Barhak: So now you have different models. So I took 2 models from apple mobility, with different variations on them. Apple mobility data says, Oh, how many people looked at their phones
113 00:22:54.370 --> 00:23:03.850 Jacob Barhak: and press the map button. This indicates they want to go somewhere. It doesn't mean they went. But this kind of an an estimate of how much mobility they had.
114 00:23:03.970 --> 00:23:07.029 Jacob Barhak: So this is, I use those as a base.
115 00:23:07.618 --> 00:23:13.241 Jacob Barhak: Also, I used as a base. Later on came Eric Ferguson.
116 00:23:13.920 --> 00:23:20.750 Jacob Barhak: I I hope I didn't butcher the name. He's from Montclair University. He did a study of us States.
117 00:23:21.170 --> 00:23:28.679 Jacob Barhak: and knew whether they shut that what their shutdown orders were in the States, each State
118 00:23:29.140 --> 00:23:43.180 Jacob Barhak: decided differently. So now you incorporate all of this into the model and says and say whether, you know, non-essential shops were closed, school stay at all models, so on and so forth.
119 00:23:43.360 --> 00:24:08.491 Jacob Barhak: and at different levels of compliance. I entered it as a formula into the model it. It's a little bit more complex. I'm just giving you the idea of what's happening here. He later on published a a good version, the I used an older version that, and I state exactly what the differences are from this newest version. But
120 00:24:09.370 --> 00:24:15.069 Jacob Barhak: This is how. Now I have different models of how the States behave.
121 00:24:15.180 --> 00:24:17.640 Jacob Barhak: and each State behaves differently, of course.
122 00:24:17.970 --> 00:24:28.839 Jacob Barhak: but I have also different models of those, and all of those are part of the mix of models that are playing around. Think about it. All of those models are roaming around and doing things.
123 00:24:29.090 --> 00:24:58.840 Jacob Barhak: then came in and tell me this is fairly recent. He gave me an hospitalization model. I didn't have an hospitalization model meaning people are in hospital. The numbers that you count of people being infected is not very good estimate, because you know the tests are not that you don't test everyone, and so on, and so forth. It does errors here, but people who ended up in the hospital oh, you know they were hospitalized. So if you have a hospitalization model.
124 00:24:58.840 --> 00:25:05.610 Jacob Barhak: It kind of like helps you out. Now, interestingly enough, not all States counted hospitalizations. Well.
125 00:25:05.690 --> 00:25:15.500 Jacob Barhak: but it doesn't matter when they do. I do want better information. So authorization models is something that Kyoti gave me. And then the question is, when is the person?
126 00:25:15.720 --> 00:25:25.680 Jacob Barhak: What's frequency? The person gets hospitalized if they get the disease. So he came up with 3 models, low probability, moderate probability, and high probability, and those depend on age.
127 00:25:25.990 --> 00:25:28.579 Jacob Barhak: as you can see here, and also
128 00:25:28.770 --> 00:25:37.303 Jacob Barhak: whether a person gets hospitalized early or later, meaning how much time it takes them to get hospitalized in each age.
129 00:25:37.880 --> 00:25:38.710 Jacob Barhak: when.
130 00:25:38.870 --> 00:25:56.340 Jacob Barhak: So again, you can run the simulation and find out that the worst summary, if you take the average one, it's not the best one, actually the one with the lower probability the was not as high as we thought, at least according to data and all of the other models that we found.
131 00:25:56.970 --> 00:25:59.740 Jacob Barhak: So all of this you take into account.
132 00:26:00.830 --> 00:26:07.310 Jacob Barhak: Finally, we have mortality models. People die out of Covid, they die, they eventually everything. But then
133 00:26:07.480 --> 00:26:19.582 Jacob Barhak: what frequency? Again, what's chance of dying. And when. So, there's 1 type of modeling saying, we'll take. This is information published by the Cdc. We'll take
134 00:26:20.410 --> 00:26:31.329 Jacob Barhak: between some ever. This is more complicated. I'll just say that it's the mortality, probability and the time, and it doesn't change much
135 00:26:31.930 --> 00:26:32.720 Jacob Barhak: it
136 00:26:32.940 --> 00:26:52.660 Jacob Barhak: but Filippocilione actually did a model about how organs die out of cells dying. This is a multi-model, like a multi-scale model, because he was working in level of cells, but later on tied it all the way up up to the mortality of the person. So it's different levels of size.
137 00:26:52.860 --> 00:27:19.839 Jacob Barhak: So this is why it's called the multi-scale model. And that model, it tells me, in each day after infection. Infections. D. 0, what's the chance of a person dying? So, for example, a 20 or someone an infant dying out of day, 1917, according to his model, is less than one per 1,000 if they get infected. But if you go to a 90 year old
138 00:27:20.460 --> 00:27:30.929 Jacob Barhak: in day 20. It's 1% in day. 19, it's a little bit less. It's also 1%. But then it drops and goes high. This is according to his model.
139 00:27:31.150 --> 00:27:55.689 Jacob Barhak: So now, which one of the ones models correct, so you can mix them up and check it out. And this is what we do. But before we do this we also have to account the fact that the numbers we get are incorrect. How many people here raise of hands? How many people here were saw something about the numbers that they are showing are wrong. Someone is miscounting them. During Covid we all went through Covid.
140 00:27:55.810 --> 00:27:59.739 Jacob Barhak: Come on Upper. Did you ever.
141 00:28:00.060 --> 00:28:04.030 Jim Mccormack: I definitely did. Yeah, I can't raise my hand, but I could raise my voice.
142 00:28:04.270 --> 00:28:12.689 Jacob Barhak: Yeah. So we know that the numbers people claim they're wrong. Some people think we're overestimate some people. They're underestimated, correct.
143 00:28:12.960 --> 00:28:27.870 Jacob Barhak: Everyone had their own opinion, and we don't know what drives those opinions, but we can suspect. But it doesn't matter really. For science we have different numbers, and we don't trust them. How do we correct and say, you know what?
144 00:28:28.190 --> 00:28:31.799 Jacob Barhak: We asked the question, what if it was a different number.
145 00:28:32.250 --> 00:28:39.130 Jacob Barhak: and what different number, for example, we know almost for sure, that the number of infections that we have is
146 00:28:41.430 --> 00:28:47.100 Jacob Barhak: is miscounted. It doesn't represent the probability in the entire society
147 00:28:47.711 --> 00:28:53.980 Jacob Barhak: like probability of not probability. The the proportion of people actually infected in society, because
148 00:28:54.300 --> 00:29:10.980 Jacob Barhak: the tests are always like they have a error. It also have matters when you test the person it. It matters not only when you test the person it's like in in the accuracy of the test, but also how how you conduct the test, how you
149 00:29:11.170 --> 00:29:27.340 Jacob Barhak: like, what your sample, all of what your sample population are. All of those things matter. So we pretty much assume that the numbers are underreported, the number of infections, people that are reported because there are those who never tested. Therefore their numbers didn't appear
150 00:29:27.420 --> 00:29:42.970 Jacob Barhak: so. Now, how much we multiply them. Well, some people multiply by 5. This seems to be like a running number that everyone multiplies in epidemiology. I claimed. Okay, let's try 20. Just if there's 5, let's try 20 and and
151 00:29:43.420 --> 00:29:49.230 Jacob Barhak: Lucas, who gave me this model, actually gave you an infectious model Lucas Botcher. He
152 00:29:49.770 --> 00:30:03.270 Jacob Barhak: he actually looked at. He says that it's also about 1757, 15, you have to multiply in fraction 7, 15. And then there's another model that it's more complicated. Explain. We'll leave it for now it doesn't matter, for now
153 00:30:03.480 --> 00:30:04.600 Jacob Barhak: and then
154 00:30:06.180 --> 00:30:16.249 Jacob Barhak: the mortality is the same thing, same people who died, you know people who died out of a car crash were listed, discovered. At least this is what was reported in some newspapers.
155 00:30:17.430 --> 00:30:27.399 Jacob Barhak: and and then vice versa. Some people died of Covid, maybe were written down that they died out of something else because of complication we never know. So
156 00:30:27.770 --> 00:30:36.420 Jacob Barhak: most, you can actually do this. And this is what Lucas the watcher, did. Did it per state, and he gave me a bunch of numbers per state.
157 00:30:36.580 --> 00:30:49.569 Jacob Barhak: So now I'm running all of those models to make sure that whatever is being told is correct. So now we have the 2 numbers, the true number that the model knows of how many people are infected, and the observed numbers which
158 00:30:50.160 --> 00:30:58.770 Jacob Barhak: takes all of those, and and and adjust this to the according to the real number. So in the zoom number the it will be different.
159 00:30:59.020 --> 00:31:20.830 Jacob Barhak: So now let's look at the results. This takes a lot of computing time to actually do. It's about 3 years of computation on one CPU or on on one CPU core. I use a 24 core machine. So it takes about 6 weeks to run that simulation that, you see, I'm now running a much, much bigger simulation. I might show you the screen while it's happening later on.
160 00:31:22.470 --> 00:31:49.910 Jacob Barhak: I. Each time each state is represented by 10,000 individuals, and those 10,000 interact kind of with each other, and also with all of the equations that I showed you. And each equation comes in with a different weight. And what I do is I run all the simulations and then test whether the numbers match or don't, how well or how badly they match or don't match their reported numbers. So let's look at this.
161 00:31:50.140 --> 00:31:56.009 Jacob Barhak: It's a huge I I cannot even show you the real results. This is a cut down version.
162 00:31:56.860 --> 00:32:03.119 Jacob Barhak: because otherwise it the file sizes become enormous at some point. But let's explain what you're seeing.
163 00:32:03.240 --> 00:32:09.579 Jacob Barhak: This is the population. panel you see here for each state
164 00:32:10.720 --> 00:32:27.540 Jacob Barhak: multiple of those circles each circles will present one day in one simulation, and I will start simulations again in different times. Because, remember, like, you have the timeline running and I
165 00:32:28.190 --> 00:32:56.680 Jacob Barhak: and I start the simulations once in day, one and then once in day 5, and then check it out after whether day 10 in one, and which means day 5 in the other, are the same. I include all of those together, because I'm if I start all the simulation the same time, and some of the numbers are wrong, then I have a problem. So I have to start the simulation different times in the pandemic and in different windows. Each time I run a window of 21 days.
166 00:32:56.780 --> 00:33:00.859 Jacob Barhak: and then I check after 21 days. How good it matches the results!
167 00:33:01.300 --> 00:33:07.789 Jacob Barhak: I I turned. I tried down the I figured out that I don't. If I don't do those windows, then
168 00:33:07.930 --> 00:33:14.660 Jacob Barhak: the the results are not really good. So those windows really help stabilize the results, because all the numbers become
169 00:33:15.080 --> 00:33:23.000 Jacob Barhak: the it's much less sensitive to to wrong number. To some errors that appear so.
170 00:33:23.750 --> 00:33:37.199 Jacob Barhak: What happens here is one circle represents, for example, the electric circle. This is Kentucky code 45 means that it starts 45 days into the after April first.st
171 00:33:37.300 --> 00:34:03.680 Jacob Barhak: This is where the simulation starts, and then it runs for 21 days. So after 21 days you can look at the results here. It will give you the average age and stuff like this. But look at the numbers that says, observed, observed, infected, you will see a number before the slash, and the number after the slash. Same thing observed death before the slash and after the slash the number before is what the model tells you.
172 00:34:03.980 --> 00:34:14.260 Jacob Barhak: The number after is the actual number, as reported by Covid tracking project after, of course, being optimized after being normalized to 10,000 people per state.
173 00:34:14.650 --> 00:34:20.170 Jacob Barhak: So this is out of 10,000 people. So you see the models way off like twice here.
174 00:34:20.570 --> 00:34:24.529 Jacob Barhak: Now, the height of that circle
175 00:34:24.750 --> 00:34:47.079 Jacob Barhak: is the error. It's called the fitness score. That takes into account the differences between the infectious, the observed infection model and the Re and the observed results and the death model and the observed results and the hospitalization model observed results. They're all bundled together.
176 00:34:47.300 --> 00:34:56.389 Jacob Barhak: And then this is being optimized using gradient descend, which is an AI technique that people are familiar with. This is what the base for all our neural networks is.
177 00:34:56.739 --> 00:34:57.690 Jacob Barhak: So
178 00:34:58.200 --> 00:35:09.280 Jacob Barhak: now I'm taking all of this, and I'm starting to optimize. Notice. The height of the circle is the error, and I'm trying to drop it down ideally. I want everything to be around 0.
179 00:35:09.810 --> 00:35:18.669 Jacob Barhak: And what I'm actually optimizing. Let's explain what happens here. You see, all those 5 blue ones. Those are all infectious bottles.
180 00:35:18.890 --> 00:35:26.039 Jacob Barhak: those purple ones are transmission models. The green ones are the behavioral models.
181 00:35:26.060 --> 00:35:43.350 Jacob Barhak: those the reddish or brownish ones, and the yellow ones are the hospitalization models. You remember, we have time and probability, and those are mortality models, and finally, the purple ones are mortality observer models. All of those are being a hospital
182 00:35:43.350 --> 00:35:58.570 Jacob Barhak: optimize at the same time. So now I'm looking, I'm I'm tweaking them, and I'm running ready in the center. There are variations. And then, little by little, you said, even like after 3 iterations, see, one of the transmission models totally disappears. We'll tell you which one in a second.
183 00:36:00.610 --> 00:36:21.890 Jacob Barhak: and then it continues continues, and see some here more dominant mortality models. So some of the mortality models were not that good? And continue and continue. Now you can build those curves I showed you at the beginning, you can actually figure out what the transmission was, how people behaved. You'll see the apple mobility data disappeared completely
184 00:36:22.656 --> 00:36:29.049 Jacob Barhak: within the transmission models, the model that disappeared in this one became dominant. Those are the ones with temperature.
185 00:36:29.600 --> 00:36:35.140 Jacob Barhak: The ones with it says that colder states transmit more
186 00:36:35.590 --> 00:36:41.780 Jacob Barhak: is the one that is dominant, meaning. If you're in a hot state you'll be. You're better off than in the cold state.
187 00:36:42.110 --> 00:36:44.770 Jacob Barhak: because this almost disappeared completely.
188 00:36:46.390 --> 00:36:52.820 Jacob Barhak: here in the infectious is one. You see that some of the dominant models, the one that were longer, not the one that shorter.
189 00:36:53.680 --> 00:37:02.510 Jacob Barhak: And finally, if you look up the mortality model, the one that Felipe Castiglio gave me is the dominant one
190 00:37:02.750 --> 00:37:07.500 Jacob Barhak: meaning this is these models actually better if you look at it over time
191 00:37:08.365 --> 00:37:30.270 Jacob Barhak: and observer models tell you that some of the models don't make sense like don't multiply by 20. It's somewhere between 1 5 or 7, 5 or 7, 1, 5. Approximately, the people who said that 5 times which you multiply, the number of infections, the correct number, those who are generally correct.
192 00:37:30.440 --> 00:37:34.860 Jacob Barhak: Approximately, we can actually calculate the exact numbers. But this doesn't matter, because
193 00:37:35.070 --> 00:37:43.209 Jacob Barhak: it's it's it's enough to know that it's what's wrong and what's not. And now we can answer questions about the pandemic.
194 00:37:43.530 --> 00:37:53.749 Jacob Barhak: And I've written all those things here. But let's talk about a little bit about conclusions. I'll I'll just conclude everything in case you might have questions.
195 00:37:54.540 --> 00:37:55.809 Jacob Barhak: and I want to go
196 00:37:56.190 --> 00:38:00.710 Jacob Barhak: too much overtime. I want to keep it short. Otherwise it's me talking. I want to hear you.
197 00:38:01.224 --> 00:38:06.839 Jacob Barhak: So the idea is that this model can help the government in the next pandemic.
198 00:38:08.740 --> 00:38:09.600 Jacob Barhak: Now
199 00:38:09.710 --> 00:38:20.649 Jacob Barhak: I'm telling you this, but I am biased because I developed it. This is developing for about since 2,012, I invested my all my time and effort and resources into this
200 00:38:21.990 --> 00:38:30.209 Jacob Barhak: quite a bit. I've been doing this for many years on my own. I'm a sole proprietor now, meaning I'm a company of one person in the us.
201 00:38:31.730 --> 00:38:38.319 Jacob Barhak: It's a form of explainable artificial intelligence, because I can explain to you things as you saw.
202 00:38:38.460 --> 00:38:43.289 Jacob Barhak: So it's AI. But the explainable type. Once I'm showing you
203 00:38:43.380 --> 00:39:08.519 Jacob Barhak: it's sometime now. The question is, how how good it is. So I can tell you for sure. It is difficult to explain phenomena like COVID-19, because there are many, many parameters, and the question is, how much time you know, because the next pandemic, if someone comes to me and ask me how good this tool will be, I tell them. Well, you have to have at least 3 weeks of data after you having some sort of
204 00:39:08.560 --> 00:39:32.929 Jacob Barhak: infection going on, I started modeling April 2020. So you need at least 4 weeks of data, but not this is actually not true, because 3 weeks of data will give you initially. When I did this it will give you different results, because you don't have enough numbers. You have to forward for at least several months and then do the windows I'm talking about, and then you and then the numbers started stabilizing it.
205 00:39:33.200 --> 00:40:00.399 Jacob Barhak: I'm still running a big simulation here to make sure those numbers are correct, because I'm doing Monte Carlo simulations when I flip all those coins. And I'm making sure that I throw enough computing power there to actually make it useful, and that I didn't make any mistakes. Eventually, every once in a while I find something that it was wrong that I need to correct. This is why there are multiple versions. But in the future you'll need at least a few months of data.
206 00:40:00.940 --> 00:40:11.729 Jacob Barhak: plus you have to one of those windows, but maybe you'll get initial results after a month or 2, and and at least some sense you'll get, and then later on you'll get
207 00:40:11.860 --> 00:40:21.439 Jacob Barhak: you. You go with it, so the Government will not be completely clueless like it was in the pandemic, because now we find out how clueless they were.
208 00:40:23.490 --> 00:40:34.629 Jacob Barhak: now I can tell you that the peak average infections to about day 5 from infection. This is for covid and transmission rate is about 2% per encounter. It's a little bit less
209 00:40:34.750 --> 00:40:38.140 Jacob Barhak: warm weather seems to reduce transmission. Now.
210 00:40:38.520 --> 00:40:52.450 Jacob Barhak: this is something important. Today we published the paper lessons learned from COVID-19 for modeling COVID-19, and steps to take in the next pandemic here. I'll show you the paper it will load. It's now in the preprint.
211 00:40:53.000 --> 00:41:06.579 Jacob Barhak: We have multiple collaborators and they have actually came from different perspectives. I'm not. I did the reference model, but they did other models that figured out how to do other things better.
212 00:41:06.710 --> 00:41:24.230 Jacob Barhak: and we wrote down the paper and says, what how to do, modeling better, how to spread information better and get it correctly, how to validate the information properly, and also we have some recommendation about infrastructure and education that will help.
213 00:41:24.400 --> 00:41:33.289 Jacob Barhak: So if you're interested, please go and read the paper. It's in preprint. You can get it by following this link number 53 here.
214 00:41:34.960 --> 00:41:59.159 Jacob Barhak: All of what I showed you here cannot produce to some degree. I'm showing exactly what I what data I did. So I can trace back. If someone ever asks me about the presentation where I got the numbers from the codes that actually created the presentation. You can find it on Github. I actually release it. But not all of the data. Some things up priority. I have a conflict of interest statement here
215 00:41:59.160 --> 00:42:05.399 Jacob Barhak: because I do intend to get money out of it. I have 2 patents us patents on this technology.
216 00:42:05.871 --> 00:42:12.049 Jacob Barhak: I'm now licensing them. If someone interested know someone who's interested, please do connect us.
217 00:42:12.280 --> 00:42:32.328 Jacob Barhak: And I've many, many people to think and many organizations. Some people gave me some all sorts of all sorts of help people from all of his allowed all this the presentation technology ideas might just help me out finding some resources and connecting to some people that help.
218 00:42:32.930 --> 00:42:43.334 Jacob Barhak: people hosted my computer published the the video, you can actually see this. This is the video that's embedded. You can actually look at it later. So it was published in Siphode.
219 00:42:44.108 --> 00:42:59.230 Jacob Barhak: people who will contribute models. I'm showing some of the other work they did here. People who contributed money to actually run simulations on cloud, I actually ran several simulations in the cloud, which, instead of several weeks or months. They ran in 2 days.
220 00:42:59.850 --> 00:43:21.759 Jacob Barhak: not all the time you have money for that. So Rescale gave me money for azure, and Amazon and Midas gave me some money. I ran simulation on the Google card with them. They gave it through some grant and many, many people. I have to thank for all the way for various ideas or things. So thank you all. And I'm open for questions
221 00:43:22.170 --> 00:43:23.910 Jacob Barhak: underneath this.
222 00:43:28.300 --> 00:43:33.410 Jim Mccormack: So, Jacob, have you been able to use any of it on like bird flu, or any other pandemics that
223 00:43:34.160 --> 00:43:37.540 Jim Mccormack: that are starting or rumored to be the next pandemic.
224 00:43:37.780 --> 00:43:50.169 Jacob Barhak: Currently, I'm focusing on Covid because this is technology. Still believe it or not, it's still in development phase. Because here you show me any technology that you invested 20, many years in?
225 00:43:50.360 --> 00:43:53.360 Jacob Barhak: Is it something good enough? 20, many years?
226 00:43:53.900 --> 00:43:58.020 Jacob Barhak: This is what they? It's about 20. It's about 20 years of development.
227 00:43:59.560 --> 00:44:07.305 Jacob Barhak: You tell me so. I I'm still making sure I'm cutting all the bits and pieces.
228 00:44:07.920 --> 00:44:20.389 Jacob Barhak: so for this such technologies, you need many more resources to actually meaning I'm talking. I'm not talking about, you know, some game that blows up that people use, or something that is fairly tested.
229 00:44:20.980 --> 00:44:40.410 Jacob Barhak: Sometimes they are not really retested. They just blow up like virally. Here, you actually to be correct, that this technology works, you actually have to invest a lot of time. Now, the big problem with all of the data, which is a different project that I'm making is the fact that you cannot get medical data.
230 00:44:40.620 --> 00:44:57.669 Jacob Barhak: One of the advantages of this project is that it can actually merge data from multiple sources. This is practically not allowed in the medical world, because if you have population A and population B in the medical world. They are not allowed to merge the data between those 2.
231 00:44:58.220 --> 00:45:14.120 Jacob Barhak: No, no, because it's patient data. So the individual data is not allowed. But you're allowed to merge the models. And this is what I'm doing. This is why this technology is important, not only for the pandemics I'm doing. I'm doing it on the pandemics, because there I have good data.
232 00:45:14.220 --> 00:45:37.740 Jacob Barhak: The other model I have is the diabetes model today, as far as I know, I have the most validated diabetes model worldwide, because I tested it with more populations than anyone else. How did I do it? I connected to clinicaltrials.gov got the information from there. But the thing is even data and clinicaltrials.gov, is not that good here. I'll show you. That's another project that I'm working on.
233 00:45:38.060 --> 00:45:49.700 Jacob Barhak: Even to get the data out of those models. It's impossible. Because the units of measure are messed up, even if you do it correctly, I'm going to show you just only one thing.
234 00:45:50.650 --> 00:45:57.069 Jacob Barhak: It will take a minute to load. This is a website. It's actually active. You can check it out. But, like here, Hba, one C is a measure of diabetes.
235 00:45:57.900 --> 00:46:05.770 Jacob Barhak: So see how many ways people write. Hba, one C, the units of measure. A computer cannot understand it.
236 00:46:06.330 --> 00:46:18.110 Jacob Barhak: So you need AI to tell it how it's supposed to be. That's a different project I'm doing, which is a spin off of this project. And actually, there's some claims in the patents that relate to this project as well. So
237 00:46:18.290 --> 00:46:29.200 Jacob Barhak: eventually, being able to get the disease models correctly, you need correct data. You messing around with the data is the biggest problem.
238 00:46:29.550 --> 00:46:34.450 Jacob Barhak: So you're saying bird flu because everyone says bird flu. But you have good data about bird flu.
239 00:46:35.380 --> 00:46:39.609 Jacob Barhak: Well, once you start with Covid, you didn't have good data.
240 00:46:39.720 --> 00:46:54.809 Jacob Barhak: And this is why we had this project running. And this is where accommodations, how to get good data for the future and how to do it. So we can models can use it can actually get the results I got I got it took me about 5 years to get to the something stable that I'm showing you now.
241 00:46:55.100 --> 00:47:16.629 Jacob Barhak: and I start in the beginning of the pandemic in the next pandemic. If it'll be one year it will be better, and then it will be few 3 months, or 2 months, or 2 weeks, then it will better and better and better. But to get to that point we need an entire lubricated system that gives us all the things that we need all the correct data and all the correct models, and so on and so forth.
242 00:47:16.630 --> 00:47:25.530 Jacob Barhak: This is still a lot of work. The big systems are not yet set to it, and hopefully this paper will be helpful in this regard.
243 00:47:25.670 --> 00:47:30.479 Jacob Barhak: and think about how much money was found in Covid, and how many models are out there.
244 00:47:30.890 --> 00:47:51.490 Jacob Barhak: Think about some big machine that crunches all of those assumptions in the future that people plug in and tells them, oh, this is somebody is probably incorrect. This doesn't match this data or that data. This is what my technology does. And we do have the computing power today. But we do need the the software infrastructure, and all those many years that I only started.
245 00:47:51.660 --> 00:47:53.450 Jacob Barhak: Did did I answer your question, Jim?
246 00:47:53.710 --> 00:48:09.379 Jim Mccormack: Yeah, yeah, and very well, so it's it is. It is again, Jacob. If I restate right? So it may not be directly translatable to bird flu. But the lessons learned and the prep work right will get us to those answers faster. Using this example in this work.
247 00:48:09.530 --> 00:48:28.879 Jacob Barhak: I didn't try it on birth flu. If you give me data of birth flow, I can try it. But then I'll tell you. Oh, this is missing, and this is missing, and I need all those assumptions. And then you'll start finding all those researchers, and they won't go even finding researchers to collaborate, to give you all those models also takes time. All of this has to be centralized in a way that
248 00:48:29.370 --> 00:48:37.410 Jacob Barhak: the system was during Covid. Everything was kind of a mess. I wasn't part of that mess. I was trying to find things. I couldn't find them.
249 00:48:37.900 --> 00:48:45.199 Jacob Barhak: I I believe I was. I was stressed because I was working on this, for at that point, like 15 years, and I still was
250 00:48:45.860 --> 00:49:01.209 Jacob Barhak: feeling that I don't need. I have everything I need, like all the pieces and pieces now, and we're trying to normalize it and give more accommodations how to do it better in the future. The idea is that someone who actually is interested.
251 00:49:01.210 --> 00:49:20.360 Jacob Barhak: We'll look at it, learn from it, and then start training groups that will do those things. I have a friend who actually has some good ideas about his name is John Rice, and he actually was the instigator of this paper, saying, You know, what what did you learn from all this work that you did about Covid. So we organized it all in a way that in the next pandemic
252 00:49:20.530 --> 00:49:29.019 Jacob Barhak: there won't be such a problem. And we're now propagating this paper. If you're interested in helping take this paper, send some of your friends.
253 00:49:29.910 --> 00:49:32.570 Gabor Szabo: There's another question I see in the in the chat.
254 00:49:35.170 --> 00:49:36.230 Gabor Szabo: Can I read it out.
255 00:49:37.783 --> 00:49:38.549 Jacob Barhak: Yeah. Please.
256 00:49:38.730 --> 00:49:44.720 Gabor Szabo: For type, one diabetes modeling. Isn't that a generic disease? Autoimmune.
257 00:49:45.690 --> 00:49:46.115 Jacob Barhak: Well.
258 00:49:47.190 --> 00:49:55.940 Jacob Barhak: type diabetes is different than type 2. They have different mechanisms. I'm not a medical doctor to goes into those.
259 00:49:56.540 --> 00:50:05.610 Jacob Barhak: It was explained to me many times while I was doing diabetes type 2 diabetes. I was working with a team of experts, worldwide experts in diabetes.
260 00:50:06.670 --> 00:50:12.190 Jacob Barhak: I'm less concerned about the type of disease or what it is.
261 00:50:12.350 --> 00:50:24.139 Jacob Barhak: Diseases for me are state transition models, where you jump from one state to another, and there's a probability of moving there, and the probability depends on all sorts of parameters.
262 00:50:24.380 --> 00:50:25.230 Jacob Barhak: So
263 00:50:25.480 --> 00:50:42.630 Jacob Barhak: the source of the disease or the cures, I don't care much. I just want to be able to explain it, explain it. And how do I explain it? If I have a model that says A, and then model, says BI want to know which one of those contributes more to the numbers I see at the end.
264 00:50:42.810 --> 00:50:52.719 Jacob Barhak: The way I look at the diseases. It's all numbers and some people who understand all of the elements. They are the ones making the models.
265 00:50:53.240 --> 00:50:56.230 Jacob Barhak: So does it answer your question.
266 00:51:00.390 --> 00:51:07.479 Jacob Barhak: Okay, so this is why it is. And I'll try to send you the link for that.
267 00:51:08.800 --> 00:51:09.860 Jacob Barhak: Here we go.
268 00:51:10.650 --> 00:51:31.700 Jacob Barhak: This is the link for that paper, the lessons learned paper. So if you know people who are interested please propagate that this is important. Hopefully, some governments or some organizations will adapt. Next pandemic will have less mess than we had in this pandemic. By the way, when I started the pandemic, everyone was doing Covid
269 00:51:31.950 --> 00:51:45.610 Jacob Barhak: really. Like every every department, every institution was like financial institutions were running Covid models computation institutions were having Covid. Everyone
270 00:51:45.860 --> 00:51:53.350 Jacob Barhak: had Covid model. Now I'm i i even talking to people who model Covid is kind of hard
271 00:51:53.520 --> 00:52:16.299 Jacob Barhak: because they they all stopped doing it. It's less interesting. But for me it's interesting very much to know, because I really am dedicated to it. And this is my life's work. I've been working on this almost 20 years, and I want this to continue for and and done properly. So this is why I'm giving this talk, and Gabor. Thank you for having me.
272 00:52:17.117 --> 00:52:26.700 Gabor Szabo: Thank you for giving this talk. This presentation. And anyone any more questions before we
273 00:52:27.180 --> 00:52:29.009 Gabor Szabo: we close the video.
274 00:52:31.430 --> 00:52:40.860 Jacob Barhak: If you have python questions on how I did this with python, or how I did that, it's also I can. I'm running many, many things. Actually, maybe it's a good time to show you.
275 00:52:41.020 --> 00:52:42.370 Jacob Barhak: You see
276 00:52:43.590 --> 00:53:03.189 Jacob Barhak: this thing hopefully, I won't disconnect everything. This is a call, a screen of a computer. Behind it is a 24 core machine. Like a very good processor, it runs as fast as older supercomputers. 32, core, 32 threads. And this simulation. Now it's
277 00:53:04.120 --> 00:53:04.890 Jacob Barhak: team.
278 00:53:04.890 --> 00:53:07.520 Gabor Szabo: If you stop the screen sharing you'll see better.
279 00:53:08.430 --> 00:53:10.879 Jacob Barhak: Oh, oh, okay, I'll stop screen share.
280 00:53:11.850 --> 00:53:12.820 Jacob Barhak: Second.
281 00:53:16.910 --> 00:53:19.340 Jacob Barhak: how do I stop the screen? Share?
282 00:53:21.328 --> 00:53:25.809 Jacob Barhak: Think it shows me screen share. How do I stop it? It says, only share.
283 00:53:26.120 --> 00:53:27.730 Jacob Barhak: Did I stop the share?
284 00:53:28.380 --> 00:53:28.910 Jacob Barhak: No.
285 00:53:28.910 --> 00:53:29.850 Gabor Szabo: Oh, not yet.
286 00:53:30.770 --> 00:53:31.900 Jacob Barhak: Give me a second.
287 00:53:35.300 --> 00:53:41.260 Jacob Barhak: I'm not sure how to stop the share in this model. Oh, oh, okay, thank you.
288 00:53:41.460 --> 00:54:08.409 Jacob Barhak: You see, this. This screen is actually a computer that runs simulation, same simulation. You see. Now, it's around here. This is iteration 17. It should get to 40 to get something stable approximately, this is my currently baseline. I've seen all the simulation 40, but this one is like much, much bigger. Here. I start the simulation in each day, and I run 5 repetitions
289 00:54:08.700 --> 00:54:15.240 Jacob Barhak: for all of this. I'm I'm actually running about like 2 months of day for I have information from April to June.
290 00:54:15.260 --> 00:54:43.210 Jacob Barhak: and each time I start a different day and run for 21 days and check the numbers for 5 simulations for each State, and then I continue doing this and this state. This will take me about half a year. I started a few months ago, and I had some options, power problems. And so on. So for computer problems, I actually burned computers on this. I kid you, not. I have multiple computers dead because of running all those simulations running all throughout the world for many, many years.
291 00:54:43.340 --> 00:54:52.760 Jacob Barhak: So I started with small clusters. I created clusters. I use dusk to create clusters. To this. I still it still runs with dusk. You cannot
292 00:54:53.020 --> 00:54:54.450 Jacob Barhak: see the best. This is.
293 00:54:56.670 --> 00:54:59.559 Gabor Szabo: Yeah, I can't. We can't really see that. No? Well.
294 00:55:00.037 --> 00:55:04.800 Gabor Szabo: apologies, I cannot make it much, much closer, very bright.
295 00:55:05.370 --> 00:55:09.679 Jacob Barhak: I apologize. This is, this is the best I can do, anyway.
296 00:55:10.630 --> 00:55:16.720 Gabor Szabo: Oh, thank you very much again, and thank you everyone for for being here, and if you're watching.
297 00:55:16.830 --> 00:55:27.079 Gabor Szabo: I mean they they you share the links. So people can find you. They will be under the the video. There will be a link for all the all these things.
298 00:55:28.670 --> 00:55:30.450 Gabor Szabo: Like the video
299 00:55:30.740 --> 00:55:38.570 Gabor Szabo: share, follow the channel, share the video and talk to Jacob. If you were interested in discussing this topic.
300 00:55:38.870 --> 00:55:40.539 Jacob Barhak: Please. Thank you very much.
301 00:55:40.540 --> 00:55:41.120 Jim Mccormack: Thank you.
302 00:55:41.120 --> 00:55:41.760 Gabor Szabo: Right.
303 00:55:42.730 --> 00:55:43.610 Jacob Barhak: Bye, bye.
]]>In my seven years as a software developer, I've primarily worked in teams composed solely of developers. However, my recent transition to a team of security researchers has opened my eyes to a crucial aspect that often goes unnoticed: log safety in applications.
My exposure to the application security ecosystem and real-life security breach analysis has opened my eyes to recognize code security issues, including the prevalence of sensitive information, tokens, passwords, and payment details, in plaintext logs. This may lead to severe data breaches, financial losses, and all kinds of catastrophes.
This talk will dive into the fatal mistakes developers often make that can result in the disclosure of sensitive information in logs. We'll explore the types of sensitive data in logs.
I'll share my personal experiences as a developer on a security research team and shed light on the often-overlooked consequences of insecure logging practices. We'll discuss practical patterns to safeguard sensitive information in Python applications, including identifying and redacting sensitive data before it reaches log files, and implementing secure logging practices.
By the end of this talk, developers will be equipped with the knowledge and tools to protect sensitive data from accidental disclosure and safeguard their applications from the perils of sensitive data exposure. Embrace the journey towards log safety and ensure your code remains secure and confidential.

2024-11-04-fastapi-dynamic-response-current-timestamp-python-english.mp4
]]>2024-11-01-fastapi-hello-world-python-english.mp4
]]>2024-10-30-raise-exception-from.mp4
]]>