Python programming language

Adding GitHub Actions to qrcode-pretty

2026-02-04T07:40:01Z

OSDC Python

Exploring qrcode-pretty and adding tests to it

2026-02-04T07:30:01Z

OSDC Python

Fix speed

2025-10-24T13:30:01Z

Based on a recent discussion about teaching programming and if speed of execution matters I wrote this snippet.

examples/fix-speed/calculate.py

import sys

def run(X, Y):
    total = 0
    for x in range(X):
        for y in range(Y):
            width, height = read_config()
            total += x*width + y*height
    print(total)

def read_config():
    import csv
    with open('config.csv', newline='') as fh:
        reader = csv.DictReader(fh)
        for row in reader:
            return int(row['width']), int(row['height'])


if __name__ == "__main__":
    if len(sys.argv) != 3:
        exit("Usage: {sys.argv[0]} X Y")
    X = int(sys.argv[1])
    Y = int(sys.argv[2])
    run(X, Y)

Can you suggest how to improve the speed of this code?

In order to try it create a file called config.csv with this content

examples/fix-speed/config.csv

width,height
23,19

and then run

time python3 calculate.py 1000 500

On my computer this takes 5 seconds to run.

How to build a microservice with Python + FastAPI to switch from RDS to DynamoDB with Nikita Baryshev

2025-05-20T08:30:01Z

Nikita Baryshev

Nikita writes:

The microservice processed all requests between different clients and DDB. In addition to this, during the transfer period, both RDS and DDB were supported before the full switch to DDB. I can talk about the general approaches I used to build this microservice, how I worked with the legacy code, monitoring, and what was the outcome. Also, I will give a summary of all the pros and cons I faced and things that you could do better from the beginning.

I'm a full-stack developer currently working at Check Point in Tel Aviv. My stack is Angular + Python (Flask, FastApi). I'm also interested in web accessibility.

Transcript

1 00:00:02.040 --> 00:00:31.269 Gabor Szabo: Hello, and welcome to the Codemaven Channel and code meet and meeting meet up group. My name is Gabor Sabo. I teach Python. I'll help help companies with python and trust. And I also organize these meetings, these events online, because I think it's a very useful platform to share knowledge. And I'm really happy that Nikita agreed to give this presentation. Hello, Nikita.

2 00:00:31.270 --> 00:00:32.040 Nikita Barysheva: Hi!

3 00:00:32.250 --> 00:00:33.379 Nikita Barysheva: Nice to meet you all.

4 00:00:33.380 --> 00:00:35.440 Gabor Szabo: And sorry, and

5 00:00:36.170 --> 00:01:03.490 Gabor Szabo: Those people who are present thank you for for joining the the meeting, you can freely ask questions in the in the chat. And if you're watching the video on Youtube, then please, like the video and follow the channel and join our meet up. Meet up groups, so you will be notified when we have the new meetings new events. So with that, said Nikita, it's your turn. Please introduce yourself and give your presentation.

6 00:01:04.170 --> 00:01:13.930 Nikita Barysheva: Yeah, sure, I'll start sharing my screen, and then probably I'll start exploring one second K,

7 00:01:14.770 --> 00:01:20.399 Nikita Barysheva: Hi, everyone. Once again. My name is Nikita, and I'm a software developer checkpoint company right now.

8 00:01:20.590 --> 00:01:32.040 Nikita Barysheva: And today I want to talk about one of the things I had in my previous experience when we decided to switch from Rds to Dynamodb

9 00:01:32.160 --> 00:01:45.890 Nikita Barysheva: for our users table. And how we thought about it. What was the overall architecture, and how we build a micro services that helped us to switch to make this switch.

10 00:01:46.826 --> 00:02:05.609 Nikita Barysheva: We'll cover different topics like we will talk about general differences, about our disb. And like, I will highlight some main things that might make you think why, to switch from one database to another, or will help to understand our motivation behind it.

11 00:02:05.740 --> 00:02:17.010 Nikita Barysheva: And we will go over the architecture of the micro service that we build, and I'll give you some examples over there, and we can talk about it in more details if you want.

12 00:02:17.700 --> 00:02:23.740 Nikita Barysheva: So let's have a quick overview. I'm I don't know

13 00:02:23.990 --> 00:02:42.469 Nikita Barysheva: everyone familiar with the Dynamodb or Rds. What is the differences. But the main ones, like dynamodb, is a key value like no scale database. It's fully managed by aws, and it's very good for applications with like low latency, with like flexible data models

14 00:02:42.850 --> 00:02:48.559 Nikita Barysheva: and opposite orders, is the SQL database, and we have, like a

15 00:02:48.830 --> 00:02:59.830 Nikita Barysheva: predefined schemas, and it's also managed by aws, but the difference between diamond monds that for this you really have to invest more

16 00:03:00.110 --> 00:03:13.869 Nikita Barysheva: into like knowledge into setting setting up things over there, and that for sure, if you have like complex queries and joins and etc, this is better for your solution.

17 00:03:15.296 --> 00:03:26.219 Nikita Barysheva: When you decide on like which database to use, you will probably look at several things like scalability, performance availability.

18 00:03:26.410 --> 00:03:28.730 Nikita Barysheva: And here I present some

19 00:03:29.070 --> 00:03:36.149 Nikita Barysheva: basic stuff about differences like for each of them, and dynamic being out there but overall, saying again.

20 00:03:36.160 --> 00:04:03.969 Nikita Barysheva: the for scalability. We know that dynamodity automatically scales horizontally, and that really helps to manage like a large amount of traffics without any interventions at the same time, like our desk scales vertically, and it has to increase the instance size, and this increase. It also takes time, and there might be some gaps in performance also because of that

21 00:04:03.990 --> 00:04:05.570 Nikita Barysheva: and the

22 00:04:05.760 --> 00:04:30.199 Nikita Barysheva: for performance. It really depends on the type of instance you chose the storage. Like, as I said before, you really need to know what you are doing there, and how you're setting it up, because if you won't do it like properly, you might, you might have some slowness or database won't be available. Something will be down, and users won't be happy.

23 00:04:30.250 --> 00:04:37.455 Nikita Barysheva: And as for availability, I found out we found out basically for ourself.

24 00:04:38.560 --> 00:04:46.200 Nikita Barysheva: like, say, big difference for dynamodb. And there is a thing like you that you can activate that calls global tables

25 00:04:46.390 --> 00:04:50.660 Nikita Barysheva: like, it's like a multi-region multi master, right? Database solution

26 00:04:50.790 --> 00:05:02.400 Nikita Barysheva: for this. It supports. And let's call it multi az multi availability zones. It's replicates the data across different availability zones. But in the same region.

27 00:05:03.740 --> 00:05:09.610 Nikita Barysheva: And the another thing that you will have to consider it will be interesting for you is like

28 00:05:10.090 --> 00:05:11.680 Nikita Barysheva: cost consideration.

29 00:05:13.510 --> 00:05:28.390 Nikita Barysheva: or is, let's say, dynamic price pricing and Rds cost will increase as you scale vertical, like large instances horizontally like read replicas. Also Rds also provides like on demand.

30 00:05:28.490 --> 00:05:38.600 Nikita Barysheva: but still, if you like, chose the the instance with a special like to say storage of I don't remember exactly the batches there, but

31 00:05:38.830 --> 00:05:43.869 Nikita Barysheva: let's say, 60 gigas. You will have to pay for 60 jigas. Even you use 20 of them.

32 00:05:44.190 --> 00:06:03.179 Nikita Barysheva: So efficient hybrid handling. Dynamodb is really optimized for hybrid scenarios, and it doesn't provide different replicas. Okay, it can handle millions of requests per second with the architecture that Kws provides to us. And the

33 00:06:03.480 --> 00:06:12.200 Nikita Barysheva: Pre. We want. We all want to have, let's say, predictable cost and capacity modes, and because of that.

34 00:06:12.560 --> 00:06:18.709 Nikita Barysheva: into benefit of Dynamodb, Dynamodb offers 2 modes, provisioned capacity. When you

35 00:06:18.910 --> 00:06:24.920 Nikita Barysheva: have when you set up the database. Basically, you have to say how many read and writes like.

36 00:06:26.620 --> 00:06:34.859 Nikita Barysheva: what is the bar? Let's say for them for for your database, and or you can use on demand

37 00:06:35.010 --> 00:06:53.600 Nikita Barysheva: that will automatically scale up your traffic and ensure you pay only for the usage. We had a situation when we we worked on one online store, and we had a situation that we didn't predict, because no one is about like no one's following super bowl in United States.

38 00:06:53.800 --> 00:07:00.989 Nikita Barysheva: But we did. We just lost it. And the traffic went up

39 00:07:01.790 --> 00:07:08.899 Nikita Barysheva: and the people tried to buy beer in United States order it online, and we didn't expect that. But thanks to

40 00:07:09.060 --> 00:07:16.610 Nikita Barysheva: the dynamo debit architecture like scaled up automatically and we are, we were on the pretty good side.

41 00:07:18.662 --> 00:07:22.460 Nikita Barysheva: These are very general customer reviews. Okay.

42 00:07:22.930 --> 00:07:30.140 Nikita Barysheva: hey? I just wanted to give you some examples. Don't take it like strict that you have to calculate it like this. I just wanted to

43 00:07:30.470 --> 00:07:32.630 Nikita Barysheva: have you, Nick.

44 00:07:33.130 --> 00:07:35.150 Nikita Barysheva: Basic understanding. Okay.

45 00:07:36.065 --> 00:07:41.960 Nikita Barysheva: For this, you pay, for instance, cost and for the storage.

46 00:07:42.830 --> 00:07:50.800 Nikita Barysheva: Once again the the specification could be more complicated. But we're talking about basics.

47 00:07:50.930 --> 00:07:57.739 Nikita Barysheva: And for dynamo dB, you pay for right capacity units, read capacity units, and also for data storage.

48 00:07:58.080 --> 00:08:05.850 Nikita Barysheva: But where regarding the storage, and I wanted to give you some like more detailed calculations here.

49 00:08:07.160 --> 00:08:13.440 Nikita Barysheva: If you, for example, want to store 5 gigs, 10 gigas, 20 gigs.

50 00:08:13.610 --> 00:08:21.249 Nikita Barysheva: You will pay the same price for the storage all this time, because, as I said before, you

51 00:08:21.470 --> 00:08:24.479 Nikita Barysheva: choose the storage type, and you have to pay for it.

52 00:08:24.650 --> 00:08:28.259 Nikita Barysheva: even if you pay, even if you use less.

53 00:08:28.470 --> 00:08:36.140 Nikita Barysheva: Okay? At the same time you see that for download, my dB, this thing is dynamic.

54 00:08:37.179 --> 00:08:41.329 Nikita Barysheva: and it depends on the real real story that you use.

55 00:08:41.620 --> 00:08:48.309 Nikita Barysheva: There are more things that I mentioned here. I'm not sure if you want to be overwhelmed right now let me know. But

56 00:08:48.660 --> 00:08:56.309 Nikita Barysheva: these are the very basic things that I wanted you to consider just to understand are the dynamodity.

57 00:08:56.950 --> 00:08:57.940 Nikita Barysheva: And

58 00:08:58.050 --> 00:09:11.530 Nikita Barysheva: yeah, so we talked about different like database types like SQL, Nonsql, specifically, Rds actually mentioned didn't mention it. But it considered Postgresql.

59 00:09:11.710 --> 00:09:15.120 Nikita Barysheva: if it's important and diamond.

60 00:09:15.630 --> 00:09:22.709 Nikita Barysheva: Now, I want to talk about the the actual problem that we had and the solution

61 00:09:22.890 --> 00:09:24.870 Nikita Barysheva: that we found out for ourselves.

62 00:09:28.790 --> 00:09:33.650 Nikita Barysheva: So the overall problem was that that when the

63 00:09:34.180 --> 00:09:38.829 Nikita Barysheva: the number of users, like number of requests to the database

64 00:09:38.950 --> 00:09:42.869 Nikita Barysheva: scaled like went up, we had spikes.

65 00:09:43.020 --> 00:09:59.789 Nikita Barysheva: Our our desk like didn't work well sometimes. So we decided that we need to do something more stable. And we started to consider different databases. And because we had previous experience with dynamic or another project we had.

66 00:10:00.150 --> 00:10:09.900 Nikita Barysheva: we decided that we want to build an architecture where all our clients will go to the dynamo dB,

67 00:10:10.060 --> 00:10:15.369 Nikita Barysheva: through a user's micro service. But

68 00:10:15.520 --> 00:10:22.409 Nikita Barysheva: the problem is another problem is that today all of our users are stored in Postgrescale.

69 00:10:22.960 --> 00:10:26.039 Nikita Barysheva: So how to how to manage it.

70 00:10:26.230 --> 00:10:35.769 Nikita Barysheva: 2 different databases, and like, not just physically, database. I mean different types, databases. That's kind of challenge. Okay?

71 00:10:35.930 --> 00:10:41.610 Nikita Barysheva: So these are really again.

72 00:10:42.104 --> 00:10:49.590 Nikita Barysheva: general overview of the solution. But on the left side. You see the clients. Each of them is like A,

73 00:10:49.840 --> 00:10:51.350 Nikita Barysheva: the client that

74 00:10:51.510 --> 00:10:57.339 Nikita Barysheva: it could be a back end client that wants to get data about the special and specific user.

75 00:10:57.500 --> 00:11:01.464 Nikita Barysheva: or to get all the users by some condition and

76 00:11:02.330 --> 00:11:16.599 Nikita Barysheva: how we do it. We decided to implement several feature flags, including like that, will tell us where we should read the data from or where we want now to write the data to.

77 00:11:16.870 --> 00:11:24.160 Nikita Barysheva: And basing on this feature flex. We were doing like, get requests, or we're doing like.

78 00:11:24.690 --> 00:11:33.529 Nikita Barysheva: put post to delete. We do all the separations based on this feature flags. And this is the like

79 00:11:33.770 --> 00:11:40.780 Nikita Barysheva: Postgresql architecture, nothing like special here. And this is the user service. So we have the

80 00:11:41.200 --> 00:11:47.700 Nikita Barysheva: containers here, and we use readies for caching, for caching and Dynama dB.

81 00:11:48.030 --> 00:11:54.420 Nikita Barysheva: Without additional details here. But I mean I think it could be pretty clear what we are

82 00:11:54.760 --> 00:11:59.329 Nikita Barysheva: trying to do here. Let me know if you have any questions so far.

83 00:11:59.700 --> 00:12:06.740 Nikita Barysheva: I will. I will be happy to answer them, because this scheme, if if you have question, I will be happy to answer them just

84 00:12:06.860 --> 00:12:10.370 Nikita Barysheva: for you to be and to make it more clear later.

85 00:12:12.630 --> 00:12:13.830 Nikita Barysheva: And then.

86 00:12:15.170 --> 00:12:23.570 Nikita Barysheva: except the fact that we want to transfer to Dynamodb, we need to have this transition period. So.

87 00:12:23.800 --> 00:12:27.390 Nikita Barysheva: as you saw on the previous like scheme.

88 00:12:27.660 --> 00:12:36.229 Nikita Barysheva: we decided to plan to implement the service like Api, that will handle all crowd operations related to our dynamic.

89 00:12:36.530 --> 00:12:42.040 Nikita Barysheva: And we also need to transfer all users data from Rds to Dynamodb.

90 00:12:42.160 --> 00:12:49.420 Nikita Barysheva: This was done. We wrote different scripts. We basically can grab the data from their desk.

91 00:12:49.730 --> 00:12:58.090 Nikita Barysheva: transform the data as we want. And to basically transfer this data to Dynamodb.

92 00:12:58.640 --> 00:13:02.679 Nikita Barysheva: And we also decided, as I mentioned

93 00:13:02.860 --> 00:13:11.260 Nikita Barysheva: in a previous slide. We decided that we want to have feature flags. The feature of the 1st feature flag read user from Dynamodb.

94 00:13:11.450 --> 00:13:12.520 Nikita Barysheva: If it through.

95 00:13:12.780 --> 00:13:19.020 Nikita Barysheva: we go to Dynamodb the micro service and Dynamodb. If it's false, we go directly to the Postgrescale

96 00:13:19.160 --> 00:13:24.129 Nikita Barysheva: and read user to our Ds and write write user to our Ds and to dynamodb.

97 00:13:24.330 --> 00:13:30.630 Nikita Barysheva: these are 2 flags that basically we need to support this period when we

98 00:13:31.270 --> 00:13:37.399 Nikita Barysheva: we work with both databases. So we're trying, we try to make the spirit as short as possible.

99 00:13:37.680 --> 00:13:41.950 Nikita Barysheva: to make some like tests on the Qa. On staging and then on production.

100 00:13:42.504 --> 00:13:47.880 Nikita Barysheva: We still have to work some production. But once we saw that everything works

101 00:13:48.230 --> 00:13:51.070 Nikita Barysheva: like fine. When we don't have any

102 00:13:51.200 --> 00:14:11.700 Nikita Barysheva: request from the customers, we don't have any bugs opening. So we closed right user to our desk. So the channel, let's go back for a second if I can. Yeah, basically, this channel. This path was closed. So we just continued working directly with our user service.

103 00:14:12.150 --> 00:14:23.140 Nikita Barysheva: The read from dynamo debu was always true, and the right to dynamo debut also. True, right to our death was false. So all this scheme started

104 00:14:23.290 --> 00:14:25.420 Nikita Barysheva: working only with this part.

105 00:14:25.690 --> 00:14:27.470 Nikita Barysheva: Avoid imposed risk.

106 00:14:28.655 --> 00:14:29.020 Nikita Barysheva: Okay,

107 00:14:30.780 --> 00:14:42.270 Nikita Barysheva: I wanted to show the client side architecture. The client, as I mentioned before, like every service that wants to get information about the the user

108 00:14:42.800 --> 00:14:48.340 Nikita Barysheva: and just wanted to give some code examples and to explain what we

109 00:14:48.470 --> 00:14:50.530 Nikita Barysheva: just generally try to achieve that.

110 00:14:51.320 --> 00:15:01.800 Nikita Barysheva: We wrote like a model interface, and the purpose of such interface is to be a handle for all calls

111 00:15:01.970 --> 00:15:05.099 Nikita Barysheva: to Dynamodb through the service.

112 00:15:05.370 --> 00:15:13.029 Nikita Barysheva: Okay, handle responses from service. Manage all the Retries, manage all the caching, etcetera. So be

113 00:15:13.250 --> 00:15:19.509 Nikita Barysheva: the one that gets the data for for the client from the service.

114 00:15:20.940 --> 00:15:23.820 Nikita Barysheva: It could like look like that.

115 00:15:23.950 --> 00:15:31.610 Nikita Barysheva: And one of the functions that we could use like get user by email.

116 00:15:32.030 --> 00:15:37.850 Nikita Barysheva: we initiate the user client with some parameters over here.

117 00:15:38.050 --> 00:15:50.039 Nikita Barysheva: One of the parameters that I really like to really like to mention is a requester Id. I will explain later. I can explain actually, right now, because

118 00:15:50.260 --> 00:15:51.730 Nikita Barysheva: why we need it.

119 00:15:52.010 --> 00:16:00.319 Nikita Barysheva: basically for the login and for the tracking, something fells down. I really like that, we know which client made this request.

120 00:16:00.690 --> 00:16:16.299 Nikita Barysheva: and on the left side the function itself which uses the the feature flag. The code could be optimized. Don't look at it like as a perfect one, just wanted to make it as much clear and readable on one slide as possible.

121 00:16:16.830 --> 00:16:22.635 Nikita Barysheva: So if we want to get a user by email, we looking at this feature flag. And

122 00:16:23.180 --> 00:16:27.599 Nikita Barysheva: we basically want to get to have a request to dynamodb

123 00:16:27.790 --> 00:16:40.490 Nikita Barysheva: service. This is a client link. And we want to to make the Cpi call, else we're going as we did it before we just go into Postgresql and getting that data over there.

124 00:16:43.070 --> 00:16:48.000 Nikita Barysheva: And this is the function that user from dynamodvis

125 00:16:48.250 --> 00:16:51.309 Nikita Barysheva: made it detect more specific over here.

126 00:16:51.770 --> 00:16:54.490 Nikita Barysheva: We're setting up all the parameters that we want to get.

127 00:16:54.650 --> 00:17:08.140 Nikita Barysheva: And we're making Api call to the to the route, and we are handling the the response. You can handle it wherever you want. We at that moment in time decided that we want to return

128 00:17:08.359 --> 00:17:09.180 Nikita Barysheva: to

129 00:17:09.440 --> 00:17:25.660 Nikita Barysheva: 2 values here. The 1st one is like represents the status, if it's okay or not. And the second one is the response. So we can check if it's if the request was okay or not. And this is actually the call Api function that actually makes a request

130 00:17:26.160 --> 00:17:44.499 Nikita Barysheva: to the service it has. Like some Retries, you can set up whatever you want, and once again you can make it better. If you want. Logging. You can make your request itself, and for sure, error, error handling with logins also.

131 00:17:45.580 --> 00:17:50.759 Nikita Barysheva: And if any questions so far, let me know

132 00:17:54.020 --> 00:18:01.149 Nikita Barysheva: it's this is one of the examples how a client can make a get request

133 00:18:01.460 --> 00:18:05.160 Nikita Barysheva: to to the micro service that will then

134 00:18:05.320 --> 00:18:08.100 Nikita Barysheva: make like, get a data from the dynamo. dB,

135 00:18:08.310 --> 00:18:13.170 Nikita Barysheva: so let's have a look at one of the routes micro service itself.

136 00:18:16.340 --> 00:18:20.059 Nikita Barysheva: As we know, we decided to use Dynamodb.

137 00:18:20.620 --> 00:18:25.700 Nikita Barysheva: 1st of all, you have to create this table. I just wanted to give you some

138 00:18:25.910 --> 00:18:29.100 Nikita Barysheva: quick overview what is included like.

139 00:18:29.620 --> 00:18:36.320 Nikita Barysheva: you see here some params, including, like key schema, that defines the primary key.

140 00:18:36.680 --> 00:18:40.250 Nikita Barysheva: Primary key could be also like a composite key

141 00:18:40.490 --> 00:18:45.620 Nikita Barysheva: of 2, let's say, 2 fields, and they

142 00:18:45.940 --> 00:18:52.720 Nikita Barysheva: into. This is what actually help us to get to get the data like quicker.

143 00:18:53.170 --> 00:18:58.179 Nikita Barysheva: We have different attribute definitions that describes the primary key

144 00:18:58.901 --> 00:19:07.130 Nikita Barysheva: fields. And we can also set up different global secondary indexes. One of them for me is like email index.

145 00:19:07.260 --> 00:19:14.640 Nikita Barysheva: It can that allows us to search by email also, not only by Id, but you may have different indexes. Not only one

146 00:19:17.307 --> 00:19:19.250 Nikita Barysheva: about the routes.

147 00:19:19.872 --> 00:19:31.540 Nikita Barysheva: maybe it could be obvious for many of you. But actually, I was surprised when it wasn't for some of like other developers. When I talked to them.

148 00:19:31.670 --> 00:19:41.340 Nikita Barysheva: The basic service handles all the get put post. Delete requests easily like should should do it. Okay. But

149 00:19:41.820 --> 00:19:44.709 Nikita Barysheva: things that people are really missing like

150 00:19:45.650 --> 00:19:51.849 Nikita Barysheva: what we we want to update many users at once. What we want to create create many users at once.

151 00:19:55.600 --> 00:19:56.680 Nikita Barysheva: Someone called it.

152 00:19:57.040 --> 00:20:00.379 Nikita Barysheva: So when you talk about dynamo D,

153 00:20:01.020 --> 00:20:09.059 Nikita Barysheva: and when we talk about the cost consideration, it's much like better.

154 00:20:09.190 --> 00:20:12.480 Nikita Barysheva: Let's say you want to create 100 users.

155 00:20:12.720 --> 00:20:17.130 Nikita Barysheva: You have a reason for that. Let's say you don't go in a for loop

156 00:20:17.290 --> 00:20:20.119 Nikita Barysheva: and creating like one after another.

157 00:20:20.290 --> 00:20:35.150 Nikita Barysheva: You're sending the batch of 100 users, and they basically, this batch will be divided by 2 chunks to chunks by 25 records, and it will be like already

158 00:20:35.280 --> 00:20:41.920 Nikita Barysheva: or requests it will be done to Dynamodb, so much, much less. Okay.

159 00:20:42.170 --> 00:20:43.479 Gabor Szabo: There is a question.

160 00:20:43.770 --> 00:20:44.190 Gabor Szabo: Oh.

161 00:20:44.190 --> 00:20:44.690 Nikita Barysheva: Yep.

162 00:20:44.690 --> 00:20:49.730 Gabor Szabo: How did you convert the data model from relational to No. SQL.

163 00:20:50.550 --> 00:20:57.619 Nikita Barysheva: Yeah, okay, that's good. One 3D is actually so, since we know that SQL,

164 00:20:57.730 --> 00:21:00.940 Nikita Barysheva: he's like, Hey, we have this strong structure

165 00:21:01.050 --> 00:21:09.219 Nikita Barysheva: like a fixed structure. We now we know what to expect and basically created another dictionary.

166 00:21:09.330 --> 00:21:16.340 Nikita Barysheva: Whoever of the user and transferred it like basically renew

167 00:21:17.000 --> 00:21:21.549 Nikita Barysheva: what is the the scheme of the Postgresql.

168 00:21:21.780 --> 00:21:29.169 Nikita Barysheva: we received the the users, we transform that like basic dictionary and

169 00:21:29.620 --> 00:21:35.050 Nikita Barysheva: using the post method like with the bulk, created it in Dynamodb.

170 00:21:35.610 --> 00:21:41.820 Nikita Barysheva: And that's that's it. No, no, not really, no, no magic over there. Actually.

171 00:21:43.180 --> 00:21:49.429 Nikita Barysheva: As for the daytime object, maybe this this is what may be specifically interest interesting to you.

172 00:21:50.448 --> 00:21:56.219 Nikita Barysheva: Restored date as a string, so we can convert it.

173 00:21:56.790 --> 00:22:08.040 Nikita Barysheva: You can also store it in a milliseconds. What else we had over there. So we had booleaning like sorry Boolean Boolean fields, and

174 00:22:08.680 --> 00:22:13.980 Nikita Barysheva: nothing really special that can change you

175 00:22:14.170 --> 00:22:18.829 Nikita Barysheva: just converting an object that you're getting from a Postgresql

176 00:22:19.010 --> 00:22:25.190 Nikita Barysheva: to the object that will be suitable for dynamo. dB, yeah.

177 00:22:26.010 --> 00:22:31.520 Nikita Barysheva: and that answer your question, or it can be more specific, excellent.

178 00:22:35.980 --> 00:22:41.230 Nikita Barysheva: and just don't see if it yes or no, just sharing screen.

179 00:22:41.620 --> 00:22:43.050 Gabor Szabo: Yes, he says.

180 00:22:43.800 --> 00:22:47.959 Nikita Barysheva: Okay, yeah. I see here now.

181 00:22:50.110 --> 00:22:56.679 Nikita Barysheva: Yep. So talked about obvious, not obvious things. And

182 00:22:59.590 --> 00:23:05.980 Nikita Barysheva: this is a service set up. I also tried to put many things and

183 00:23:06.270 --> 00:23:15.079 Nikita Barysheva: one screen. Can you see it? Because when I just showed it, live, people struggle to see it. I just want to.

184 00:23:15.080 --> 00:23:20.760 Gabor Szabo: If you could enlarge a little bit the whole thing that might be nice. I don't know if it's if you can do that.

185 00:23:23.290 --> 00:23:30.279 Nikita Barysheva: One second doesn't look like it allows in in this note.

186 00:23:32.750 --> 00:23:34.000 Nikita Barysheva: This doesn't help.

187 00:23:38.820 --> 00:23:48.070 Gabor Szabo: Why did you need? There's also another question. In the meantime, why did you need readies? If dynamo dynamodb has great performance on, reads.

188 00:23:49.990 --> 00:24:00.989 Nikita Barysheva: Because it's also good for saving money actually. And the radius can be can be applied in a second.

189 00:24:01.400 --> 00:24:06.300 Nikita Barysheva: It's a good one, because let's go over here.

190 00:24:14.750 --> 00:24:15.500 Nikita Barysheva: Okay.

191 00:24:15.800 --> 00:24:21.639 Nikita Barysheva: So we now saying that we are working like this, okay.

192 00:24:21.800 --> 00:24:31.819 Nikita Barysheva: we are going from client to dynamo. dB, and I mentioned Red zone here. But also I had, I think, to mention the readies on this layer.

193 00:24:32.390 --> 00:24:40.649 Nikita Barysheva: So what's happening right now is that our client makes another Api call to another service.

194 00:24:41.000 --> 00:24:44.359 Nikita Barysheva: which, like every call, let's say, cost us something.

195 00:24:44.560 --> 00:24:46.810 Nikita Barysheva: and then we go to dynamo. dB,

196 00:24:47.776 --> 00:24:56.910 Nikita Barysheva: it's radius is not only about the speed, it's also about the the money, the costs reduction.

197 00:24:57.390 --> 00:25:02.470 Nikita Barysheva: And we, for example, here at this layer.

198 00:25:02.810 --> 00:25:07.010 Nikita Barysheva: if the if the client was created like

199 00:25:07.310 --> 00:25:13.530 Nikita Barysheva: not properly, and the many requests. You don't catch the requests. You don't catch the results.

200 00:25:13.700 --> 00:25:25.129 Nikita Barysheva: This client will make another request, another request, another request request, and it can grow up dramatically, and your and you will get like a huge cost after that. After all.

201 00:25:25.240 --> 00:25:28.289 Nikita Barysheva: so red is the solution for that. Also.

202 00:25:29.710 --> 00:25:38.919 Nikita Barysheva: Using readies for caching basically allows you, 1st of all to decrease the the load of the

203 00:25:39.190 --> 00:25:45.409 Nikita Barysheva: of the service and the as a result, to decrease the the cost.

204 00:25:45.960 --> 00:25:53.560 Nikita Barysheva: So one of the reasons which not everyone think about in the beginning is like the cost cost reductions.

205 00:25:54.890 --> 00:25:58.660 Nikita Barysheva: Yep, is that okay?

206 00:26:03.800 --> 00:26:04.800 Nikita Barysheva: Trying to

207 00:26:10.780 --> 00:26:12.339 Nikita Barysheva: that answers the question.

208 00:26:15.790 --> 00:26:23.640 Gabor Szabo: I think you can just go. So used, okay, I'm just reading out. So used redis or used aws, caching services.

209 00:26:24.360 --> 00:26:26.529 Nikita Barysheva: Redis redis. We use reddish.

210 00:26:26.530 --> 00:26:27.560 Gabor Szabo: Yeah, okay.

211 00:26:27.560 --> 00:26:34.379 Nikita Barysheva: Because we, we use it widely in all projects. And we can. Yeah, we are used to Redis.

212 00:26:41.870 --> 00:26:44.940 Gabor Szabo: I think we can. You can go back to the code example of.

213 00:26:50.400 --> 00:26:58.774 Nikita Barysheva: Okay, the basically the 1st things that you will need for for the service.

214 00:26:59.670 --> 00:27:01.669 Nikita Barysheva: It's like the the setup is.

215 00:27:01.820 --> 00:27:05.519 Nikita Barysheva: It's pretty easy. First, st where is dynamed again hydantic.

216 00:27:06.150 --> 00:27:10.750 Nikita Barysheva: even though we said that the dynamon dB. Is like a

217 00:27:11.340 --> 00:27:14.620 Nikita Barysheva: we don't have to follow this strict schema.

218 00:27:15.040 --> 00:27:17.980 Nikita Barysheva: It's a very good practice to have one, I mean

219 00:27:18.130 --> 00:27:28.449 Nikita Barysheva: in dynamic D. When you create a record with a field like none, it won't be added, so we won't find it. If you go to the Ui, you you won't have it. But when you

220 00:27:28.820 --> 00:27:33.899 Nikita Barysheva: getting the data when you return the data, it's very good to have the

221 00:27:34.400 --> 00:27:41.840 Nikita Barysheva: and the model that you can use to serialize what you receive from dynamic, from the database. This will help.

222 00:27:42.300 --> 00:27:47.980 Nikita Barysheva: like the client, to know what to expect and

223 00:27:49.110 --> 00:27:54.539 Nikita Barysheva: ex avoid some unpredictable scenarios. So

224 00:27:54.670 --> 00:27:58.719 Nikita Barysheva: the model is always good, even if you work with a

225 00:27:58.860 --> 00:28:02.199 Nikita Barysheva: dynamodity for the key value database

226 00:28:02.560 --> 00:28:08.259 Nikita Barysheva: you have, you see here, like like one of the examples of how you can model data.

227 00:28:08.590 --> 00:28:16.620 Nikita Barysheva: And if there is a function that said user cache that sets a specific key value

228 00:28:16.900 --> 00:28:21.400 Nikita Barysheva: into the into the readies with the like expiration time.

229 00:28:21.560 --> 00:28:35.010 Nikita Barysheva: And so you can find the and set exception log error, that every time that something falls down. You will see it later. We just throw a properly log. Sorry I need to move the bar.

230 00:28:36.690 --> 00:28:41.280 Nikita Barysheva: Yeah, that will give you some details.

231 00:28:42.850 --> 00:28:49.040 Nikita Barysheva: I don't know why, but I also saw a lot a lot of examples of people trying to avoid it in logs.

232 00:28:49.220 --> 00:29:04.850 Nikita Barysheva: Oh, forgetting any logs like, I think it's a must you have to to show. Like to have it written somewhere. What was the error? And what I mentioned before is like, requested Id.

233 00:29:04.950 --> 00:29:19.059 Nikita Barysheva: It's very important if you have many, many services or clients that work with the users database, and that everything goes through the service like a main point. All these cloud iterations you need to know

234 00:29:19.410 --> 00:29:28.760 Nikita Barysheva: made the request. When it's very good for analytics, you can use it for graph later on. You can use it to debug things

235 00:29:29.780 --> 00:29:34.309 Nikita Barysheva: only only positive things from the logs that I see.

236 00:29:34.890 --> 00:29:40.377 Nikita Barysheva: and also for sure it can increase the costs for some reason, for some reason. But

237 00:29:41.080 --> 00:29:43.910 Nikita Barysheva: still have to find a balance.

238 00:29:46.030 --> 00:29:48.930 Nikita Barysheva: Second, yep.

239 00:29:49.290 --> 00:30:12.569 Nikita Barysheva: this is one of the simple examples how you can get the user, and you can get it by Id, by email, you can specify which fields to project. It's like select in the postgres. Clearly, when you do select email like 1st name, last name the same, and it will return you on the the skills, the same you can do in dynamodity

240 00:30:13.270 --> 00:30:17.819 Nikita Barysheva: and basically calls projection attributes.

241 00:30:18.010 --> 00:30:26.010 Nikita Barysheva: Also, you're checking. We're checking. If there are Id, or if there is Id or email provided.

242 00:30:26.480 --> 00:30:28.410 Nikita Barysheva: because it's a very logical thing.

243 00:30:28.580 --> 00:30:34.219 Nikita Barysheva: if nothing of this is provided you don't look for for the user.

244 00:30:34.380 --> 00:30:41.199 Nikita Barysheva: Only if you don't have maybe a route that can do a conditional

245 00:30:41.747 --> 00:30:49.370 Nikita Barysheva: conditional search. But here I decided to show, like the basic one but conditional. I mean, for example.

246 00:30:49.570 --> 00:30:55.500 Nikita Barysheva: and in SQL, when you look for the user, which who's a

247 00:30:55.660 --> 00:30:59.459 Nikita Barysheva: with the name Nikita, all the users with the name Nikita.

248 00:30:59.660 --> 00:31:01.950 Nikita Barysheva: So you don't have Id or email.

249 00:31:02.150 --> 00:31:07.829 Nikita Barysheva: But anyway, you can do it. The same thing over here just didn't include it over here.

250 00:31:12.668 --> 00:31:18.590 Nikita Barysheva: This is the actual process. So before going to Dynamodb.

251 00:31:18.790 --> 00:31:20.669 Nikita Barysheva: we are checking the ready sketch.

252 00:31:20.790 --> 00:31:27.560 Nikita Barysheva: So we're checking the the cash, and if there is nothing in cash, so we proceed to actual

253 00:31:28.228 --> 00:31:30.920 Nikita Barysheva: look up at the table.

254 00:31:31.400 --> 00:31:33.129 Nikita Barysheva: So here is like

255 00:31:33.270 --> 00:31:42.380 Nikita Barysheva: users. Table is like a let's say, an agent to initiate before initiated. Before that knows the function. Get item.

256 00:31:42.770 --> 00:31:45.170 Nikita Barysheva: good item by by key.

257 00:31:45.310 --> 00:31:53.150 Nikita Barysheva: and if we provide the projection attributes, and we provide and we say that please return us some specific fields.

258 00:31:53.320 --> 00:31:59.249 Nikita Barysheva: or we search by email. Once again, you can optimize this code as you wish. But

259 00:32:00.060 --> 00:32:08.269 Nikita Barysheva: again, this is just, for example. And then in the end of the day after we found out the user. If you found out the user, we cache it

260 00:32:12.675 --> 00:32:16.210 Nikita Barysheva: in this specific example. I wanted to

261 00:32:17.120 --> 00:32:23.600 Nikita Barysheva: to mention that we we potentially may have 3 types of successful response over here.

262 00:32:25.440 --> 00:32:39.880 Nikita Barysheva: we may be in the situation. We may end up in a situation when we didn't find the any users according to condition that we were like, according to Id, that was provided or email. And we return that like empty object here.

263 00:32:40.270 --> 00:32:58.629 Nikita Barysheva: or if we provide a projection attribute, we return the data as we received it from the Dynamodb. Or if we did find the user and didn't provide any projection attributes. Here. We want to serialize it. We, as I said, we have a model.

264 00:32:58.750 --> 00:33:01.120 Nikita Barysheva: and we want to serialize the user

265 00:33:01.310 --> 00:33:11.326 Nikita Barysheva: all the day. All the fields that we have inside the user will be patched as they are like a key in its value, and those that are not will get like

266 00:33:12.490 --> 00:33:18.410 Nikita Barysheva: we'll get a default values. Normally you put them as none, because

267 00:33:18.750 --> 00:33:24.750 Nikita Barysheva: there is nothing. There is no reason to put a something not relevant

268 00:33:25.500 --> 00:33:28.119 Nikita Barysheva: depends depends on the business thing. But yeah.

269 00:33:30.190 --> 00:33:34.359 Nikita Barysheva: And another thing that might be

270 00:33:34.780 --> 00:33:38.750 Nikita Barysheva: that might be, that is also important. It's just like error handling.

271 00:33:39.500 --> 00:33:47.180 Nikita Barysheva: We have different types of error handling. Please don't forget about it, please use it, and even though it may look like overwhelmed.

272 00:33:47.320 --> 00:33:53.660 Nikita Barysheva: I find it sometimes much better to have it rather than avoiding it. And after that

273 00:33:53.850 --> 00:34:02.910 Nikita Barysheva: something is crashing, and everyone like trying to understand what was the situation. You can handle it everything, with a general exception. But it.

274 00:34:04.130 --> 00:34:09.680 Nikita Barysheva: if you are provided with the with the tools, why not use it? My idea

275 00:34:13.100 --> 00:34:18.300 Nikita Barysheva: just wanted to sum up the service it's like

276 00:34:18.480 --> 00:34:28.170 Nikita Barysheva: intended for, and there are 2 things that you see here that I didn't mention up. But they're very important for the services like that.

277 00:34:28.550 --> 00:34:40.309 Nikita Barysheva: So the service idea is like handling dynamodity requests. Like all crowd operations, it also should be able to cache things

278 00:34:40.730 --> 00:34:43.080 Nikita Barysheva: to avoid like if

279 00:34:43.239 --> 00:34:50.190 Nikita Barysheva: if the query already like, if the request already got to the service by some reason, but

280 00:34:50.460 --> 00:34:57.029 Nikita Barysheva: you have the cached values. Still, even the service got the request. There is no reason to

281 00:34:57.250 --> 00:34:59.759 Nikita Barysheva: to bother Dynamodb, because it's

282 00:34:59.940 --> 00:35:07.190 Nikita Barysheva: after all, it's another. It's another like cent. It doesn't sound like another dollar. Let's call it so.

283 00:35:07.360 --> 00:35:18.809 Nikita Barysheva: But if you think about the very big scale, like when you have millions, tens of millions of users, if something won't be covered, it could cost you a lot, so

284 00:35:19.310 --> 00:35:23.910 Nikita Barysheva: I would prefer using caching and

285 00:35:24.760 --> 00:35:27.969 Nikita Barysheva: rather than avoiding using it. And

286 00:35:28.400 --> 00:35:30.560 Nikita Barysheva: to save some money over here

287 00:35:31.805 --> 00:35:37.329 Nikita Barysheva: we need to have a proper error error handler. That's why I mentioned 4 of them.

288 00:35:37.730 --> 00:35:42.989 Nikita Barysheva: and maybe someone won't like it. But I did it, and the 2 things that I didn't mention here is like

289 00:35:43.410 --> 00:35:47.720 Nikita Barysheva: you need to have throttling mechanism and rate limiter. Actually.

290 00:35:48.970 --> 00:35:53.220 Nikita Barysheva: it depends, I mean, it could be done on the

291 00:35:53.900 --> 00:36:02.979 Nikita Barysheva: should be done also on the service side, because, like services like what? Let's say, it's standalone thing. But you also may think about like

292 00:36:03.150 --> 00:36:05.360 Nikita Barysheva: trotting on the clients. So I mean

293 00:36:06.180 --> 00:36:16.179 Nikita Barysheva: throttling for sure and rate limiter is one, since we're talking about the service already is 2 things are very important to have in your services like that.

294 00:36:16.340 --> 00:36:18.929 Nikita Barysheva: So don't forget to cover it.

295 00:36:19.090 --> 00:36:25.790 Nikita Barysheva: And then I think that's actually cute.

296 00:36:25.990 --> 00:36:26.790 Nikita Barysheva: Hey?

297 00:36:27.190 --> 00:36:29.089 Nikita Barysheva: Yeah, let's it.

298 00:36:29.270 --> 00:36:32.549 Nikita Barysheva: And I think you did it faster, you know

299 00:36:33.410 --> 00:36:39.910 Nikita Barysheva: if you have any questions, just let me know. I would be happy to answer them again, if not.

300 00:36:40.100 --> 00:36:44.509 Nikita Barysheva: thanks for listening. I hope it was interesting to you, and

301 00:36:44.830 --> 00:36:47.499 Nikita Barysheva: we'll give you some ideas, maybe, or

302 00:36:47.940 --> 00:36:50.580 Nikita Barysheva: you will decide to do something similar to this.

303 00:36:52.160 --> 00:36:53.080 Nikita Barysheva: Just let me know.

304 00:36:54.640 --> 00:36:55.580 Gabor Szabo: Well.

305 00:36:55.890 --> 00:37:05.981 Gabor Szabo: thank thank you for the presentation. Any. If anyone has any more questions. That would be a good idea to ask now. If not, then.

306 00:37:08.180 --> 00:37:16.060 Gabor Szabo: thank you very much for for giving this presentation and for being here, and for the those who were watching it live.

307 00:37:16.190 --> 00:37:23.169 Gabor Szabo: Now you have the chance. So I'm telling it also to the viewers on Youtube Channel that those people who are here

308 00:37:23.820 --> 00:37:33.870 Gabor Szabo: in the live meeting they can stay on, and after we stop the recording we can open the the mics, and then we can have a conversation

309 00:37:33.990 --> 00:37:51.040 Gabor Szabo: asking all kind of other questions that you might have not wanted to do with be on the on the video. So anyway, thank you for being here. Thank you for watching, and don't forget to like the video and follow the channel and see you next time in the next, and and

310 00:37:51.220 --> 00:37:55.589 Gabor Szabo: join the Meetup group. If if you're not there yet and thank you.

311 00:37:56.960 --> 00:37:58.029 Nikita Barysheva: Thank you. Everyone.

Using Streamlit to create interactive web apps and deploy machine learning models with Leah Levy

2025-03-13T07:30:01Z

Discover how to quickly turn your Python scripts into interactive web apps using Streamlit. This session will cover key features like visualisations, widgets, and deployment, empowering you to create user-friendly interfaces with minimal effort.

Transcript

1 00:00:02.390 --> 00:00:05.820 Gabor Szabo: So hello and welcome to the Code Maven Channel.

2 00:00:05.960 --> 00:00:14.180 Gabor Szabo: My name is Gabor. I organize these events because I think it's very important for people to be able to share their knowledge.

3 00:00:14.410 --> 00:00:38.479 Gabor Szabo: and it's very useful for everyone else to learn from other people all around the world. I myself usually teach python and rust and help companies introduce testing in these 2 languages or introduce these languages. And that's it. Basically, this channel is mostly with. Now these videos from these meetings.

4 00:00:38.860 --> 00:01:07.330 Gabor Szabo: and I am really happy that you agreed to give this presentation in our meeting, and thank you everyone for joining us here. If you are in the Zoom Meeting. Then feel free to ask questions. Just remember that it's going to be in Youtube. If you're watching it in Youtube, then. And if you enjoy this video, then please, like the video and follow the channel, and later on we'll have below the the video links

5 00:01:07.380 --> 00:01:14.970 Gabor Szabo: and where you can contact layer as well if you are interested later on. So now it's your turn. Go ahead.

6 00:01:15.950 --> 00:01:16.850 Gabor Szabo: Welcome now.

7 00:01:21.690 --> 00:01:30.810 Leah Levy: so hopefully, you can see my screen. So my name is Leah. I'm currently living in the Uk, I'm a data scientist in the Uk.

8 00:01:30.810 --> 00:01:41.190 Gabor Szabo: Maybe I it's only just me, but I can see all the list of the people who are joined. Is it on your screen, or it's just mine. No, it's it's I think you're sharing that one.

9 00:01:43.930 --> 00:01:44.700 Leah Levy: Yeah.

10 00:01:49.860 --> 00:01:53.950 Gabor Szabo: Wait a second. Maybe it's my, it's mine. No view.

11 00:01:54.720 --> 00:01:56.819 Gabor Szabo: Yeah, no, it was mine. Sorry.

12 00:01:58.430 --> 00:02:00.800 Gabor Szabo: Sorry, confusing you. Okay.

13 00:02:02.020 --> 00:02:02.550 Leah Levy: It's okay.

14 00:02:02.550 --> 00:02:06.490 Gabor Szabo: Go ahead. No, no, it's okay. It was on my screen in the.

15 00:02:11.030 --> 00:02:22.899 Leah Levy: yeah. So I'm a data scientist in the for the Uk government. I'm currently get living in in England. I'm hoping to move to Israel soon. So be nice to meet everybody.

16 00:02:23.611 --> 00:02:42.418 Leah Levy: I'm gonna talk today about streamlit, which is a python library and how I use it to like deploy machine learning models and just build web apps. I'll put my contact details in the chat. If you wanna connect with me on Linkedin or follow me on Github.

17 00:02:43.200 --> 00:02:45.630 Leah Levy: be great to be great, to connect

18 00:02:46.694 --> 00:02:55.610 Leah Levy: and please feel free to ask questions as we go along. I've I can see the chat. So if you want to put messages in the chat or come off mute, whatever you prefer.

19 00:02:56.760 --> 00:03:02.790 Leah Levy: So streamlet is a python library. It's open source.

20 00:03:02.790 --> 00:03:10.029 Gabor Szabo: Sorry. Sorry. Just one note, I mean, right now we can see both you and and this and the slides.

21 00:03:10.300 --> 00:03:11.630 Gabor Szabo: and.

22 00:03:11.900 --> 00:03:12.880 Leah Levy: Oh, okay.

23 00:03:12.880 --> 00:03:22.350 Gabor Szabo: So maybe you want to turn off your your camera, or or just show the slides, because in the recording you you will be seen, anyway, probably at the top right corner

24 00:03:23.000 --> 00:03:25.769 Gabor Szabo: that now you can. I can see myself.

25 00:03:26.950 --> 00:03:28.720 Leah Levy: I'll share again. Hold on.

26 00:03:29.060 --> 00:03:29.850 Gabor Szabo: Okay.

27 00:03:36.170 --> 00:03:40.759 Leah Levy: Oh, yeah, it was on a strange I think I was messing around with the settings before.

28 00:03:40.960 --> 00:03:41.710 Leah Levy: Okay.

29 00:03:46.550 --> 00:03:47.979 Gabor Szabo: Oh, now it's good!

30 00:03:48.590 --> 00:03:49.296 Leah Levy: Yeah, okay.

31 00:03:50.100 --> 00:03:50.510 Gabor Szabo: Okay.

32 00:03:51.450 --> 00:03:56.100 Leah Levy: Thanks for letting me know. So you can see just like there's the slideshow.

33 00:03:57.130 --> 00:03:57.710 Gabor Szabo: Yeah.

34 00:03:58.130 --> 00:03:58.790 Leah Levy: Yeah,

35 00:04:02.210 --> 00:04:22.959 Leah Levy: so how many of you ever perhaps worked on a data science project? You've built a machine learning model. And you've wished you could deploy it quickly for others to use. Or perhaps you've built a web application. But front end development isn't really your expertise. It's too complicated. So this is where stream it really comes into its own.

36 00:04:23.270 --> 00:04:41.560 Leah Levy: It makes it easy for python developers to and data scientists to create beautiful interactive web apps without needing any front end development expertise. So it's lightweight. It's really easy to use doesn't require, you know, hundreds of lines of code.

37 00:04:41.620 --> 00:04:56.920 Leah Levy: And there's a really strong community online. So there's people building like add ons constantly. And there's also a strong community of people happy to answer questions and help if you have any issues.

38 00:05:01.950 --> 00:05:10.450 Leah Levy: So Streamline allows you to turn your python scripts into interactive web applications and just a few lines of code. So you don't need to be. Know any like

39 00:05:10.620 --> 00:05:19.650 Leah Levy: break traditional web frameworks like Flask or Django. You don't need any HTML Css. Or javascript. It's all python.

40 00:05:20.640 --> 00:05:32.522 Leah Levy: You can easily customize your web application using like sliders, buttons, check boxes making it interactive. And you're able to capture user, input too.

41 00:05:34.180 --> 00:05:53.920 Leah Levy: The app automatically updates when you're coding in in whatever id prefer, like visual studio code, as soon as you update the code and save it then updates in the in the actual application. I'll show I'll do a demo of it a bit later, so you could see exactly what I mean.

42 00:05:55.020 --> 00:06:00.160 Leah Levy: And but that just like makes development much faster. So you can see your changes as you go along.

43 00:06:00.400 --> 00:06:05.189 Leah Levy: And it works really well with other python libraries, popular ones like

44 00:06:05.370 --> 00:06:11.689 Leah Levy: numpy pandas plotly, even data science ones like tensorflow and scikit-learn.

45 00:06:11.900 --> 00:06:18.939 Leah Levy: So it enables you to visualize data. You can build dashboards, graphs, charts and also

46 00:06:19.470 --> 00:06:23.439 Leah Levy: integrate machine learning models directly into your application.

47 00:06:26.840 --> 00:06:51.719 Leah Levy: So a bit about deploying machine learning models so often. In data science, you go. You put a lot of work into creating it in a model. You've got your data, you've cleaned it. You've built a model. You've tested it, optimized it. You've evaluated the performance. But the real key is to kind of surface that to your end users or your clients

48 00:06:52.170 --> 00:07:01.079 Leah Levy: and using stream that makes it easy. It's quite user friendly interface. And it can handle resource, intensive tasks.

49 00:07:01.690 --> 00:07:03.910 Leah Levy: And it's easy to deploy as well.

50 00:07:04.050 --> 00:07:14.830 Leah Levy: You a basic workflow could be something like loading a pre-trained model from pickle file or on something from hugging face or tensorflow.

51 00:07:16.480 --> 00:07:32.710 Leah Levy: collect input from users. So as soon as they could enter some text. If it's like a chat bot, they could use some sliders and then it uses the machine learning model to make predictions and display the results to users.

52 00:07:33.150 --> 00:07:39.510 Leah Levy: So I've created a couple of examples of

53 00:07:39.610 --> 00:07:46.359 Leah Levy: what it can do. Just like kind of basic one's a dashboard and one's uses a pre-trained machine learning model.

54 00:07:50.010 --> 00:07:59.530 Leah Levy: I'm gonna I've taken some screenshots, but I think it'd be better to just show it live. So I'm just gonna have a go showing like, can you see this.

55 00:08:00.660 --> 00:08:01.400 Gabor Szabo: Then like.

56 00:08:03.980 --> 00:08:06.170 Leah Levy: Because, yeah, the code.

57 00:08:06.620 --> 00:08:07.280 Gabor Szabo: Yes.

58 00:08:09.100 --> 00:08:14.939 Leah Levy: So I've just pre pre-built like this very basic dashboard.

59 00:08:15.070 --> 00:08:17.750 Leah Levy: What it does is

60 00:08:18.230 --> 00:08:24.410 Leah Levy: I've got some dummy data about British culture. I thought I'd make it relative to me.

61 00:08:25.030 --> 00:08:27.209 Leah Levy: and I've just put it into a.

62 00:08:27.210 --> 00:08:29.650 Gabor Szabo: Saying, maybe you can enlarge the fonts a little bit.

63 00:08:32.220 --> 00:08:32.970 Leah Levy: Yeah, let me.

64 00:08:33.276 --> 00:08:33.889 Gabor Szabo: Yeah. Thanks.

65 00:08:35.520 --> 00:08:36.020 Gabor Szabo: Think so.

66 00:08:38.250 --> 00:08:38.909 Gabor Szabo: Noon.

67 00:08:42.010 --> 00:08:43.020 Gabor Szabo: Okay, well.

68 00:08:43.020 --> 00:08:43.420 Leah Levy: Oh, 2.

69 00:08:43.429 --> 00:08:47.150 Gabor Szabo: Yeah, yeah, no, it's good. I see.

70 00:08:48.430 --> 00:08:49.330 Leah Levy: Pardon.

71 00:08:50.920 --> 00:08:51.859 Gabor Szabo: I think it's fine now.

72 00:08:52.500 --> 00:08:53.440 Leah Levy: Okay?

73 00:08:54.357 --> 00:09:02.049 Leah Levy: So in the terminal I just use the command stream. Let run. So I do. Stream lit.

74 00:09:02.210 --> 00:09:06.640 Leah Levy: run, and then the name of your file.

75 00:09:06.830 --> 00:09:13.420 Leah Levy: In this case it's in the app folder, and it's called English chat, Hi.

76 00:09:16.530 --> 00:09:21.744 Leah Levy: and it takes a couple of seconds and it should pop up in like your browser.

77 00:09:23.780 --> 00:09:28.030 Leah Levy: so here you can have you stream it up in your browser. It's popped up here.

78 00:09:29.430 --> 00:09:37.429 Leah Levy: and here's the very basic app that I built in the top right hand corner. You see it running

79 00:09:38.360 --> 00:09:42.490 Leah Levy: and then there's a option here to deploy. If you want. If you're ready to deploy it.

80 00:09:43.405 --> 00:09:45.339 Leah Levy: Oh, what's this?

81 00:09:48.310 --> 00:09:49.360 Leah Levy: Okay?

82 00:09:56.720 --> 00:10:03.289 Leah Levy: If this doesn't work, I will just show you the screenshot instead.

83 00:10:03.890 --> 00:10:05.980 Leah Levy: Okay, so I've saved it here.

84 00:10:06.750 --> 00:10:10.696 Leah Levy: And you'll see an example now, actually, of

85 00:10:12.450 --> 00:10:23.590 Leah Levy: of how it updates in real time. So I've updated the file, the source file. And you see in the top right hand corner. Now there's an option I'll just zoom in and make it a bit bigger.

86 00:10:25.070 --> 00:10:28.779 Leah Levy: but it says source file change, and it gives you the option. Rerun

87 00:10:29.161 --> 00:10:33.799 Leah Levy: and they can click, always rerun. So I don't have to click that every time. So if I try that.

88 00:10:34.150 --> 00:10:38.630 Leah Levy: and it's work now. So this is just like a

89 00:10:39.100 --> 00:10:46.520 Leah Levy: basic application. There's a dropdown menu here, so you can select the category if I wanted to. Just see landmarks. See that

90 00:10:46.830 --> 00:10:50.740 Leah Levy: some reason it's giving me error sports

91 00:10:54.010 --> 00:11:03.280 Leah Levy: and the size of each bubble is the size of visitors per year, and you can hover over, and it gives you a little bit more information. And then if

92 00:11:04.900 --> 00:11:12.589 Leah Levy: yeah, I think I think the map plot little bit is broken on bottom. So that's 1 example. The next

93 00:11:13.270 --> 00:11:22.030 Leah Levy: application. Let me just cancel this. I'll just do control. C, let's run another

94 00:11:23.102 --> 00:11:30.350 Leah Levy: another. This is more of like a machine learning one. So I just run, stream, let run and

95 00:11:31.820 --> 00:11:32.980 Leah Levy: spell check.

96 00:11:50.550 --> 00:11:54.489 Leah Levy: Oh, I know why it's giving me an error because I haven't installed the packages.

97 00:12:09.660 --> 00:12:13.009 Leah Levy: I'm actually just using poetry library, which

98 00:12:13.200 --> 00:12:34.820 Leah Levy: it's it's not sure how common, how widely it's used. But it's a 3rd party. It's like A, it's not an inbuilt typically, you might manage your libraries, use your dependencies using like requirements, dot text file and then have a virtual create a virtual environment. But I'm just.

99 00:12:35.420 --> 00:12:45.569 Leah Levy: I've got used to using poetry, which is another like dependency package. So and that's just

100 00:12:46.130 --> 00:12:48.370 Leah Levy: just to clarify exactly what it is.

101 00:12:51.940 --> 00:12:58.580 Leah Levy: Yeah, that's not working. So let me just show you on the on the slide show.

102 00:12:59.580 --> 00:13:00.750 Leah Levy: Sorry?

103 00:13:09.314 --> 00:13:18.655 Leah Levy: What this is. Is. It imports text blog, which is a light, very lightweight kind of natural language processing library

104 00:13:20.020 --> 00:13:26.119 Leah Levy: and what happens is you put in your spelling. So you put in some text. In this case

105 00:13:26.530 --> 00:13:35.059 Leah Levy: I'm so bad at spelling spell really wrong, and then it returns the correct spelling and then in the top right. You can see it's very kind of

106 00:13:35.320 --> 00:13:47.810 Leah Levy: simply. There's only like 16 lines of code. It's quite lightweight. And I've put a link here to more community projects. You can see on on the stream website.

107 00:13:48.440 --> 00:13:50.030 Leah Levy: they've actually got

108 00:13:51.100 --> 00:13:59.750 Leah Levy: community projects. You can kind of get an idea of flavor, of exactly what's possible. So this one's quite cool. This is like a map.

109 00:14:00.445 --> 00:14:06.500 Leah Levy: Application that somebody's built that's called pretty map, where you kind of visualize

110 00:14:07.361 --> 00:14:11.959 Leah Levy: maps in like different, cool, different, cool ways.

111 00:14:13.051 --> 00:14:22.290 Leah Levy: But just so you can get kind of get an idea of like, it's quite personalizable. It doesn't have to look like they did. All the applications don't necessarily have to look the same.

112 00:14:38.920 --> 00:14:40.470 Leah Levy: Sorry gone too far.

113 00:14:45.890 --> 00:14:53.241 Leah Levy: Okay. So I wanted to talk about deployment. So I mentioned. It's there's different options to deploy.

114 00:14:54.210 --> 00:14:59.230 Leah Levy: Just gonna wait for the slides to kind of sync.

115 00:15:07.560 --> 00:15:08.619 Leah Levy: Not sure.

116 00:15:09.560 --> 00:15:10.799 Leah Levy: Okay, there we go.

117 00:15:13.880 --> 00:15:27.930 Leah Levy: There's a there's a couple of different options you could deploy locally, which is kind of what we've done. Just before we do the stream that run. But in most cases you want to deploy it to a cloud or servers.

118 00:15:28.370 --> 00:15:31.159 Leah Levy: So there's stream that has its own kind of

119 00:15:31.370 --> 00:15:39.799 Leah Levy: built like customized deployment option called the stream community cloud where you can deploy from, get straight from Github.

120 00:15:40.551 --> 00:15:46.568 Leah Levy: But that also supports other deployment options like Docker, Aws

121 00:15:48.475 --> 00:15:53.880 Leah Levy: and all these other options. The another benefit of the community cloud is.

122 00:15:54.720 --> 00:16:12.700 Leah Levy: you can it provides you with analytics data. So how many people have clicked on on your onto your dashboard. Total viewers, most recent viewers, timestamps of people's last visit. So you can kind of get an idea of when people have have used your application.

123 00:16:14.520 --> 00:16:18.800 Leah Levy: So I want to talk about the testing framework in the app.

124 00:16:18.910 --> 00:16:21.500 Leah Levy: This is something.

125 00:16:22.090 --> 00:16:35.319 Leah Levy: Last time I gave this talk at Pi Web in Tel Aviv. Someone asked me about testing. And I thought, Oh, yeah, that's I've not really used the testing framework. So I thought, I put a section in here to to show you kind of how I've done it.

126 00:16:36.415 --> 00:16:58.584 Leah Levy: So stream that has its own. You can use pi test, and and those usual kind of testing frameworks and stream. It has its own framework, which enables developers to build and run headless tests, which I executes the app code directly. So it simulates that user input and inspects the output for correctness.

127 00:16:59.090 --> 00:17:07.560 Leah Levy: for those who don't know headless tests is like a way to run automated browser tests without having, like the user interface.

128 00:17:08.027 --> 00:17:13.299 Leah Levy: So it's a more efficient way of testing the application because it doesn't need to like render the HTML.

129 00:17:13.569 --> 00:17:27.959 Leah Levy: It just sends requests to the server the same way like you would do in a browser, and it's much faster because you don't need to wait for a page to load, and it integrates well into your like any crcd pipelines you might have as well.

130 00:17:29.670 --> 00:17:47.450 Leah Levy: So an example of testing. So on the left hand side. I've written what might be a more traditional way to write a test. So you would import streamlet and also import textblob, which is the library I mentioned before that we used for the spell checker.

131 00:17:47.660 --> 00:17:49.590 Leah Levy: You kind of set up a

132 00:17:50.100 --> 00:17:57.630 Leah Levy: set up the app just as it appears in that, just as what you've to kind of mirror what you've written

133 00:17:58.258 --> 00:18:07.070 Leah Levy: and have some simulated user input and then load the text blob and then run the

134 00:18:07.520 --> 00:18:15.440 Leah Levy: run. The text Blob Library to generate the correct spelling, and then have an assert to correct, to ensure that

135 00:18:15.740 --> 00:18:23.610 Leah Levy: that is, what the output is is what you've expected is that should be the corrected spelling of what you've inputted.

136 00:18:24.489 --> 00:18:32.130 Leah Levy: But on the right, all you need to do is run install the streamer testing framework

137 00:18:32.250 --> 00:18:45.980 Leah Levy: with app test. App test is is what simulates the running of the app, and it provides different methods to set up, manipulate and inspect the app via the Api instead of doing it in the browser

138 00:18:49.370 --> 00:18:57.074 Leah Levy: And then I've just written a function to test the spelling. So you you've got app test, which runs the

139 00:18:57.710 --> 00:19:03.239 Leah Levy: which runs the application as if I was running it in the terminal.

140 00:19:03.950 --> 00:19:09.750 Leah Levy: I is simulate an input of the incorrect spelling and run that.

141 00:19:10.520 --> 00:19:16.360 Leah Levy: and then the assert that the corrected text equals the correct spelling.

142 00:19:17.358 --> 00:19:25.871 Leah Levy: And then I've just written some a couple of other tests this next function just asserts that the

143 00:19:27.180 --> 00:19:33.809 Leah Levy: the application is running and not producing any exception errors. And then this one tests that the title is

144 00:19:33.990 --> 00:19:36.970 Leah Levy: displaying the correct title as we've expected.

145 00:19:39.550 --> 00:19:48.459 Leah Levy: so you'll see it's much quicker. It's fewer lines of code. And you could just run it using like in the terminal using. I test

146 00:19:48.680 --> 00:19:51.339 Leah Levy: as you would like any other testing.

147 00:19:54.660 --> 00:20:03.330 Leah Levy: you can add multiple pages to an app. So you kind of create a new pages folder in the same folder where your application is running

148 00:20:03.934 --> 00:20:15.910 Leah Levy: and then you can give it. You can, whatever you name the whatever you name. The file is what kind of appears on the sidebar and you can amend the

149 00:20:17.040 --> 00:20:23.254 Leah Levy: you can amend the content as you would like any other application. I've put a link in here.

150 00:20:24.030 --> 00:20:25.680 Leah Levy: just so you can kind of

151 00:20:27.610 --> 00:20:30.949 Leah Levy: I was gonna show how to

152 00:20:32.609 --> 00:20:36.229 Leah Levy: it. It gives a good example rather than me, like giving

153 00:20:36.680 --> 00:20:41.279 Leah Levy: setting up lots of different ones. But you can kind of see the from the. It's got a good

154 00:20:41.750 --> 00:20:44.358 Leah Levy: kind of demo page.

155 00:20:49.446 --> 00:20:53.703 Leah Levy: hey? It's got a hello page. It's got a plotting demo.

156 00:20:54.980 --> 00:20:58.089 Leah Levy: yeah, you can have a look in your own time if you like.

157 00:21:24.610 --> 00:21:27.299 Leah Levy: Sorry. My computer's running super slow.

158 00:21:30.410 --> 00:21:32.449 Gabor Szabo: So I just I was just saying.

159 00:21:33.350 --> 00:21:38.320 Leah Levy: It's also supports chat inputs. So oops.

160 00:21:38.920 --> 00:21:47.796 Leah Levy: So if you if you everybody wants to build their own chat bots nowadays, and it provides support for that

161 00:21:48.380 --> 00:21:55.700 Leah Levy: where it kind of mimics a user. And it's got like an assistant with these like different emojis

162 00:21:56.242 --> 00:22:02.159 Leah Levy: so as if you were speaking to a person. Similar to kind of.

163 00:22:02.720 --> 00:22:07.300 Leah Levy: you know, like chat. Gpt's got an assistant kind of answer.

164 00:22:07.560 --> 00:22:30.921 Leah Levy: You can also like stream, the reply, you know how chat gpt kind of streams it, or writes it word by word, instead of just giving you an answer right away. As if somebody just to like make it look like somebody's typing. You can add a delay as well of like a couple of seconds to make it seem like it's thinking about a reply.

165 00:22:32.280 --> 00:22:52.160 Leah Levy: And different things like that. So this is just a an echo bot which just echoes, echoes whatever you type into it. Obviously not using any large language models. But you can use kind of any large language models that you want, and kind of just plug it in to a streaming dashboard.

166 00:23:01.040 --> 00:23:04.700 Leah Levy: So finally, just some additional features

167 00:23:05.710 --> 00:23:16.739 Leah Levy: which I've oops added kind of some links to. So, as I mentioned before, it's got like a whole wide range of different input widgets. And

168 00:23:17.180 --> 00:23:32.760 Leah Levy: I didn't kind of include them all on the dashboard, because I think that this page actually does it in a nicer way. You can see it's got different buttons, check boxes, feedback options, radio buttons.

169 00:23:33.550 --> 00:23:35.240 Leah Levy: sliders.

170 00:23:35.966 --> 00:23:39.269 Leah Levy: Numeric inputs. Yeah, I could just go on, but

171 00:23:40.150 --> 00:23:49.400 Leah Levy: pretty much you know anything you would need to build a nice looking app. It's got another

172 00:23:49.840 --> 00:23:56.568 Leah Levy: another thing is status elements of like progress bars loading

173 00:23:58.890 --> 00:24:03.326 Leah Levy: call out messages, but error boxes I've used before.

174 00:24:04.080 --> 00:24:08.824 Leah Levy: I can't say I've used the balloon ones, but that looks fun

175 00:24:12.470 --> 00:24:20.803 Leah Levy: And it also has integration for like interactive maps, as we saw before, like that, the map application that I

176 00:24:21.340 --> 00:24:27.209 Leah Levy: And it's also you can build interactive charts with like plotly and other similar libraries.

177 00:24:27.640 --> 00:24:36.139 Leah Levy: You can cache large data sets. So particularly when you're working with machine learning models. You're often dealing with

178 00:24:36.250 --> 00:24:48.150 Leah Levy: lot really, really, large data sets which you can cache into memory. So rather than reloading the reloading like a data set each time it can just store it in memory.

179 00:24:50.161 --> 00:25:12.448 Leah Levy: From a safety point of view. I've just looked at the privacy policy and took this this 4th bullet point straight from the privacy policy which is stream that cannot see and does not store any information contained inside stream. The apps like text shots and images, but as general advice, I would say, not to expose sensitive data.

180 00:25:13.020 --> 00:25:17.580 Leah Levy: unless you yeah.

181 00:25:18.310 --> 00:25:40.254 Leah Levy: you can expect, unless you're kind of like it's locked down. It's in a safe, secure environment. And you've got like full access controls and ensure your app is also protected from malicious input, like sequel injections, because, you know any. Any application is susceptible to to being hacked. So I guess just

182 00:25:41.480 --> 00:25:48.060 Leah Levy: be wary of this is probably no different either to to malicious input like that.

183 00:25:52.590 --> 00:25:53.465 Leah Levy: But

184 00:25:54.630 --> 00:26:01.731 Leah Levy: yeah, that's all I prepared for now, but happy to answer questions and go into into more detail on different bits.

185 00:26:03.040 --> 00:26:07.319 Leah Levy: but thank you for your time, and happy to answer any questions.

186 00:26:12.910 --> 00:26:15.524 Gabor Szabo: So thank you for the presentation.

187 00:26:17.190 --> 00:26:25.759 Gabor Szabo: I heard it the second time. I really like the testing part. I always think about testing when I, whatever I try to show.

188 00:26:25.890 --> 00:26:26.970 Gabor Szabo: And

189 00:26:27.810 --> 00:26:38.989 Gabor Szabo: if anyone has questions, then please ask. Now we can also, after the recording, after we stop the recording, we can stay around and have a conversation without the recording.

190 00:26:39.240 --> 00:26:45.520 Gabor Szabo: But anyway, it seems that there are no questions now.

191 00:26:46.440 --> 00:26:50.600 Gabor Szabo: So, Leah, thank you very much for for this presentation.

192 00:26:50.780 --> 00:26:56.499 Gabor Szabo: If we'd like to add anything more, I mean, I'll I'll have the links below also the the video.

193 00:26:59.180 --> 00:27:05.545 Gabor Szabo: So thank you for for giving this presentation. And thank you. Thank you. Thanks. Everyone who was attending. And

194 00:27:06.420 --> 00:27:11.800 Gabor Szabo: and everyone who was watching. So please remember to like the video and follow the Channel and see you

195 00:27:11.980 --> 00:27:15.530 Gabor Szabo: at one of our next one of our upcoming events.

196 00:27:15.960 --> 00:27:16.850 Gabor Szabo: Bye, bye.

197 00:27:18.140 --> 00:27:19.260 Leah Levy: Thanks, bye.

daffodil, data frames for optimized data inspection and logical processing with Ray Lutz

2025-03-06T16:30:01Z

Speaker: Ray Lutz

daffodil (data frames for optimized data inspection and logical (processing)), which can create data frame instances similar to pandas, but using conventional python data types.

This means no conversion to/from the Pandas world, which I have found from testing has a very high overhead. In fact, unless you plan to do at least 30 repetitive column-based operations (like sums, etc) then you should just stay in python world and avoid the conversion time, and you win. But for many, time is not of the essence, or they stay in Pandas world and never need any python. The syntax is easy to use and I am extending it to use SQL database to allow for large table size and use of the robust joins, etc. The SQL part is under work and not released yet.

Transcript

1 00:00:02.370 --> 00:00:06.679 Gabor Szabo: Hello and welcome to the Code Maven meeting a meeting group

2 00:00:06.860 --> 00:00:12.580 Gabor Szabo: and Youtube Channel. If you are watching this on Youtube, thank you very much for everyone who joined us.

3 00:00:13.080 --> 00:00:17.649 Gabor Szabo: and especially thanks Ray, for giving this talk.

4 00:00:17.790 --> 00:00:26.829 Gabor Szabo: My name is Gabor Sabo. I usually teach python and rust and help companies introduce these languages or introduce testing in these languages.

5 00:00:27.030 --> 00:00:33.439 Gabor Szabo: And I also organize these meetings because I think it's very important to share knowledge and

6 00:00:33.640 --> 00:00:38.700 Gabor Szabo: the Zoom Meetings. And it is online. Events allow us to to

7 00:00:39.040 --> 00:00:47.660 Gabor Szabo: learn from each other, even if they are halfway around the world. And so with that, let me

8 00:00:48.120 --> 00:00:52.799 Gabor Szabo: give the word to you, Ray, and please introduce yourself and and just go ahead.

9 00:00:53.030 --> 00:01:15.399 Gabor Szabo: One thing sorry. Just one thing. Those who are here feel free to ask questions, either in the chat or in the or just speak up. Ray will tell you how it's going to work out. Just remember, if you're recording this, it's going to be in Youtube. So if you don't want to be in your in Youtube, then just write.

10 00:01:15.570 --> 00:01:17.069 Gabor Szabo: So thank you, it's yours.

11 00:01:17.660 --> 00:01:25.920 Ray Lutz: Okay, thank you so much. Gabor. Yes, my name is Ray Lutz. I'm let me go on to the let me share my screen here so we can get started.

12 00:01:27.660 --> 00:01:36.300 Ray Lutz: I am actually not that much that long term of a python user, you know only about maybe 5, 6 years.

13 00:01:36.925 --> 00:01:43.150 Ray Lutz: And then I had quite a wealth of experience before that with other languages, including.

14 00:01:43.320 --> 00:01:58.208 Ray Lutz: you know, assembly language. See? You know, Perl you know, Javascript, all these other kind of languages in one form or another, even though I do really like python. So I I did kind of settle on that

15 00:01:59.190 --> 00:02:00.310 Ray Lutz: for now.

16 00:02:00.670 --> 00:02:08.970 Ray Lutz: And so essentially, today, we're going to talk about this package called Daffodil.

17 00:02:09.110 --> 00:02:19.119 Ray Lutz: And it is data frames for optimized data inspection and logical processing. I came up with that later, you know, after we've chose the name. But

18 00:02:19.290 --> 00:02:26.149 Ray Lutz: the idea is that you see a lot. Df, if you use pandas, you're talking about data frames. Df, and

19 00:02:26.300 --> 00:02:35.600 Ray Lutz: so we wanted something kind of like that. And we use daf. So you know, throughout the code, if you see daf, you know that it's a daffodil data frame

20 00:02:35.710 --> 00:02:37.769 Ray Lutz: instead of a pandas.

21 00:02:39.390 --> 00:02:43.949 Ray Lutz: And I have a Master's degree, mostly electronics. I did do

22 00:02:44.810 --> 00:02:51.010 Ray Lutz: various medical devices and and document processing in my career.

23 00:02:52.170 --> 00:03:05.290 Ray Lutz: Most recently I'm developing audit engine, which is a ballot image auditing platform for checking elections. And underneath the citizens oversight, which is a nonprofit organization.

24 00:03:05.940 --> 00:03:11.629 Ray Lutz: Now, why, Daffodil, we already have pandas. So why would we need something new?

25 00:03:11.760 --> 00:03:20.499 Ray Lutz: Well, I needed a two-dimensional data type sort of a table structure. And so I started using pandas

26 00:03:21.433 --> 00:03:26.579 Ray Lutz: for almost everything I I use. You know, these 2 dimensional tables are really handy.

27 00:03:26.990 --> 00:03:31.890 Ray Lutz: but it turns out that pandas is mostly designed for numerics and

28 00:03:33.630 --> 00:03:35.880 Ray Lutz: it uses numpy under the hood.

29 00:03:37.400 --> 00:03:46.650 Ray Lutz: and so it's slow, really slow for row based operations, and some of them are now not even allowed. So you can't do an append

30 00:03:46.920 --> 00:03:51.099 Ray Lutz: like a Panda row. Seems like a basic thing you might want to do

31 00:03:51.290 --> 00:03:58.989 Ray Lutz: that's now not supported at all in pandas, because they know it's so desperately a disaster.

32 00:03:59.756 --> 00:04:04.090 Ray Lutz: So then you have to go over and and use something else. If you want to do that sort of thing

33 00:04:05.070 --> 00:04:06.400 Ray Lutz: and

34 00:04:07.280 --> 00:04:16.359 Ray Lutz: and also apply, they say, don't use, apply and apply is kind of a handy thing, which means you go row by row, and you apply some function to it

35 00:04:16.760 --> 00:04:18.329 Ray Lutz: at each row.

36 00:04:18.519 --> 00:04:22.530 Ray Lutz: And so you can't do that either, they said. We're deprecating all these things.

37 00:04:22.740 --> 00:04:27.689 Ray Lutz: I think you can still do apply. But they say, you know, it's really not recommended at all.

38 00:04:28.470 --> 00:04:29.809 Ray Lutz: And then

39 00:04:31.010 --> 00:04:39.919 Ray Lutz: it turns out also, when we're using files that are kind of a weird formats. Pandas assumes a lot when it reads them in.

40 00:04:40.090 --> 00:04:46.950 Ray Lutz: and you have to go jump through a lot of hoops to get it to just read it in like like something without doing anything.

41 00:04:47.090 --> 00:04:49.320 Ray Lutz: and then convert things as you go.

42 00:04:50.075 --> 00:04:53.209 Ray Lutz: It has some other problems, too, and we'll get into that.

43 00:04:53.360 --> 00:05:14.250 Ray Lutz: So this is when I started looking for another data type, and I had some various ones that I started using. And I ended up standardizing on this type of a two-dimensional data frame which is based on a list of lists. I call it a lol doesn't mean laughing out loud. It's a list of list type.

44 00:05:14.800 --> 00:05:17.030 Ray Lutz: And so it's a

45 00:05:17.430 --> 00:05:27.109 Ray Lutz: it's a python list. And in each each of these lists you have a additional list, and it's it's rectangular in form.

46 00:05:27.360 --> 00:05:33.910 Ray Lutz: So every single row is the same length, and it has a certain thing. So it's it's a rectangular

47 00:05:34.130 --> 00:05:39.620 Ray Lutz: array, but it's not the array type. It's a list of lists. So it's easy to add to.

48 00:05:39.780 --> 00:05:53.780 Ray Lutz: relatively easy to splice and insert insert rows or columns. You can do a lot of things fairly easily, mostly inserting and and rows as easy columns. Not quite so easy. But

49 00:05:55.260 --> 00:05:58.310 Ray Lutz: it's fairly malleable. And then

50 00:05:58.450 --> 00:06:07.459 Ray Lutz: also you can put anything at all in any one of these cells, and python will handle it just fine, so you could put a whole pandas array in here. If you want.

51 00:06:07.910 --> 00:06:13.879 Ray Lutz: you could put a whole numpy array of a million things in one cell if you want. Okay, so that's

52 00:06:14.000 --> 00:06:15.800 Ray Lutz: it's very versatile that way.

53 00:06:16.280 --> 00:06:27.199 Ray Lutz: So the basic thing is that you have this array, which is just numbered, and the numbers here don't stick to the columns and rows like they do in pandas?

54 00:06:28.680 --> 00:06:36.939 Ray Lutz: maybe other things. They they float like they would in a regular spreadsheet. So if you move the the rows around, the numbers of the rows

55 00:06:37.130 --> 00:06:42.429 Ray Lutz: are going to stay in the same order, even though you might have moved something up there and so forth.

56 00:06:42.870 --> 00:06:47.700 Ray Lutz: But then you can also optionally have names for each column.

57 00:06:47.940 --> 00:06:56.250 Ray Lutz: data types for the name for the columns and a separate type data types object that explains what those are.

58 00:06:56.420 --> 00:07:20.869 Ray Lutz: And then Optional row keys. Okay, these are both dictionaries. So the Header Dictionary, HD. And the Row Keys Key Dictionary are a special type of dictionary which gives you the number of the column, or the number of the row in the dictionary, so I don't know what you call this exactly, but I end up calling it a keyed list.

59 00:07:21.060 --> 00:07:27.140 Ray Lutz: In other words, this, this is this is the the the key.

60 00:07:27.430 --> 00:07:31.920 Ray Lutz: We'll go into the key list later. But but essentially this is the key.

61 00:07:32.310 --> 00:07:36.460 Ray Lutz: and this is the number that refers to an item in a list.

62 00:07:36.860 --> 00:07:37.940 Ray Lutz: And

63 00:07:39.230 --> 00:07:47.170 Ray Lutz: so your your dictionary would have a a key, and the value is always 0 1, 2, 3, 4, and so forth.

64 00:07:47.350 --> 00:08:04.169 Ray Lutz: And there isn't a standard function for this in python like there is like like you can have from keys, and you can give it a single value, and it can have nones all the way, or zeros, whatever you want all the way through. But it doesn't have it automatically. But it's easy to make.

65 00:08:04.410 --> 00:08:06.249 Ray Lutz: So this is what it looks like.

66 00:08:09.330 --> 00:08:20.650 Ray Lutz: Now, as I said, the Row keys and the Header Dictionary are dictionaries, but these are all optional. You could start with nothing, just an array of list of lists, and you still get all the functionality.

67 00:08:20.950 --> 00:08:23.839 Ray Lutz: But you would have to be using these indexes here.

68 00:08:24.280 --> 00:08:25.830 Ray Lutz: All right, let's go on to next.

69 00:08:26.100 --> 00:08:30.579 Ray Lutz: So essentially what my problem was this, if you have.

70 00:08:31.090 --> 00:08:36.630 Ray Lutz: you want to use pandas, and you import pandas here. And you say, I want to start a new data frame.

71 00:08:37.280 --> 00:08:43.970 Ray Lutz: and let's say you go through a bunch of Urls, and you harvest stuff from web pages, and you want to append to this array.

72 00:08:45.200 --> 00:08:46.570 Ray Lutz: If you say

73 00:08:46.690 --> 00:08:55.020 Ray Lutz: my data frame dot, append the web page metadata. You just take a dictionary, and you want to add it to the bottom of the pandas array.

74 00:08:55.250 --> 00:08:59.530 Ray Lutz: It's horrible! And this, in fact, this has been banned by

75 00:08:59.970 --> 00:09:02.950 Ray Lutz: the Pandas people. You can't append anymore.

76 00:09:03.190 --> 00:09:06.760 Ray Lutz: They they just said, This doesn't exist. That's how bad it is.

77 00:09:06.860 --> 00:09:09.240 Ray Lutz: Now, what were they doing? Why is it so bad.

78 00:09:09.420 --> 00:09:14.370 Ray Lutz: It's because what pandas is is, let me go back a second.

79 00:09:15.120 --> 00:09:18.980 Ray Lutz: I gotta. What is it? Shift to go back control?

80 00:09:20.630 --> 00:09:22.279 Ray Lutz: I gotta go with the keys.

81 00:09:23.050 --> 00:09:31.609 Ray Lutz: Okay, so what pandas is is essentially a numpy array vertically right here in a dictionary

82 00:09:32.010 --> 00:09:36.450 Ray Lutz: where you have the name of the dictionary, and the value

83 00:09:36.650 --> 00:09:41.830 Ray Lutz: is a numpy array vertically, and you've got to think of it that way, and they're all the same length.

84 00:09:42.370 --> 00:09:45.710 Ray Lutz: So the numpy array has data in

85 00:09:48.260 --> 00:09:57.079 Ray Lutz: numpy arrays. The data is, is, each value is like rammed up against each other. There's nothing else unlike Python, where

86 00:09:57.270 --> 00:10:06.190 Ray Lutz: even an integer, or whatever you have in here takes quite a bit of overhead. Usually it'll be like, I think, 28 Byte, just to represent an integer. There's a lot of overhead generally.

87 00:10:06.540 --> 00:10:12.820 Ray Lutz: and if you have, if you put a dictionary in each row, then you have the key for each one. That's I'll get into that in a second.

88 00:10:13.010 --> 00:10:19.259 Ray Lutz: My point, though, is that in a pandas array you have the the name, and you have a

89 00:10:20.950 --> 00:10:28.160 Ray Lutz: numpy array, and if you want to add to the bottom. You have to create all new numpy arrays, or us add to each one.

90 00:10:28.480 --> 00:10:32.420 Ray Lutz: They don't let you just add each one. They they create a whole new array every time.

91 00:10:32.770 --> 00:10:37.929 Ray Lutz: so they copy it over and add to the bottom, copy it over, add to the bottom, copy it over it. That's how they do it.

92 00:10:38.240 --> 00:10:39.739 Ray Lutz: And so it takes a long time

93 00:10:41.630 --> 00:10:51.909 Ray Lutz: if you're appending. So they they basically have disallowed this. So if you're not going to do that. Then you can do this. You can say, I want to make a list of dictionaries. I call it a lod.

94 00:10:52.460 --> 00:10:56.850 Ray Lutz: Okay? And it's a list of dictionaries with string keys and anything inside.

95 00:10:57.230 --> 00:11:05.030 Ray Lutz: And then you read the web page and you put your metadata dict and you append to the list of dictionaries. This will work fine.

96 00:11:06.100 --> 00:11:06.680 Gabor Szabo: Be fast

97 00:11:06.840 --> 00:11:15.420 Gabor Szabo: sorry. Let me just say something related to this. It's interesting, because in, in, I think in both in go and in rust

98 00:11:16.160 --> 00:11:21.789 Gabor Szabo: this you can. You can allocate more. Place space for these arrays.

99 00:11:22.090 --> 00:11:47.780 Gabor Szabo: even if you don't use them. So you can say that. Okay, I'm going to have at the end. I'm going to have a hundred or 1,000 long of these vectors or arrays right now. I have one item in there and then, whenever you so, the memory is already allocated, so you can append up till 1,000 without this overhead of recreating the whole array.

100 00:11:48.210 --> 00:11:58.869 Ray Lutz: They could have done a better job in pandas because they would not need to copy over the whole thing. I didn't even know they were doing that when I 1st started.

101 00:11:59.030 --> 00:12:07.589 Ray Lutz: And so I noticed when I got you know, when the array started to get pretty big that it just started to slow down to a snail space. And so what is this? Well.

102 00:12:07.710 --> 00:12:19.159 Ray Lutz: and then, in the documentation it says, Don't do this. What you're going to have to do is create something else, a list of dictionaries, and then at one fell swoop take your list of dictionaries and convert it into a data frame.

103 00:12:19.460 --> 00:12:21.690 Ray Lutz: and then it'll be reasonably fast.

104 00:12:22.100 --> 00:12:24.400 Ray Lutz: But this turns out, is very slow.

105 00:12:26.672 --> 00:12:33.759 Ray Lutz: But it's way faster than than the appending. Okay, so if you're going through and appending to the bottom of the array.

106 00:12:35.960 --> 00:12:42.792 Ray Lutz: this will be faster. But then this part right here is actually kind of slow. But if that's all you're gonna do. And you're just gonna write it out to a cash flow

107 00:12:43.140 --> 00:12:44.440 Ray Lutz: Csv file.

108 00:12:44.650 --> 00:12:50.110 Ray Lutz: Then then you've just wasted a lot of time because you didn't need to go through this here

109 00:12:50.700 --> 00:13:01.689 Ray Lutz: you could. You could just write it straight out. But if you did do a couple of things with it before you did that, you know you you maybe summed everything one time, and you added everything up.

110 00:13:02.681 --> 00:13:08.649 Ray Lutz: Maybe you did some other manipulation. You thought being in Panda's world was was a good idea.

111 00:13:09.272 --> 00:13:11.810 Ray Lutz: But then you had this overhead of doing this.

112 00:13:12.020 --> 00:13:16.770 Ray Lutz: So this works. But it turns out this is very slow, and when you time it.

113 00:13:17.350 --> 00:13:29.710 Ray Lutz: going from a list of dictionaries into pandas with, this is a 1 million integer table, a thousand by a thousand. Okay, that's the size table that we're using for our benchmark.

114 00:13:30.230 --> 00:13:40.079 Ray Lutz: Now, would Pandas normally have a thousand columns? No, right? Because most Panda, you know, most data tables have. Yeah, very few columns.

115 00:13:40.280 --> 00:13:43.350 Ray Lutz: Usually. Yeah, 20 to 30 columns is a big one.

116 00:13:44.218 --> 00:13:49.599 Ray Lutz: For the data tables I'm working with. They have a lot of columns. Okay, like

117 00:13:49.740 --> 00:13:57.470 Ray Lutz: something with 5,000 columns is is pretty big, but you'll see stuff under that and a lot of it. 3, 400 columns.

118 00:13:57.570 --> 00:14:00.429 Ray Lutz: So a thousand by 1,000, not unusual, that I see.

119 00:14:00.850 --> 00:14:12.250 Ray Lutz: and when you convert this in Daffodil, you take the list of dictionaries and make a list of lists with, you know, formatted for daffodil. It takes 139.

120 00:14:13.810 --> 00:14:15.009 Ray Lutz: What is it?

121 00:14:15.770 --> 00:14:22.660 Ray Lutz: Microseconds! Milliseconds, I believe Pandas takes 5,600 more than 5 seconds

122 00:14:23.640 --> 00:14:25.830 Ray Lutz: more than 5 seconds to convert

123 00:14:25.970 --> 00:14:31.070 Ray Lutz: it into pandas. So it is a ridiculous bottleneck.

124 00:14:31.830 --> 00:14:32.700 Ray Lutz: Okay.

125 00:14:33.350 --> 00:14:50.620 Ray Lutz: it takes 139 like, look at the difference here, and if you multiply this out, even though pandas is really really fast. To do certain things like summing columns is ridiculously fast compared to Daffodil. I can sum columns here at 191 ms.

126 00:14:50.720 --> 00:14:52.359 Ray Lutz: It takes only 4.

127 00:14:52.810 --> 00:14:56.289 Ray Lutz: So that's a big difference. So you do a big savings here.

128 00:14:56.430 --> 00:15:09.510 Ray Lutz: If you do a lot of these, then this might make up for this big difference here, but it takes a lot. It takes at least 30, all columns, doing all columns, all sum standard deviation. You got to do 30 of those

129 00:15:10.120 --> 00:15:12.950 Ray Lutz: before you make up for converting it into pandas.

130 00:15:14.670 --> 00:15:30.780 Ray Lutz: So for just a few things like summing columns or something, or just manipulating the data a little bit. You're just better off not getting into pandas because of this ridiculous conversion factor. Now, I tried to get around this problem here.

131 00:15:31.560 --> 00:15:34.549 Ray Lutz: and they're also another problem with pandas.

132 00:15:34.760 --> 00:15:38.289 Ray Lutz: This is integers as soon as you add a string.

133 00:15:39.620 --> 00:15:49.650 Ray Lutz: and and this is the size the size here is 38 MB. Believe it's megabytes for

134 00:15:49.750 --> 00:15:53.450 Ray Lutz: a million integers, and

135 00:15:55.600 --> 00:16:05.859 Ray Lutz: they they're only at 9.3, so it's quite a bit more compact and a pandas. Right, if you have just integers or just floats.

136 00:16:06.190 --> 00:16:12.950 Ray Lutz: but if you get a string. Then this goes up and becomes quite a bit larger by 10 times

137 00:16:13.730 --> 00:16:18.089 Ray Lutz: what, and quite a bit larger than a daffodil table, which really doesn't go up very much.

138 00:16:19.250 --> 00:16:26.170 Ray Lutz: Okay. So then, you know, numpy, we can convert things to numpy really quickly.

139 00:16:26.440 --> 00:16:33.029 Ray Lutz: 48 ms going to numpy, and from numpy doesn't take very long.

140 00:16:33.290 --> 00:16:36.139 Ray Lutz: and then you can manipulate in numpy

141 00:16:36.630 --> 00:16:47.929 Ray Lutz: one column at a time, or or add columns together, or sum the columns. Whatever you want to do. You can then do it directly in numpy and skip over pandas. Pandas is also a big beast.

142 00:16:48.170 --> 00:16:52.400 Ray Lutz: It takes a long time to load, so if you use daffodil.

143 00:16:52.890 --> 00:17:13.309 Ray Lutz: you import daffodil, and then you you create a daffodil array. You can't click on this, or it goes to the next thing. I can't highlight for that reason, but you create a daffodil. Array my daff, and then I go through the URL, and I get this stuff, and I append the dictionary to Daffodil array

144 00:17:13.369 --> 00:17:25.439 Ray Lutz: done, and then I simply write it out directly, and I skip over this thing here. Now. I was in a bad habit of using these pandas arrays for almost everything, because they're so handy.

145 00:17:25.760 --> 00:17:37.760 Ray Lutz: But little did I know that my code was getting to be really slow because of the conversion of the list of dictionaries over to Pandas was taking a long time every single time and then back.

146 00:17:40.047 --> 00:17:45.620 Ray Lutz: So this is when I came up with daffodil, and and you know what it provides is

147 00:17:46.060 --> 00:17:51.780 Ray Lutz: a way of also indexing into these. Now, if you just used a list of dictionaries.

148 00:17:52.210 --> 00:18:06.820 Ray Lutz: if you think about it? You have for every single row the keys are repeated, and then the next row. You repeat the keys, and you repeat the keys. Repeat the keys. So every single row has a lot of overhead, because the keys are being repeated.

149 00:18:09.260 --> 00:18:22.100 Ray Lutz: so when you crunch that down, you know, if we if we if we look back at the at the data, at the data type here, and you see in the row here, it's only that for each row. You just have a list of values.

150 00:18:22.260 --> 00:18:30.510 Ray Lutz: and you don't have the keys. The keys are one time, only you don't need them every single row. So you crunch all the keys up into one row.

151 00:18:31.090 --> 00:18:40.619 Ray Lutz: and then the indexing goes 2 times. So 1st you get the index of the list, and then you index into the list to get the data item. So it's 1 more step to get to it.

152 00:18:42.290 --> 00:18:48.650 Ray Lutz: But these are lists, and there's a lot of benefits to that

153 00:18:51.000 --> 00:18:59.080 Ray Lutz: 1st of all, we can use this type of indexing in python, which they provide all this as part of their infrastructure, so that you can write code

154 00:18:59.240 --> 00:19:02.600 Ray Lutz: that uses the indexing row column to.

155 00:19:03.040 --> 00:19:07.930 Ray Lutz: or you can use it for anything. But in this case the 1st index is row and then column

156 00:19:10.398 --> 00:19:16.771 Ray Lutz: so their own column can be integers, and and that can be either the array index

157 00:19:17.550 --> 00:19:18.500 Ray Lutz: or

158 00:19:20.240 --> 00:19:28.510 Ray Lutz: it can be, if you want it to be, but you have to go. If it's an integer it assumes it's going to be the array index and not a

159 00:19:28.790 --> 00:19:31.040 Ray Lutz: going through the dictionaries.

160 00:19:31.720 --> 00:19:37.109 Ray Lutz: If you want to go through the array index and you have to use a method. But

161 00:19:38.641 --> 00:19:43.559 Ray Lutz: if it's a string, then it assumes that it's a key into the dictionaries.

162 00:19:44.120 --> 00:19:57.309 Ray Lutz: and it can be a list of integers which, then, is the list of array indices that you want to choose. It can be a list of strings. It can be a list of string keys, so you can pull out individual rows, individual columns.

163 00:19:57.570 --> 00:20:03.520 Ray Lutz: Whatever you want, you can index, an individual position, and the array

164 00:20:03.790 --> 00:20:08.950 Ray Lutz: you can slice and dice it you can give it a

165 00:20:09.200 --> 00:20:14.760 Ray Lutz: a range of indexes like 5 to 10, which gives you 5, 6, 7, 8, 9,

166 00:20:15.510 --> 00:20:20.010 Ray Lutz: or you can do a range of keys in a close, closed

167 00:20:20.340 --> 00:20:28.939 Ray Lutz: kind, of which we use a tuple for that. So it's like from C to A, B, so from like column C to column A, B,

168 00:20:29.140 --> 00:20:36.129 Ray Lutz: because you don't know what's after A B, you can't say go to the next one and back up one. You have to give it a closed range.

169 00:20:36.580 --> 00:20:38.570 Ray Lutz: and so we do it like that.

170 00:20:39.070 --> 00:20:44.780 Ray Lutz: Now you can leave out the column if you want to use all columns kind of like star and sequel expression.

171 00:20:45.670 --> 00:20:49.729 Ray Lutz: But you can just leave that out and and then talk about the row.

172 00:20:50.160 --> 00:21:03.749 Ray Lutz: and you can index in. If you, if you append, the things can be in a different order, and they will always go in correctly and to the right. So here I have it screwed up where C is first, st and it ends up putting C in the right thing.

173 00:21:03.910 --> 00:21:07.139 Ray Lutz: And there's all kinds of examples here of how you would.

174 00:21:07.728 --> 00:21:28.079 Ray Lutz: Take that array that we start with here. 1, 2, 3, n045-67-8910. I don't know why I did that, but then you end up saying, I want to use rows one through 0 and one because that's a slice. You get those. You can get the columns same way. You can use the names of the columns.

175 00:21:28.200 --> 00:21:36.639 Ray Lutz: take all rows and names of columns here, and a list, so forth. We also offer

176 00:21:36.830 --> 00:21:42.830 Ray Lutz: a list of Tuples. I'm sorry a list of ranges which is kind of handy sometimes.

177 00:21:43.690 --> 00:21:45.829 Ray Lutz: and then you can

178 00:21:46.100 --> 00:21:53.070 Ray Lutz: get. You can set a value. You can say, I want to set this value to the entire array. It sets the whole thing

179 00:21:53.270 --> 00:22:02.466 Ray Lutz: you can. You can set the 1st few columns, you can slice it and do this. So this is a setting, so you can set

180 00:22:02.940 --> 00:22:11.539 Ray Lutz: and you can also pop in a list like, if you have a list, you want to put that in the column, you put that in and put a list into the row.

181 00:22:12.385 --> 00:22:17.500 Ray Lutz: You can put another daffodil array in, and it will put in this, that.

182 00:22:18.330 --> 00:22:24.870 Ray Lutz: for you know, whatever the rectangular region is, Boeing can put that in all those things work.

183 00:22:26.600 --> 00:22:33.980 Ray Lutz: Now, there's a return mode which is optional. But we're gonna end up putting this into like when you when you do this

184 00:22:35.236 --> 00:22:36.990 Ray Lutz: indexing here.

185 00:22:37.850 --> 00:22:50.590 Ray Lutz: you want to get the value out in this case, because you want to multiply 2 values together. If you set the return mode to Val, then then it'll give you the value directly. If you just did this you would get a daffodil array

186 00:22:51.560 --> 00:22:53.760 Ray Lutz: of the cell 0 1 1.

187 00:22:54.090 --> 00:23:01.039 Ray Lutz: I'm sorry one comma 0. So would be row. One is row 0 row one, and this would be 5 right.

188 00:23:01.350 --> 00:23:26.629 Ray Lutz: and you would get an array of 5. 1 thing in the middle of the ray. Well, you don't want that. You just wanted the value. So if you said, return the value, then you can just multiply it by this value over here and put that, and the one at 2 2 is 0 1 0 1, 2, 0 1, 2 is 10. Multiply those together 5 times 10, and put that in the cell. 2 comma one, and down here

189 00:23:26.890 --> 00:23:29.180 Ray Lutz: 0, 1, 2, 1 is 50.

190 00:23:29.430 --> 00:23:33.750 Ray Lutz: So we multiplied those values together and put it in here. So it's all malleable.

191 00:23:33.890 --> 00:23:38.526 Ray Lutz: You can do it like that just like a spreadsheet, and then

192 00:23:39.810 --> 00:23:43.939 Ray Lutz: we can insert columns here. So we're going to put in a first.st So the

193 00:23:44.160 --> 00:23:50.149 Ray Lutz: if you add a column like house, car and boat, and we call that category

194 00:23:52.740 --> 00:23:59.119 Ray Lutz: then we also say we want to set the key field to category. Now, what it's done is

195 00:23:59.310 --> 00:24:08.039 Ray Lutz: what it does, what it does. Is it one of the columns you can say that's going to be my key field, and then it puts it in that. That dictionary lookup!

196 00:24:08.200 --> 00:24:09.600 Ray Lutz: Called the

197 00:24:11.840 --> 00:24:20.049 Ray Lutz: Dk. Let's see, it's called HD, it's called a key key dictionary. So this is a dictionary lookup so super fast

198 00:24:20.260 --> 00:24:21.729 Ray Lutz: if you have a long one.

199 00:24:22.600 --> 00:24:30.250 Ray Lutz: but it has to be. If you do this, you can't have repeated values in here. It's gonna it's gonna hit the la, the 1st one that it sees.

200 00:24:31.774 --> 00:24:37.559 Ray Lutz: And so here, what we did was we add additional records, and it's going to add them in there

201 00:24:37.760 --> 00:24:38.690 Ray Lutz: with

202 00:24:40.260 --> 00:24:45.879 Ray Lutz: Here, you see the category is in a different order, and it still puts it in

203 00:24:46.530 --> 00:24:54.150 Ray Lutz: and if we have a double in there, it's going to modify the one that's there.

204 00:24:54.560 --> 00:24:57.849 Ray Lutz: So if you have, if you index in and you say?

205 00:24:58.403 --> 00:25:02.540 Ray Lutz: House, car, boat and house car boat, Mall Van Condo.

206 00:25:02.680 --> 00:25:04.559 Ray Lutz: I think I have it in the next one.

207 00:25:05.330 --> 00:25:15.550 Ray Lutz: where, if you say house and you give it new values. It's going to modify the one that's there. Okay, so it doesn't add another one called house.

208 00:25:16.650 --> 00:25:23.520 Ray Lutz: Then you can select by using a select where statement.

209 00:25:23.980 --> 00:25:34.650 Ray Lutz: This is where lambda statements are really useful, where you just say Lambda Row, and you say the row where the C value is greater than 20. I want to select those row those rows.

210 00:25:35.270 --> 00:25:42.259 Ray Lutz: It makes a new daffodil table, but it doesn't make new rows.

211 00:25:42.640 --> 00:25:47.410 Ray Lutz: These are actually the rows from this table just referenced over here.

212 00:25:48.110 --> 00:25:53.340 Ray Lutz: So it uses a by reference, just like Pandas Python does all the time.

213 00:25:53.460 --> 00:26:04.570 Ray Lutz: So you're not actually creating a new whole table. These are not unique values. These are actually the same list values from over here, put in over here so that you've just selected them.

214 00:26:04.770 --> 00:26:10.030 Ray Lutz: And so this daffodil table only has a list of references to the same data.

215 00:26:10.310 --> 00:26:15.149 Ray Lutz: all right, so that this way these selections are very fast because it doesn't do any copying

216 00:26:15.270 --> 00:26:17.197 Ray Lutz: unless you wanted to.

217 00:26:17.820 --> 00:26:18.740 Ray Lutz: Okay,

218 00:26:20.300 --> 00:26:30.059 Ray Lutz: you can select a record by the key. You can also just do it this way. Put the key in to the indexing, and then say, you want it to be a dictionary.

219 00:26:30.380 --> 00:26:35.700 Ray Lutz: Now, what we're going to end up doing is putting comma in here. Whoops can't click.

220 00:26:35.850 --> 00:26:44.570 Ray Lutz: You put a comma in here and put a R type return type equals Dick inside here instead of having dot 2, Dick.

221 00:26:44.730 --> 00:26:51.919 Ray Lutz: because it's handier to know, like in this mode, if you want it to be a list.

222 00:26:52.340 --> 00:26:57.040 Ray Lutz: that's what is already in the array. The list. So if you want the list out.

223 00:26:57.220 --> 00:26:59.189 Ray Lutz: You don't want to convert it to

224 00:27:00.940 --> 00:27:09.149 Ray Lutz: Say a dictionary or a whole array, because you're going to get a whole array out of this selection. One row. But it's going to be a daffodil array data type.

225 00:27:11.580 --> 00:27:21.490 Ray Lutz: so what you don't. If you want to get a list out of it. It's nice to know ahead of time. So, and we'll all show you that in a second, because there's another thing I want to show you, which is called a keyed list.

226 00:27:23.220 --> 00:27:30.630 Ray Lutz: so you can get at different types out. If you have 2 deck, 2 list, 2 value, you can just print, or there's other things to numpy

227 00:27:31.125 --> 00:27:34.909 Ray Lutz: to pandas. You know there's other things you can convert to here.

228 00:27:36.590 --> 00:27:39.989 Ray Lutz: So a common usage pattern is to process things by row.

229 00:27:40.752 --> 00:27:46.529 Ray Lutz: Where you would have somehow you're transforming the original row into a new row.

230 00:27:47.320 --> 00:27:50.870 Ray Lutz: and then you append the new row to the new daffodil table.

231 00:27:51.000 --> 00:28:09.870 Ray Lutz: Now, depending upon what the transform does, it might give you the same data again, with just something modified. It might mutate that row, and you would get it back here. When you append this, it's the same row as the original with a mutation. Guess what? That's going to modify the old row, so you don't necessarily want to do that. If you're making a mutation

232 00:28:13.400 --> 00:28:19.369 Ray Lutz: and then you would append it to the new table.

233 00:28:19.590 --> 00:28:27.799 Ray Lutz: and then you can do, you can put it out. It turns out you don't have to flatten. We've discovered later, and I want to show you that in a second it automatically flattens.

234 00:28:29.580 --> 00:28:37.630 Ray Lutz: so you can just apply. So if you have a transform row function, you just say, apply the function and it applies it, row by row

235 00:28:37.780 --> 00:28:41.289 Ray Lutz: and then gives you a new daffodil table. So you just go. You can do it this way.

236 00:28:42.380 --> 00:28:54.709 Ray Lutz: And you can also then just apply the data types at the end. If you want like, you can read it in, apply the data types, apply the transform. You don't need to flatten it anymore. Because I'll show you why most of the time.

237 00:28:54.980 --> 00:28:57.570 Ray Lutz: And then you just say to Csv and write it out.

238 00:28:57.960 --> 00:29:01.169 Ray Lutz: So here's where you write it in. You transform, row by row.

239 00:29:02.940 --> 00:29:14.319 Ray Lutz: if you're doing this, daffodil works really? Well, okay. And this same sort of transform can be applied to. I'll show you in a second, when we're expanding this to use SQL,

240 00:29:15.330 --> 00:29:16.210 Ray Lutz: backing

241 00:29:18.610 --> 00:29:24.210 Ray Lutz: so we avoid copies. This is what makes it faster way faster than pandas. Most of the time

242 00:29:24.390 --> 00:29:26.660 Ray Lutz: pandas is is fast.

243 00:29:26.990 --> 00:29:33.479 Ray Lutz: If you're doing those matrix manipulate not their array manipulations that are used in numpy.

244 00:29:33.810 --> 00:29:40.909 Ray Lutz: But if you do stupid things like like add columns and add rows and pen things and stuff. It gets really, really slow.

245 00:29:41.040 --> 00:29:48.260 Ray Lutz: And also when you're when you end up copying. So we're using references to existing data rather than recopying unless you want to

246 00:29:49.530 --> 00:29:53.550 Ray Lutz: so row selections, reuses the existing Header Dictionary

247 00:29:53.870 --> 00:30:00.470 Ray Lutz: and the selected list values from the source daffodil array, and then

248 00:30:00.730 --> 00:30:10.269 Ray Lutz: processing by columns is slower, but you can usually avoid that. What you want to do is in one fell swoop. If you want to add columns and drop them.

249 00:30:11.090 --> 00:30:12.809 Ray Lutz: You do that all at one time.

250 00:30:13.310 --> 00:30:16.820 Ray Lutz: and, in fact, if you want to do 8 that a

251 00:30:19.150 --> 00:30:20.690 Ray Lutz: if you want to flip, the

252 00:30:20.960 --> 00:30:25.246 Ray Lutz: flip, the array on a on a diagonal, which is

253 00:30:26.710 --> 00:30:29.919 Ray Lutz: Why can't I think of it? It starts with France. I can't think of a trance.

254 00:30:30.895 --> 00:30:33.259 Ray Lutz: We'll get to that in a second, but it but the

255 00:30:34.496 --> 00:30:43.109 Ray Lutz: daffodil is pretty slow with doing when you you know head to head when you're doing manipulations of

256 00:30:43.460 --> 00:30:44.700 Ray Lutz: numerics.

257 00:30:45.190 --> 00:30:51.100 Ray Lutz: But when you're doing this type of row selections, it's much faster

258 00:30:51.440 --> 00:31:03.609 Ray Lutz: and column basing. Oh, it's transposition. If you say Flip is true, and you're doing adding rows or subtracting them. You can also flip it for free, because you have to make a whole new one.

259 00:31:04.720 --> 00:31:12.470 Ray Lutz: so you can flip it for free, if you want to, when you're changing the columns, dropping them and adding them. But you want to do that all at one time.

260 00:31:13.170 --> 00:31:14.359 Ray Lutz: Add and drop.

261 00:31:14.460 --> 00:31:21.460 Ray Lutz: basically modify. The columns, end up with a new array that has the columns that you need, and then mutate it in place.

262 00:31:23.110 --> 00:31:25.880 Ray Lutz: In other words don't add columns one at a time.

263 00:31:26.470 --> 00:31:29.340 Ray Lutz: and because it's going to just be a lot of overhead.

264 00:31:31.280 --> 00:31:37.400 Ray Lutz: But if you have columns in there, then you can just don't use the ones that you don't want to use. Okay, next thing.

265 00:31:37.790 --> 00:31:44.809 Ray Lutz: So the keyed list is one of the core technologies that we developed inside this at once, we got some more experience.

266 00:31:45.500 --> 00:31:51.710 Ray Lutz: So a key keyed list is basically, if you

267 00:31:51.870 --> 00:31:58.130 Ray Lutz: take a if you do a zip of keys and values. This creates a conventional dictionary. So you have.

268 00:31:58.300 --> 00:32:04.090 Ray Lutz: if you want to. Let's say you have a list and you have keys. You want to apply to a list values.

269 00:32:04.250 --> 00:32:07.959 Ray Lutz: You have to go through this transformation, and this takes time.

270 00:32:08.570 --> 00:32:12.000 Ray Lutz: It distributes the values to each item in the dictionary.

271 00:32:12.340 --> 00:32:16.860 Ray Lutz: You create a dictionary with these keys, and then you put a value on each one.

272 00:32:17.100 --> 00:32:25.010 Ray Lutz: and it's it's in memory. Now. All of a sudden the values are distributed out in this dictionary. You don't know what order they're entering some weird order now.

273 00:32:25.760 --> 00:32:32.019 Ray Lutz: The dictionary takes care of making them in the same order, but in the actual dictionary itself. I don't know what order they're in.

274 00:32:32.610 --> 00:32:34.240 Ray Lutz: They're not a list anymore.

275 00:32:34.360 --> 00:32:35.690 Ray Lutz: Let's put it that way

276 00:32:36.480 --> 00:32:42.069 Ray Lutz: so you can get the list out by saying, Dick. Dot values. You can get the list out

277 00:32:42.560 --> 00:32:48.910 Ray Lutz: and you can get the keys out. It's it's not a list. At this point you'd have to convert it to a list. It's a keys.

278 00:32:49.290 --> 00:32:51.120 Ray Lutz: It's a keys type oops.

279 00:32:51.740 --> 00:32:56.009 Ray Lutz: So we propose this concept except of a keyed list

280 00:32:56.680 --> 00:33:01.210 Ray Lutz: which contains a header dictionary that contains indexes for each key and

281 00:33:02.950 --> 00:33:09.609 Ray Lutz: this is one way to create the header dictionary. And it's this is an easy way to understand it, but it's not the most optimal way to do it.

282 00:33:09.760 --> 00:33:11.419 Ray Lutz: So you're going to have a column

283 00:33:11.540 --> 00:33:23.539 Ray Lutz: name and the index for the index and the column for the enumeration of the keys. So this index is going to go. 0 1, 2, 3, 4, 5, and that's going to be the value. Right? You go through all the keys, and you put them up here.

284 00:33:23.640 --> 00:33:28.090 Ray Lutz: So this stays the same for every

285 00:33:28.400 --> 00:33:36.420 Ray Lutz: keyed list of the same size and with the same columns. You don't need a new header dictionary. You can use the same one for different keyed lists.

286 00:33:36.850 --> 00:33:41.490 Ray Lutz: And the key. The list is

287 00:33:41.870 --> 00:33:50.640 Ray Lutz: a list so, and like a regular dictionary that has, that distributes the values amongst all the keys in the structure.

288 00:33:50.950 --> 00:33:53.400 Ray Lutz: The list is this is still a list.

289 00:33:54.930 --> 00:34:02.499 Ray Lutz: It still looks like a dictionary. You have a key and a value, but it's structured, and you can still get the list out.

290 00:34:04.370 --> 00:34:09.670 Ray Lutz: It looks like a dictionary, but it's not designed like a dictionary.

291 00:34:10.050 --> 00:34:15.480 Ray Lutz: It has a header which has, excuse me

292 00:34:16.360 --> 00:34:19.470 Ray Lutz: like a 0 b, 1 c, 2,

293 00:34:20.190 --> 00:34:28.119 Ray Lutz: and then a list associated with that and and that this way, it's faster to to do things. So if you have a keyed list.

294 00:34:28.610 --> 00:34:32.309 Ray Lutz: and like A is 34 B, 45, and C is 56,

295 00:34:33.159 --> 00:34:36.180 Ray Lutz: and you have values here. 1, 2, 3.

296 00:34:37.460 --> 00:34:42.620 Ray Lutz: You can say, the key list. Dot values equals. This value list, assign new values.

297 00:34:43.440 --> 00:34:46.719 Ray Lutz: And now you have a new keyed list. With those values in there

298 00:34:47.530 --> 00:34:52.000 Ray Lutz: you could do the same thing with the dictionary. It would put new values into the dictionary.

299 00:34:53.510 --> 00:34:57.329 Ray Lutz: If you say, what is the value of B.

300 00:34:58.010 --> 00:35:04.480 Ray Lutz: You know you're saying, I want to assign 67 to the to the B.

301 00:35:04.930 --> 00:35:06.970 Ray Lutz: Now you have 67 here

302 00:35:07.450 --> 00:35:11.410 Ray Lutz: the values list that you originally used. Sorry I can't click

303 00:35:11.540 --> 00:35:17.100 Ray Lutz: the values list that you originally used also got changed because it's the same list.

304 00:35:17.460 --> 00:35:18.729 Ray Lutz: When you said.

305 00:35:19.160 --> 00:35:31.599 Ray Lutz: I want to use. Assign this values list to the keyed list. Dot values. It did not make a new list. It did not recopy anything. All it did is add a reference in here to this existing list.

306 00:35:33.840 --> 00:35:41.649 Ray Lutz: and then the values list is the key list. Dot values output. True.

307 00:35:41.780 --> 00:35:45.139 Ray Lutz: the is function means it is exactly the same thing.

308 00:35:45.900 --> 00:35:49.550 Ray Lutz: It's the same thing in memory. There's no new version of it.

309 00:35:50.370 --> 00:35:55.670 Ray Lutz: So a keyed list means that we can

310 00:35:57.003 --> 00:36:03.320 Ray Lutz: number one. If you have a list, and you want to put it into your daffodil array.

311 00:36:04.780 --> 00:36:11.099 Ray Lutz: Don't turn it into addiction like, if you have a list. It goes directly into that list in the array.

312 00:36:11.430 --> 00:36:15.050 Ray Lutz: Now, if you want to iterate through the daffodil array.

313 00:36:15.650 --> 00:36:20.339 Ray Lutz: it's convenient to iterate through with keyed lists, because, if you modify one.

314 00:36:20.450 --> 00:36:25.580 Ray Lutz: it actually modifies the array without having to recopy it in just the way a

315 00:36:25.730 --> 00:36:28.430 Ray Lutz: like. If you had a list of dictionaries

316 00:36:28.710 --> 00:36:35.689 Ray Lutz: and you go through the list of dictionaries, and you you have a dictionary in hand, and you change that item.

317 00:36:36.690 --> 00:36:44.360 Ray Lutz: It actually is the same dictionary as in the main ray of list of dictionaries, and it'll change it in the list of dictionaries.

318 00:36:44.560 --> 00:36:50.450 Ray Lutz: Now, if you have a daffodil array and you pull a dictionary out.

319 00:36:50.830 --> 00:36:56.150 Ray Lutz: It's not the same data as what's in the array, and if you change it, it doesn't change what's in the array.

320 00:36:56.510 --> 00:36:58.570 Ray Lutz: But if you take a keyed list out.

321 00:36:58.900 --> 00:37:07.050 Ray Lutz: and you change that item in that list. That list is the same one that's in the array, and you've changed it without having to recopy it back in. So then, it works the same way as dictionaries do.

322 00:37:07.830 --> 00:37:11.330 Ray Lutz: I don't know if I I probably can add another slide for that to explain it.

323 00:37:12.240 --> 00:37:14.490 Ray Lutz: Now we're going to go to a new topic.

324 00:37:14.940 --> 00:37:18.390 Ray Lutz: Csv, reading very, very fast. If you have

325 00:37:21.110 --> 00:37:25.429 Ray Lutz: as you stay in string type, so as long as you don't convert anything.

326 00:37:26.140 --> 00:37:30.528 Ray Lutz: the python reader is really fast

327 00:37:31.580 --> 00:37:39.080 Ray Lutz: for a million rows. According to this guy here. This reference he timed it. I'm not sure I trust this, but anyway, I used it because it was a reference.

328 00:37:39.360 --> 00:37:44.210 Ray Lutz: and if you do a Pandas read Csv. It's much more time.

329 00:37:46.960 --> 00:37:51.909 Ray Lutz: Pandas, read Csv with a chunk size is for some reason worse.

330 00:37:52.080 --> 00:37:55.340 Ray Lutz: Dask, worst data frame

331 00:37:55.630 --> 00:38:04.049 Ray Lutz: a data table, I guess, is another option. It's not as fast. This looks like absurdly

332 00:38:04.280 --> 00:38:15.620 Ray Lutz: way better than it really is, so I'll have to look into that. But it's still very, very fast, because it doesn't do any type conversion for you and pandas does this automatically try to be

333 00:38:15.780 --> 00:38:17.700 Ray Lutz: be easy to use.

334 00:38:17.830 --> 00:38:20.650 Ray Lutz: But if you don't want that, it doesn't, doesn't happen

335 00:38:21.290 --> 00:38:26.700 Ray Lutz: so later you can apply the D types and unflatten.

336 00:38:27.230 --> 00:38:36.669 Ray Lutz: unflatten them, which would would bring them up into become a data python data type, such as like, if you have a dictionary in a cell.

337 00:38:37.170 --> 00:38:42.340 Ray Lutz: and it gets turned into what either Json or what I call pyon.

338 00:38:42.490 --> 00:38:44.270 Ray Lutz: which we're going to get into in a second.

339 00:38:44.710 --> 00:38:50.050 Ray Lutz: then it will reform that into the dictionary within the cell.

340 00:38:52.220 --> 00:38:54.230 Ray Lutz: Now, Csv. Writer.

341 00:38:54.690 --> 00:39:02.520 Ray Lutz: right flattens automatically to pion. I didn't know this pion is something that I dreamed up as a name.

342 00:39:03.140 --> 00:39:07.729 Ray Lutz: It means python object notation. And it's similar to Json.

343 00:39:08.630 --> 00:39:14.329 Ray Lutz: It's actually a superset of Json Javascript. Object notation

344 00:39:14.520 --> 00:39:18.730 Ray Lutz: is Json. And this is simply python object notation.

345 00:39:18.920 --> 00:39:25.750 Ray Lutz: but it can express sets, tuples, dicks, lists, functions, etc. So we can do everything

346 00:39:26.633 --> 00:39:40.640 Ray Lutz: within python, and it already does for the most part, except for functions. It'll create sets, Tuples, diction lists automatically, and a Csv writer without any you doing anything.

347 00:39:41.710 --> 00:39:43.569 Ray Lutz: I just stumbled across this.

348 00:39:44.205 --> 00:39:50.850 Ray Lutz: Now Pyon already exists. It's already defined. It's what you get if you rep, or something.

349 00:39:53.110 --> 00:39:54.560 Ray Lutz: Generally speaking.

350 00:39:54.840 --> 00:40:02.690 Ray Lutz: not always, because sometimes the wrappers are broken in in these things, but that they should do is define this as being

351 00:40:03.190 --> 00:40:08.180 Ray Lutz: a Csv writer should use Reper, and sometimes it uses the Str function instead.

352 00:40:11.540 --> 00:40:21.019 Ray Lutz: it's better than Pickle, Json, Pickle, and other variants of Json for working with python types, in my opinion.

353 00:40:21.940 --> 00:40:25.030 Ray Lutz: So I generated this pyon pyon tools.

354 00:40:25.210 --> 00:40:30.269 Ray Lutz: python module. It isn't quite published yet, but I'm using it myself.

355 00:40:30.630 --> 00:40:37.289 Ray Lutz: and it turns out, Csv, but it's very simple, because what you're doing is you're using the wrapper method for any object.

356 00:40:37.440 --> 00:40:40.980 Ray Lutz: and it basically does it already.

357 00:40:41.090 --> 00:40:48.960 Ray Lutz: But when you're using Csv writer, you don't have to change this. So the way I stumbled across. This is, I had dictionaries

358 00:40:49.170 --> 00:40:54.449 Ray Lutz: in, and my daffodil array. I wrote it out to a file.

359 00:40:54.600 --> 00:41:01.139 Ray Lutz: and it automatically converted them and flattened them out into character strings, the normal ones that you see

360 00:41:02.866 --> 00:41:13.269 Ray Lutz: when you look at when you look at a dictionary like we just were looking at some that look just like when you look at this, this dictionary right here

361 00:41:13.420 --> 00:41:21.259 Ray Lutz: opens bracket single quote, a single quote, Colon 0, comma. All that sort of thing.

362 00:41:21.610 --> 00:41:31.939 Ray Lutz: This is the expression, a string expression that represents a dictionary. It isn't the dictionary itself. Dictionary itself is some other, you know, thing in memory, and

363 00:41:32.100 --> 00:41:42.130 Ray Lutz: of a fairly complex structure that python has suppressed. And what you understand as a dictionary, are these symbols right here?

364 00:41:42.430 --> 00:41:52.550 Ray Lutz: Those symbols are character strings that can be represented in a file. So this is what you get. If you have a dictionary, which is this header dictionary with those things in it. That's exactly what you find in the

365 00:41:52.940 --> 00:41:54.910 Ray Lutz: and the and the Csv file

366 00:41:55.290 --> 00:42:04.269 Ray Lutz: this right here. Unfortunately, he doesn't use double quotes. So it's not exactly Json. If they allowed you to say, use double quotes instead. This would be Json.

367 00:42:04.920 --> 00:42:11.979 Ray Lutz: and then you could use it with other tools, a little bit of a ripple there with what they use in python.

368 00:42:12.100 --> 00:42:14.969 Ray Lutz: and maybe we can get Csv. Writer to

369 00:42:15.140 --> 00:42:19.120 Ray Lutz: optionally. Use double quotes, so would still be valid Pyon.

370 00:42:19.750 --> 00:42:23.499 Ray Lutz: but also meets the subset of Json.

371 00:42:24.730 --> 00:42:27.319 Ray Lutz: I think the Python community should embrace

372 00:42:27.550 --> 00:42:30.940 Ray Lutz: the pyon that they've already defined, but they don't have a name for it

373 00:42:31.270 --> 00:42:38.240 Ray Lutz: and provide options for Csv. Writer to use double quotes and stuff in there, because then it would produce Json.

374 00:42:38.570 --> 00:42:43.040 Ray Lutz: But this is how we flatten things from Daffodil. We do almost do nothing.

375 00:42:43.160 --> 00:42:45.529 Ray Lutz: Python already does it for us.

376 00:42:46.340 --> 00:42:53.050 Ray Lutz: Now, when we import Csv. We do it in a very controlled, explicit manner. So it comes in as strings.

377 00:42:54.110 --> 00:43:08.590 Ray Lutz: and then we convert them. Unfortunately, pandas is optimized for tables with just numerics and simple header normal for Csv, and they don't. It's hard to work around this. You can do it, but it's just a pain in the ass to try to get it to, to do weird things.

378 00:43:10.230 --> 00:43:13.730 Ray Lutz: So what we do is it dot

379 00:43:14.000 --> 00:43:21.190 Ray Lutz: daffodil d types, which is something you can specify, and it doesn't do anything. It just gets carried around in the in the frame.

380 00:43:21.510 --> 00:43:23.669 Ray Lutz: But if you

381 00:43:23.910 --> 00:43:33.459 Ray Lutz: are importing things you can say, apply it, and then it will apply d types to the columns that you want to apply to. If the columns don't exist. It's not going to hurt it.

382 00:43:33.620 --> 00:43:35.420 Ray Lutz: You if you drop them.

383 00:43:37.130 --> 00:43:41.140 Ray Lutz: You don't want to apply d types to any columns you're not going to actually use.

384 00:43:41.410 --> 00:43:50.979 Ray Lutz: So if you bring in an array, and it's got 5,000 columns, you only need 3 of them. First, st drop everything else that you don't need, or just work on the ones that you want to work on.

385 00:43:51.120 --> 00:44:01.939 Ray Lutz: In fact, you don't need to drop them if you brought it all the way in. Just work on the columns that you want to work with, and then just ignore the rest. As soon as you start converting the thing, then you're starting to add time.

386 00:44:04.110 --> 00:44:06.869 Ray Lutz: So we have a few other features. I want to mention.

387 00:44:07.450 --> 00:44:10.200 Ray Lutz: number one. We have an indirect functionality

388 00:44:10.490 --> 00:44:15.199 Ray Lutz: where a Dick can specify the contents of specific columns. So inside of a cell

389 00:44:15.930 --> 00:44:23.250 Ray Lutz: you have a dictionary, and that dictionary actually specifies column names and values

390 00:44:23.510 --> 00:44:40.629 Ray Lutz: which are to be interpreted as part of the actual array. But it's it's got an indirection. So you 1st go into the cell, you find out what's specified there and then that's to be interpreted as the rest of the array, and this is useful for sparse arrays. As I was saying, what I was working with

391 00:44:40.760 --> 00:44:44.310 Ray Lutz: was very sparse. Away with like 5,600 columns.

392 00:44:44.740 --> 00:44:48.410 Ray Lutz: and only about 50 of them are used at any one time.

393 00:44:48.960 --> 00:44:51.849 Ray Lutz: So if you use, if you represent this as A

394 00:44:51.990 --> 00:44:55.661 Ray Lutz: as a actual Csv file or anybody like that.

395 00:44:57.050 --> 00:45:02.380 Ray Lutz: It's very, very costly, because you have all these commas right, comma comma comma comma.

396 00:45:02.530 --> 00:45:12.309 Ray Lutz: and to represent all of the 5,600 columns when you're only going to use 50, and then you have them in there, and you got to try to figure out which ones they are. It's a mess. So

397 00:45:12.850 --> 00:45:23.699 Ray Lutz: in this case, even though you want the array to be logically 5,600 columns for any one row. You don't want to have to specify more than just the 50 columns that you're working with.

398 00:45:24.380 --> 00:45:37.019 Ray Lutz: And so in that one cell, what you have is a dictionary which specifies all of the columns that you're working with, and then it is logically considered, part of the array. So if you sum the column, the rows.

399 00:45:37.340 --> 00:45:41.520 Ray Lutz: it figures out where those it takes that indirection into account.

400 00:45:41.670 --> 00:45:45.160 Ray Lutz: expands them and works with summing them that way.

401 00:45:46.600 --> 00:45:57.230 Ray Lutz: We have a from Pdf that will take a Pdf. File with a you know, header and columns and and parse it.

402 00:45:57.590 --> 00:46:06.050 Ray Lutz: usually skipping a few things. You can use a few controls there to skip things. But to just convert from basic Pdf files that you might find

403 00:46:06.692 --> 00:46:11.399 Ray Lutz: a little shortcut, you can do it yourself, but this shortcuts some work.

404 00:46:11.700 --> 00:46:19.234 Ray Lutz: It offers the adders attributes, attrs. Dictionary as part of the

405 00:46:21.550 --> 00:46:26.399 Ray Lutz: part of the class, for an instance, has this, and this is the same as in Pandas.

406 00:46:26.620 --> 00:46:35.670 Ray Lutz: You can add any day any kind of attribute you want to a data frame and

407 00:46:35.950 --> 00:46:44.030 Ray Lutz: what I find is convenient. There's is like, if I've already figured out like these are the metadata columns. They're all strings, and the rest of it is data.

408 00:46:45.090 --> 00:46:54.610 Ray Lutz: Once I parse that and I know where they are, I need to pass that along and say, these are the metadata columns. This is the number of columns that's the metadata

409 00:46:55.215 --> 00:47:00.590 Ray Lutz: and that's easy to do. You just put that into this adders, and then

410 00:47:00.810 --> 00:47:14.399 Ray Lutz: your next function says, Well, how many metadata columns are there? Oh, it's an address. I already know that now you have to know that it's in there yet, you know, but at least you don't have to pass another variable along to say here's or recalculate it even worse.

411 00:47:14.590 --> 00:47:21.159 Ray Lutz: So so that's something that turns out pandas had. And we just and using that the same way.

412 00:47:21.590 --> 00:47:27.490 Ray Lutz: And then we're now offering a join method that is efficient and mimics a join in. SQL.

413 00:47:29.430 --> 00:47:35.409 Ray Lutz: pandas doesn't have a real join, it has a merge. So when you do a merge in pandas you.

414 00:47:36.910 --> 00:47:44.299 Ray Lutz: You do this, you essentially are doing what this does here. But if when we're

415 00:47:44.440 --> 00:47:46.030 Ray Lutz: and I'll get to this in a second.

416 00:47:46.590 --> 00:47:55.179 Ray Lutz: we're extending daffodil to use SQL. In the background. And when the SQL. Machine does a join.

417 00:47:55.290 --> 00:47:59.290 Ray Lutz: it doesn't actually do anything, it actually just keeps track of the joint.

418 00:47:59.720 --> 00:48:19.770 Ray Lutz: And if you say I want to take these columns in this table, and I want to join them with these columns in this table along this key, it doesn't do anything. It just remembers that. And if you say I also want to join this table in this table, in this table and this on these keys. You could do them one at a time, and you can do up to, I think, 64 times, or maybe it's 16. But there's a certain limit.

419 00:48:19.910 --> 00:48:22.240 Ray Lutz: and then, once you have all your joints done.

420 00:48:22.720 --> 00:48:31.590 Ray Lutz: and you say I want to select these column, these rows out of my join tables. Then it does it. Then it figures it out and it pulls all the data in and does it.

421 00:48:31.980 --> 00:48:32.810 Ray Lutz: Okay.

422 00:48:32.960 --> 00:48:51.309 Ray Lutz: so it's nice that way that SQL, when it does a join, it creates a view. It doesn't actually create a it doesn't actually do anything. Now, pandas always does something. It always does a merge. It takes data from one thing, it puts it with this one and merges it together. Essentially, that's what this join does. But

423 00:48:51.470 --> 00:48:57.639 Ray Lutz: this one is, we'll be using the the SQL. Type of join when we get to that.

424 00:48:58.010 --> 00:49:00.029 Ray Lutz: So there's the plans for SQL.

425 00:49:01.270 --> 00:49:10.500 Ray Lutz: Now, right now, daffodilla rays must fit into memory at this time. So so if it doesn't fit into memory, you're going to have to chunk it

426 00:49:13.440 --> 00:49:29.729 Ray Lutz: which we do. So we have a thing where we we chunk things and there's a lot of infrastructure. I might add a daffodil that are that that chunks things. So so basically, we, we have a chunk of like a hundred things in one chunk, and we have thousands of those

427 00:49:29.980 --> 00:49:37.310 Ray Lutz: we don't actually want to combine them necessarily upfront. We combine them all at one fell swoop and then make one big file.

428 00:49:37.440 --> 00:49:39.339 Ray Lutz: or just work with the chunks.

429 00:49:40.470 --> 00:49:50.440 Ray Lutz: What we'll do here with SQL is use, and since the row-based daffodil arrays are similar to SQL. Data tables. They're also row based.

430 00:49:51.120 --> 00:49:53.339 Ray Lutz: But they have column operations. Of course.

431 00:49:53.860 --> 00:50:14.559 Ray Lutz: we'll add a quarks. We'll add additional keyword arguments in the indexing to specify whether it will be an SQL. Table or another way to say it is just you take the original daffodil.to SQL. And we'll give it a name, and then we'll get this daffodil table main underscore. SQL. Daff, we'll just call it that.

432 00:50:14.760 --> 00:50:18.009 Ray Lutz: You don't have to use this name, and that will be

433 00:50:18.800 --> 00:50:21.019 Ray Lutz: how we refer to it within python.

434 00:50:21.360 --> 00:50:25.990 Ray Lutz: And this actually will look like a daffodil table. But it's actually in SQL,

435 00:50:26.190 --> 00:50:31.890 Ray Lutz: so we don't actually have the table in daffodil. It's basically a proxy to the actual table.

436 00:50:32.610 --> 00:50:39.430 Ray Lutz: Then operations on SQL. Def. Will operate as if the table were in memory. But it actually is operating SQL. Engine.

437 00:50:39.890 --> 00:50:40.880 Ray Lutz: and

438 00:50:41.570 --> 00:50:55.309 Ray Lutz: the results is, we can allow, much larger tables while still manipulating the daffodil array paradigm with selection and indexing done in a pythonic way. So essentially, we're still going to use those square brackets, you know. The 1st one is the row.

439 00:50:55.420 --> 00:51:03.800 Ray Lutz: Well, that's like select, you know. The second one is column. So select, star. That would be kind of like the first.st The second thing, the columns that you want.

440 00:51:03.980 --> 00:51:11.139 Ray Lutz: and then you say which things you want to select in a in a SQL. Statement that would be the 1st

441 00:51:11.370 --> 00:51:15.381 Ray Lutz: parameter, and selecting it as and and the like. So

442 00:51:16.730 --> 00:51:27.859 Ray Lutz: I won't go into some of the difficulties that we found in sqlite. But sqlite does not have a row based like a vector based operation so that we can have.

443 00:51:31.860 --> 00:51:34.319 Ray Lutz: We can have python

444 00:51:34.530 --> 00:51:47.540 Ray Lutz: an apply that would take, say, a row from the table, run it through python, and return on entire row, and then add it to a new table. It doesn't provide that in SQL lite that would be an extension we'd want to see

445 00:51:47.998 --> 00:51:55.519 Ray Lutz: all they allow is Scalar returns, and then it's a lot of work to do it, so it's better to bring a chunk out of the table

446 00:51:55.850 --> 00:52:01.129 Ray Lutz: as a daffodil array. Apply it within python, and then move it back into

447 00:52:01.500 --> 00:52:05.320 Ray Lutz: the SQL. Right? That's the best way to do it right now. It's the fastest.

448 00:52:07.336 --> 00:52:10.609 Ray Lutz: So once, if you want to use

449 00:52:11.560 --> 00:52:14.560 Ray Lutz: we would also support general SQL. Queries

450 00:52:14.810 --> 00:52:20.530 Ray Lutz: and and the proxy. So if you say I want to do, I want to actually use this SQL, query.

451 00:52:21.110 --> 00:52:30.490 Ray Lutz: Then I keep doing that click can't click when I'm in this. So if you want to have an SQL. Query and apply it to this, this proxy.

452 00:52:31.860 --> 00:52:34.610 Ray Lutz: We don't know the name of the table over there, necessarily.

453 00:52:34.800 --> 00:52:38.880 Ray Lutz: And one thing about python is you.

454 00:52:39.010 --> 00:52:47.389 Ray Lutz: When you create a daffodil table. You don't know what name you're going to apply to it, because it's not the way Python works. It doesn't know what name it has. In fact, it could have many names.

455 00:52:49.910 --> 00:52:53.390 Ray Lutz: In SQL. When you have a table it has a name.

456 00:52:53.530 --> 00:52:56.559 Ray Lutz: and you have to use that name to refer to it all the time.

457 00:52:58.870 --> 00:53:05.299 Ray Lutz: so we don't. We're gonna be have to name our table with some arbitrary name, if you don't give it one.

458 00:53:05.670 --> 00:53:06.899 Ray Lutz: And then

459 00:53:07.660 --> 00:53:13.269 Ray Lutz: or we may actually always name it with arbitrary name and then map it over. But essentially.

460 00:53:14.780 --> 00:53:18.910 Ray Lutz: when you do an SQL statement.

461 00:53:19.450 --> 00:53:22.219 Ray Lutz: and you say, I want to

462 00:53:22.340 --> 00:53:25.069 Ray Lutz: like, select blah blah blah from

463 00:53:25.770 --> 00:53:27.700 Ray Lutz: you have to put in a table name.

464 00:53:27.960 --> 00:53:39.890 Ray Lutz: Okay? And you're not going to know what that name is. And so that's why we're going to have to have some substitution going on with that. And that won't be too hard for people to do if they want to use general purpose. Queries

465 00:53:42.040 --> 00:53:47.979 Ray Lutz: pretty much. Everything within pandas is repeat, is available within daffodil

466 00:53:49.160 --> 00:53:55.049 Ray Lutz: But some things that are not available in pandas are available like we can do append which has been deprecated.

467 00:53:58.270 --> 00:54:19.940 Ray Lutz: but most things run the same way. A little bit different, because pandas is normally columns of numpy arrays, and so, if you don't, and so, if you, the 1st value in the square brackets is a column by default, whereas in our mode the 1st thing by default is the row just to be aware of that

468 00:54:23.060 --> 00:54:28.340 Ray Lutz: and so pretty much they're all there. And again the timing we went over briefly at the beginning.

469 00:54:30.730 --> 00:54:38.663 Ray Lutz: Daffodil is faster for array, manipulation like appending rows. But pandas is faster. If you're going to do column based

470 00:54:39.590 --> 00:54:41.020 Ray Lutz: manipulations.

471 00:54:41.300 --> 00:54:45.950 Ray Lutz: Basically. Here was my summary about use cases, and when you want to use each one.

472 00:54:46.160 --> 00:54:50.950 Ray Lutz: if you have existing data in well-defined, column-based format, then.

473 00:54:51.820 --> 00:54:54.970 Ray Lutz: and almost all data is numeric.

474 00:54:55.760 --> 00:55:01.139 Ray Lutz: and you don't want to do appending or ponification other than maybe creating some additional columns.

475 00:55:02.210 --> 00:55:09.709 Ray Lutz: and then maybe produce plots after you analyze it, and and so forth. Then pandas might be the best choice for sure.

476 00:55:11.190 --> 00:55:13.289 Ray Lutz: as long as the data fits in memory.

477 00:55:13.980 --> 00:55:24.469 Ray Lutz: once it gets out of memory, then maybe you're going to use Daffodil, SQL. Might be a good choice. We'll have to see that hasn't really been haven't really tested that enough to know if that's going to be a better choice for you.

478 00:55:25.040 --> 00:55:30.090 Ray Lutz: If you're building data tables by analyzing converting images or other data penning to a table

479 00:55:30.250 --> 00:55:44.499 Ray Lutz: that is not going to be pandas. If you want to have small utility tables used for tracking processes or parsing data, driving state machines, all these kind of little tables you might use all the time throughout your code.

480 00:55:44.970 --> 00:55:51.520 Ray Lutz: Use daffodil tables. Don't get involved with pandas. That's for data analysis. And those specific things.

481 00:55:53.091 --> 00:55:55.550 Ray Lutz: Once you build the table.

482 00:55:55.670 --> 00:55:58.259 Ray Lutz: Then you might want to use pandas or numpy.

483 00:55:58.860 --> 00:56:07.090 Ray Lutz: We can convert individual columns, and this is pretty good way to do it. Convert it to numpy arrays so that the columns can be managed.

484 00:56:07.886 --> 00:56:13.330 Ray Lutz: There can be, you know, some, for example, like, if you sum 2 columns, you get another whole column.

485 00:56:13.760 --> 00:56:33.910 Ray Lutz: This kind of operation here will work. If it's a numpy array, or you can multiply columns, you can do all of the functions, add them together multiply by a scalar. You can all of these functions here, or say, divide one column by another one. It creates another whole column, and and that expression is very fast.

486 00:56:34.480 --> 00:56:42.450 Ray Lutz: So you can do this by just having a dictionary of numpy arrays converted on the columns that you want to use.

487 00:56:43.210 --> 00:56:55.069 Ray Lutz: If you want to do state updates like, like tracking the state of a user in a web-based application or something. Then you're going to want to use an SQL or no SQL. Or something kind of database, and not use any of these. Of course.

488 00:56:55.837 --> 00:57:03.530 Ray Lutz: Statuses that you know I've used it quite a bit myself. It hasn't really been adopted very much. That's okay. I mean, we're still

489 00:57:03.690 --> 00:57:06.800 Ray Lutz: sort of researching the best ways for this to work.

490 00:57:07.398 --> 00:57:12.489 Ray Lutz: I've used it myself. I've convert almost everything over from Pandas, and and I love it.

491 00:57:15.100 --> 00:57:22.539 Ray Lutz: and a couple of cases I still have to use. SQL. Because the tables got big. And so that's why I want to convert over to Daffodil. SQL,

492 00:57:23.301 --> 00:57:25.869 Ray Lutz: and that's pretty much what I had there.

493 00:57:26.330 --> 00:57:30.270 Ray Lutz: Okay, so I'm done. Guess

494 00:57:30.410 --> 00:57:35.550 Ray Lutz: my contact is Ray [email protected], or you can. That's my email.

495 00:57:37.210 --> 00:57:42.549 Ray Lutz: Any questions. I guess I I used up the whole hour a little bit more and not

496 00:57:43.430 --> 00:57:49.000 Ray Lutz: gave room for many questions. I see a chat room is okay. I'll leave.

497 00:57:49.000 --> 00:57:51.159 Gabor Szabo: I think apologies, no worries.

498 00:57:51.160 --> 00:57:51.800 Gabor Szabo: Don't do that.

499 00:57:52.590 --> 00:57:55.050 Gabor Szabo: If anyone has questions, then then please do ask.

500 00:57:55.200 --> 00:57:59.460 Gabor Szabo: I just wanted to say something. 1st of all. Thank you very much for the presentation.

501 00:57:59.750 --> 00:58:04.290 Gabor Szabo: But one thing that is sort of related.

502 00:58:04.970 --> 00:58:17.130 Gabor Szabo: that I see many people using pandas for, for I just read in Csv file and do some simple manipulation, and they always go to Pandas, because that's what they learned.

503 00:58:17.470 --> 00:58:21.289 Gabor Szabo: and they don't use the Standard Csv Library in.

504 00:58:21.805 --> 00:58:22.320 Ray Lutz: Yeah.

505 00:58:22.600 --> 00:58:33.609 Gabor Szabo: And and I had a feeling that pandas is just way too big for this. But now your your numbers show that it's way slower than than using the Standard Csv Library.

506 00:58:34.620 --> 00:58:38.569 Ray Lutz: Way slower and also just getting in and out of it.

507 00:58:39.090 --> 00:58:43.039 Ray Lutz: like, if you're if you're just staying within the Pandas world.

508 00:58:43.280 --> 00:58:46.079 Ray Lutz: and you're doing stuff that are pandas related.

509 00:58:47.080 --> 00:58:48.270 Ray Lutz: It's great.

510 00:58:48.440 --> 00:58:52.260 Ray Lutz: And and I think, though, that what we're going to find

511 00:58:52.400 --> 00:59:02.439 Ray Lutz: is that using a daffodil array and converting it to a dictionary of numpy arrays, which is kind of what's inside pandas. But pandas has grown so big

512 00:59:02.780 --> 00:59:10.259 Ray Lutz: that I've watched it load. It takes several seconds, maybe like 5, 10 seconds for it just to be imported.

513 00:59:10.600 --> 00:59:15.820 Ray Lutz: So when you're running one of these interpreters and you're using a huge library like Pandas.

514 00:59:16.550 --> 00:59:19.070 Ray Lutz: I mean the main Panda's class.

515 00:59:19.310 --> 00:59:23.379 Ray Lutz: Just one class is like 13,000 lines.

516 00:59:23.690 --> 00:59:28.290 Ray Lutz: It's a real. It's all in one file, I mean, I'm really surprised. They still write them this way.

517 00:59:28.400 --> 00:59:35.630 Ray Lutz: but it's it's a very highly functional thing. And here's the thing is that people are nowadays.

518 00:59:36.320 --> 00:59:45.059 Ray Lutz: They might be using an AI machine to assist them, and they say, You know, read this in, and and you know, do a few conversions and then put out, help me do this plot.

519 00:59:45.770 --> 00:59:52.860 Ray Lutz: The AI machines. They know perfectly well how to use pandas, and they'll do it now, for that

520 00:59:53.260 --> 00:59:59.179 Ray Lutz: efficiency is not important, really. You may wait a few extra seconds. But who cares?

521 01:00:00.353 --> 01:00:09.340 Ray Lutz: So? And Daffodil is is a little bit different animal, that actually is all python.

522 01:00:10.120 --> 01:00:15.499 Ray Lutz: and not that pandas is not, you know pandas is numpy, and

523 01:00:15.750 --> 01:00:30.809 Ray Lutz: but it's restrictive in what it can put in into its list, and it's designed around numerics. So when you start adding strings or anything else. It just freaks out. It's just like you're just going to have. Then you got to go back to a dictionary of lists, a list of dictionaries, I mean.

524 01:00:31.100 --> 01:00:36.260 Ray Lutz: and that's what I ended up doing. But then I didn't have the functionality of selecting rows and other things that you want.

525 01:00:38.170 --> 01:00:40.676 Ray Lutz: Now will this catch on? I don't know.

526 01:00:42.030 --> 01:00:46.110 Ray Lutz: I think that that it for most people

527 01:00:46.290 --> 01:00:49.019 Ray Lutz: I mean, I still like to use pandas for

528 01:00:49.170 --> 01:00:54.550 Ray Lutz: for certain things, just because I know that the AI machine knows exactly what to do with it.

529 01:00:55.293 --> 01:01:04.750 Ray Lutz: Once the AI machine understands Daffodil, it might do it, but I don't think it's really the use case. Daffodil is more for programmers than for

530 01:01:05.280 --> 01:01:07.370 Ray Lutz: people who are data analysts.

531 01:01:08.320 --> 01:01:11.200 Ray Lutz: It's more for somebody who wants to program in

532 01:01:12.350 --> 01:01:18.409 Ray Lutz: that use these, I mean, I use them all the time, because if you don't use one.

533 01:01:18.920 --> 01:01:27.129 Ray Lutz: and you're you know that it's not suitable to use pandas for this. So then you want to refer to a column of the array.

534 01:01:27.910 --> 01:01:36.059 Ray Lutz: Well, it's a list of dictionaries, so you don't have columns. You have to go through and write in a comprehension that pulls out the column. You can do that.

535 01:01:36.600 --> 01:01:41.220 Ray Lutz: but it's easier just to have something that's all well tested, and everything that pulls that column out

536 01:01:42.930 --> 01:01:47.280 Ray Lutz: makes conversions and so forth. So it's it's a handy thing to have.

537 01:01:47.590 --> 01:01:49.899 Ray Lutz: and I think it's logical to have

538 01:01:50.070 --> 01:01:52.420 Ray Lutz: a next step up. So we have

539 01:01:52.550 --> 01:01:58.670 Ray Lutz: fairly high level data structures in python, such as lists, dictionaries

540 01:01:59.010 --> 01:02:02.419 Ray Lutz: very highly functional and really nice.

541 01:02:02.920 --> 01:02:09.959 Ray Lutz: But we need to move up a level and have a two-dimensional functional data frame within the python world

542 01:02:10.220 --> 01:02:11.510 Ray Lutz: and and not

543 01:02:12.090 --> 01:02:19.120 Ray Lutz: make it numpy. Not that I'm against numpy. It's just that it's very restrictive as to what you can put into those cells.

544 01:02:20.880 --> 01:02:25.430 Ray Lutz: you can put some strings in. I think it's up to 20 characters or something. So.

545 01:02:26.550 --> 01:02:31.590 Ray Lutz: okay, all right. Well, thank you so much. I guess

546 01:02:32.030 --> 01:02:35.199 Ray Lutz: you're right. It's it's most people

547 01:02:35.570 --> 01:02:43.179 Ray Lutz: I think are going to say, well, I'm just going to continue to use python pandas, because that's what I'm used to, and

548 01:02:43.680 --> 01:02:45.039 Ray Lutz: I don't really care about

549 01:02:45.180 --> 01:02:56.389 Ray Lutz: time so much as what you're saying here, and I'm not doing appending. But if you're doing the appending, if you're building these tables up, that's when Daffodil becomes a pretty handy little tool.

550 01:02:57.490 --> 01:02:59.970 Ray Lutz: Okay, thanks a lot, Gabor. I guess that's the end.

551 01:03:00.360 --> 01:03:09.079 Gabor Szabo: Yeah. So thank you. Thank you again for for giving this presentation. Thank you. Everyone who listened to the were present and listened to the presentation.

552 01:03:09.330 --> 01:03:17.269 Gabor Szabo: If you like the video, then please like it, and follow the channel and see you next time.

553 01:03:17.810 --> 01:03:19.889 Ray Lutz: Okay, thanks. A lot. Okay. Bye.

Reducing your memory footprint by 75% with 6 lines with Tomer Brisker

2025-02-26T08:30:01Z

While profiling a slow process I stumbled upon a surprising way to reduce our memory consumption. This talk will present some useful profiling tools, and an important thing to know when using AbstractBaseClass extensively. In this session, we will dive into the realm of Python optimization, as we cover some essential profiling tools designed to identify and resolve performance bottlenecks in your code. We'll navigate through practical examples, showcasing how these tools can provide invaluable insights into your application's memory and CPU usage patterns. Furthermore, we'll delve into some nuances of AbstractBaseClass usage, and its implications on speed and memory management in Python applications. Whether you're a seasoned developer or just starting your journey with Python, this session offers some practical strategies to optimize Python programs effectively.

Transcript

1 00:00:02.250 --> 00:00:29.170 Gabor Szabo: Hi, and welcome to the Codeme events, meetings and the Codeme events channel. If you're watching it on Youtube, my name is Gabor Sabo. I usually teach python and rust and help companies with these 2 languages mostly. And I also organize these events because I really like the idea of sharing knowledge, I mean receiving knowledge from other people like this time, Toma.

2 00:00:29.330 --> 00:00:49.120 Gabor Szabo: and from around the world. So that's a good good idea, I think. And that's it. If you're watching the Youtube, then then please, like the video and those who follow the channel and thanks everyone who arrived to this meeting, and especially Tomer, for giving us the presentation. Now it's your turn.

3 00:00:49.510 --> 00:00:52.949 Gabor Szabo: So welcome to introduce yourself. And yeah.

4 00:00:53.120 --> 00:00:56.100 Tomer Brisker: Thank you. Let me share my screen.

5 00:00:58.800 --> 00:01:00.990 Tomer Brisker: Okay, can you see it?

6 00:01:01.920 --> 00:01:02.670 Gabor Szabo: Yes.

7 00:01:03.060 --> 00:01:03.920 Tomer Brisker: Excellent.

8 00:01:05.334 --> 00:01:09.469 Tomer Brisker: Okay, so 1st of all, I have to make a confession.

9 00:01:09.770 --> 00:01:14.040 Tomer Brisker: It wasn't 6 lines of code. It was actually 7 lines of code.

10 00:01:14.460 --> 00:01:28.110 Tomer Brisker: And I guess you're all pretty wondering, curious what these lines of code were. So here they are. Okay. Okay. It was 8 lines of code. If you count the space in between the functions.

11 00:01:28.660 --> 00:01:39.080 Tomer Brisker: and we'll dive into what exactly these lines of code mean a bit later, and why these allowed us to save so much memory.

12 00:01:39.220 --> 00:01:41.780 Tomer Brisker: But 1st of all, just so, you believe me.

13 00:01:41.930 --> 00:01:47.609 Tomer Brisker: this is for a memory usage graph in production. When we deployed this fix.

14 00:01:47.770 --> 00:02:01.359 Tomer Brisker: As you can see, the deployment was around 5 10 Pm. Which is a great time to deploy fixes. If I remember correctly, this was the Thursday, which is the end of the week for Israel, perfect time for deploying to production.

15 00:02:01.870 --> 00:02:13.220 Tomer Brisker: But 1st of all, do we have anyone in the call who happens to be a Us. Citizen or has to file us. Tax supports.

16 00:02:18.200 --> 00:02:21.170 Tomer Brisker: feel free to wave, or something.

17 00:02:21.860 --> 00:02:24.939 Tomer Brisker: If there are, I guess I guess not.

18 00:02:25.090 --> 00:02:31.839 Tomer Brisker: Well, if you were, I guess this would probably look pretty familiar to you.

19 00:02:31.950 --> 00:02:43.169 Tomer Brisker: So for those of you who don't know practically all us citizens are required to file tax supports annually for the Irs

20 00:02:43.440 --> 00:02:47.260 Tomer Brisker: for the income taxes. This is a

21 00:02:47.360 --> 00:02:52.590 Tomer Brisker: pretty painful process. It requires filling out a lot of obscure forms.

22 00:02:53.093 --> 00:03:17.609 Tomer Brisker: And if you make some mistakes on it, you can find yourself in jail. So most people either pay one of the existing companies who provide services for filing tech supports or pay an accountant to do the tax reports for them. So Hi! My name is Tomer. I'm the tech lead at. We are fixing the issue of Irs tech support filing.

23 00:03:18.833 --> 00:03:27.250 Tomer Brisker: We are developing a simple to use application that allows users to file their taxes seamlessly with the irs.

24 00:03:27.793 --> 00:03:48.870 Tomer Brisker: Usually takes most users around half an hour to do their taxes. Which is pretty awesome compared to what it normally takes, which is many hours and oh, and we don't charge them nearly as much as an accountant or one of the existing providers charge for this.

25 00:03:49.694 --> 00:04:01.029 Tomer Brisker: If I look a bit tired in the recording. I'm going to let you guess which one of these are to blame today. Him she's on the left.

26 00:04:01.935 --> 00:04:04.964 Tomer Brisker: She's 6 months old.

27 00:04:05.810 --> 00:04:11.380 Tomer Brisker: I live with my partner and 2 kids and give a time, which is a suburb of Tel Aviv

28 00:04:12.070 --> 00:04:18.920 Tomer Brisker: and our dog but I I guess you're not here to hear about me and my life.

29 00:04:19.149 --> 00:04:22.340 Tomer Brisker: You're here to hear about performance in Python.

30 00:04:23.170 --> 00:04:25.279 Tomer Brisker: So let's dive in.

31 00:04:25.420 --> 00:04:30.670 Tomer Brisker: Our story begins when we noticed there's some

32 00:04:30.830 --> 00:04:35.610 Tomer Brisker: certain action in our system that's taking quite a long time to complete

33 00:04:36.321 --> 00:04:41.589 Tomer Brisker: in fact, we even got some reports from users, complaining that they were hitting timeouts

34 00:04:41.740 --> 00:04:45.890 Tomer Brisker: when they were running this specific action within the system.

35 00:04:46.710 --> 00:05:01.249 Tomer Brisker: and I was assigned to this task, started digging in, and I managed to produce it locally. I created a nice little script that created the exact conditions of the users that were timing out

36 00:05:01.800 --> 00:05:09.590 Tomer Brisker: and the 1st step was to see how long it actually takes. And yeah, it was actually pretty slow.

37 00:05:09.690 --> 00:05:20.060 Tomer Brisker: Python comes with a couple of built in modules in the Standard Library that are pretty nice when you're timing things those time and those time it

38 00:05:20.599 --> 00:05:25.270 Tomer Brisker: I will leave feeding the exact documentations of them to the listener.

39 00:05:26.530 --> 00:05:27.429 Tomer Brisker: But

40 00:05:28.480 --> 00:05:54.030 Tomer Brisker: we actually have. It's it's very useful. If you know what you're looking for. For example, if there's a specific method that you know, is slow, and you want to measure some change that you make to it, and you want to see the impact of it. This is very useful. We even have a little wrapper method that allows us to easily measure the timings for various functions that we call

41 00:05:54.405 --> 00:06:02.299 Tomer Brisker: but what do we do if we're not sure where the slowness is coming from this specific action in the system.

42 00:06:02.470 --> 00:06:16.779 Tomer Brisker: This was a pretty complex section. It involved calling several different services, a lot of methods, so it's pretty difficult if you don't know what where the slowness is coming from.

43 00:06:18.680 --> 00:06:23.669 Tomer Brisker: So this is what profilers were invented for

44 00:06:24.368 --> 00:06:31.150 Tomer Brisker: profiler. There's a TV series called the Profiler. We're not going to talk about it. I have no idea what it's about.

45 00:06:31.430 --> 00:06:35.990 Tomer Brisker: but profilers generally come into different varieties.

46 00:06:36.560 --> 00:07:03.049 Tomer Brisker: They are. There are that we will mention that as we go, the 1st variety is deterministic profilers. These are profilers, that essentially every time you make a method call. They register that method call. They write down the start time, and once that method returns they write down the end time. Python, like standard library has a nice one called C profile.

47 00:07:03.330 --> 00:07:24.280 Tomer Brisker: It gives you a context manager. Basically, you wrap the code that you want to measure with the context manager. Then call whatever slow function you want to profile and save the statistics to a file. And see profile will actually take care of going over all of the method calls within that function.

48 00:07:24.410 --> 00:07:30.459 Tomer Brisker: measuring how long they take, how much, how many times each method is called, etc.

49 00:07:30.560 --> 00:07:41.699 Tomer Brisker: And it saves all of these statistics into a file which can be used can be read using a module called pstats.

50 00:07:41.810 --> 00:08:05.169 Tomer Brisker: Pstats allows you reading these files. And just so, there's a lot of information here. You can see we'll dive into it a little bit to understand what this table is about. So 1st of all, on the right, we can see the file, name, line number and function name, pretty basic. So you know what we're calling here

51 00:08:05.260 --> 00:08:18.769 Tomer Brisker: on the left hand side, we can see the number of calls to each function, as you can see. The 1st 3 here were called just once. This is actually the script that I was using to de- debug this issue.

52 00:08:19.561 --> 00:08:29.669 Tomer Brisker: The second column. Here is total time, which is the time that was spent within this specific method in total all of the times that it was called.

53 00:08:29.840 --> 00:08:39.579 Tomer Brisker: and those cumulative time which is basically the time that was spent within this method and any other method that was called from within that method.

54 00:08:40.289 --> 00:08:40.799 Tomer Brisker: and

55 00:08:41.130 --> 00:08:53.159 Tomer Brisker: something stood out pretty quickly to me out of 12 seconds. Runtime, in total. About 6 seconds or half the runtime was spent in one specific method.

56 00:08:53.955 --> 00:08:59.914 Tomer Brisker: And this method is ABC surplus check. Interesting. Okay?

57 00:09:00.620 --> 00:09:13.350 Tomer Brisker: let's see. And even more interesting is the number of times. This was called so in 12 seconds. We actually called this method 175,000 times.

58 00:09:13.410 --> 00:09:40.590 Tomer Brisker: That's the number on the right. And if those a slash, and another number here, that means that this method was calling another, it was calling itself recursively. So in this case we were calling ABC. Subtract check a bit over 3 million times. So about 20 times for every single call that we were making to subclass check, it was actually making about 20 different calls on average.

59 00:09:41.633 --> 00:09:53.739 Tomer Brisker: Okay, pretty interesting. But still, I'm not quite sure why we're calling this method so many times, or why is it taking so long when this method is being called.

60 00:09:53.920 --> 00:10:04.660 Tomer Brisker: And that's what the second type of profilers is really useful for identifying the second type of profilers is statistical profilers.

61 00:10:04.720 --> 00:10:21.380 Tomer Brisker: These are profilers that basically take a snapshot of your python call stack, or in any other language, the call stack. Every certain interval. Usually this is done. Every few milliseconds, the shorter the interval.

62 00:10:21.380 --> 00:10:37.109 Tomer Brisker: obviously the higher the impact it has on performance. On the other hand, if you set too long of an interval you might miss very quick method calls return within the interval, and they won't actually be registered when running the profile.

63 00:10:37.190 --> 00:10:44.800 Tomer Brisker: and a very common way of looking at statistical profiles is using a tool called flame charts.

64 00:10:45.400 --> 00:11:06.120 Tomer Brisker: The way that flame charts work. Basically, you have 2 axes here. The x-axis is the time. So the bigger the block is on the X-axis. That means the longer time that was spent within that specific block, and the Y-axis is the stack.

65 00:11:06.420 --> 00:11:26.929 Tomer Brisker: So you can see the actual call stack of every single method. And you can see why it was being called. Well, the call was coming from. See the bigger ones, the smaller ones. And then that's very helpful. When you need to debug and identify why, a certain method is being called a lot of times.

66 00:11:27.900 --> 00:11:31.660 Tomer Brisker: So I used one of these statistical profilers

67 00:11:31.990 --> 00:11:54.509 Tomer Brisker: specifically, one called pyspy. There's multiple different profilers available for python, and each language has its own ecosystem of profilers. I'm just showing the ones that I used in this case, but there are various other tools that are useful, and they're all good in their own fields.

68 00:11:55.047 --> 00:12:02.359 Tomer Brisker: So I run a statistical profile of pispy on this web producer that I created.

69 00:12:02.490 --> 00:12:10.869 Tomer Brisker: And hmm, yeah, okay, this is fine. This is fine. I can deal with that.

70 00:12:11.030 --> 00:12:20.069 Tomer Brisker: as you can see, a flame chart when you have a very complex operation, can be very, very, very difficult to read.

71 00:12:20.650 --> 00:12:27.250 Tomer Brisker: Sometimes there's something that stands out, you see, a very big block that's taking a very long time to call.

72 00:12:27.380 --> 00:12:52.389 Tomer Brisker: and you can identify the bottleneck pretty quickly from looking at this. But other cases everything is on fire, and you don't really know what's going on. Specifically, in this case we were seeing a certain method call being called 3 million times, which makes sense that it would be very difficult to identify all of these different calls within the flame chart.

73 00:12:52.640 --> 00:12:57.119 Tomer Brisker: and for that there was a nice tool called Sandwich.

74 00:12:57.330 --> 00:13:02.459 Tomer Brisker: Not that kind of sandwich. There's a tool called speed scope, and it has a

75 00:13:02.670 --> 00:13:05.350 Tomer Brisker: way of showing flame charges.

76 00:13:05.460 --> 00:13:18.199 Tomer Brisker: playing charts in a different way, which is, they call it sandwich. Basically on the left hand side, we can see all of the different method calls within our application within the run that was profiled.

77 00:13:18.560 --> 00:13:27.659 Tomer Brisker: And we can solve this list by the total time and by the self time. This is the same, by the way, as we saw previously total time, and

78 00:13:27.790 --> 00:13:33.126 Tomer Brisker: the cumulative time with in the c profile

79 00:13:33.860 --> 00:13:36.710 Tomer Brisker: And then once you click on one of these

80 00:13:37.000 --> 00:13:42.490 Tomer Brisker: you can see on the right hand side those 2 parts, those the callers and the callees.

81 00:13:42.500 --> 00:14:09.719 Tomer Brisker: The top half shows you where this method was being called from. So, for example, in this case we can see subtest check was mostly called from instance check, and some other internal methods. Also. Here we can see. Instance, check was being called from various other methods. And here the X-axis actually shows the time. That's cumulative by the specific

82 00:14:09.810 --> 00:14:29.109 Tomer Brisker: method. So this isn't a single call. This is the total of the times that it was called from here, and obviously I can show the internals of our system. But you can see that there were several places that we were calling. Instance check, pretty commonly leading to most of

83 00:14:29.170 --> 00:14:41.379 Tomer Brisker: the load. On this method, and on the left, you can see again. This was about 8 seconds in this test run. So quite a long time

84 00:14:41.480 --> 00:14:44.690 Tomer Brisker: from the overall. Time.

85 00:14:45.310 --> 00:14:50.109 Tomer Brisker: Okay, so instance, check subclass check.

86 00:14:50.290 --> 00:15:05.529 Tomer Brisker: This is like built in python stuff, right? And it's not something in our code base. What what should I do about it? It's pretty odd, I don't know. Let's, I guess, ask Dr. Google.

87 00:15:06.302 --> 00:15:16.930 Tomer Brisker: and turns out there's an open issue about ABC subclass check, which has a very poor performance. And I think a memory leak. Hmm!

88 00:15:17.610 --> 00:15:18.680 Tomer Brisker: More league.

89 00:15:19.030 --> 00:15:20.260 Tomer Brisker: Interesting.

90 00:15:21.537 --> 00:15:25.922 Tomer Brisker: Memories. Memory is pretty expensive.

91 00:15:26.940 --> 00:15:35.520 Tomer Brisker: and turns out I'm pretty bad at counting. So there's not actually 2 kinds of profilers, those 3 kinds of profilers.

92 00:15:35.710 --> 00:15:41.029 Tomer Brisker: There's also memory profilers besides the runtime profilers.

93 00:15:41.170 --> 00:16:10.060 Tomer Brisker: memory. Obviously it's expensive. If you need to use and allocate a lot of it, but it's also expensive in terms of performance, because if the python runtime runs out of memory, it has to make system calls to allocate additional memory to the python program. The garbage collector also has to go over all of the memory and clean up unused memory. So the more memory. You allocate the slow, the garbage collection will be

94 00:16:10.120 --> 00:16:38.349 Tomer Brisker: so. These tools, the memory profilers, allow us to identify issues with our memory allocations. Sometimes our program can be very fast. But allocates a very large amount of memory. Just recently we had a case, actually, where a certain process was crashing, and we were seeing pods being killed. So out of memory killed

95 00:16:38.380 --> 00:16:45.739 Tomer Brisker: basically in Kubernetes, when you allocate a certain amount of memory to a process.

96 00:16:45.930 --> 00:17:15.490 Tomer Brisker: if you over, if the process runs over the memory. The Kubernetes controller will kill it. So it doesn't starve out other processes. And in this specific case memory was running out so quickly that it wasn't even sending telemetry data to Prometheus. And we used actually a memory profiler to identify where exactly this memory was being allocated so rapidly that it was killing our pods.

97 00:17:16.339 --> 00:17:34.970 Tomer Brisker: So let's talk about memory profile as a bit. There's a really nice one for Python called memory. It gives you some nice runtime statistics on your program. In this case this is a reproducer script that I was using.

98 00:17:34.970 --> 00:17:51.320 Tomer Brisker: We can see that we actually had about 11 million object allocations during the script. You can notice that the runtime here is a bit longer than the 12 seconds that it took when running with just c profile. And that's because

99 00:17:51.330 --> 00:17:57.990 Tomer Brisker: every single memory allocation that the program does. There's some overhead to it when you're profiling it.

100 00:17:58.010 --> 00:18:04.920 Tomer Brisker: So the Runtime was a bit slower here, and we were allocating nearly 2 GB of memory.

101 00:18:07.340 --> 00:18:10.049 Tomer Brisker: When running this this process.

102 00:18:10.410 --> 00:18:23.900 Tomer Brisker: and it also gives you information like which python memory allocator was being used. Number of frames. That's the number of samples that it was taking. This is also statistical profilers.

103 00:18:25.123 --> 00:18:33.740 Tomer Brisker: And it also shows us a nice flame chart like we saw before. But, unlike the runtime profilers.

104 00:18:33.760 --> 00:18:59.500 Tomer Brisker: this flame chart, the X-axis, is actually the size of the memory allocated, so the wider the block is, that means that the memory allocated within this block was higher. And also we can see specific statistics for a certain method call. For example, in this case we were allocating 82 MB of memory and 2 and a half

105 00:18:59.520 --> 00:19:05.030 Tomer Brisker: a thousand objects were being allocated in a single call to subtrust check.

106 00:19:09.150 --> 00:19:11.982 Tomer Brisker: Okay, so let's go back to the bug.

107 00:19:13.050 --> 00:19:20.099 Tomer Brisker: ABC, subtest check has very poor performance. I think. Memory leak. It's open since May 2022.

108 00:19:20.350 --> 00:19:24.630 Tomer Brisker: Anybody in the audience maybe knows who Samuel Colvin is

109 00:19:26.440 --> 00:19:32.050 Tomer Brisker: feel free to unmute. If you do anyone.

110 00:19:32.150 --> 00:19:48.209 Tomer Brisker: Samuel Corvin, that's the guy behind pydantic Pydantic is a very popular data validation Library for python. He opened this issue almost 3 years ago, and it's still open. So

111 00:19:48.856 --> 00:19:53.720 Tomer Brisker: well, I guess case closed. Python is a slow language.

112 00:19:53.940 --> 00:20:09.769 Tomer Brisker: There's nothing to do about it. We have to rewrite our application completely, using go rust elixir. I don't know what are the cool kids using today, Gabbo, I know you do rust a lot. So I guess rewrite right?

113 00:20:10.990 --> 00:20:16.260 Tomer Brisker: No, I could just decide. This is a case of

114 00:20:16.470 --> 00:20:26.119 Tomer Brisker: language limitations. We have to cope with it, and that's it. But they decided to dig in a little bit deeper and try to figure out if there's something we can do to resolve the issue

115 00:20:26.310 --> 00:20:35.859 Tomer Brisker: and to dig in a little bit deeper. We need to discuss ABC a bit, not the TV network abstract based classes

116 00:20:36.280 --> 00:20:41.090 Tomer Brisker: for those of you who are not familiar with abstract based classes.

117 00:20:41.677 --> 00:20:47.099 Tomer Brisker: Which, as we mentioned, have a fairly poor performance for the subclass check.

118 00:20:47.806 --> 00:20:54.949 Tomer Brisker: Abstract-based classes, is a mechanism in Python that allows us to define an abstract class

119 00:20:55.260 --> 00:21:14.820 Tomer Brisker: which allows us to define certain methods. We require any class that is subclassing from that class to do. We require these methods. So if we try to subclass it, and we don't implement these specific methods, the methods. The interpreter will yell at us saying, Hey.

120 00:21:14.930 --> 00:21:19.040 Tomer Brisker: this class has to implement a certain method.

121 00:21:19.701 --> 00:21:34.009 Tomer Brisker: Usually we use subclass ABC's as makes sense, basically defining a specific interface. We want to implement. But they have an interesting idea that is registering virtual subclasses.

122 00:21:34.662 --> 00:21:38.999 Tomer Brisker: Which is, for example, let's say you have a

123 00:21:39.180 --> 00:21:50.129 Tomer Brisker: base class that you've defined, and you want one of the built-in types of python to be a subclass of that. Obviously, you can't

124 00:21:50.380 --> 00:21:54.520 Tomer Brisker: have int subclassing something else right?

125 00:21:55.065 --> 00:22:09.930 Tomer Brisker: And the ABC. Class, or ABC Meta subclass ABC Meta class. Sorry ABC. Meta Meta class allows us to register various other classes as virtual subclasses of the base class.

126 00:22:09.950 --> 00:22:33.360 Tomer Brisker: This is also useful. If you have a class that implements multiple interfaces, let's say you have a class that implements iterable and implements hashable and implements. I don't know sortable, say, and a few others. Obviously, you don't want to have to declare all of these. When you create a class.

127 00:22:33.860 --> 00:22:50.000 Tomer Brisker: you can just register this class as a virtual subclass. And that means that if you look at the method, resolution, or the Mlo. Of a specific object of that class, you won't see these classes as their parent.

128 00:22:50.150 --> 00:23:15.209 Tomer Brisker: That's by the way, the way usually subclass check works. It checks the method, resolution order, and to see if the parent class is there? But since we have to, we allow registering subclasses virtually to classes that aren't the actual parents, there's a specific implementation within ABC. Matter for subclass check and

129 00:23:15.270 --> 00:23:21.079 Tomer Brisker: instance check that allows support for this specific use case.

130 00:23:21.390 --> 00:23:46.080 Tomer Brisker: So just to better understand this use case, let's say we have this class here we have a base class which is an abstract base class. We have class A class B inheriting from that base class, and we have a virtual subclass which isn't inheriting from the base class, but it's registered as a subclass of this base class, and so forth. We have a virtual subclass, a etc, etc.

131 00:23:46.867 --> 00:23:54.869 Tomer Brisker: Let's say we have an object, and we want to check if that object is a subclass or an instance of base class.

132 00:23:55.358 --> 00:24:02.720 Tomer Brisker: In this case we would. Let's say, this object is of type, virtual, subclass, A, we would need to check

133 00:24:02.910 --> 00:24:04.010 Tomer Brisker: the whole

134 00:24:04.160 --> 00:24:25.339 Tomer Brisker: inheritance tree for base class to identify if this class was registered to any of the classes within that inheritance tree. So this calculation is pretty complex. There's also potentially an issue with a bad implementation of

135 00:24:25.420 --> 00:24:53.739 Tomer Brisker: caching within the implementation of abstract base class. But normally this isn't a big issue, because you wouldn't have that many classes inheriting from a single base class, maybe 2, 3, 1020. Usually it's not noticeable. But, as I mentioned, we're dealing with tax reports and tax filing for the Us. Irs.

136 00:24:53.890 --> 00:25:23.869 Tomer Brisker: There's thousands of different forms that the user needs to fill in. Think of the number of states. Each State has its own forms. Each form is composed of multiple different parts, and you can pretty quickly guess the rough number of classes that we have in our system to enable this fairly complex calculation, which has led us to this issue because of the

137 00:25:23.930 --> 00:25:32.029 Tomer Brisker: very large inheritance that we have from our base class that we use for the calculation.

138 00:25:32.380 --> 00:25:49.000 Tomer Brisker: And that's why going back to the solution. This solution worked. Let's look at it a bit more in depth. Obviously, we're using type definitions. We are not barbarians. Previously I was dropping them just to make it easier to look.

139 00:25:49.550 --> 00:26:11.719 Tomer Brisker: But this is very, very straightforward. Those is subclass. And is this instance, and what they do is they go to type. Type is the base class for all classes in case you're not familiar with it, and they call subclass check or instance check on type directly by default. When you call is instance, or is subclass.

140 00:26:11.990 --> 00:26:39.610 Tomer Brisker: The way it works is, it goes to the class that is on the right hand side of the of the second parameter, basically of the function call, and it checks the Meta class for that class looking for subclass check or instance check depending on which method you called, and then it goes up the method resolution order until it finds the implementation. In case of

141 00:26:39.700 --> 00:27:02.590 Tomer Brisker: an abstract base class, it would go to abstract based class ABC subclass check. But here, what we do is we basically bypass the ABC methods and go directly to the source type subclass check, which is the default implementation used by python for any types that aren't abstract based classes.

142 00:27:03.150 --> 00:27:15.699 Tomer Brisker: And then all we had to do in our code base. It's a very simple change. Use is subclass form, first, st subclass instead of using the default, python implementation.

143 00:27:15.800 --> 00:27:35.569 Tomer Brisker: checking. If it's a subclass of the base model. And the reason this worked is because we didn't really care about the virtual subclass aspect of ABC. In our case we were just checking. If a certain object is or isn't, a subclass of our base model.

144 00:27:35.670 --> 00:27:44.720 Tomer Brisker: This wouldn't work, obviously, if we were actually registering our objects into the base model instead of directly inheriting from it.

145 00:27:44.850 --> 00:27:47.290 Tomer Brisker: But in our case this was good enough.

146 00:27:47.520 --> 00:27:57.079 Tomer Brisker: and, as you can see, there's another very nice added benefit to profiling. It lets you add nice statistics to your pull requests.

147 00:27:57.812 --> 00:28:08.119 Tomer Brisker: For example, Runtime went down from 33 seconds to 26 seconds. Memory usage went was improved by 50%, etc, etc.

148 00:28:08.120 --> 00:28:31.359 Tomer Brisker: Actually, I didn't even have to implement this in all of the places. I only had to switch to using the 1st subclass in very specific places I identified, using profiling as being the most common places that this was being called from, and this already gave me a very significant improvement.

149 00:28:31.360 --> 00:28:56.700 Tomer Brisker: And when we actually deployed, this fix turns out that the impact was even higher than the specific use case that I was profiling because this had impact across the system, it could significantly, as you can see here, reduced our memory load. It also improved the system runtime in general, and also the system load. Time was drastically reduced.

150 00:28:56.960 --> 00:29:08.460 Tomer Brisker: making our deployments much faster and saving a lot of costs, questions anybody.

151 00:29:11.030 --> 00:29:16.439 Gabor Szabo: And 1st of all, thanks, thanks for the presentation. Can you go back? 1, 1 slide.

152 00:29:16.880 --> 00:29:17.540 Tomer Brisker: Yes.

153 00:29:17.680 --> 00:29:20.002 Gabor Szabo: What is this? Bump? The hunter.

154 00:29:20.848 --> 00:29:37.821 Tomer Brisker: That's a good question. Actually, that's we're using warning updates. So basically, we spun up a few pods, switched over to them, then spun up a few more pods and it these are

155 00:29:39.060 --> 00:29:56.730 Tomer Brisker: These are. This is the time when there were still some of the older pods running in parallel with the new pods that were using less memory. So this is the 1st drop is when we killed the 1st batch of the old pods, and the second drop is when we killed the second batch.

156 00:29:57.290 --> 00:29:57.870 Gabor Szabo: Hmm!

157 00:30:00.030 --> 00:30:05.289 Gabor Szabo: But why did this? I still don't understand why it went up, but it go up again.

158 00:30:05.554 --> 00:30:10.840 Tomer Brisker: So we started a few pods, killed a few, and then started a bunch more and then killed the rest.

159 00:30:11.010 --> 00:30:11.490 Tomer Brisker: Oh.

160 00:30:11.490 --> 00:30:12.060 Gabor Szabo: Okay.

161 00:30:12.280 --> 00:30:18.859 Tomer Brisker: Oh, man, this is just a loading of the the new ports! While the all the ones were still running.

162 00:30:19.650 --> 00:30:20.610 Gabor Szabo: Okay. Nice.

163 00:30:24.700 --> 00:30:26.580 Tomer Brisker: Any other questions. Anybody?

164 00:30:29.900 --> 00:30:31.789 Tomer Brisker: Okay, thank you very much.

165 00:30:32.380 --> 00:30:39.630 Gabor Szabo: No, it seems so. Thank you. Thank you for giving this presentation and everyone for being here listening.

166 00:30:40.250 --> 00:30:41.120 Gabor Szabo: And

167 00:30:42.190 --> 00:30:49.600 Gabor Szabo: I'm going to stop the video. But please remember to like the video and follow the channel and see you next time, Toya.

168 00:30:50.180 --> 00:30:52.370 Tomer Brisker: Bye, bye, thanks for having me Nobel.

169 00:30:52.370 --> 00:30:53.150 Gabor Szabo: Bye-bye.

Simulations for the Mathematically Challenged with Miki Tebeka

2025-02-21T07:30:01Z

Question: What are the odds that in a class of 23 students, two have the same birthday? We'll solve this question and others, using only a for loop and a random number generator.

In this talk, we'll see how to use Monte Carlo simulations to solve various problems that might intimidate you due to lack of match skills.

source code

Transcript

1 00:00:02.020 --> 00:00:20.500 Gabor Szabo: So Hi, and welcome to the Codemaven Meetup Group and to the Codemaven Youtube Channel. In case you are watching this in Youtube, my name is Gabor. I provide the training services in python and rust and help companies get start using these languages.

2 00:00:20.740 --> 00:00:28.840 Gabor Szabo: And I also think that it's important to share knowledge among people. So that's why I'm organizing these events, these meetings.

3 00:00:29.820 --> 00:00:40.000 Gabor Szabo: So I would like to welcome everyone who joined us at this meeting, and especially Mickey, for agreeing to give this presentation, and that's it. The floor is yours, Mickey.

4 00:00:40.694 --> 00:00:44.060 Miki Tebeka: Hi, everyone. I am going to share my screen.

5 00:00:44.390 --> 00:00:48.980 Miki Tebeka: Then we will start sure.

6 00:00:50.920 --> 00:00:51.690 Miki Tebeka: Okay.

7 00:00:52.080 --> 00:01:03.720 Miki Tebeka: so we are going to talk about what I call simulations for the mathematically challenged. And this is about the tool called simulation. How you can solve various problems.

8 00:01:03.720 --> 00:01:06.970 Gabor Szabo: Sorry. Just just one thing. Can can you move the.

9 00:01:07.230 --> 00:01:07.910 Miki Tebeka: Oh, there!

10 00:01:07.910 --> 00:01:09.300 Gabor Szabo: This again, I'll do this.

11 00:01:10.010 --> 00:01:11.290 Gabor Szabo: Yeah, thanks.

12 00:01:11.290 --> 00:01:17.899 Miki Tebeka: Moved. Okay, sorry. Okay. So my name is Mickey. I've been a professional developer

13 00:01:18.210 --> 00:01:25.368 Miki Tebeka: 37 years. Now. Give or take work mostly with python and go

14 00:01:26.200 --> 00:01:31.599 Miki Tebeka: I teach. I consult, I write books, I do videos. I

15 00:01:31.870 --> 00:01:46.060 Miki Tebeka: enjoy myself in a very geeky way. And and this is a tool that I used in a couple of occasions. I think it's very simple, but not a lot of people aware of that

16 00:01:46.580 --> 00:01:51.379 Miki Tebeka: and it starts usually with the problem that you have usually a data related problem.

17 00:01:53.670 --> 00:01:58.850 Miki Tebeka: You have a cash. You want to know what are the odds that given that

18 00:01:59.080 --> 00:02:10.100 Miki Tebeka: amount of cache hits. What's the average latency? Other questions that usually involve statistics or probability.

19 00:02:10.680 --> 00:02:16.470 Miki Tebeka: And and then you said, Okay, you know, I'll we're all gigs. Right? So we go and hit the books.

20 00:02:17.910 --> 00:02:23.169 Miki Tebeka: But then you start seeing all these kinds of equations.

21 00:02:23.290 --> 00:02:28.480 Miki Tebeka: and usually around that time. I say, you know what. Maybe this problem is, not that important.

22 00:02:28.670 --> 00:02:30.920 Miki Tebeka: And I'll I'll move to do something.

23 00:02:31.680 --> 00:02:37.979 Miki Tebeka: And what I wanted to show you is basically that if you can write a follow up.

24 00:02:38.110 --> 00:02:39.529 Miki Tebeka: you can do statistics.

25 00:02:40.040 --> 00:02:46.170 Miki Tebeka: and that's it. You don't need more than that. You need the follow up, and you need random.

26 00:02:46.530 --> 00:02:55.529 Miki Tebeka: and these are the only 2 tools that you need in order to work. By the way, I'm going to show code, and if you have questions, feel free to ask

27 00:02:58.270 --> 00:03:12.519 Miki Tebeka: if you don't understand the code, if you want to learn about other things. So what are we going to do is we're going to talk about these 5 problems. So we're going to talk about the game of Qatar. And what are the best tiles going to calculate pipe

28 00:03:12.800 --> 00:03:17.480 Miki Tebeka: randomly, which sounds weird. But it's it's another

29 00:03:18.120 --> 00:03:23.610 Miki Tebeka: interesting uses for simulations. We're going to solve the birthday problem.

30 00:03:23.830 --> 00:03:31.290 Miki Tebeka: Given 23 people, I think. What's what are those the 2 people in say in this group has

31 00:03:32.900 --> 00:03:55.700 Miki Tebeka: the same birthday? We're going to see if person is sick, or what are the odds? That person is sick or not given a test that says that he is sick. And we're going to talk about the Monty Hall problem which is philosophically, it's a statistically interesting, but also a very philosophical question, not a philosopher. So you can discuss it later and see what's going on.

32 00:03:56.150 --> 00:04:02.910 Miki Tebeka: So let's start with Catan. Right? So in Catan we have these tiles, and every tile has a number on it.

33 00:04:03.290 --> 00:04:06.159 Miki Tebeka: and then you throw a couple of dices.

34 00:04:06.370 --> 00:04:13.100 Miki Tebeka: and if the number of the dices matches the number of your tile, then you can do things in the game.

35 00:04:13.290 --> 00:04:20.110 Miki Tebeka: So at the beginning you can pick out where you want to put your places, and it's up to you to decide.

36 00:04:20.269 --> 00:04:26.870 Miki Tebeka: Which tile do you want? And you want to know, you know, which tiles are going to get the most hits. What are the probability

37 00:04:27.060 --> 00:04:28.349 Miki Tebeka: of of doing that?

38 00:04:29.890 --> 00:04:30.850 Miki Tebeka: So

39 00:04:33.760 --> 00:04:39.109 Miki Tebeka: This is this is that. Yes, and I'm old. I'm using vim.

40 00:04:39.440 --> 00:04:54.299 Miki Tebeka: Sue, me later. But I think the code is clear enough. So basically, what we're going to do is we're going to do a dice wall, which is basically just a random number between one and 6. This is coming from there.

41 00:04:54.510 --> 00:05:03.489 Miki Tebeka: And then what we're going to do is we run a simulation. So what we're going? We're going to run a lot of dice roll. So I'm going to do a million

42 00:05:05.690 --> 00:05:07.109 Miki Tebeka: vice vers,

43 00:05:08.070 --> 00:05:21.110 Miki Tebeka: And every time I'm going to do 2 dice rows, right? So I get a number. And I'm updating some kind of counter right? We have counter from collections. This is a special data structure that basically stores

44 00:05:21.520 --> 00:05:29.950 Miki Tebeka: how much data we have per count.

45 00:05:30.090 --> 00:05:35.210 Miki Tebeka: And then I'm going over all the numbers right. The minimal

46 00:05:35.680 --> 00:05:48.199 Miki Tebeka: number that you can get with rolling 2 dices is 2, 2 ones and maximum. One is 12, but the range is half open. So we're not going to get there. I'm going to show the fraction

47 00:05:49.340 --> 00:05:53.260 Miki Tebeka: how many? Hey?

48 00:05:54.170 --> 00:05:57.799 Miki Tebeka: This number of the total counts, and I'm going to print it up.

49 00:05:57.920 --> 00:06:15.039 Miki Tebeka: That's that's the code that I'm going to do. And then, if you're going to do pythonpy, you're going to see now, we get probabilities right? And you see that unsurprisingly 7

50 00:06:15.290 --> 00:06:34.940 Miki Tebeka: has the best percentage to to get roll of 2 dices. And you can do the what I call the hard way, which is just, you know, doing all the combinations of all the dice rolls, and then calculate how many there are. But for me as a programmer. This is much easier.

51 00:06:35.610 --> 00:06:39.139 Miki Tebeka: Why, they just write some code. This is like 20 lines of code.

52 00:06:39.370 --> 00:06:45.859 Miki Tebeka: pretty simple. And now I have it. So this is the basic of simulation. We basically

53 00:06:47.220 --> 00:06:54.540 Miki Tebeka: create scenarios using some kind of randomness in scenario. And then we are going to

54 00:06:55.030 --> 00:07:04.519 Miki Tebeka: calculate some statistics about what happened on every scenario, and finally display the result. And this is known as a simulation or Monte Carlo simulation

55 00:07:04.710 --> 00:07:05.669 Miki Tebeka: for what we

56 00:07:08.870 --> 00:07:10.259 Miki Tebeka: questions about this one

57 00:07:16.070 --> 00:07:17.270 Miki Tebeka: no questions.

58 00:07:17.570 --> 00:07:21.660 Miki Tebeka: Alright. By the way, if you ask questions, just open the mic and ask questions, cause

59 00:07:22.140 --> 00:07:26.409 Miki Tebeka: it's hard for me to focus both on the code and on on the zoom screen.

60 00:07:26.960 --> 00:07:34.329 Miki Tebeka: Okay, the next thing we're going to do it's pretty interesting. We're going to calculate pi again randomly.

61 00:07:34.830 --> 00:07:35.890 Miki Tebeka: So

62 00:07:36.820 --> 00:07:42.999 Miki Tebeka: what the way we're going to do it is, we're going to say, let's take a circle which has a radius of one.

63 00:07:43.810 --> 00:07:48.237 Miki Tebeka: And now we're going to concentrate only on top right

64 00:07:48.990 --> 00:07:57.920 Miki Tebeka: square, which is the bonding square for this circle, and we're going to start getting random dots

65 00:07:58.290 --> 00:08:02.520 Miki Tebeka: if the dot falls in the circle, I'm going to paint them as

66 00:08:03.110 --> 00:08:08.290 Miki Tebeka: green, and if it falls outside of the circle, I'm going to paint them as red.

67 00:08:08.860 --> 00:08:16.689 Miki Tebeka: Okay? So once I've done it enough times I can calculate what is the ratio between the green dots and the red dots.

68 00:08:17.180 --> 00:08:21.580 Miki Tebeka: and this ratio is quarter of a pipe.

69 00:08:24.010 --> 00:08:28.210 Miki Tebeka: right? Because the the area of the

70 00:08:30.090 --> 00:08:32.335 Miki Tebeka: the way of the circle is

71 00:08:33.780 --> 00:08:37.330 Miki Tebeka: pi r squared. But r is one. So it's just pi.

72 00:08:37.840 --> 00:08:42.050 Miki Tebeka: so basically, the amount of dust that falls inside the circle should be pi.

73 00:08:42.320 --> 00:08:50.420 Miki Tebeka: but we we doing it only on a quarter of a circle. This is a quarter of a pi, and we are going to get the number pi.

74 00:08:50.800 --> 00:08:56.720 Miki Tebeka: So this is pi, dot, PP. 1,

75 00:08:57.840 --> 00:09:13.699 Miki Tebeka: so again, we we're going to import this time uniform from random, and then sq, and this is going to run for a bit. So I'm going to display display progress bar with something called Tqdm.

76 00:09:15.030 --> 00:09:22.700 Miki Tebeka: and then the radius of is one, and we have n, which is the number of iterations which is a hundred 1 million.

77 00:09:23.370 --> 00:09:35.250 Miki Tebeka: and inner, is the number of points that are inside the circle, which I'm going to do with to start with 0, right? So I'm getting X and Y, which is uniform between 0 and one.

78 00:09:35.960 --> 00:09:42.390 Miki Tebeka: And then, if the point falls inside the circle, I'm just going to increment inner.

79 00:09:43.700 --> 00:09:46.760 Miki Tebeka: So this is how many points fell inside the signal.

80 00:09:48.670 --> 00:10:01.319 Miki Tebeka: Now the ratio is inner divided by N. And as we said, this is quarter of a pi, so we need to print out 4 times this ratio to get to the number of pi

81 00:10:07.660 --> 00:10:14.720 Miki Tebeka: and you know what I'm going to also run the time command to show you how much time it took. So this is, this is a hundred 1 million

82 00:10:15.110 --> 00:10:19.840 Miki Tebeka: Ron's so it's going to take a little bit of fine.

83 00:10:20.300 --> 00:10:21.420 Miki Tebeka: And

84 00:10:23.110 --> 00:10:30.290 Miki Tebeka: and it's a good thing in the winter, because it also warms up your CPU, so you can warm yourself without using the A/C.

85 00:10:35.020 --> 00:10:43.789 Miki Tebeka: I said. Tqdm, which shows the progress bar is really nice, especially if you have long running processes, you know, if your process is actually running or it's not stuck.

86 00:10:43.930 --> 00:10:45.236 Miki Tebeka: So we're

87 00:10:46.570 --> 00:10:56.980 Miki Tebeka: So it's been done. And and we see that we got a 3.1 4.

88 00:10:57.380 --> 00:11:04.729 Miki Tebeka: Yeah, which is close enough to pie. And it took us about 41 seconds to run.

89 00:11:06.320 --> 00:11:12.810 Miki Tebeka: Now, one thing that can help you with simulations is pip pi-pipe.

90 00:11:12.980 --> 00:11:24.209 Miki Tebeka: So if you're not familiar, the python we are using is called C. Python. It's python written in C. There are other pythons, such as Jython, which is Python, written in Java.

91 00:11:24.530 --> 00:11:30.700 Miki Tebeka: and several others, and micropython for micro devices, and there is pi pi.

92 00:11:30.970 --> 00:11:39.090 Miki Tebeka: which is a python written in Python, and it has several optimizations that are not in C. Python.

93 00:11:39.330 --> 00:11:48.439 Miki Tebeka: especially a git compiler, though from 3 13 and up we have an experimental git compiler in C title which should bring it.

94 00:11:49.475 --> 00:11:54.449 Miki Tebeka: and if I'm going to run a pi po eye on that thing.

95 00:11:54.640 --> 00:11:56.969 Miki Tebeka: you're going to see the difference.

96 00:11:57.100 --> 00:12:05.319 Miki Tebeka: Right? No, forgot the time. Command like. You see, this is

97 00:12:05.520 --> 00:12:12.520 Miki Tebeka: 3.5 seconds, so more than 10 times faster on these calculations. So I'm not going to. I'm not saying.

98 00:12:13.420 --> 00:12:22.700 Miki Tebeka: Use pi for everything. There are some compatibility issues with, especially external libraries and maybe other things, and it's not

99 00:12:22.890 --> 00:12:30.900 Miki Tebeka: on par with python. Currently, I think they're on they're on on 3, 10,

100 00:12:31.160 --> 00:12:36.990 Miki Tebeka: equivalent to 3, 10. And right now in Python we are in 3 13. So they take some time

101 00:12:37.400 --> 00:12:40.189 Miki Tebeka: they're building up, and then they close it.

102 00:12:40.690 --> 00:12:48.000 Miki Tebeka: But it's a it's a nice tool to know and and work with questions about this one.

103 00:12:56.360 --> 00:12:59.850 Gabor Szabo: It's not about this one. And and probably it's not

104 00:13:00.490 --> 00:13:13.559 Gabor Szabo: relevant to this, this presentation. But maybe it's for another time is how come actually, this pi can be pi can be so much faster. I would really like to understand this.

105 00:13:13.560 --> 00:13:29.079 Miki Tebeka: The the current. C. Python itself does not do any optimization. So if you look at the C compiler, it has tons of optimization loop, unrolling, constant folding, a lot of many things that the python temperature is not doing at all.

106 00:13:29.784 --> 00:13:40.270 Miki Tebeka: And the other thing is that there is a technology called jit, which is just in time compilation, which means that you run the code. Once in python.

107 00:13:40.430 --> 00:13:45.080 Miki Tebeka: you see what happens there and then you generate specific machine code.

108 00:13:45.230 --> 00:13:52.810 Miki Tebeka: And next time you call the function, it is actually, not the python function. It's called, but the

109 00:13:53.160 --> 00:14:03.310 Miki Tebeka: optimized generated machine code for that. And this is something that Nodejs and other dynamic languages are using, including Java.

110 00:14:03.590 --> 00:14:08.849 Miki Tebeka: to make things faster. And Pypi has pipi. Sorry. Pi Pi has a

111 00:14:09.040 --> 00:14:12.810 Miki Tebeka: a very good git compiler that has been developed for a lot of years.

112 00:14:13.449 --> 00:14:21.089 Miki Tebeka: And that's why it's faster. Basically, pi is written in Python, but eventually generates a

113 00:14:21.520 --> 00:14:28.779 Miki Tebeka: and executable in machine code. So it is pretty fast in this case.

114 00:14:32.600 --> 00:14:34.444 Miki Tebeka: Okay, so

115 00:14:35.730 --> 00:14:43.390 Miki Tebeka: someone joked that, you know, this is every time they go on a Zoom Meeting. That's what comes to their mind right? This is very similar to

116 00:14:43.630 --> 00:14:55.150 Miki Tebeka: to Zoom, and the idea is, and this is known as the birthday problem. Given a group of people.

117 00:14:55.500 --> 00:15:00.259 Miki Tebeka: what are the odds that 2 people have the same birthday?

118 00:15:00.800 --> 00:15:01.740 Miki Tebeka: And

119 00:15:08.330 --> 00:15:10.020 Miki Tebeka: if a

120 00:15:11.310 --> 00:15:19.429 Miki Tebeka: what I pick up as a group. Size is 23 people. So I would like you to just take a guess

121 00:15:19.730 --> 00:15:24.659 Miki Tebeka: like we have a group of 23 people. What are the odds that 2 people have the same birthday?

122 00:15:24.930 --> 00:15:35.990 Miki Tebeka: Yeah. So around the birthday. So I'm going to basically say that I'm not looking at dates. I'm looking at day of year. So we have 365 days per year. So

123 00:15:36.480 --> 00:15:41.359 Miki Tebeka: a random date is basically a number between one and 365. That's

124 00:15:41.590 --> 00:15:50.350 Miki Tebeka: how many days we have in here. And now what I'm saying is given a groups of a given size. Are there any duplicates in the group?

125 00:15:50.770 --> 00:16:00.790 Miki Tebeka: Alright? So basically I creating a set and then going over the numbers generating around the birthday. And then, if it's already. If you already seen this birthday.

126 00:16:01.290 --> 00:16:06.209 Miki Tebeka: we say there is a duplication, so at least 2 people have the same birthday in that group.

127 00:16:06.350 --> 00:16:16.929 Miki Tebeka: otherwise we add it to the group and return. And finally, if you're out of the follow, we say, no, there are no duplicates in this group. So basically, we draw a group of

128 00:16:17.140 --> 00:16:18.390 Miki Tebeka: random numbers.

129 00:16:19.026 --> 00:16:24.570 Miki Tebeka: Between one and 365, and say, is there an overlap here somewhere?

130 00:16:26.440 --> 00:16:38.710 Miki Tebeka: Alright? And now we start again. The simulation. So simulation now is going to run over a million. We're going to do it 1 million times. The group size is 23. And again, none of the applications is

131 00:16:38.900 --> 00:16:48.529 Miki Tebeka: 0 to begin with. And then we run the simulation. And if there is a duplication, we say, Okay, let's increment duplication.

132 00:16:48.640 --> 00:16:54.299 Miki Tebeka: Finally, we are printing what the fraction is from the total.

133 00:16:55.750 --> 00:16:58.950 Miki Tebeka: Okay, so remember the number you guessed.

134 00:17:09.520 --> 00:17:12.040 Miki Tebeka: anyone guessed 50%.

135 00:17:15.579 --> 00:17:18.080 Miki Tebeka: Right? This is seems pretty high right.

136 00:17:18.319 --> 00:17:24.920 Miki Tebeka: And this is something that happens with statistics and probabilities a lot of time. This is not intuitive.

137 00:17:25.069 --> 00:17:33.850 Miki Tebeka: A lot of times we think that we know the answer, and we say, you know, there are a lot of days, only 23 people. How comes? But

138 00:17:34.070 --> 00:17:41.780 Miki Tebeka: if you do it and you do the statistical computation, you'll get exactly the same thing. That's that's the idea. But

139 00:17:42.430 --> 00:17:48.739 Miki Tebeka: again, this is a a more complicated the

140 00:17:49.250 --> 00:17:53.420 Miki Tebeka: computation. And for me, as a developer. This is pretty easy.

141 00:17:53.760 --> 00:17:55.630 Miki Tebeka: you know a follow up around them.

142 00:17:56.360 --> 00:17:57.329 Miki Tebeka: I'm done with.

143 00:18:01.430 --> 00:18:10.880 Gabor Szabo: Yeah, it's nice. What would be interesting, I think, is to see if you take 2 people, 3 people and so on. Up to 365 people.

144 00:18:11.340 --> 00:18:15.830 Gabor Szabo: and for each number to see this, the probability and then graph.

145 00:18:16.185 --> 00:18:27.929 Miki Tebeka: Pretty sure there is. If you go on the web. This is a really known problem. People have done it already. There's a. You can probably see the graph for that.

146 00:18:29.194 --> 00:18:37.129 Miki Tebeka: But this is not exactly true. Right? This is a joke right?

147 00:18:37.280 --> 00:18:46.709 Miki Tebeka: And the chances of a piece of bread falling on the on butter side down is directly proportional to the cost of the carpet, like it's not 50 50.

148 00:18:47.040 --> 00:18:49.360 Miki Tebeka: This is correlations to Mr. Murphy.

149 00:18:49.580 --> 00:18:50.490 Miki Tebeka: Oh.

150 00:18:50.730 --> 00:19:02.460 Miki Tebeka: and the thing is that not everybody is born or does not an average distribution on the days that you are born, especially if you're born on March 29, th

151 00:19:03.080 --> 00:19:09.370 Miki Tebeka: right? This is reducing the odds that you're going to have someone with the same birthday.

152 00:19:12.510 --> 00:19:13.389 Miki Tebeka: So

153 00:19:17.110 --> 00:19:32.149 Miki Tebeka: we. We have a model, and there's a saying, by George, Box, which I really believe that all models are wrong. But some are useful right? So it needs to be interesting enough, or or the answer should be interesting enough, but not to

154 00:19:34.740 --> 00:19:35.405 Miki Tebeka: to

155 00:19:38.780 --> 00:19:45.018 Miki Tebeka: to give you a useful answer. But you can have a model. So I actually took some data from

156 00:19:45.590 --> 00:19:48.589 Miki Tebeka: for about birthdays in general.

157 00:19:49.030 --> 00:19:53.730 Miki Tebeka: Right? So no, this is

158 00:19:55.770 --> 00:20:06.170 Miki Tebeka: us both. Right? So the the Csv file gives you. What is the birthday? Per file? Oh, this is a windows. New lines. Yay,

159 00:20:07.250 --> 00:20:08.130 Miki Tebeka: so

160 00:20:09.380 --> 00:20:14.699 Miki Tebeka: We have a year, month, day of birth, day of the week, and and how many birth there were

161 00:20:15.070 --> 00:20:17.320 Miki Tebeka: per every one of them.

162 00:20:17.941 --> 00:20:24.589 Miki Tebeka: And then what I'm going to do is I'm going to do actually weighted probabilities.

163 00:20:24.910 --> 00:20:28.047 Miki Tebeka: meaning it's not that every day has the

164 00:20:29.080 --> 00:20:41.680 Miki Tebeka: the same probability. But I'm going to use these frequencies to do that. And here I'm going to switch over to tools from the scientific python side of things.

165 00:20:42.410 --> 00:20:45.390 Miki Tebeka: And these tools are pandas and numpy.

166 00:20:46.170 --> 00:21:04.579 Miki Tebeka: And I'm using pandas. If you're not familiar with pandas, and this may be a topic for a different talk, or probably already done it before. Then, this is a really really great library for working with data. I'm using it to load the Csv

167 00:21:06.360 --> 00:21:15.900 Miki Tebeka: so basically, I'm loading the birthdays from this Csv file, and I'm I'm and

168 00:21:18.510 --> 00:21:20.829 Miki Tebeka: converting things to daytime.

169 00:21:20.980 --> 00:21:25.829 Miki Tebeka: And then I'm saying that the the birthday is the day and the month.

170 00:21:27.370 --> 00:21:36.752 Miki Tebeka: and then I'm doing what is known as a group by to get all of the people that were born on the same day in the month

171 00:21:38.260 --> 00:21:45.690 Miki Tebeka: divided by the total number of probabilities, and then return the index and the values am.

172 00:21:46.010 --> 00:21:52.929 Miki Tebeka: So once I have that I can do something else I'm going to use now. Numpy

173 00:21:53.320 --> 00:21:59.770 Miki Tebeka: numpy has a random choice. Basically choice says, pick things from from a group.

174 00:22:00.274 --> 00:22:10.019 Miki Tebeka: But and if you don't say anything, it's it's going to give an equal probability to everything. But you can provide the size and the probabilities.

175 00:22:10.230 --> 00:22:18.226 Miki Tebeka: and then it is going to do a weighted probability meaning there's a bigger chance. If if people are more people are born on.

176 00:22:19.980 --> 00:22:24.335 Miki Tebeka: What is that? September 9, let's say then,

177 00:22:24.950 --> 00:22:30.960 Miki Tebeka: there is a bigger chance. It's going to pick. September 9 versus February 29, th

178 00:22:31.430 --> 00:22:36.951 Miki Tebeka: let's say, or other day, April 26. For some reason I don't know why

179 00:22:37.570 --> 00:22:42.380 Miki Tebeka: people don't like their birthday, they think, or July, July 4.th

180 00:22:43.140 --> 00:22:46.209 Miki Tebeka: I don't know why they have lessened.

181 00:22:46.480 --> 00:22:53.181 Miki Tebeka: Okay, so I'm loading the birthdays from from the Csv file. I'm doing

182 00:22:56.150 --> 00:23:04.390 Miki Tebeka: And again, no 100,000 this time. Simulations. The group size is 23, and

183 00:23:04.550 --> 00:23:10.139 Miki Tebeka: duplicate is 0. And again, I'm doing the same same. Follow up. And again the same thing the time

184 00:23:10.290 --> 00:23:11.440 Miki Tebeka: I'm doing here.

185 00:23:17.470 --> 00:23:22.520 Miki Tebeka: And if I'm running this one we got the same number.

186 00:23:23.640 --> 00:23:33.690 Miki Tebeka: But this is rounding up. Pretty sure if I'm going to show more digits after the decimal. It is going to show more. You're going to see the difference. But the idea is that

187 00:23:34.725 --> 00:23:38.680 Miki Tebeka: we we added more. We made our model more accurate.

188 00:23:38.790 --> 00:23:42.699 Miki Tebeka: but the inner, the even the inaccurate model, was good enough.

189 00:23:42.850 --> 00:23:48.714 Miki Tebeka: right? It was good enough, and that's what saying that all models are wrong, but some are useful.

190 00:23:49.760 --> 00:23:59.990 Miki Tebeka: you don't have to have exact distribution of exact information about your data to gain some insights which are correct from the data. And a lot of time.

191 00:24:00.420 --> 00:24:02.260 Miki Tebeka: you can do approximations.

192 00:24:02.910 --> 00:24:09.580 Miki Tebeka: And statistically, it's still good Facebook questions.

193 00:24:16.730 --> 00:24:17.690 Miki Tebeka: No question.

194 00:24:18.270 --> 00:24:25.577 Miki Tebeka: Okay, so this is about

195 00:24:26.270 --> 00:24:32.039 Miki Tebeka: a question that actually, they they gave to doctors. They said that there is a test for disease

196 00:24:32.370 --> 00:24:42.200 Miki Tebeka: that has 5% false positives, meaning that in 5% of the people. The test will tell you that you're sick, even though you're not sick.

197 00:24:42.530 --> 00:24:44.750 Miki Tebeka: This is what is known as a false positive.

198 00:24:45.320 --> 00:24:51.000 Miki Tebeka: and he says that it knows that the disease strikes about one person in a thousand in the population.

199 00:24:52.330 --> 00:24:55.103 Miki Tebeka: Okay, they said, okay, now we we're taking

200 00:24:55.970 --> 00:25:08.399 Miki Tebeka: a random test. We take a random person from the street, we make the we make the test, and the test says this person is sick. What is the actual probability that this patient is really sick.

201 00:25:09.550 --> 00:25:11.930 Miki Tebeka: Okay? And think about that.

202 00:25:12.388 --> 00:25:18.140 Miki Tebeka: With Covid, for example. Right they swap you for Covid and says you have Covid. Now

203 00:25:18.430 --> 00:25:38.570 Miki Tebeka: you're home. You're not allowed to go out sometimes, you know, for other diseases. It might say that the doctor says, Okay, you're sick now you need a treatment, maybe a violent treatment, maybe something which costs a lot of money. So they ask his doctors to see if they're actually basing their decisions on something which makes sense or not.

204 00:25:38.690 --> 00:25:43.869 Miki Tebeka: Right. So talking about true positives, right? So

205 00:25:44.450 --> 00:25:50.620 Miki Tebeka: if I predicted that the person is sick and they're actually sick. This is what known as a true positive.

206 00:25:51.430 --> 00:25:57.499 Miki Tebeka: We talked about false positive, which is person, is said to be sick, but they are actually healthy.

207 00:25:59.040 --> 00:26:03.640 Miki Tebeka: and we also have false negative saying, this is now

208 00:26:03.860 --> 00:26:06.710 Miki Tebeka: person who is sick, and we said that they're healthy.

209 00:26:06.920 --> 00:26:15.630 Miki Tebeka: and we have a true negative, which is a healthy person, and that says they are healthy. Right? Remember, positive is sick, negative, healthy.

210 00:26:16.180 --> 00:26:19.649 Miki Tebeka: That's it. This thing is known as a confusion matrix.

211 00:26:19.910 --> 00:26:25.470 Miki Tebeka: And on the confusion matrix, you can do a lot of

212 00:26:25.910 --> 00:26:29.789 Miki Tebeka: calculations. When you measure your models.

213 00:26:30.060 --> 00:26:43.740 Miki Tebeka: especially prediction models, you start with the confusion matrix and then say, what is the percent of true positive, can I? There's precision, recall, and and several other things that come to mind.

214 00:26:44.070 --> 00:26:51.570 Miki Tebeka: I call it the. I think the name confusion. Matrix is also very good, because

215 00:26:51.760 --> 00:26:55.139 Miki Tebeka: I always get confused by that. I need to go back and think about

216 00:26:56.000 --> 00:26:58.449 Miki Tebeka: what every term is saying. But we do that.

217 00:26:58.600 --> 00:27:04.280 Miki Tebeka: So let's have a look at the signal. Okay, so

218 00:27:07.510 --> 00:27:10.059 Miki Tebeka: I have a function for warning. So

219 00:27:10.560 --> 00:27:14.720 Miki Tebeka: I want to say that one in

220 00:27:15.310 --> 00:27:27.320 Miki Tebeka: a thousand. Right? They said, one in a thousand is sick. So basically what they're saying, I'm drawing a number between one and N, and if this number is one, and I can pick any number between. N. So 17

221 00:27:27.780 --> 00:27:31.560 Miki Tebeka: 3, any number will will work. I just need

222 00:27:31.730 --> 00:27:34.390 Miki Tebeka: that one in Ni will happen.

223 00:27:36.230 --> 00:27:43.460 Miki Tebeka: And now, I'm I'm going to say, like this. There is a person.

224 00:27:44.190 --> 00:27:56.340 Miki Tebeka: and if the person is sick we are going to say that the person is sick. This is not specified in the question in the equation, but this is the assumption, the assumption that there are no false negatives.

225 00:27:56.680 --> 00:27:58.850 Miki Tebeka: They're only false, positive.

226 00:27:59.020 --> 00:28:06.990 Miki Tebeka: So if you are doing that, and they say if it, the disease is.

227 00:28:07.220 --> 00:28:27.909 Miki Tebeka: we say we have a 5% false positive. So in one. In 20 cases this test is also going to say true. So if the person is sick, the test is going to say, for sure you're sick. If you're healthy, there's 1 in 5%, one in 5, 1 in 20. Sorry chance that you're the test will still say that you're sick, even though you're healthy.

228 00:28:30.070 --> 00:28:35.889 Miki Tebeka: So now, number of sick people and number of people who are diagnosed as sick are both starting at 0.

229 00:28:36.150 --> 00:28:40.569 Miki Tebeka: And now we are running a million simulations

230 00:28:40.720 --> 00:28:47.079 Miki Tebeka: for every one of them. We are picking a person at random, so the chances of the person being sick

231 00:28:47.240 --> 00:28:57.279 Miki Tebeka: like we said, Here, and the disease strikes 1 1 of every 1,000 people in the population.

232 00:28:57.390 --> 00:29:01.800 Miki Tebeka: Right? So there's a 1 in a thousand chance that this person is sick.

233 00:29:02.450 --> 00:29:06.310 Miki Tebeka: and if the person is sick, then we increment the number of sick.

234 00:29:06.440 --> 00:29:10.619 Miki Tebeka: and then we do the diagonals for the person, and if

235 00:29:11.990 --> 00:29:14.820 Miki Tebeka: we diagnose the person is sick, we increment the number of.

236 00:29:14.970 --> 00:29:18.160 Miki Tebeka: But okay.

237 00:29:18.760 --> 00:29:26.069 Miki Tebeka: So now we have the number of people who are actually sick. And I know people who were diagnosed as sick. And we are going to

238 00:29:27.110 --> 00:29:31.512 Miki Tebeka: print out this frequency that says,

239 00:29:33.000 --> 00:29:38.800 Miki Tebeka: what are the percentage of people who are actually sick from the people who are diagnosed as sick.

240 00:29:41.160 --> 00:29:42.610 Miki Tebeka: Anyone care to guess

241 00:29:51.340 --> 00:29:54.650 Miki Tebeka: 2%. See, you are good.

242 00:29:57.160 --> 00:30:02.210 Miki Tebeka: Okay? So a lot of people are saying, you know, this is

243 00:30:02.340 --> 00:30:04.820 Miki Tebeka: what right? The the test is

244 00:30:05.020 --> 00:30:15.159 Miki Tebeka: only 5% false, positive. How come? So 95%. It's okay. But still only 2% of the people are sick.

245 00:30:15.530 --> 00:30:17.620 Miki Tebeka: So and and think about that.

246 00:30:19.520 --> 00:30:24.550 Miki Tebeka: yeah, 1% per divide 5%. That's 2%

247 00:30:25.490 --> 00:30:26.240 Miki Tebeka: So

248 00:30:29.630 --> 00:30:50.819 Miki Tebeka: if you come to think about that, that has a lot of implications. This is, again, this intuition that we have that is usually wrong when it comes to these things. And the 3rd thing that they give this test to a lot of doctors. Most of them get it wrong. And then it means that they're actually basing treatments and other things on something which is

249 00:30:51.220 --> 00:30:52.330 Miki Tebeka: not correct.

250 00:30:53.535 --> 00:30:57.430 Miki Tebeka: So maybe they should run a simulation

251 00:30:57.600 --> 00:31:00.239 Miki Tebeka: and get some understanding of what's going on.

252 00:31:01.920 --> 00:31:02.710 Miki Tebeka: Alright.

253 00:31:03.700 --> 00:31:11.480 Miki Tebeka: The last one that one is known as the Monty Hall problem. And we are

254 00:31:11.970 --> 00:31:14.529 Miki Tebeka: okay. Well, we have a lot of time.

255 00:31:14.860 --> 00:31:24.710 Miki Tebeka: I'll start speaking. No so the Monty Hall problem says, you're in a game show, and you have 3 doors.

256 00:31:24.990 --> 00:31:30.829 Miki Tebeka: and the host says, you know, behind 2 doors there are goats.

257 00:31:31.030 --> 00:31:36.710 Miki Tebeka: and behind and the 3rd door. There is a car that you can win.

258 00:31:37.980 --> 00:31:43.140 Miki Tebeka: and it says, Pick a door, 1, 2, or 3, and you pick a door. Let's say I picked one.

259 00:31:43.270 --> 00:31:50.730 Miki Tebeka: And now the host says, Okay, he goes on to another door. Let's say this time door number 2 opens the door and shows you a goat.

260 00:31:50.990 --> 00:31:53.920 Miki Tebeka: And now, he says, do you want to keep

261 00:31:54.530 --> 00:31:58.800 Miki Tebeka: your original door, or do you want to switch to the second one?

262 00:32:02.050 --> 00:32:14.090 Miki Tebeka: Right? So you have a strategy now to say, now I pick door number one. I'm going to stay with door number one, or after they show me the door with the goat. I want to change my my answer, and I actually go on to pick door number 3.

263 00:32:14.660 --> 00:32:19.359 Miki Tebeka: So what is the strategy? What is the good strategy to work in this case.

264 00:32:19.660 --> 00:32:24.539 Miki Tebeka: Too late, is it? On, on one or 2?

265 00:32:25.203 --> 00:32:35.650 Miki Tebeka: So again, we are going to simulate right? So a random door is around the manager now

266 00:32:36.020 --> 00:32:45.589 Miki Tebeka: and here. What we're doing is that we pick, we say, does staying with the door wins the game right? So we

267 00:32:46.133 --> 00:32:49.850 Miki Tebeka: we pick a 1 door which the car is

268 00:32:51.282 --> 00:32:57.270 Miki Tebeka: under the door, and then we pick another door, which is the door that the player picked.

269 00:32:57.890 --> 00:33:03.619 Miki Tebeka: Now, if the if it is the same door, it means that the player who says I'm going to stay

270 00:33:03.720 --> 00:33:06.080 Miki Tebeka: is going to win the game.

271 00:33:07.580 --> 00:33:14.870 Miki Tebeka: Okay? So I'm just saying, if the color is equal to the player door, then, meaning that the stay strategy wins.

272 00:33:15.240 --> 00:33:17.310 Miki Tebeka: And now I'm going to

273 00:33:18.660 --> 00:33:26.040 Miki Tebeka: to do a million simulations. I'm going to say that these are the number of wins that the stay strategy have.

274 00:33:26.160 --> 00:33:31.619 Miki Tebeka: and these are the 9 number of wins that the switch strategy had.

275 00:33:31.880 --> 00:33:38.980 Miki Tebeka: And I'm going to run the simulation, and then, if stay wins the game, I'm incrementing the stays. Otherwise

276 00:33:39.620 --> 00:33:43.610 Miki Tebeka: I'm going to increment the wins, and then I

277 00:33:44.710 --> 00:33:53.400 Miki Tebeka: divide them by N. So what is the fraction of the one? And I'm printing out? What is the fraction of times we want with staying, and what is the fraction of

278 00:33:53.980 --> 00:33:58.640 Miki Tebeka: things that, and we did by switching

279 00:34:04.620 --> 00:34:10.360 Miki Tebeka: any guesses which strategy is better.

280 00:34:20.120 --> 00:34:25.560 Miki Tebeka: So if you switch door, you have twice as much chances of winning.

281 00:34:25.980 --> 00:34:30.609 Miki Tebeka: Then then stay.

282 00:34:30.900 --> 00:34:35.659 Miki Tebeka: And and this is really counterintuitive. Because why?

283 00:34:36.440 --> 00:34:45.299 Miki Tebeka: Right? I picked the door at one dome. There could be a car under it. The fact that someone showed me another door that I didn't pick as a goat in it shouldn't change that.

284 00:34:45.790 --> 00:34:47.720 Miki Tebeka: But it actually does.

285 00:34:47.920 --> 00:34:56.590 Miki Tebeka: And there's a lot of debate. If you Google, the multi-hole problem with statistics, there's a lot of debate about

286 00:34:56.710 --> 00:34:59.999 Miki Tebeka: what? What does it mean? And and

287 00:35:00.260 --> 00:35:10.629 Miki Tebeka: are these calculations okay or not. For me. Yeah, it's I have a strategy now. So if I see a goat, they pick the other door. And that's it.

288 00:35:14.370 --> 00:35:18.649 Miki Tebeka: Okay? So you can read more on Wikipedia, on the Monty Hall problem and it.

289 00:35:19.582 --> 00:35:25.739 Miki Tebeka: so that's that's basically these are the 4 cases. And I hope I convince you that

290 00:35:26.660 --> 00:35:30.299 Miki Tebeka: when you have these questions don't don't shy away because you don't know the map

291 00:35:30.470 --> 00:35:40.719 Miki Tebeka: because you don't know how to figure out statistics, and a lot of time will also help you, because a lot of time, our intuition, when it comes to statistics and probabilities, is usually wrong.

292 00:35:41.280 --> 00:35:50.400 Miki Tebeka: We, we, as people, have good intuition about small numbers when it comes to large numbers, we are really

293 00:35:51.569 --> 00:35:53.129 Miki Tebeka: very bad at that.

294 00:35:53.827 --> 00:36:04.760 Miki Tebeka: There, there's 1 statistic and and nothing tallible says that every time he works with statistics he need to turn off this part of the brain that says, I know what I'm doing, and just pass the numbers

295 00:36:05.050 --> 00:36:17.420 Miki Tebeka: right? So if you want to learn more, there is a great talk by Jake Vanderplass is an astrophysicist who's heavily involved in scientific python community.

296 00:36:17.850 --> 00:36:35.519 Miki Tebeka: and he has that's where that's where I started, and he shows some other simulations and how to do statistics. You can read more about the multicolor simulation in in Wikipedia. By the way, I think even in Google sheets and and excel, you can run

297 00:36:36.570 --> 00:36:42.519 Miki Tebeka: Monte Carlo simulations, which is pretty awesome, or in excel

298 00:36:46.570 --> 00:36:50.369 Miki Tebeka: in in excel. Now we have python in excel. Right? So this is

299 00:36:50.760 --> 00:36:54.949 Miki Tebeka: great, and there's a library called Simpy.

300 00:36:55.100 --> 00:37:10.809 Miki Tebeka: If you want to do what is known as a discrete simulation. And Simpa, basically for every tick tell, you tell you process to do something, and then you can simulate cars going and people crossing the road and cell phone towers, and a lot of, and a lot of other things.

301 00:37:14.760 --> 00:37:23.070 Miki Tebeka: Zia, I'm not sure what your name is. But, he said, the simulation will not help you if the problem is not well specified, and that that is true.

302 00:37:23.360 --> 00:37:30.109 Miki Tebeka: Right. So you need a good definition of the problem before you start. If you have a vague definition of the problem, then

303 00:37:30.650 --> 00:37:31.560 Miki Tebeka: now

304 00:37:40.240 --> 00:37:48.350 Miki Tebeka: you, you can think about it any way you want. I'm not sure why you're saying that the Monty Hall problem is not fully defined, but we can talk about it later.

305 00:37:50.230 --> 00:37:52.989 Miki Tebeka: Because this is just about picking a strategy to win.

306 00:37:53.100 --> 00:37:55.940 Miki Tebeka: And I think the the switch one is is a winning strategy.

307 00:37:58.290 --> 00:38:03.219 Miki Tebeka: That's it. All of this code is in my github

308 00:38:03.490 --> 00:38:11.880 Miki Tebeka: talks repo, so you can look at the code there. All all the things that that they have there.

309 00:38:13.050 --> 00:38:20.920 Miki Tebeka: I wrote a book on Python with quizzes, if you want to to buy it, and if you have questions, that's a good time to ask them. Not

310 00:38:28.840 --> 00:38:30.660 Miki Tebeka: no question. Gabor.

311 00:38:30.970 --> 00:38:36.500 Gabor Szabo: Well, I don't have any question either. Now. I already asked the ones I had.

312 00:38:36.620 --> 00:38:40.469 Gabor Szabo: I just want to thank you and thank everyone who who joined us.

313 00:38:41.320 --> 00:38:42.590 Gabor Szabo: And

314 00:38:44.080 --> 00:38:56.819 Gabor Szabo: if you are watching the video and you reach this point, then please remember to like the video and follow the channel. And under the video you will find a link to where you will have the link to this Github page.

315 00:38:56.930 --> 00:39:05.280 Gabor Szabo: To this Github Repo, and you will be able to also find find Mickey. I guess you also.

316 00:39:05.730 --> 00:39:06.530 Miki Tebeka: Yeah, yeah, sure.

317 00:39:06.530 --> 00:39:07.200 Gabor Szabo: Share your link.

318 00:39:07.200 --> 00:39:09.780 Miki Tebeka: If you have any questions I will answer.

319 00:39:11.540 --> 00:39:15.230 Gabor Szabo: Okay, so thank you very much.

320 00:39:15.510 --> 00:39:16.870 Miki Tebeka: So more for organizing this.

321 00:39:17.710 --> 00:39:21.050 Gabor Szabo: You're welcome, and I hope to see you in other presentations.

322 00:39:21.050 --> 00:39:22.110 Miki Tebeka: Awesome. Thank you.

323 00:39:22.110 --> 00:39:22.770 Gabor Szabo: Bye, bye.

Python async.io - From zero to hero with Eyal Balla

2025-02-13T13:10:01Z

Asyncio is a cool part of python. But what does it do? It is a way to write async code in Python.

This lecture shows use real world use cases, knowhows and troubleshooting methods for using asyncio in python

Eyal Balla

Transcript

1 00:00:01.740 --> 00:00:30.889 Gabor Szabo: So Hello, and welcome to the Codemavens, Meetup Group and Codemavens Channel. If you're watching it on on Youtube, my name is Gabor Sabo. I usually teach python and rust at companies and also introduce test automation and Ci, and that the area and I also think that sharing knowledge is extremely important among high tech

2 00:00:31.470 --> 00:00:56.010 Gabor Szabo: programmers and and people working in the high tech industry. So that's why I am organizing these these meetings, these presentations. As you can see it's being recorded. It's going to be on Youtube, please, like the video and follow the channel below the video, you will find a link to information about Al and and about the content of this video.

3 00:00:56.130 --> 00:01:03.549 Gabor Szabo: And I would like to welcome everyone who's who joined us in the live meeting, and especially like Eyal, giving us the presentation.

4 00:01:03.820 --> 00:01:11.349 Gabor Szabo: So now it's yours. Please introduce yourself and share the screen as you feel fit and welcome.

5 00:01:12.590 --> 00:01:30.479 Eyal Balla: Thank you. So I'll share the screen. So this is presentation about it's the it's like a take off on a presentation I did like, I think, 2 years ago, or maybe 3 years ago.

6 00:01:33.070 --> 00:01:42.467 Eyal Balla: bit about me. So I've been developing for more than 15 years and working in python like

7 00:01:43.120 --> 00:01:49.530 Eyal Balla: 5 to 10 years. And currently, I lead the data team in scenario.

8 00:01:50.390 --> 00:01:51.629 Eyal Balla: See? That's me.

9 00:01:52.994 --> 00:02:04.040 Eyal Balla: So you can find me in the links, and if you're interested we're also hiring. So you're welcome to try and join. And we're looking for people that

10 00:02:04.520 --> 00:02:08.080 Eyal Balla: working python, and that is their passion.

11 00:02:08.650 --> 00:02:13.000 Eyal Balla: So today, what I'm gonna do is I'm gonna go through

12 00:02:13.340 --> 00:02:29.884 Eyal Balla: a bit about what Asyncayo is and try to give a real world example from things that we do, and then we'll try and talk about some advanced topics regarding Asyncayo. And so that's what we're gonna do.

13 00:02:31.060 --> 00:02:36.420 Eyal Balla: I think it's important that you guys feel free to step in and ask questions if you need

14 00:02:38.620 --> 00:02:45.520 Eyal Balla: and cause there, there's gonna be a bit code and some topics, and maybe

15 00:02:45.730 --> 00:02:52.010 Eyal Balla: hopefully, it's gonna be all clear. But if somebody has any question, then feel free to jump in.

16 00:02:52.550 --> 00:03:00.030 Eyal Balla: So 1st of all, what what is Asyncaio. So Asyncaio is a style of concurrent programming in python.

17 00:03:00.340 --> 00:03:04.660 Eyal Balla: So why do we need it? So you can think of

18 00:03:04.960 --> 00:03:08.569 Eyal Balla: wanting to do multiple things in python at the same time.

19 00:03:08.770 --> 00:03:09.506 Eyal Balla: So

20 00:03:11.140 --> 00:03:19.540 Eyal Balla: a simple way to do it is using a fork right? So you can run multiple python processes at the same time.

21 00:03:20.010 --> 00:03:43.789 Eyal Balla: So the OS handles the concurrency. And you can actually use multi-cores on your machines. And the problem is that you get duplicated memory because each process has its own memory space right? And in order to communicate between the different python processes you need like OS level communications. So pipes and

22 00:03:43.950 --> 00:03:45.463 Eyal Balla: files, and

23 00:03:47.161 --> 00:03:59.079 Eyal Balla: sorry other ways of multi-process communication. So you'd say, Okay, maybe we can do it some some other way. So there's also multi-threading in python.

24 00:03:59.390 --> 00:04:02.710 Eyal Balla: So this is nice. So you can create new thread. And

25 00:04:03.242 --> 00:04:11.339 Eyal Balla: it looks like you can run multiple things at the same time. But then there's Gil. So Gil is the global interpreter lock.

26 00:04:11.878 --> 00:04:17.962 Eyal Balla: I know there's an agenda in python to try and remove it. But for now it's there

27 00:04:19.910 --> 00:04:25.750 Eyal Balla: And Gil prevents multiple commands of python running in the same process at the same time.

28 00:04:25.860 --> 00:04:34.925 Eyal Balla: So always concurrency for threading is done when you do. things like

29 00:04:35.580 --> 00:04:41.672 Eyal Balla: accessing files or network, or anytime they give access from the your

30 00:04:42.660 --> 00:04:46.390 Eyal Balla: python commands into the OS. This is done.

31 00:04:47.221 --> 00:04:59.660 Eyal Balla: There's something you don't do implicitly. You do it. You can't do it explicitly. You can only do it implicitly by accessing something or doing something that requires OS interaction.

32 00:05:00.720 --> 00:05:12.629 Eyal Balla: And there's Asyncayo. So what is Asyncaio? It's an I/O event manager. Okay? And it, helps you manage states. So you can have multiple states of your system

33 00:05:12.750 --> 00:05:24.389 Eyal Balla: on the same thread. And you can actually explicitly manage the context switching. So you can say, I want to work on multiple items. These are the multiple items. And I want to work on them.

34 00:05:24.920 --> 00:05:33.670 Eyal Balla: So if we look at high level at what the options are for multiprocessing. Say, we have like multiprocesses.

35 00:05:33.780 --> 00:05:47.120 Eyal Balla: so you can have concurrency, and you can use all the cpus. But you know you can't run many processes on the same machine, because each uses a full CPU, so maybe

36 00:05:47.740 --> 00:05:52.530 Eyal Balla: one to 10 cpus and processes.

37 00:05:52.800 --> 00:06:01.410 Eyal Balla: And you can use, generally, the the standard library blocking components and synchronization tools.

38 00:06:02.110 --> 00:06:16.011 Eyal Balla: And then if you need something that's maybe a bit higher on the scalability you can use threads so you can. You have a single process, and Gil is protecting you from

39 00:06:16.910 --> 00:06:22.843 Eyal Balla: doing things between threads which which ha! Which touch memory, and

40 00:06:23.670 --> 00:06:34.039 Eyal Balla: in intrusive way. And but the problem is that you let the OS think your code and you can't really

41 00:06:34.792 --> 00:06:36.720 Eyal Balla: control it manually.

42 00:06:37.070 --> 00:06:51.839 Eyal Balla: And then there's asking where you can actually handle multiple thousands of scalable small components you call them quote coroutines and but pro, and and it's in the application level.

43 00:06:51.940 --> 00:06:58.970 Eyal Balla: So this is what we're gonna look at today. And we're gonna see how it's work, how it works and how you can control it.

44 00:07:00.620 --> 00:07:03.188 Eyal Balla: So like every other

45 00:07:04.440 --> 00:07:09.849 Eyal Balla: program in the world, there's the hello world like in for asking Kaya. Right? So there's

46 00:07:12.010 --> 00:07:15.570 Eyal Balla: Let's see. Can you see my cursor, then you can.

47 00:07:16.310 --> 00:07:23.630 Eyal Balla: So there's a a regular hello world. And there's the one with asking Kyle, wait, okay, but

48 00:07:23.900 --> 00:07:29.589 Eyal Balla: this program is not really helpful, right? Because it doesn't show anything that's important for us.

49 00:07:30.210 --> 00:07:50.489 Eyal Balla: So what what do we want to do with? I think in the real world? So we want to use it for handling multiple heavy I/O processes. So like maybe database accesses or multiple web requests or file sharing or accessing many I/O components at the same time.

50 00:07:51.330 --> 00:08:10.769 Eyal Balla: And you can always use as incaio using multi processes. So you can have. maybe in a cloud application. You can have multiple pods right? And but also you can run it on multiple processes and have have the ability to use multiple cpus if needed.

51 00:08:13.430 --> 00:08:14.330 Eyal Balla: But

52 00:08:14.440 --> 00:08:28.290 Eyal Balla: the downside of Asyncago is that it's almost a different programming language. It looks like python, and it the constructs are very much like Python, and you just use like a few more keywords. But

53 00:08:28.810 --> 00:08:36.129 Eyal Balla: it's very different in concept, because each coroutine the functions that you call with the weight

54 00:08:37.038 --> 00:08:45.999 Eyal Balla: they have to be short enough to allow multiple contacts to run together. So you mustn't use

55 00:08:46.528 --> 00:08:56.480 Eyal Balla: long computations. You can't block the event. Queue. Okay, just like in iui application. You don't want the the main loop to be blocked.

56 00:08:56.890 --> 00:09:10.859 Eyal Balla: And also you can't use general purpose OS blocking commands like, create connections for socket, or select or sleep. So you have things that are, I think, ios specific

57 00:09:13.062 --> 00:09:20.630 Eyal Balla: so you even have like different libraries that you can use in Asyncaio. So

58 00:09:21.888 --> 00:09:33.809 Eyal Balla: if you usually use requests, I suggest you try use Httpx. It has a better behavior fast Api over Django and flask.

59 00:09:34.671 --> 00:09:37.929 Eyal Balla: Mpg, instead of psychopg, etcetera.

60 00:09:38.130 --> 00:09:40.760 Eyal Balla: Okay, so

61 00:09:42.130 --> 00:09:55.740 Eyal Balla: if we look at Asyncaio at the main building blocks, what we have is, we have the the main Asyncaio run command. So what it does it receives in a

62 00:09:57.490 --> 00:10:05.362 Eyal Balla: in a sync way, it received in a sync context, it receives a core routine, something that

63 00:10:05.900 --> 00:10:09.830 Eyal Balla: is run on the asicio loop and creates a loop

64 00:10:09.990 --> 00:10:16.710 Eyal Balla: and runs the core routine in it. And usually this is how you do the entry point into Asyncaya context.

65 00:10:17.300 --> 00:10:20.779 Eyal Balla: Okay? And then you have the core routines. This, these look like,

66 00:10:21.330 --> 00:10:26.570 Eyal Balla: I define Async Def and a function. And this creates a core routine.

67 00:10:26.770 --> 00:10:31.479 Eyal Balla: Okay and core routines. You can run either using domain loop

68 00:10:31.600 --> 00:10:38.720 Eyal Balla: or create a Co a context, a task using the Asyncare create task.

69 00:10:39.560 --> 00:10:40.470 Eyal Balla: Okay?

70 00:10:40.740 --> 00:10:49.319 Eyal Balla: And also when you want to wait for something to happen, and you want to release the context. So you run, await.

71 00:10:49.480 --> 00:10:59.880 Eyal Balla: call, wait in the call routine, and then you wait for the you wait. The context itself waits until the Async context is finished

72 00:11:00.260 --> 00:11:04.519 Eyal Balla: and then returns the control to the main loop that calls this.

73 00:11:04.620 --> 00:11:05.430 Eyal Balla: Okay?

74 00:11:06.350 --> 00:11:07.370 Eyal Balla: So

75 00:11:08.022 --> 00:11:21.649 Eyal Balla: I'm gonna show you guys an example of a small program. And so what this does it. This is a synchronous program. It reads from S. 3, and then queries some database. Right?

76 00:11:22.040 --> 00:11:23.960 Eyal Balla: So we have, like a

77 00:11:24.090 --> 00:11:30.433 Eyal Balla: 2 contexts. One. This is the 1st one. This is second one, and they're not.

78 00:11:31.700 --> 00:11:48.750 Eyal Balla: they're not dependent on each other. You can see the it just gets a file, and then just runs a query. And seeing we want to try and do this together because we want to have the context returned with the content itself. But we don't have some kind of connection between the 2 contexts.

79 00:11:49.460 --> 00:11:56.740 Eyal Balla: So what you can do is you can move to Asyncayo, define this as a coroutine using the

80 00:11:57.150 --> 00:11:58.716 Eyal Balla: aio bottle.

81 00:12:00.194 --> 00:12:10.339 Eyal Balla: Async library and using Asynpg use, create, run the create a quoting from the query of the database.

82 00:12:10.660 --> 00:12:21.090 Eyal Balla: and then you can use gather to run the 2 core routines together. Okay, independently, one of each other, one from each other. So when

83 00:12:21.826 --> 00:12:30.723 Eyal Balla: the execution of the query runs the return of the items, and the body and of the file is done. This is done

84 00:12:31.300 --> 00:12:40.689 Eyal Balla: asynchronously, and while waiting for the I/O, the the context is continued to the other con. Other part.

85 00:12:41.160 --> 00:12:42.000 Eyal Balla: Okay.

86 00:12:43.430 --> 00:12:46.690 Eyal Balla: Questions great.

87 00:12:47.450 --> 00:13:16.690 Eyal Balla: So more more options. Also supports context managers. So this is very convenient because you can create. For instance, in this you can create, you look at the the bottom part here I had to connect and then close using finally. But I can also create a context manager from this and open the connection with the context manager exits close the connection. So I don't have to use

88 00:13:16.710 --> 00:13:19.260 Eyal Balla: explicit exception handling.

89 00:13:19.850 --> 00:13:21.320 Eyal Balla: And also

90 00:13:22.012 --> 00:13:33.339 Eyal Balla: I think I supports iterators so you can. use like generator way of controlling small parts of the code

91 00:13:33.848 --> 00:13:43.049 Eyal Balla: one after the other, and using Async interval. So this is these are patents very known in Python, and you can also use them in Async context.

92 00:13:45.200 --> 00:13:50.409 Eyal Balla: So now I want to show you guys, maybe a problem from our day to day work.

93 00:13:50.870 --> 00:13:56.281 Eyal Balla: I'll present the problem first.st So what we want to do is we want to do some

94 00:13:58.582 --> 00:14:21.089 Eyal Balla: in integration which reads data from an external source and then enriches it. Okay, gets information from maybe a database, and then adds to the context from the external source, and then writes the results into our database as entities. Okay? And I think the the main issue here is that maybe

95 00:14:21.659 --> 00:14:28.550 Eyal Balla: we have multiple customers. Some are small, some are large, and customers have maybe

96 00:14:29.200 --> 00:14:44.130 Eyal Balla: tens of thousands of entities. So there's a lot of reading from the external source, and also maybe a lot of writing and reading into the database. So we have a lot of I/O. And this actually fits very well the Asyncaio concepts.

97 00:14:44.460 --> 00:14:48.700 Eyal Balla: Okay, so what do we want to do?

98 00:14:49.388 --> 00:15:01.639 Eyal Balla: We wanna call something once in a while and go over each of the customers, get the information and then update with the enriched information into our database. Okay? So

99 00:15:01.810 --> 00:15:06.059 Eyal Balla: like an even nave implementation would be something like this.

100 00:15:07.620 --> 00:15:13.880 Eyal Balla: You define all the the the bootstraps

101 00:15:14.549 --> 00:15:37.869 Eyal Balla: needed, and then you get the list of customers. And then for each customer, you do the information, the enrichment. So you get the settings and you run the enricher on the information, and then you in per customer. You get the information from the integration system and enrich it and write it into your database. Right?

102 00:15:38.640 --> 00:15:40.020 Eyal Balla: So this is nice.

103 00:15:40.935 --> 00:15:41.740 Eyal Balla: But

104 00:15:42.558 --> 00:15:54.589 Eyal Balla: the problem with that when we look at this. So this runs per per customer. So that means that until one customer is done the other, the next customer doesn't start.

105 00:15:54.770 --> 00:15:59.140 Eyal Balla: Okay? So if we have small customers and large customers, then

106 00:16:00.190 --> 00:16:06.090 Eyal Balla: we have a problem that small customers are impacted by the size of large customers right?

107 00:16:06.910 --> 00:16:17.090 Eyal Balla: And also, once we have something that's bigger than the the total chrome time, then it doesn't actually

108 00:16:17.991 --> 00:16:30.509 Eyal Balla: it's not actually up to the time it's called time. So system is not up to the functionality according to the time constraints that supposed to run through.

109 00:16:32.395 --> 00:16:36.110 Eyal Balla: So what can we do so. I think the 1st thing we can do

110 00:16:36.310 --> 00:16:48.919 Eyal Balla: is separated per customer. So we can have, some kind of injection of the customer id through a queue and have the system run only per customer. So

111 00:16:49.160 --> 00:16:54.019 Eyal Balla: it reads the information from the queue gets the customer id

112 00:16:54.380 --> 00:17:01.040 Eyal Balla: here, and then runs the same thing just for a specific customer.

113 00:17:01.160 --> 00:17:03.090 Eyal Balla: So how does this help? So

114 00:17:03.310 --> 00:17:10.749 Eyal Balla: what we can do now is we can scale out. So we can have multiple instances of this specific code run

115 00:17:12.778 --> 00:17:19.050 Eyal Balla: together each on a different customer, and assuming that they're not dependent then.

116 00:17:19.699 --> 00:17:27.270 Eyal Balla: Small customers are now not impacted by the size of large customers and the number of the the time that you wanna

117 00:17:27.829 --> 00:17:29.460 Eyal Balla: to run. This is.

118 00:17:30.045 --> 00:17:40.390 Eyal Balla: at most, the time of the biggest customer. Right? So you can scale out as much as you want, and the time that this whole process takes is the time of the biggest customer.

119 00:17:41.270 --> 00:17:42.110 Eyal Balla: Okay?

120 00:17:42.890 --> 00:17:49.890 Eyal Balla: So till now we did not touch anything. That's as Asyncare, right? So we just used like a simple

121 00:17:50.600 --> 00:17:56.219 Eyal Balla: design patterns that allow scaling out of of loops.

122 00:17:57.300 --> 00:18:02.860 Eyal Balla: So now let's try and use Asyncayo to improve the performance of this whole loop.

123 00:18:03.030 --> 00:18:05.249 Eyal Balla: So what do we do?

124 00:18:05.887 --> 00:18:09.710 Eyal Balla: We create a core routine and run it, using Asikaya run.

125 00:18:11.583 --> 00:18:17.220 Eyal Balla: This coroutine is very much similar to what we ran before.

126 00:18:19.340 --> 00:18:25.209 Eyal Balla: Except that now when we look at what happens inside the run for customer.

127 00:18:25.450 --> 00:18:27.550 Eyal Balla: this looks a bit different.

128 00:18:27.810 --> 00:18:31.698 Eyal Balla: So what do we do? First, st we create

129 00:18:33.466 --> 00:18:44.300 Eyal Balla: we run through the pages. Okay? And when we want to enrich the items, we create batches of coroutines, and then we run them together.

130 00:18:44.420 --> 00:18:59.979 Eyal Balla: Okay, so here. The coroutines are created according to the number of the integration items, and when you enrich and read the information each each batch of the

131 00:19:00.460 --> 00:19:04.559 Eyal Balla: coroutines runs together so you can wait for

132 00:19:06.570 --> 00:19:11.189 Eyal Balla: So so they happen together and wait only for the I/O for each of the items.

133 00:19:11.760 --> 00:19:12.580 Eyal Balla: Okay,

134 00:19:16.630 --> 00:19:20.870 Naty Harary: Yeah, I have a question, and just should I interrupt you? Mid sentence?

135 00:19:20.870 --> 00:19:21.420 Eyal Balla: Yeah.

136 00:19:22.394 --> 00:19:29.490 Naty Harary: So, as far as I know, in Async I/O, it is enough to mark the function itself. Async.

137 00:19:29.700 --> 00:19:34.740 Naty Harary: you just wait for it. So I'm not really familiar with the syntax like here.

138 00:19:35.258 --> 00:19:40.680 Naty Harary: So why do we need to ask Sync this as well? I'm not really sure I understand.

139 00:19:42.430 --> 00:19:48.560 Eyal Balla: What we're doing here is we wait for for each of the pages. So this is a nastic iterator, right?

140 00:19:48.940 --> 00:19:53.389 Eyal Balla: But this, this is. This is the core routine which returns an sn key iterator.

141 00:19:53.510 --> 00:19:56.173 Eyal Balla: And when when you

142 00:19:57.270 --> 00:20:06.239 Eyal Balla: what each of these pages contains items, so you want to enrich each of the items. So you create core teams for each of the items to be enriched.

143 00:20:06.390 --> 00:20:16.100 Eyal Balla: And when you run them you run them using gathers because we can when you run a weight on something. Okay, this makes the the

144 00:20:16.570 --> 00:20:24.700 Eyal Balla: higher level function. Wait till it's done. Okay, this is a way to synchronize as in context.

145 00:20:25.220 --> 00:20:29.610 Eyal Balla: Okay, so here you synchronize multiple as in context, using gather.

146 00:20:31.990 --> 00:20:41.489 Naty Harary: I see. So you just gather all the chunks that you have, and you create them with the asset iterator rather than just give a big function and just asking on that right? That's.

147 00:20:41.490 --> 00:21:00.950 Eyal Balla: Because you want to split your context into smaller processing units, each of them may be I/O bound so together the I would run together on each of the in parallel. On each of the items.

148 00:21:01.940 --> 00:21:03.429 Naty Harary: Got it. Thank you.

149 00:21:06.130 --> 00:21:07.540 Eyal Balla: Okay. So

150 00:21:08.060 --> 00:21:27.009 Eyal Balla: now, like I said before, so enrichment haps happens in parallel. And but still you can scale out. So you can have the multiple services. And so the total performance here is, not blocked. And also small customers are not impacted by the large customers.

151 00:21:30.120 --> 00:21:34.040 Eyal Balla: Okay. So some other things you should take into consideration.

152 00:21:34.300 --> 00:21:45.839 Eyal Balla: So I think the 1st thing is exception handling. So when you create as in context, you sometimes need to handle exception in the the top level.

153 00:21:46.000 --> 00:22:04.660 Eyal Balla: So when you do that you can register manual exception handler, so you get the the main loop and set the exception handler, and you can handle the main the errors from that are created from each of the tasks

154 00:22:06.276 --> 00:22:16.647 Eyal Balla: separately, because if you don't do that, then the the Asyncaio context would ha would behave

155 00:22:18.610 --> 00:22:27.379 Eyal Balla: it. It would exit on each of the when each when one of the sub core teams throws an exception into the main context.

156 00:22:27.760 --> 00:22:28.620 Eyal Balla: Okay?

157 00:22:28.810 --> 00:22:34.353 Eyal Balla: So sometimes you want to wait, maybe maybe for the last one, or

158 00:22:35.440 --> 00:22:41.499 Eyal Balla: perhaps some other behavior that is specific for your system. And you can do this this way

159 00:22:42.607 --> 00:22:54.359 Eyal Balla: specifically for gather you can have 2 ways to handle exceptions, so you can do it. Inside each of the core routines like we I did before, or you can

160 00:22:54.610 --> 00:23:16.869 Eyal Balla: ask the gather to collect all the exceptions from each of the core routines, and then you can handle errors together. For instance, if you want to have some retry mechanism, then this is a good way to do it. So you gather all the errors, and then you can retry all those that failed, or decide to do whatever you want to do with those that did not succeed.

161 00:23:19.144 --> 00:23:29.915 Eyal Balla: Regarding testing. So I think, if you look at this core routines what I want to test here is I want to test, maybe.

162 00:23:32.830 --> 00:23:40.820 Eyal Balla: functional response. Okay, something like a happy path and maybe an exception to test the raise for status.

163 00:23:42.300 --> 00:23:50.960 Eyal Balla: So the important part is to mark your test as pi test mark asking Kyle, so this allows you to run test in asking context.

164 00:23:51.770 --> 00:23:57.170 Eyal Balla: There's like Htpx gives Htpx mock. So you can use that.

165 00:23:57.300 --> 00:24:04.819 Eyal Balla: And and then you can inject the response. Here, for instance, and test your happy flow.

166 00:24:05.150 --> 00:24:17.508 Eyal Balla: and also you can always use pytest raises like you did before. And assuming you marked as Asyncio, you can. Test the flow of async and

167 00:24:18.130 --> 00:24:19.460 Eyal Balla: exception flow.

168 00:24:19.720 --> 00:24:20.520 Eyal Balla: Okay?

169 00:24:22.605 --> 00:24:23.380 Eyal Balla: Sorry.

170 00:24:24.240 --> 00:24:31.040 Eyal Balla: What you can also do is there's Async mock like a unit test magic mock

171 00:24:32.065 --> 00:24:40.340 Eyal Balla: you can mock coroutines. So here's an example of how you mark a coroutines and test it. So

172 00:24:40.510 --> 00:24:48.583 Eyal Balla: this is something that's nice to know, and I think it's very valuable when you're testing and mocking

173 00:24:50.250 --> 00:24:51.110 Eyal Balla: coroutines

174 00:24:51.798 --> 00:25:02.409 Eyal Balla: I think that today the default patch returns in a Async context, or a mock on a magic mark or an Async mark, according to the

175 00:25:02.938 --> 00:25:09.211 Eyal Balla: type of function that it gets. So if this is the core routine, then this would return.

176 00:25:10.050 --> 00:25:17.320 Eyal Balla: create this as an Async mark, and if not, it be a magic mark according to what is needed.

177 00:25:18.980 --> 00:25:24.740 Eyal Balla: Something else that's very important for developers is the ability to debug.

178 00:25:25.240 --> 00:25:33.950 Eyal Balla: So I think Kyle gives a debug mode when you run with the environment variable. You get

179 00:25:34.473 --> 00:25:43.669 Eyal Balla: also track, trace track trace backs on asking functions when they're not awaited. So you can find out where where this happens, and when

180 00:25:44.410 --> 00:25:57.470 Eyal Balla: and also this monitors thread safety. So when you, behave, something in your system behaves unsafe regarding the different core routines and the memory they touch.

181 00:25:57.900 --> 00:26:08.020 Eyal Balla: So you get errors in your in your logs. And also this helps debugs debug slow core routines. Because.

182 00:26:08.592 --> 00:26:10.857 Eyal Balla: I think Cayo is very

183 00:26:12.057 --> 00:26:24.449 Eyal Balla: very sensitive to having core routines and long coroutines blocking short coroutines. So this actually helps you understand better the flow of your code once you use Async I/O

184 00:26:26.425 --> 00:26:38.450 Eyal Balla: so this is how a slow log look looks like. So if I do something very slow, you'd get like a log, saying, this is this has taken too long. Okay? So you would know that

185 00:26:38.890 --> 00:26:40.749 Eyal Balla: you want to look at dysfunction.

186 00:26:42.468 --> 00:26:46.551 Eyal Balla: Also something you want you might want to consider is

187 00:26:47.599 --> 00:26:52.540 Eyal Balla: having something that solvers running in your context in your services.

188 00:26:53.396 --> 00:27:01.890 Eyal Balla: So aio debug allows you to log slow callbacks inside your production pods.

189 00:27:02.010 --> 00:27:08.362 Eyal Balla: And this you can enable a specific

190 00:27:09.420 --> 00:27:12.209 Eyal Balla: callbacks when this happens. And this is

191 00:27:12.340 --> 00:27:14.906 Eyal Balla: really great, because this has

192 00:27:16.144 --> 00:27:20.340 Eyal Balla: almost no performance impact on the actual services.

193 00:27:20.490 --> 00:27:26.070 Eyal Balla: And it allows you to understand better how your code behaves in production.

194 00:27:27.820 --> 00:27:28.650 Eyal Balla: Okay?

195 00:27:29.270 --> 00:27:30.080 Eyal Balla: Great

196 00:27:31.981 --> 00:27:46.289 Eyal Balla: also, something you can do is you can monitor each of the the the different tasks. There's asking Kyle, get all tasks and

197 00:27:46.770 --> 00:27:50.107 Eyal Balla: current tasks. So you can run

198 00:27:51.380 --> 00:27:56.220 Eyal Balla: a core routine once in a while to understand what is running and

199 00:27:56.340 --> 00:28:00.020 Eyal Balla: and get the stacks and understand the behavior.

200 00:28:00.690 --> 00:28:01.500 Eyal Balla: Okay.

201 00:28:03.120 --> 00:28:11.940 Eyal Balla: So this is about it. So I went over the Asynch concurrent programming framework.

202 00:28:12.702 --> 00:28:27.219 Eyal Balla: I think we saw a real world example and understood a bit how I think I behaves, and something, and why we want we'd want to use it. And also we looked at some debug testing and exceptional handling tools.

203 00:28:27.810 --> 00:28:28.950 Eyal Balla: and that's it.

204 00:28:31.170 --> 00:28:32.000 Eyal Balla: Questions.

205 00:28:36.240 --> 00:28:36.940 lapid: Can I?

206 00:28:39.040 --> 00:28:40.060 lapid: Do you hear me?

207 00:28:40.820 --> 00:28:41.869 Gabor Szabo: Yes, yes, we can hear you.

208 00:28:41.870 --> 00:28:56.800 lapid: Oh, Hi, yeah, Hi, so you touched a little bit on it. But I'm when I'm doing something that I called it like a project that develops into something a little bit bigger. I find myself sometimes I I just get lost.

209 00:28:57.150 --> 00:29:01.299 lapid: Oh, I I can't verify myself that I actually

210 00:29:02.800 --> 00:29:06.060 lapid: control all the coroutines properly, because many times

211 00:29:06.430 --> 00:29:20.690 lapid: queues that feed one another like I have some streaming, and then some queues and something like that. And so you touched a little bit on that, and how you how you monitor that! But can you expand a little bit like how do you deal with that cause. I I just

212 00:29:21.140 --> 00:29:29.730 lapid: afterwards I go back, and I just print constantly, and I check the timing, and I waste a lot of time on that, and I feel like maybe someone more experienced has a better solution.

213 00:29:31.030 --> 00:29:34.775 Eyal Balla: So I think, whe when you

214 00:29:35.460 --> 00:29:41.375 Eyal Balla: the, I think that the 1st thing is to build your software like in

215 00:29:42.340 --> 00:29:44.319 Eyal Balla: even though it's Async.

216 00:29:44.440 --> 00:29:59.850 Eyal Balla: you need to build it like a top down architecture, understanding which parts are calling what other the other parts, and making sure that when you synchronize correctly, then, once you do that, things are easier. I think.

217 00:30:01.503 --> 00:30:12.799 Eyal Balla: So you need so like other others. Other considerations in software development. You need to have a a solid design

218 00:30:13.250 --> 00:30:17.059 Eyal Balla: as a as a beginning, right? And then you can use

219 00:30:17.707 --> 00:30:21.539 Eyal Balla: something like the test monitor right here.

220 00:30:21.650 --> 00:30:26.649 Eyal Balla: So you can add this as something that you can call within your code.

221 00:30:27.310 --> 00:30:35.309 Eyal Balla: And this actually helps you understand the the different core routines that are run at the same time

222 00:30:35.530 --> 00:30:57.230 Eyal Balla: and can help, you understand, together with the slow core routines, understand the impact of each of the the different coroutines running. And when you, I think that when you say I want to understand. So you have some kind of a problem, right? You have, maybe something that doesn't get the ability to run at all.

223 00:30:57.380 --> 00:31:07.317 Eyal Balla: and you don't know why. So the reason to this is probably something is blocking the the main loop right? It's too long. So you'd get

224 00:31:08.220 --> 00:31:17.930 Eyal Balla: messages on the slow callbacks. And then you would see this in the running tests and understand the context of how it ran.

225 00:31:19.290 --> 00:31:21.089 Eyal Balla: So does this make sense.

226 00:31:21.380 --> 00:31:46.119 lapid: Yeah, some something on that on that area. It it's more so that I when the polish gets big enough, I you know, I've designed patterns for code that I know that I follow. That helps me, you know, every time again, to a code that I didn't touch for you, I know, like, okay, this is how I what I do in order to actually add a new feature. But somehow, when I when I develop with, I think.

227 00:31:46.430 --> 00:32:05.980 lapid: unless it's something the whole development many times like, if I want to change some change, something in the future, I find myself having to go very deep into the cold like, I don't mind this, or maybe I just don't know how to do a design partner. Well, and for my expense just affect the stuff that

228 00:32:06.150 --> 00:32:15.399 lapid: change in their behavior from something that's asynchronousy to asynchronically had forced me to change my code way deeper than I wanted.

229 00:32:15.650 --> 00:32:22.590 lapid: So this is what actually, I'm I'm curious about this is like, this is the pendate I experienced.

230 00:32:23.800 --> 00:32:28.240 lapid: did it? Didn't like next like, Are you? Are you?

231 00:32:28.430 --> 00:32:40.939 lapid: The the something changing in the future. I want to add something, and this thing now is let's say I. I have some. I'm scraping some information, and I'm leaving it, and I want to do it in parallel.

232 00:32:41.100 --> 00:32:44.689 lapid: But and I have an existing project that was not

233 00:32:44.840 --> 00:32:54.990 lapid: so so far didn't assume anything that has to be in parallel cause. Cause I I had a different data source that I used before, and it was way, way, way faster. So now.

234 00:32:54.990 --> 00:32:55.840 lapid: so.

235 00:32:56.690 --> 00:33:02.479 Eyal Balla: So I I think what you would do is you would add, maybe an Asyncare context to

236 00:33:02.620 --> 00:33:06.920 Eyal Balla: part of the code right, and then

237 00:33:07.310 --> 00:33:14.299 Eyal Balla: run it, maybe, with Asynchaire run, and the rest would remain synchronous.

238 00:33:14.780 --> 00:33:20.479 Eyal Balla: So you can limit the the extent of what you're looking to.

239 00:33:21.090 --> 00:33:27.019 Eyal Balla: and also, as always be sure to test the specific part

240 00:33:27.350 --> 00:33:32.620 Eyal Balla: as a as as a different, like a different library, that you're calling

241 00:33:33.050 --> 00:33:35.920 Eyal Balla: and and treat it like one

242 00:33:36.190 --> 00:33:44.869 Eyal Balla: so like a different code component, a different part of the code and put it somewhere. That's self-contained

243 00:33:46.018 --> 00:33:48.710 Eyal Balla: and maybe that can help.

244 00:33:49.990 --> 00:34:00.119 lapid: Yeah. So so what you're describing is how I solved it. But I didn't. I was not. Actually. I was asking myself like if I had to go over the project again.

245 00:34:00.440 --> 00:34:06.849 lapid: saying, Oh, maybe in the future some I will have some asynchronic way, some asynchronic part.

246 00:34:07.720 --> 00:34:13.109 lapid: I want to actually prepare my code for the possibility of something as running in Async in the future.

247 00:34:13.980 --> 00:34:17.570 Eyal Balla: So I can tell you that we had.

248 00:34:18.586 --> 00:34:23.349 Eyal Balla: we needed to move from synchronous code to Async code in our company.

249 00:34:23.860 --> 00:34:25.130 Eyal Balla: And this is a

250 00:34:25.620 --> 00:34:31.530 Eyal Balla: this is, quite a big migration, because, as I, as I described in the beginning of the talk.

251 00:34:31.690 --> 00:34:39.639 Eyal Balla: using Asyncahyo is like something very different than the design of a synchronous program.

252 00:34:40.050 --> 00:34:43.910 Eyal Balla: So I don't think I have like a

253 00:34:44.429 --> 00:34:56.759 Eyal Balla: anything that I can say. Well, if you write a synchronous problem, and you want to prepare, do this and that because I think that you need to look at it in a very different way. Writing Async code and sync code.

254 00:34:57.500 --> 00:35:01.580 lapid: Okay. So it sounds like you went through the same problems I had. So.

255 00:35:02.470 --> 00:35:02.940 Eyal Balla: Yeah.

256 00:35:02.940 --> 00:35:04.480 lapid: At least we suffer together.

257 00:35:06.480 --> 00:35:08.990 Eyal Balla: Suffering. Sharing is always good.

258 00:35:08.990 --> 00:35:11.800 lapid: Yeah, yeah, thank you.

259 00:35:12.560 --> 00:35:13.260 Eyal Balla: You're welcome.

260 00:35:15.480 --> 00:35:16.050 Eyal Balla: Anything else.

261 00:35:16.050 --> 00:35:21.090 Naty Harary: Yeah. Yeah, I have a question. I'm using a lot of 3rd party.

262 00:35:21.766 --> 00:35:39.670 Naty Harary: Libraries. Test Api sequel. Alchemy like that. And they sometimes hide the implementation of I think I/O, and I always wondered, because I just believe it works well. Is there any way to worry the event loop. So I know.

263 00:35:39.910 --> 00:35:43.440 Naty Harary: like, what's running right now.

264 00:35:43.810 --> 00:35:48.510 Naty Harary: Is it even possible? Is that something that python hides from us completely?

265 00:35:48.690 --> 00:35:56.280 Eyal Balla: So so this is like a way to query all the tasks that are running in Asyncayo.

266 00:35:56.680 --> 00:35:57.110 Naty Harary: All right.

267 00:35:58.330 --> 00:36:06.459 Eyal Balla: And also there's a library I did not talk about here, and it's called Aio Monitor. So you can look into that, too.

268 00:36:07.030 --> 00:36:08.999 Eyal Balla: And it's very nice.

269 00:36:09.965 --> 00:36:11.860 Eyal Balla: So you can try that, too.

270 00:36:12.550 --> 00:36:15.994 Naty Harary: Alright cool cause you talked about the timing, so I didn't know if

271 00:36:16.470 --> 00:36:19.789 Naty Harary: if there are the concerns but a or mine, too. That's cool.

272 00:36:19.930 --> 00:36:20.890 Naty Harary: We'll check it.

273 00:36:22.530 --> 00:36:23.380 Naty Harary: Thank you.

274 00:36:27.600 --> 00:36:29.680 Eyal Balla: Okay, so I think we're done.

275 00:36:30.493 --> 00:36:46.420 Eyal Balla: Thank you guys for listening. And you can reach me this email or Linkedin. And also there's a Github project with this presentation and all the code samples together.

276 00:36:47.370 --> 00:36:51.650 Eyal Balla: I can. I think I sent it to Gabor last time, but I can send it again.

277 00:36:51.880 --> 00:36:53.749 Eyal Balla: and he can spread it out.

278 00:36:53.970 --> 00:36:58.322 Gabor Szabo: Yes, that would be a good idea. I think I'm going to include it in in the there is this

279 00:36:58.740 --> 00:37:09.540 Gabor Szabo: web page about the the presentation which will be linked from the video. And and then on that page you'll see also, I include these these links as well.

280 00:37:09.810 --> 00:37:12.380 Gabor Szabo: Oh, so your your Linkedin, and your

281 00:37:12.870 --> 00:37:15.660 Gabor Szabo: and the link to your that Github page.

282 00:37:15.950 --> 00:37:16.730 Gabor Szabo: We'll get to.

283 00:37:16.730 --> 00:37:17.390 Eyal Balla: Okay. Great.

284 00:37:17.390 --> 00:37:18.530 Gabor Szabo: Repository.

285 00:37:18.820 --> 00:37:19.959 lapid: It's a nice, clear.

286 00:37:19.960 --> 00:37:22.739 Gabor Szabo: No more questions. Then. Thank you very much.

287 00:37:23.530 --> 00:37:27.019 Gabor Szabo: Yeah, thank you. Everyone for participating. And.

288 00:37:27.020 --> 00:37:28.569 Eyal Balla: I think there's 1 more question.

289 00:37:28.570 --> 00:37:31.890 lapid: Oh, I if I can ask chitchat questions.

290 00:37:32.120 --> 00:37:32.950 Gabor Szabo: You're good. Go ahead.

291 00:37:32.950 --> 00:37:37.289 lapid: So you said you you have a company like what your company does, and

292 00:37:37.430 --> 00:37:39.700 lapid: can you elaborate a little bit for more.

293 00:37:40.840 --> 00:37:43.977 Eyal Balla: Sure. So scenario, we do.

294 00:37:45.833 --> 00:37:55.216 Eyal Balla: we do security for health for healthcare. So we give hospitals tools to understand security posture and

295 00:37:56.927 --> 00:38:03.010 Eyal Balla: and attack detection. So we detect, malicious content and attacks on hospitals.

296 00:38:04.950 --> 00:38:06.262 Eyal Balla: And I think,

297 00:38:07.670 --> 00:38:16.929 Eyal Balla: because hospitals are very sensitive. So we need to handle like a very high scale, we do with with passive network inspection.

298 00:38:17.150 --> 00:38:19.275 Eyal Balla: And so we handle like,

299 00:38:20.770 --> 00:38:30.910 Eyal Balla: quite a lot of information in our cloud. So we need to use tools and also using asyncaio helps us

300 00:38:31.040 --> 00:38:34.540 Eyal Balla: scale out and handle things as we need.

301 00:38:35.900 --> 00:38:37.260 Eyal Balla: I hope that.

302 00:38:38.536 --> 00:38:39.730 lapid: Get answered.

303 00:38:40.010 --> 00:38:46.181 lapid: Yeah, no, I'm just curious, like, it's very. It's very far from my my expertise. I'm a i'm a data scientist. And

304 00:38:46.960 --> 00:38:54.960 lapid: MI came across a sync when some of my project needed some boost.

305 00:38:55.770 --> 00:39:01.929 lapid: And you're looking also for data scientist. I'm not available now. But in about a month or 2.

306 00:39:03.532 --> 00:39:14.997 Eyal Balla: So I think data scientist is not something that we're currently looking for. But you, you could actually look at the company career page. There are several positions, and

307 00:39:16.130 --> 00:39:19.450 Eyal Balla: we're expanding, and it's it's a good time, I think.

308 00:39:20.570 --> 00:39:22.509 Eyal Balla: Alright. Then brush.

309 00:39:24.170 --> 00:39:24.860 lapid: Alright. Thank you.

310 00:39:26.120 --> 00:39:32.120 Gabor Szabo: So. So thank you. Thank you very much. Thank you, Al, for the presentation, and for all the questions people

311 00:39:32.430 --> 00:39:38.840 Gabor Szabo: and everyone who was watching, please again, like the video, as I told you, and follow the Channel.

312 00:39:38.980 --> 00:39:46.519 Gabor Szabo: And if you would like to present at one of our meetings, then please get in touch with me.

313 00:39:46.790 --> 00:39:53.759 Gabor Szabo: I would be happy to to provide the the place for people to to share their their knowledge.

314 00:39:54.540 --> 00:39:55.679 Gabor Szabo: Thank you very much.

315 00:39:56.000 --> 00:39:56.850 Gabor Szabo: Goodbye.

316 00:39:56.850 --> 00:39:57.250 Eyal Balla: Bye-bye.

317 00:39:57.250 --> 00:39:57.590 Dmitry Morgovsky: Sure.

318 00:39:57.590 --> 00:39:58.919 lapid: Bye-bye. Thank you.

319 00:39:59.980 --> 00:40:01.209 Shalaka Deshan: Thank you, anyway.

The Evolution of Python Monitoring with May Walter

2025-02-06T08:30:01Z

In this lecture, we will explore the evolution of Python monitoring over the years, covering tools and techniques from sys.monitoring to import hooks, highlighting advancements and best practices in keeping your Python code in check.

Join us and time-travel across the evolution of Python monitoring mechanisms. We'll delve into history from dedicated tools like sys.monitoring to more advanced techniques such as ceval and import hooks. This session will provide a comprehensive overview of how monitoring practices have developed over the years, offering insights into the best practices for maintaining and debugging your Python code and the pros and cons of each approach. Whether you're a seasoned developer or new to Python, you'll gain valuable knowledge on how to keep your code running smoothly and efficiently without hurting performance or your dev velocity with tedious maintenance.

May Walter

Transcript

1 00:00:00.720 --> 00:00:02.690 Haki Benita: This meeting is being recorded.

2 00:00:03.400 --> 00:00:04.320 Gabor Szabo: Okay.

3 00:00:05.800 --> 00:00:12.250 Gabor Szabo: yeah. So Hi, and welcome to the python Maven, let's call it Python Maven. This is the code Maven

4 00:00:12.500 --> 00:00:41.910 Gabor Szabo: Youtube channel. And we are organizing these meetings in the Codebay Events group, but sort of it has 3 separate sessions, and this is going to be the the Python specific one. My name is Gabor Sabo. I usually teach python and rust and help companies introduce testing, and I also like to organize these events and allow people to share their knowledge with with each other.

5 00:00:42.270 --> 00:00:46.010 Gabor Szabo: You're welcome. I'm really happy that you're here

6 00:00:46.140 --> 00:01:04.909 Gabor Szabo: in this session, listening, as I mentioned earlier, you're welcome to to comment or use the chat and ask questions. And if you're just watching the video recorded on Youtube, then please remember to like the video and follow the channel.

7 00:01:05.080 --> 00:01:11.990 Gabor Szabo: and let's welcome hockey now, and let him introduce you. Introduce yourself and and

8 00:01:12.700 --> 00:01:17.579 Gabor Szabo: and give the presentation. So thank you for accepting the invitation.

9 00:01:18.970 --> 00:01:31.149 Haki Benita: Thank you. Thank you, Gabo. 1st of all, I like the fact that we have this intimate group that we can freely talk. I actually encourage you to consider opening the mics.

10 00:01:31.210 --> 00:02:01.090 Haki Benita: Because I think we can actually have a conversation throughout the presentation. I like to give interactive presentation. Your call. You're the boss, and just a quick introduction about the subject and about myself. So we are going to talk about how to make your back end war. And I want to start by apologizing for the tacky headline. But unfortunately, these types of tacky headlines do work. Believe it or not.

11 00:02:01.610 --> 00:02:09.010 Haki Benita: So. My name is Jake Benita. I'm a software developer and a technical lead. I'm currently leading a team

12 00:02:09.289 --> 00:02:18.949 Haki Benita: of developers working on a very large ticketing platform and Israel serving about one and a half unique

13 00:02:19.580 --> 00:02:32.470 Haki Benita: 1.5 million unique paying users every month. And I also like to write and talk about python performance and databases. And you can find my stuff on my website.

14 00:02:33.110 --> 00:02:47.839 Haki Benita: So today, we are going to talk about some lesser known features of indexes. And we're going to try and understand how they work and when we can and should use them

15 00:02:47.850 --> 00:03:14.629 Haki Benita: to do that, we are going to build a URL shortener together, and we're going to do it in Django. I would say that since this is a talk about python, I'm going to use Django and the Django Orm. But the concepts that I'm going to describe are not specific to Django, and they're not specific to Postgres. Heck. They're not even specific to python. But this is a good environment to explain the concepts with.

16 00:03:15.390 --> 00:03:19.889 Haki Benita: So what is a URL shortener? You probably know about

17 00:03:19.900 --> 00:03:39.330 Haki Benita: other types of URL shorteners? You have bitly. You have the late googl buffer, Li, and so on. Basically, URL shortener is a system that provides a short URL that redirects to a longer URL. Now, why would you want to do that

18 00:03:39.330 --> 00:04:02.240 Haki Benita: first.st If you are operating in text constrained environments like SMS messages or Tweets, you might want to share a very large link. So you want to make it shorter, so it consumes less space. This is where short Urls can be handy. Another nice feature of URL shortening that whenever someone clicks the short URL,

19 00:04:02.240 --> 00:04:16.500 Haki Benita: the URL shorten and redirects to the long URL, and keeps a track of how many people click that link. So if you have something like a campaign that you want to launch, and you want to keep track of how many people clicked your link.

20 00:04:16.820 --> 00:04:20.149 Haki Benita: This is what you would use a URL shortener for

21 00:04:20.310 --> 00:04:48.240 Haki Benita: so to build our URL shortener in Django, we're going to start with this very, very simple model. We are calling the model short URL, we have an Id column which is the primary key. It's just an auto incrementing integer field. We have the key. That's a unique short piece of text that uniquely identifies our short URL. This is the short key at the end of the short URL.

22 00:04:48.500 --> 00:05:07.030 Haki Benita: We then have the URL, which is the long URL, we want to redirect to. We also want to keep track of when the URL was created. We do that using the created at column. And finally, we want to keep track of how many users click the link, and we do that with the hits column

23 00:05:07.180 --> 00:05:08.110 Haki Benita: at the bottom.

24 00:05:08.960 --> 00:05:19.650 Haki Benita: So for our demonstration. So we actually have something to work with. I loaded 1 million short Urls into the table. Okay, now, this is not a lot. But we are going to see, some

25 00:05:20.700 --> 00:05:25.929 Haki Benita: performance gains with just 1 million rows. Okay.

26 00:05:26.810 --> 00:05:33.380 Haki Benita: so this talk is about python. But it's essentially about SQL, so

27 00:05:33.510 --> 00:05:54.859 Haki Benita: in Django, if you want to get the SQL. Generated by Django for a given query set. You can do that by accessing the query, set dot query and print it. In this case I'm doing short URL filter on a specific key dot query. And I can actually get Django to print

28 00:05:55.190 --> 00:05:59.549 Haki Benita: the SQL. That it generated for this query set right.

29 00:06:00.040 --> 00:06:26.740 Haki Benita: So, after viewing the query set, it's also very interesting to see how the database is planning to execute my query. Right? So I can do that by executing the function. Explain. This, translates into an explain query, command in SQL. And what I get in return is not the results of the query, but the execution plan, which is how the database is planning

30 00:06:26.930 --> 00:06:30.979 Haki Benita: to execute my query. Now, when we just use, explain

31 00:06:31.200 --> 00:06:36.260 Haki Benita: the database doesn't actually execute the query. It just produces a plan

32 00:06:36.370 --> 00:06:53.839 Haki Benita: sometimes, especially when we're benchmarking and we're trying to improve performance. It can be useful to produce the execution plan, but also have the database, execute this query and return some useful execution data. For that we can use a slightly different variation of the explain command.

33 00:06:53.970 --> 00:07:13.319 Haki Benita: which is, explain, analyze in Django. You can do that by using. Explain, analyze. True in SQL. Postgres. Specifically, you can do explain, analyze on timing on in parenthesis, following by the query, and then you get some additional information about the execution plan.

34 00:07:13.350 --> 00:07:27.339 Haki Benita: first, st because the database actually executed the query. You can see at the bottom that we get how long it took the database to produce an execution plan in this case that would be 0 point 1 4 0 ms.

35 00:07:27.710 --> 00:07:38.510 Haki Benita: and I also get how long. It took the database to execute the query from start to end. In this case that would be 0 point 0 4 6. Okay.

36 00:07:39.430 --> 00:07:47.120 Haki Benita: Now, in addition to the timing. I'm also getting a very, very interesting piece of information inside the execution plan.

37 00:07:47.260 --> 00:07:53.699 Haki Benita: Okay, what I get is the estimated cost and the actual cost

38 00:07:53.820 --> 00:07:58.059 Haki Benita: that the database encountered while executing the query. So

39 00:07:59.010 --> 00:08:15.400 Haki Benita: discussing the cost-based optimizer is slightly outside the scope of this talk, I would just say that, comparing the expected cost to the actual cost is a very useful measure to try and identify bad execution plans.

40 00:08:16.100 --> 00:08:17.350 Haki Benita: Finally.

41 00:08:17.990 --> 00:08:28.419 Haki Benita: another way of viewing queries is to turn on the logger for the database backend in Django. This way, whenever the database, whenever Django executes a query.

42 00:08:29.040 --> 00:08:32.620 Haki Benita: it logs the SQL. That was produced by the aura.

43 00:08:33.510 --> 00:08:34.475 Haki Benita: So

44 00:08:35.700 --> 00:09:05.329 Haki Benita: to actually start discussing some indexing techniques, we need to start implementing some. You know, business processes. So let's start with the most basic thing that URL shortener actually does. And that's look up the URL to redirect to by a key. So a user uses one of our short Urls, we get the unique key. And we need to find the long URL to redirect to. Okay, this is like the bread and butter of this system.

45 00:09:05.440 --> 00:09:27.109 Haki Benita: So if we want to implement this very, very simple function. We can do something like that. Def resolve. Okay, that's the name of the function. We want to resolve a key to a URL. We accept a key, and then we execute this simple query to just get a show, URL for this key. If we don't find anything we return none. Otherwise we return the URL to redirect to

46 00:09:27.110 --> 00:09:37.730 Haki Benita: okay. Now we want to look at the SQL. That Django generated for this function. Right? So we execute this function on some random key

47 00:09:37.950 --> 00:09:57.950 Haki Benita: with SQL. Logging turned on, and we can see the query right here. Now, if you look at this query, it looks like Django, basically fetch everything from the short URL table for the key that we asked for right select star from short URL, where Key equals something.

48 00:09:58.270 --> 00:10:05.050 Haki Benita: If we want to look at how postgres is actually executing this query.

49 00:10:05.210 --> 00:10:12.719 Haki Benita: we can use the explain command. And what we get is that Postgres is planning to use an index scan

50 00:10:13.535 --> 00:10:20.159 Haki Benita: on the index we have on the key column. Okay, now.

51 00:10:21.180 --> 00:10:28.839 Haki Benita: to understand what exactly it means in index scan. Let's take a second to talk about btree index.

52 00:10:29.040 --> 00:10:42.120 Haki Benita: So Btree index is like the king of all indexes. This is the default index in most database engines. If you're not sure what type of index you're using. You're probably using a btree index. Okay?

53 00:10:42.560 --> 00:11:11.160 Haki Benita: So to understand how A B tree index works. Let's start by building one. So imagine you have these values, one through 9, and you want to create a B tree index on them. You start by sorting the values and storing them in leaf blocks. You can see the leaf blocks at the bottom. They are sorted from left to right. We have 1, 2, 3, all the way through 9. Now every entry in the leaf blocks contains a list of tids. These are pointers to rows in the table.

54 00:11:11.400 --> 00:11:15.460 Haki Benita: That store rows with these values. Okay.

55 00:11:16.290 --> 00:11:28.179 Haki Benita: now, above the leaves, we have branches and root block that acts as a directory to these leaf blocks. So let's see how this works. Let's imagine that we want to look.

56 00:11:28.180 --> 00:11:38.290 Gabor Szabo: Sorry. Just someone says says that it doesn't see this the slides. So I just wanted to. And I'm unsure if the other people do see the light slide. So if

57 00:11:38.670 --> 00:11:53.529 Gabor Szabo: I asked it in the chat, but no one answered. So I hope that people other people okay. So some other people see it so my recommendation is to Eduard Eduardo to turn, maybe on and off the I mean, maybe exit zoom and enter zoom again. Sorry for the.

58 00:11:53.530 --> 00:11:54.940 Haki Benita: Okay, no problem.

59 00:11:55.120 --> 00:11:56.160 Haki Benita: Yeah.

60 00:11:56.400 --> 00:11:59.700 Haki Benita: Okay, okay, so let's

61 00:12:01.690 --> 00:12:31.100 Haki Benita: okay. So let's try to search for the value 5 in the V 3 index that we just built. So we start with the root block and we start scanning from left to right. So 5 is larger than 3. So we skip the 1st entry 3 is between 3 and 7, 5 is between 3 and 7, so we follow this pointer to the middle leaf block. We then start scanning the leaf block from left to right. The 1st value is 4. It's not a match.

62 00:12:31.100 --> 00:12:36.150 Haki Benita: The next value is 5. That's a match, and now we can

63 00:12:36.150 --> 00:12:47.970 Haki Benita: scan. We can follow the pointers from this leaf block to the rows in the table. We can read the rows and do whatever we need to do with these rows. Okay, now.

64 00:12:48.310 --> 00:13:15.100 Haki Benita: let's go back to our query. Okay, one second, yeah. Let's go back to our query. Remember that we said that Django generated this query and this query is fetching everything right, basically select star from short URL. But, in fact, if you think about it, we don't actually care about all these fields right? We only care about the URL. I mean, we're not looking to resolve

65 00:13:16.290 --> 00:13:27.129 Haki Benita: a key to a URL for the purpose of redirecting. I don't care when it was created. I don't care about the Id. I already have the key right, and I don't care about the head counter at this point

66 00:13:27.610 --> 00:13:30.209 Haki Benita: right? So I don't care about all these fields. So

67 00:13:30.770 --> 00:13:55.089 Haki Benita: one thing that we can do is instead of fetching all of these fields, how about if we just fetch what we actually need. Right? So in Django, we can do that by adding values list. URL. Now the function is slightly different. But if we look at the SQL. Generated by this function, we can see that now, instead of fetching all the columns in the row, we just fetch the URL. So this is exactly what we need.

68 00:13:55.200 --> 00:14:10.249 Haki Benita: If we look at this execution plan once again for this query, we can see that again. Django is using postgres is using an index scan on the unique index that we have on the key. Right? So now.

69 00:14:10.920 --> 00:14:30.719 Haki Benita: once we found a matching row, we can follow the pointer to the table. We can get the URL from the table. So if you imagine the amount of disk reads, I need to do to satisfy this query. I'm starting by reading their root block. Right? So that's 1 read. Then I need to follow the branch all the way to the leaf. Let's say that we have just.

70 00:14:30.730 --> 00:14:41.789 Haki Benita: you know, root block, and then directly to the leaf. So reading the leaf is another read, and then we need to follow the link from the leaf block to read the row from the table. So this is a unique

71 00:14:41.970 --> 00:14:52.020 Haki Benita: column. So we have at most one row. So that's another read. So basically, we did 3 random reads to satisfy this query right now.

72 00:14:53.290 --> 00:15:03.019 Haki Benita: this query is executed a lot. This is basically what our system is doing right. It's getting keys and resolving them to Urls to redirect right

73 00:15:03.360 --> 00:15:17.979 Haki Benita: now. We already established that all we care about in this specific scenario is just the URL. I don't care about anything else. I care just about the URL. So what if? And stay with me? This is mind blowing.

74 00:15:17.980 --> 00:15:34.249 Haki Benita: What if, instead of going to the table to get the URL. What if I could include the URL in the leaf block in the index this way? When I found a matching entry in the leaf block, I would have the URL just sitting there.

75 00:15:34.310 --> 00:15:52.420 Haki Benita: Right? So this mind-blowing idea is called inclusive index. Okay, in other databases it's called covering index or inclusive indexes, and what it allows us to do, it allows us to store additional information in the leaf block.

76 00:15:52.500 --> 00:16:14.569 Haki Benita: So if we want to use an inclusive index in Django, we can add the include argument to the unique constraint. Now look, the key is indexed. The URL is not indexed. It's just included in the leaf block. Okay. Now, if we generate a migration, we apply it and we try the query again.

77 00:16:15.500 --> 00:16:21.569 Haki Benita: You can see that once again, Postgres is using our index, our unique index on the key. But there is

78 00:16:21.900 --> 00:16:33.889 Haki Benita: very, very subtle difference here. If you notice. Previously we had an index scan using our unique index. This time we have an index only scan.

79 00:16:34.020 --> 00:17:03.620 Haki Benita: This means that Postgres was able to satisfy the query without accessing the table. All the data that it needs was already in the leaf block. So if we once again imagine how many reads we need to do to satisfy this query, using the inclusive index, we read the root block. We follow the pointer all the way down to the leaf block, and now, instead of going to the table to read the URL. We have the URL right there in the leaf block. So we only need to read

80 00:17:03.670 --> 00:17:05.849 Haki Benita: 2 blocks from disk.

81 00:17:06.150 --> 00:17:17.110 Haki Benita: Okay, the way to identify. This is by the operator on the index only scan right? So we have an index scan, and we have an index. Only scan.

82 00:17:18.170 --> 00:17:39.170 Haki Benita: So quick recap about inclusive indexes, as I mentioned in other databases. They are sometimes called covering indexes, and they allow us to fulfill queries without accessing the table. However, you should use them with caution. Because if you think about it, we're basically duplicating data from the table to the index. Okay?

83 00:17:39.170 --> 00:17:49.959 Haki Benita: So if you have a very big big piece of like information like URL can be very, very big. So basically, I'm now storing the URL

84 00:17:50.140 --> 00:18:09.440 Haki Benita: twice. So the index could get very, very big. I'm actually not a big fan of inclusive indexes. But I can think of 2 scenarios where it might be a good idea. First, st if you have very wide tables. Imagine, like data, warehouse type of tables, denormalized tables.

85 00:18:09.600 --> 00:18:11.520 Haki Benita: and you have a very

86 00:18:12.250 --> 00:18:22.290 Haki Benita: predefined set of queries that are executed very, very often on a very, very small subset of columns, you can consider doing using

87 00:18:23.440 --> 00:18:50.249 Haki Benita: an inclusive index. And also, I personally found that non unique composite indexes can be good candidates for inclusive indexes that is, indexes on multiple columns that are not used to enforce a unique constraint. Sometimes they can benefit from switching to just a composite index to an inclusive index. Okay, questions so far before we move on to the next use case.

88 00:18:55.710 --> 00:19:02.210 Haki Benita: Okay, if you have any questions, feel free, let's move on to the next to the next use case.

89 00:19:02.800 --> 00:19:04.080 Haki Benita: So now

90 00:19:04.230 --> 00:19:16.229 Haki Benita: we want to find unused keys right? We have this business question. We want to know how many show through ours. We have with no hits at all. Okay, we have 0 hits.

91 00:19:17.070 --> 00:19:23.050 Haki Benita: So we start by implementing this very, very simple function. We call it, find unusedindexes.

92 00:19:23.350 --> 00:19:26.190 Haki Benita: and it returns a query set where

93 00:19:26.790 --> 00:19:43.480 Haki Benita: with short Urls, where hits equals 0. Once again, if we want to see what the query looks like we can print the result of query. We can see that it returns like star from short URL, where hits equals 0.

94 00:19:44.560 --> 00:19:58.929 Haki Benita: Once again, through the process, we produce an execution plan. This time. We can see that Postgres is doing a sequential scan on short URL. A sequential scan is basically a full table. Scan postgres is just

95 00:19:59.010 --> 00:20:18.369 Haki Benita: reading the table row by row, looking for rows where hits equals 0. We can see that the execution time at the bottom is 116 ms. Let's say, for the sake of discussion, that this is very, very slow, and we want to try and improve that.

96 00:20:18.450 --> 00:20:48.250 Haki Benita: So if you go to like 99% of developers at Dba, they will tell you what's the problem and just slap a B tree on it. Right. So we add a B tree index on the hits column. We do that in Django using dB index equals. True, we generate a migration. We apply the migration. We once again produce the execution plan with, analyze, and lo and behold.

97 00:20:48.310 --> 00:20:56.180 Haki Benita: Postgres is using our index short. URL hits Ix. And, as you can see the execution. Time

98 00:20:56.810 --> 00:21:02.370 Haki Benita: is very, very fast compared to before, so we're done right.

99 00:21:03.230 --> 00:21:06.060 Haki Benita: We can call it the day we can go for lunch.

100 00:21:06.330 --> 00:21:08.609 Haki Benita: We're happy. It's fast. Now

101 00:21:09.310 --> 00:21:20.299 Haki Benita: stop, let's take a second to talk about performance and what it actually means. Okay? Because intuitively, when we talk about performance, we talk about

102 00:21:20.380 --> 00:21:37.639 Haki Benita: speed right? We want things to be very, very quick. But I think, or the way I view performance is that we need to balance different types of resources. And I want to illustrate this with an example. Okay, let's say that you have this batch processing job running at night.

103 00:21:37.640 --> 00:21:53.420 Haki Benita: Now, this batch processing job runs at the middle of the night, where you have very, very little users, and it runs very, very fast. It takes like this batch processing job like 10 seconds to complete. You're so happy, so fast. However.

104 00:21:53.720 --> 00:22:05.569 Haki Benita: however, this job consumes huge amounts of memory, huge amounts of CPU and huge amounts of disk space right. What if I told you that

105 00:22:06.440 --> 00:22:12.950 Haki Benita: if we are willing to compromise, and instead of completing in 10 seconds, it takes a minute

106 00:22:13.410 --> 00:22:38.970 Haki Benita: right? It consumes very little memory disk space and CPU, right? I'm guessing that if you pay a lot of money for memory, you are willing to make this compromise. Okay, I'll give you another example. Let's say that you have this background job running in the middle of the day. Right now, this background job consumes a lot of CPU so much CPU, in fact, that it starts to interfere with user traffic in the system.

107 00:22:39.030 --> 00:23:07.120 Haki Benita: In this case, instead of optimizing for time, you might be optimizing for CPU, right? You're willing to compromise a few seconds. But you don't want the background job to consume a lot of CPU. So when we talk about performance. We talk about more than just speed. We're talking about how we can balance different resources in the system, usually depending on some type of context time of day the type of resource that we have available at this time. Right?

108 00:23:07.670 --> 00:23:23.450 Haki Benita: So remember that we slapped A B tree on it right? And it was very, very fast, but I'm not sure that was like the most optimal thing that we could done. We could have done. So. Let's go to the database and see

109 00:23:23.580 --> 00:23:33.769 Haki Benita: and check the size of the index we created to solve this teeny, tiny problem. Okay, so this index.

110 00:23:34.570 --> 00:23:41.979 Haki Benita: right is 7 MB. Okay, so that's pretty big for for this type of index.

111 00:23:42.120 --> 00:23:47.420 Haki Benita: So our 7 MB index includes

112 00:23:47.630 --> 00:23:57.789 Haki Benita: all the rows in the table. Right? We just add a dB index through create a B tree index on the column. So it contains all the 1 million rows in the table. But

113 00:23:58.570 --> 00:24:05.790 Haki Benita: we actually don't care about all the rows in the table. Right? Nobody asked us how many

114 00:24:06.150 --> 00:24:25.690 Haki Benita: short Urls you have with less than 5 hits, or more than 266 hits, or exactly 1,000 hits. Nobody cares about that. We had a very specific question that we wanted to answer in regards to the hits. We wanted to find how many short Urls we have with exactly 0 hits.

115 00:24:26.100 --> 00:24:37.350 Haki Benita: So what if, instead of indexing the all the rows in the table, we could index just a portion of the rows, the part of the table that we actually care about.

116 00:24:37.810 --> 00:24:51.950 Haki Benita: Right? So this is a once again mind-blowing idea, and this is made possible with something called partial indexes. Partial indexes, allows us to index just a part of the table that we actually care about.

117 00:24:52.810 --> 00:25:08.019 Haki Benita: So going back to our Django model right. 1st we start by removing the dB index from the column definition, you should never use dB index. Regardless of this, and then, instead of adding this default index. On the column.

118 00:25:08.020 --> 00:25:28.989 Haki Benita: we add a proper index. Right? But we add a condition. Okay, so what this does, it creates an index on the Id column with a condition where hits equals 0. This would cause postgres to create an index just on the rows that satisfy this query. Just on rows

119 00:25:29.200 --> 00:25:54.569 Haki Benita: where hits equal 0. Right? So we generate the migration. We apply the migration, and we try the query again. We produce an execution plan, and we can see that Postgres is using our index. Right? We see an index scan using short URL unused part Ix. This is the index we just created. Okay, so Postgres is able to use the index we just created the partial index

120 00:25:55.000 --> 00:26:04.670 Haki Benita: to satisfy this very specific query. We can also see that the query is very, very fast, even compared to the full index. Right?

121 00:26:05.090 --> 00:26:13.180 Haki Benita: But that wasn't the motivation here, right? This is not what we look to optimize. If we go back

122 00:26:13.320 --> 00:26:28.990 Haki Benita: to the database, and we look at the size of this index. Look at that. The partial index is just 88 kB in size. Okay? Previously the full index was 7 MB. The partial index is 88 kB.

123 00:26:28.990 --> 00:26:48.659 Haki Benita: So I did the math seriously. I opened excel. I did the math. That's 99% smaller. Okay, so that's a lot of space. Now, at this point you're probably saying, Come on, man, it's just 7 MB. Who cares? But if you go back to your system, and you have huge tables with hundreds and billions of rows. Right?

124 00:26:48.840 --> 00:27:06.290 Haki Benita: Check the size of your B 3 indexes. They can become huge. I've seen situations where the B 3 index was larger than the table. Okay, and if you have a lot of indexes it can grow out of control very, very quickly.

125 00:27:07.020 --> 00:27:21.090 Haki Benita: So, as you may guess, I'm a very, very big fan of partial indexes. They produce smaller indexes, and I highly encourage you to use them whenever possible. One limitation of partial indexes is that

126 00:27:22.030 --> 00:27:26.349 Haki Benita: the database can only use partial indexes when

127 00:27:26.500 --> 00:27:52.249 Haki Benita: the query uses the exact same condition as the predicate in the index. Right? The database is not even smart enough to do something like like, where hits equal one minus one. Okay to this level. Okay, so it's limited to queries that use the exact same condition. Usually it's fine, because you know, why would you do hits equal one minus one.

128 00:27:52.380 --> 00:27:53.080 Haki Benita: I don't know.

129 00:27:53.520 --> 00:27:58.490 Haki Benita: I personally found that noble columns are great candidates

130 00:27:58.780 --> 00:28:09.290 Haki Benita: for partial indexes, because in postgres, for example, null values are indexed, and usually you don't want to use an index for is null queries. So I found that

131 00:28:09.480 --> 00:28:34.749 Haki Benita: whenever I have a noble column with an index on it, I can benefit from making it a partial indexes. In fact, I wrote an entire article on how we save 20 GB of unused disk space simply by identifying noble columns with indexes and switching them to use partial index. Okay, so questions about partial indexes before we move on to the next use case.

132 00:28:36.780 --> 00:28:38.540 Haki Benita: Gabor, you have a question.

133 00:28:42.160 --> 00:28:42.730 Haki Benita: No.

134 00:28:42.730 --> 00:28:45.249 Gabor Szabo: Sorry there is this sorry? Actually, there is this question.

135 00:28:46.340 --> 00:29:04.110 Haki Benita: Oh, is it a good idea to recalculate the hits and partial indexes? How frequently! Well, the nice thing about indexes and btrees in general that they are always in sync with the data in the table, it's actually part of the transaction. So when you, for example, increment.

136 00:29:05.180 --> 00:29:07.990 Haki Benita: when you increment the counter for the 1st time

137 00:29:08.290 --> 00:29:11.070 Haki Benita: the row would just disappear from the index.

138 00:29:11.250 --> 00:29:26.029 Haki Benita: Right? So I'm guessing that you're asking, because you have some experience with like materialized views and stuff like that. So you don't actually have to maintain it actively. It's just maintained by the database.

139 00:29:26.460 --> 00:29:32.839 Haki Benita: It's truly an amazing feature. You should definitely use that any more questions before we move on to

140 00:29:33.140 --> 00:29:36.009 Haki Benita: a very exotic type of index in postgres.

141 00:29:36.750 --> 00:29:38.110 Haki Benita: Ow.

142 00:29:41.210 --> 00:29:46.360 Haki Benita: okay, great. So let's talk about another type. Another use case.

143 00:29:47.270 --> 00:30:00.790 Haki Benita: So first, st in the 1st use case, we wanted to resolve the key to a URL right? This is the redirect action. This time we want to do a reverse. Look up. We want to ask

144 00:30:01.000 --> 00:30:09.090 Haki Benita: how many keys we have pointing to this specific URL. So we wanna search for keys by the URL.

145 00:30:09.530 --> 00:30:20.539 Haki Benita: So we implement this very simple function called reverse lookup. It accepts a URL and returns query, set of short Urls. Okay?

146 00:30:21.210 --> 00:30:49.150 Haki Benita: So if we want to see what the query looks like. We use dot query. And we can see select star from short URL where URL equals something. Okay, if we produce an execution plan. We can see that the database is doing a sequential scan on the short URL table that is, scanning the entire table, sifting row by row, finding matches for our query.

147 00:30:49.430 --> 00:30:50.800 Haki Benita: Whoa!

148 00:30:51.590 --> 00:30:55.929 Haki Benita: And we can see that it's relatively

149 00:30:56.140 --> 00:31:00.379 Haki Benita: slow, right? It's like 105

150 00:31:00.500 --> 00:31:03.990 Haki Benita: milliseconds so compared to the index

151 00:31:04.320 --> 00:31:08.840 Haki Benita: queries that we saw before. That's that's pretty slow. Right?

152 00:31:09.220 --> 00:31:23.659 Haki Benita: So you know, once again, 99% of the people would just say, Come on, man, I'm hungry. Let's order some food. Just slop a B tree on it. So this is what we do right? We start by adding A B tree on the URL

153 00:31:23.860 --> 00:31:37.679 Haki Benita: right? We generate and apply the migration. Now we execute the exact same query again, and we can see that now Postgres is using the index that we just created. We can see an index scan using

154 00:31:38.030 --> 00:31:57.059 Haki Benita: the index on the URL column, and also it's very fast. Previously it was like a 100 ms. Now it's 0 point 1 ms. So that's a very, very big and significant improvement. We can all seek to launch and be very, very happy and satisfied with ourselves. But

155 00:31:57.770 --> 00:32:09.459 Haki Benita: are we done? Do you think that we are done? Is there anything that we can optimize? Now, if you are paying attention throughout this presentation. You know that we can definitely do better than that.

156 00:32:09.830 --> 00:32:16.550 Haki Benita: Let's go to the database and check the size of the index. Okay? So the size of the index.

157 00:32:16.740 --> 00:32:22.669 Haki Benita: Okay, stay with me. 47 MB. If you remember the previous

158 00:32:23.050 --> 00:32:28.779 Haki Benita: use case, we had an index on all the heads. It was 7 MB. I told you it was large.

159 00:32:28.950 --> 00:32:44.159 Haki Benita: This index on the same amount of rows is 47 MB. That's very, very big, and the reason that it's very, very big is that the URL is very, very big, right? The beach index

160 00:32:44.390 --> 00:32:49.879 Haki Benita: holds the actual values in the leaf block. So if we are indexing.

161 00:32:50.020 --> 00:32:58.219 Haki Benita: A column with very large values like Urls, can be very, very big. So if we are indexing

162 00:32:58.430 --> 00:33:03.490 Haki Benita: a column with very, very big values, these values are also present in the index.

163 00:33:04.000 --> 00:33:14.130 Haki Benita: and the index can get very, very big. So previously, when we were indexing integers, it was 7 MB. Now we're indexing large pieces of text Urls.

164 00:33:14.410 --> 00:33:18.940 Haki Benita: and that's 47 MB. Okay, so

165 00:33:19.430 --> 00:33:28.389 Haki Benita: let's pause for a second. Okay, I know that btree is like the magic for 90% of the use cases. But there are other types of indexes that we can use.

166 00:33:28.955 --> 00:33:32.949 Haki Benita: So let's pause for a second and ask ourselves, what do we know about.

167 00:33:33.210 --> 00:33:48.990 Haki Benita: what do we know about the URL? Okay? So 1st of all, we know that URL is not unique. Right? We can have multiple keys pointing to the same URL. We can have, for example, different campaigns with different short Urls

168 00:33:49.100 --> 00:33:55.800 Haki Benita: pointing to the same URL. There's no restriction in the system. You can have many keys pointing to the same URL. So it's not unique.

169 00:33:55.930 --> 00:33:57.940 Haki Benita: However, however.

170 00:33:59.780 --> 00:34:06.770 Haki Benita: if we actually look at the data, we see that we don't have a lot of duplicate long Urls right

171 00:34:06.970 --> 00:34:07.889 Haki Benita: like.

172 00:34:09.444 --> 00:34:18.389 Haki Benita: It's not likely that people will use the same show to a lot to point to the same URL like at the at the very least.

173 00:34:18.650 --> 00:34:22.639 Haki Benita: they would have different utm parameters for the same. URL.

174 00:34:22.780 --> 00:34:33.040 Haki Benita: So while it's it's, it's not a restriction. You can have many keys, pointing to the same URL. It's not likely, so we don't have a lot of duplicate values.

175 00:34:34.199 --> 00:34:36.219 Haki Benita: So now I want to introduce you

176 00:34:36.710 --> 00:35:00.369 Haki Benita: to what I call the Ugly Duckling of index types in postgres, the Hash Index. Okay? And to understand how a hash index works and why it's different from B 3 index. Let's start by actually building a hash index ourselves. So imagine we have these values, A, BC and D, and we want to index them using a hash index.

177 00:35:00.730 --> 00:35:20.800 Haki Benita: So we start by applying a hash function on each value. So postgres in our example, it has different hash functions for different types. So you can see that we have hash for text char arrays, even Json types, Timestamps, and so on.

178 00:35:20.930 --> 00:35:34.680 Haki Benita: In our case we have just one character. So it uses hashchar. If we actually apply this function on the values we get the hash values. The next step is we want to divide these

179 00:35:34.870 --> 00:35:36.829 Haki Benita: values into buckets.

180 00:35:37.030 --> 00:35:43.100 Haki Benita: So we start by dividing them into 2 buckets. Basically, we apply modular 2 on

181 00:35:44.050 --> 00:36:04.600 Haki Benita: on the hash value, and then we assign each value to a bucket. So we can see that A goes to bucket one and BC and D goes to bucket 0. So this is our hash index. Okay, so we have 3 hash values in bucket 0, each hash value points to

182 00:36:04.860 --> 00:36:10.809 Haki Benita: somewhere in the table. Okay, just like we had the Tids in the B tree. We have

183 00:36:10.980 --> 00:36:32.230 Haki Benita: the tids right here in the hash index. Now, if we want to use this hash index to find some value, we do the exact same thing, but the opposite, but the other way around, right? So if you want to search for the value B, for example, we apply a hash function on it. We get the hash value. We apply modulus number of buckets to get the

184 00:36:32.360 --> 00:36:54.430 Haki Benita: bucket. In this case 0, and then we go to bucket 0 and we start scanning the pointers to find matching hash. Once we found a matching hash, we can take this 2, which is a pointer to a place in the table, and we can go scan this row and look for matching rows. Okay, so this is how a hash index works in postgres.

185 00:36:55.190 --> 00:37:14.639 Haki Benita: Now, if we want to create a hash index in Django. We need to use the special hash index from postgres contrip. Okay? The reason for that is that hash index is not the default index type in postgres. So we need to explicitly say, we want a hash index. Okay.

186 00:37:15.260 --> 00:37:19.239 Haki Benita: so in this case we are creating a hash index on the URL field.

187 00:37:19.770 --> 00:37:46.360 Haki Benita: and the name of this index is going to be short. URL Hix. I like to use a suffix that indicates the type of the index. So when you know, when I look at execution planes, I can quickly identify the type of the index. So I usually use Ix for B. 3 indexes, and then I use part Ix. For partial hix for hash indexes, and so on. You can come up with whatever convention you want.

188 00:37:47.920 --> 00:37:48.900 Haki Benita: So

189 00:37:49.530 --> 00:38:00.809 Haki Benita: we generate the migration, we apply the migration and produce an execution plan. And we can see that Postgres is using our hash index. Okay? Now.

190 00:38:00.940 --> 00:38:01.990 Haki Benita: okay.

191 00:38:02.180 --> 00:38:18.460 Haki Benita: 1st observation. This is very, very fast. Okay, so you can see that 0 point 0 7 ms. That's very, very fast. But that's not all. If we look at the size of our hash index. Compared

192 00:38:18.730 --> 00:38:34.859 Haki Benita: to the Beecher index, we can see that the hash index is 30% smaller. Okay, trust me, I took a calculator, an old casio. And I calculated the difference. It's 30% smaller. Okay, that's very, very significant. Okay.

193 00:38:35.340 --> 00:38:37.929 Haki Benita: if we put all the data in a table.

194 00:38:38.180 --> 00:38:46.570 Haki Benita: You can see that the hash index in this case, with both faster and smaller.

195 00:38:46.860 --> 00:38:47.990 Haki Benita: So that's

196 00:38:48.170 --> 00:39:06.030 Haki Benita: a win-win all around. Okay, faster and smaller than the default. B, 3. Index. Now, I did a little experiment. Okay. So what I did was, I created a hash index and a btree index on the key and on the URL. Okay, you can see the the chart right here.

197 00:39:06.490 --> 00:39:35.660 Haki Benita: I have a hash index on the key. I have a hash index on the URL, I have a B tree on the key, and I have a B tree on the URL. And what I did is I started adding rows to the table. Okay, you can see at the bottom the bottom axis. That's the number of rows. So I started adding rows into the table until I get to a million rows. Now, every time I added rows to the table I took a snapshot of the sizes of the hash index of all the indexes, and then I put this

198 00:39:35.740 --> 00:39:39.649 Haki Benita: all the data in this chart, and we can see some

199 00:39:39.740 --> 00:39:43.597 Haki Benita: interesting things. Okay. 1st of all.

200 00:39:44.510 --> 00:39:46.580 Haki Benita: 1st of all, if you look at the.

201 00:39:47.000 --> 00:39:49.219 Haki Benita: If you look at the red line.

202 00:39:49.470 --> 00:39:52.999 Haki Benita: which is the B tree on the URL big piece of text.

203 00:39:53.690 --> 00:40:18.479 Haki Benita: and the green line which is the B tree on the key the short piece of text. 1st of all, you can see that both of them grow basically linearly as I add more rows to the table, right? So we can see like this linear line increasing right? As I add, more rows, the size of the index increases. We can also see that the red line, the B tree on the URL is always larger.

204 00:40:18.850 --> 00:40:21.239 Haki Benita: the the B tree on the key right?

205 00:40:21.780 --> 00:40:30.559 Haki Benita: So the reason for that is that the URL is a big piece of text, and the key is a short piece of text. This tells us

206 00:40:30.890 --> 00:40:33.730 Haki Benita: that the size of the bee tree is

207 00:40:33.840 --> 00:40:36.900 Haki Benita: very much affected by the size

208 00:40:37.250 --> 00:40:40.240 Haki Benita: of the column that it indexes.

209 00:40:40.380 --> 00:40:49.959 Haki Benita: So A B tree on URL will be bigger than A B tree on key for the same amount of rows, because a URL is bigger than a key.

210 00:40:50.270 --> 00:40:56.780 Haki Benita: So that's about the B 2 indexes. However, if we look at the hash indexes. That's the blue.

211 00:40:57.900 --> 00:40:59.700 Haki Benita: the yellow lines.

212 00:41:00.190 --> 00:41:02.260 Haki Benita: 1st of all, we can see that

213 00:41:03.480 --> 00:41:10.410 Haki Benita: the size of the hash index, if I add more rows is not affected by the size of the value.

214 00:41:10.540 --> 00:41:18.259 Haki Benita: because URL is big key small. But as I add more rows to the table. The size of the hash index is the same. Okay.

215 00:41:18.400 --> 00:41:27.409 Haki Benita: The second thing that I can see is that in this specific case the hash index was consistently lower, smaller.

216 00:41:27.690 --> 00:41:35.050 Haki Benita: Then the same index, the same B, 3 index on the same column. Okay. So in this case the hash index was always smaller.

217 00:41:35.520 --> 00:41:40.680 Haki Benita: Another thing that we can see in this chart that, unlike the B 3 index that grows linearly.

218 00:41:41.050 --> 00:41:48.299 Haki Benita: the hash index grows in like steps. Right? You can see the step, and then it's flat. Step flat.

219 00:41:48.700 --> 00:42:09.099 Haki Benita: So what's happening in a hash index is once we have, we start adding rows to the hash index, and then we have some bucket, and this bucket starts to fill up. Now, when a bucket fills up, postgres, needs to split this bucket. Now, when the bucket is split, postgres, pre allocates

220 00:42:09.580 --> 00:42:12.570 Haki Benita: storage disk space for this bucket.

221 00:42:12.700 --> 00:42:16.419 Haki Benita: So the steps that you see is the bucket splits

222 00:42:16.540 --> 00:42:21.430 Haki Benita: where postgres allocates additional storage to split the bucket.

223 00:42:21.770 --> 00:42:22.630 Haki Benita: Right?

224 00:42:22.970 --> 00:42:25.229 Haki Benita: So this is why hash index

225 00:42:25.420 --> 00:42:28.239 Haki Benita: grows in in, in, in steps.

226 00:42:29.060 --> 00:42:35.259 Haki Benita: So hash index is ideal. When we have very few duplicates

227 00:42:35.470 --> 00:42:59.300 Haki Benita: in the rows that we want to index, and the reason for that is, if we have lots of duplicates, the values would map to the same bucket, and we won't get the benefit of a hash index. The reason that a hash index made sense in our case is that URL is mostly unique. It's almost unique. Okay, it's not unique by definition. But there's not a lot of duplicates.

228 00:42:59.680 --> 00:43:18.200 Haki Benita: We also saw that, unlike a B tree index, hash index is not affected by the size of the values that it indexes, and the reason for that is that the hash index doesn't actually include the values. It includes hash values. Okay, this is why I can index very, very big values, big strings

229 00:43:18.540 --> 00:43:40.110 Haki Benita: with a relatively small index. Okay, as we saw hash index under some circumstances, can be both smaller and faster than A. B 3 index, and the reason that a lot of people are unfamiliar with a hash index is that prior to Postgres 10, which is already pretty old because we're now at Postgres 17.

230 00:43:40.580 --> 00:44:04.829 Haki Benita: If you went to the documentation for Hash Index, there would be like this huge warning, saying, Beware, do not use hash indexes. They are not production ready. So a lot of developers became used to not using hash indexes, but starting in postgres 10, you can definitely use hash indexes in production. They are production ready, and as we saw, they can be very, very good under some circumstances.

231 00:44:06.160 --> 00:44:12.890 Haki Benita: We're talking about hash indexes. It is very important to also know the restrictions of hash indexes. 1st of all, hash index

232 00:44:14.290 --> 00:44:32.920 Haki Benita: cannot be used to create. You can create a unique hash index, and the reason that you can is that a hash index does not contain the actual values, just hash values. And technically, you can have multiple different values producing the exact same hash value.

233 00:44:33.090 --> 00:44:43.399 Haki Benita: So it can. You can create a unique hash index. However, okay, and that's the comment at the bottom, we can talk about it later. If you want. You can enforce unique

234 00:44:43.680 --> 00:44:47.209 Haki Benita: with the hash index using an exclusion constraint.

235 00:44:47.440 --> 00:44:56.589 Haki Benita: Okay, next, we can't have a composite hash index. We can't have a hash index on multiple columns. Okay?

236 00:44:57.410 --> 00:45:02.989 Haki Benita: And we can use hash index for sorting and range searches, because once again.

237 00:45:03.280 --> 00:45:10.940 Haki Benita: hash index does not contain the actual values. Just the hash values right? So I can't use a hash index for things like.

238 00:45:11.390 --> 00:45:17.379 Haki Benita: you know, between greater than less than and so on. Just equality.

239 00:45:18.540 --> 00:45:24.421 Haki Benita: So quick. Recap just 4 more slides. I promise. Okay,

240 00:45:26.090 --> 00:45:34.610 Haki Benita: when to use indexes. So remember, indexes can make queries faster. We saw that in all of our examples.

241 00:45:34.650 --> 00:45:56.340 Haki Benita: using an index, made the query faster. However, the not free, they come at a cost. You need to maintain this index, and this index maintenance happens when you insert when you update and when you delete. So the more indexes you create, the faster your queries are. But the slower every other operation is

242 00:45:56.500 --> 00:46:18.380 Haki Benita: okay. Another thing to consider, and this is often overlooked. Indexes can be very, very big. They consume a lot of disk space when you go back to your databases. After this talk, please go do slash di plus, and look at the sizes of your index. I think that if you never looked at the size of your indexes.

243 00:46:18.620 --> 00:46:23.349 Haki Benita: You're going to be very much surprised at what you're going to find.

244 00:46:24.180 --> 00:46:41.909 Haki Benita: and finally using an index is not always best. If you have a query that needs to access a large portion of the table. Sometimes it doesn't make sense to use an index for that. Okay, there's no magic number, but, you know.

245 00:46:42.190 --> 00:46:43.480 Haki Benita: keep that in mind.

246 00:46:44.710 --> 00:46:55.220 Haki Benita: So we talked about index types and features. We talked about partial indexes, inclusive between indexes, and we talked about hash index.

247 00:46:55.420 --> 00:47:07.439 Haki Benita: We talked a little bit about how to evaluate performance. I don't know if you noticed, but throughout throughout this presentation we went through the same process over and over again. We start by

248 00:47:07.600 --> 00:47:25.639 Haki Benita: executing some query with, explain, analyze, to get the timing with no indexes. This is basically establishing a baseline right? And then we start by experimenting with different types of indexes. So usually, we start with a B tree. We take a measure of the time using, explain, analyze.

249 00:47:25.640 --> 00:47:40.620 Haki Benita: and then we take the size of the index. We put it all in a nice table. We start experimenting. And once you have all the data organized like that. It's a lot easier to reach a decision on what is the best indexing approach

250 00:47:40.630 --> 00:47:42.499 Haki Benita: for your specific use case.

251 00:47:42.560 --> 00:47:53.119 Haki Benita: And also and hopefully, you remember that indexes performance is not just about speed. As we saw, we can get significant

252 00:47:53.660 --> 00:47:57.540 Haki Benita: disk space reductions with a very, very.

253 00:47:57.600 --> 00:48:09.329 Haki Benita: with a very small price of speed sometimes makes sense to make this compromise. We also, throughout this talk, saw how to use, explain

254 00:48:09.360 --> 00:48:31.259 Haki Benita: how to use, explain, analyze how to debug SQL in Django, and we also saw a lot of execution plans. I don't know if you noticed, but if you've never seen execution plans before, hopefully, when you go back to your system. You start doing, explain, analyze some of the queries you run a lot. You get to actually understand what the database is doing. Now

255 00:48:31.560 --> 00:48:45.659 Haki Benita: in this talk I talked only about inclusive indexes, partial indexes, and hash index, but, in fact, there are many, many different other types of indexes that are exotic and very, very cool. We have

256 00:48:46.330 --> 00:48:56.900 Haki Benita: Brent indexes. We have function based indexes, and we have a lot of different flavors of things that we can do. And you can check out this

257 00:48:57.300 --> 00:49:04.960 Haki Benita: class 3 h packed with astral magic for your benefit and

258 00:49:05.810 --> 00:49:13.720 Haki Benita: finally check me out in all of these places, and I'm happy to take questions or discuss whatever you want.

259 00:49:19.490 --> 00:49:22.113 Gabor Szabo: Whoa, thank you.

260 00:49:23.750 --> 00:49:26.585 Gabor Szabo: Because, yeah.

261 00:49:27.400 --> 00:49:28.630 Haki Benita: Hectic.

262 00:49:30.335 --> 00:49:35.410 Gabor Szabo: Yeah, this is not a question, Hank. His article on Hash Indexes is truly excellent.

263 00:49:35.520 --> 00:49:42.589 Gabor Szabo: I believe it remains one of the top search results for anyone looking for resources on hash indexes.

264 00:49:42.760 --> 00:49:47.639 Haki Benita: It's true, it's true. This is one of the top searches for hash index in postgres.

265 00:49:47.910 --> 00:49:48.340 Gabor Szabo: Yeah.

266 00:49:48.340 --> 00:49:53.060 Haki Benita: Yeah, I managed to catch this trend very, very early on.

267 00:49:54.515 --> 00:49:55.540 Gabor Szabo: Okay.

268 00:49:55.790 --> 00:49:56.270 Haki Benita: Mom.

269 00:49:56.270 --> 00:50:01.189 Gabor Szabo: Comments, questions before we. We close this session.

270 00:50:02.340 --> 00:50:05.000 Gabor Szabo: We know where where to find you.

271 00:50:05.160 --> 00:50:07.829 Gabor Szabo: We have the. We'll have the link.

272 00:50:08.320 --> 00:50:16.320 Gabor Szabo: You can add the links to the post of the of the of the video as well, so people can find find it easily, easily.

273 00:50:17.100 --> 00:50:19.660 Gabor Szabo: and any comments.

274 00:50:19.660 --> 00:50:20.020 Haki Benita: Okay.

275 00:50:20.020 --> 00:50:21.859 Gabor Szabo: Questions, apparently not.

276 00:50:21.860 --> 00:50:24.780 Haki Benita: Yeah, I want to thank you, Gabra, for hosting this meeting.

277 00:50:24.780 --> 00:50:25.650 Gabor Szabo: It was excellent.

278 00:50:26.146 --> 00:50:27.139 Haki Benita: Meet up!

279 00:50:27.140 --> 00:50:32.660 Gabor Szabo: Yeah. Well, yeah. So thank you very much for this presentation.

280 00:50:32.770 --> 00:50:41.470 Gabor Szabo: If anyone has questions, then we'll see how to find the hockey later on in this on this slide, and then we'll put it in under the video.

281 00:50:42.020 --> 00:50:52.750 Gabor Szabo: Thank you for for supporting us. Thank you for being here. Thank you very much to you to giving the presentation, please, like the video and follow the channel. Yeah.

282 00:50:53.020 --> 00:51:10.139 Gabor Szabo: And if you would like to give any presentation, you're welcome to contact me as well, and we'll see how we can schedule a presentation at what time, and and so on. So thank you very much, and

283 00:51:10.430 --> 00:51:15.029 Gabor Szabo: see you at the next meeting next video, whatever.

284 00:51:15.400 --> 00:51:16.869 Gabor Szabo: Thank you. Bye, bye.

285 00:51:16.870 --> 00:51:18.830 Haki Benita: Thank you very much. Everyone. Good night.

How to Make Your Backend Roar with Haki Benita

2025-02-04T08:30:01Z

Developers who are not familiar with databases often dread them and treat them like blackboxes, but fear no more! In this talk I present advanced indexing technics to make your database faster and more efficient.

Indexes are extremely powerful and ORMs like Django and SQLAlchemy provide many ways of harnessing their powers to make queries faster and the database more efficient. In this talk I reveal the secrets of DBAs with some advanced indexing techniques such as partial, function based and inclusive B-Tree indexes, and who knows, maybe even some index types you never heard of before!

Haki Benita