forked from Sydney-Informatics-Hub/ParallelPython
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathsearch.json
More file actions
135 lines (135 loc) · 72.6 KB
/
search.json
File metadata and controls
135 lines (135 loc) · 72.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
[
{
"objectID": "index.html",
"href": "index.html",
"title": "Parallel Python",
"section": "",
"text": "This course is aimed at researchers, students, and industry professionals who want to learn intermediate python skills applied to scientific computing and data science."
},
{
"objectID": "index.html#trainers",
"href": "index.html#trainers",
"title": "Parallel Python",
"section": "Trainers",
"text": "Trainers\n\nKristian Maras (Kris) (MSc Mathematics / Ba Commerce)\nNathaniel (Nate) Butterworth (PhD Computational Geophysics), [email protected]\nDarya Vanichkina (PhD Bioinformatics, SFHEA)\nTim White (PhD Physics and Astronomy)"
},
{
"objectID": "index.html#course-pre-requisites-and-setup-requirements",
"href": "index.html#course-pre-requisites-and-setup-requirements",
"title": "Parallel Python",
"section": "Course pre-requisites and setup requirements",
"text": "Course pre-requisites and setup requirements\nIntroductory Python experience recommended."
},
{
"objectID": "index.html#code-of-conduct",
"href": "index.html#code-of-conduct",
"title": "Parallel Python",
"section": "Code of Conduct",
"text": "Code of Conduct\nWe expect all attendees of our training to follow our code of conduct, including bullying, harassment and discrimination prevention policies.\nIn order to foster a positive and professional learning environment we encourage the following kinds of behaviours at all our events and on our platforms:\n\nUse welcoming and inclusive language\nBe respectful of different viewpoints and experiences\nGracefully accept constructive criticism\nFocus on what is best for the community\nShow courtesy and respect towards other community members\n\nOur full CoC, with incident reporting guidelines, is available here."
},
{
"objectID": "index.html#general-session-timings",
"href": "index.html#general-session-timings",
"title": "Parallel Python",
"section": "General session timings",
"text": "General session timings\n\nA. Intoduction and Revise Python Data Manipulation and Pandas Data Structure\nB. Dask Fundamentals and application to solving data and computationally bounded bottlenecks.\nC. Scientific Computing Demonstration"
},
{
"objectID": "index.html#setup-instructions",
"href": "index.html#setup-instructions",
"title": "Parallel Python",
"section": "Setup Instructions",
"text": "Setup Instructions\nAll content in this course is designed to run on the existing Dask Binder :\nDask Binder\nHence no installation steps are needed prior to the course."
},
{
"objectID": "setup.html",
"href": "setup.html",
"title": "Setup",
"section": "",
"text": "At home setup\nTo complete the exercises presented in the workshop, you may create a Python environment with the following packages:\nconda create -n quantum python=3.9 numpy scipy pandas scikit-learn seaborn dask jupyterlab -c conda-forge\nAt the time of this workshop, the major package versions were:\ndask=2022.10.0\njupyterlab=3.4.8\nmatplotlib-base=3.5.3\nnumpy=1.23.4\npandas=1.5.1\npython=3.9.13\nscikit-learn=1.1.2\nscipy=1.9.3\nseaborn=0.12.1\nOther combinations may also work."
},
{
"objectID": "notebooks/01a-fundamentals.html",
"href": "notebooks/01a-fundamentals.html",
"title": "Parallel Python with Dask",
"section": "",
"text": "What can Python do?\nHow do I do it?\n\n\n\n\n\nLearn the basic Python commands\n\n\nGenerally, cells like this are what to type into your Python shell/notebook/colab:\n2+4*10\n42\n\n\nThese are bits of code you want to perhaps use many times, or keep self contained, or refer to at different points. They can take values as input and give values back (or not).\n#Declare the name of the function\ndef add_numbers(x,y):\n '''adds two numbers\n usage: myaddition=addnumbers(x,y)\n returns: z\n inputs: x,y\n x and y are two integers\n z is the summation of x and y\n '''\n \n z=x+y\n \n return(z)\nNote the indentation - Python forces your code to be nicely readable by using ‘whitespace’/indentation to signify what chunks of code are related. You will see this more later, but generally you should try and write readable code and follow style standards\nMany functions have a header - formatted as a multiline comment with three '''. This hopefully will tell you about the function.\nAnyway, let’s run our function, now that we have initialised it!\nadd_numbers(1,2)\n3\n\n\n\nWrite a function to convert map scale. For example, on a 1:25,000 map (good for hiking!) the distance between two points is 15 cm. How far apart are these in real life? (3750 m).\n[Reminder: 15 cm * 25000 = 375000 cm = 3750 m]\nYour function should take as input two numbers: the distance on the map (in cm) and the second number of the scale and, i.e. calculate_distance(15, 25000) should return 375000\n\n\nSolution\n\n#Declare the name of the function\ndef calculate_distance(distance_cm,scale):\n '''calculates distance based on map and scale\n returns: z\n inputs: distance_cm,scale\n distance_cm and scale are two integers\n returns the product of distance_cm and scale\n ''' \n \n return(distance_cm * scale)\n\n\nLet’s quickly make a figure using some sample data.\n#First we have to load some modules to do the work for us.\n#Modules are packages people have written so we do not have to re-invent everything!\n\n#The first is NUMerical PYthon. A very popular matrix, math, array and data manipulation library.\nimport numpy as np\n\n#This is a library for making figures (originally based off Matlab plotting routines)\n#We use the alias 'plt' because we don't want to type out the whole name every time we reference it!\nimport matplotlib.pyplot as plt \n\n# random code from matplotlib docs\n# Fixing random state for reproducibility\nnp.random.seed(19680801)\n\n\nplt.rcdefaults()\nfig, ax = plt.subplots()\n\n# Example data\npeople = ('Tom', 'Dick', 'Harry', 'Slim', 'Jim')\ny_pos = np.arange(len(people))\nperformance = 3 + 10 * np.random.rand(len(people))\nerror = np.random.rand(len(people))\n\nax.barh(y_pos, performance, xerr=error, align='center')\nax.set_yticks(y_pos, labels=people)\nax.invert_yaxis() # labels read top-to-bottom\nax.set_xlabel('Performance')\nax.set_title('How fast do you want to go today?')\n\nplt.show()\n\n\n\n\nYou can store things in Python in variables\nLists can be used to store objects of different types\nLoops with for can be used to iterate over each object in a list\nFunctions are used to write (and debug) repetitive code once\nPython uses 0-based indexing\nNumpy and matplotlib are two key python libraries for numerical computations and data visualisation, respectively"
},
{
"objectID": "notebooks/01b-TransitionToDask.html",
"href": "notebooks/01b-TransitionToDask.html",
"title": "Parallel Python with Dask",
"section": "",
"text": "Dask\n10 minute intro\nAPI Reference\n\nDask DataFrames coordinate many pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned row-wise, grouping rows by index value for efficiency. These pandas objects may live on disk or on other machines.\nInternally, a Dask DataFrame is split into many partitions, where each partition is one Pandas DataFrame. When our index is sorted and we know the values of the divisions of our partitions, then we can be clever and efficient with expensive algorithms (e.g. groupby’s, joins, etc…).\nUse Cases:\nDask DataFrame is used in situations where pandas is commonly needed, usually when pandas fails due to data size or speed of computation. Common use cases are:\n\nManipulating large datasets, even when those datasets don’t fit in memory\nAccelerating long computations by using many cores\nDistributed computing on large datasets with standard pandas operations like groupby, join, and time series computations\n\nDask DataFrame may not be the best choice in the following situations:\n\nIf your dataset fits comfortably into RAM on your laptop, then you may be better off just using pandas. There may be simpler ways to improve performance than through parallelism\nIf your dataset doesn’t fit neatly into the pandas tabular model, then you might find more use in dask.bag or dask.array\nIf you need functions that are not implemented in Dask DataFrame, then you might want to look at dask.delayed which offers more flexibility\nIf you need a proper database with all of the features that databases offer you might prefer something like Postgres or SQLite\n\n#Import libraries and datasets\nimport pandas as pd\nimport numpy as np\nimport scipy as sp\nimport seaborn as sns\nimport dask.datasets\nimport dask.dataframe as dd\n\nts_data = dask.datasets.timeseries()\ndf = sns.load_dataset('diamonds')"
},
{
"objectID": "notebooks/01b-TransitionToDask.html#transitioning-to-dask-dataframes",
"href": "notebooks/01b-TransitionToDask.html#transitioning-to-dask-dataframes",
"title": "Parallel Python with Dask",
"section": "Transitioning to Dask DataFrames",
"text": "Transitioning to Dask DataFrames\n# load ddf from existing df\nddf = dd.from_pandas(df,npartitions = 2) \n# many loading options available\n\nddf #dask dataframe \n# by default it has lazy execution where computation are triggered by compute() (or head) \nddf.compute() # convert dd to pd.DataFrame\nddf.head(2)\n\n# Attributes of Dask dataframe distinct from pd.Dataframe\nddf.npartitions # number of partitions\nddf.divisions # Divisions includes the minimum value of every partition’s index and the maximum value of the last partition’s index\nddf.partitions[1] # access a particular partition\nddf.partitions[1].index # which have similar pd.DataFrame attributes\n\n# Special consideration\n\n# By default, groupby methods return an object with only 1 partition. \n# This is to optimize performance, and assumes the groupby reduction returns an object that is small enough to fit into memory. \n# If your returned object is larger than this, you can increase the number of output partitions using the split_out argument.\nddf.groupby('cut').mean() #npartitions=1\nddf.groupby('cut').mean(split_out=2) #npartitions=2\n/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/site-packages/dask/dataframe/groupby.py:1351: FutureWarning: In the future, `sort` for groupby operations will default to `True` to match the behavior of pandas. However, `sort=True` does not work with `split_out>1`. To retain the current behavior for multiple output partitions, set `sort=False`.\n warnings.warn(SORT_SPLIT_OUT_WARNING, FutureWarning)\n/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/site-packages/dask/dataframe/groupby.py:1351: FutureWarning: In the future, `sort` for groupby operations will default to `True` to match the behavior of pandas. However, `sort=True` does not work with `split_out>1`. To retain the current behavior for multiple output partitions, set `sort=False`.\n warnings.warn(SORT_SPLIT_OUT_WARNING, FutureWarning)\n\nDask DataFrame Structure:\n\n\n\n \n \n \n carat\n depth\n table\n price\n x\n y\n z\n \n \n npartitions=2\n \n \n \n \n \n \n \n \n \n \n \n \n float64\n float64\n float64\n float64\n float64\n float64\n float64\n \n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n\n\nDask Name: truediv, 19 graph layers\n# Dask syntax intentionally mimics most well knows pandas apis\nddf.loc[15:20] # subset rows\nddf[[\"carat\",\"price\"]] # subset columns\nddf.dtypes # access attributes\nddf.head(3)\nddf.query('price > 50') # same as pd.DataFrame\n\nlazy_manipulations = (ddf.query('price > 50').\n groupby('clarity').\n price.mean())\nlazy_manipulations.compute() # trigger computation to pd.DataFrame\n\n# dask aggregate has more features than pandas agg equivalent, supports reductions on the same group.\n\nddf_aggs = (ddf.groupby('cut')\n .aggregate({\"price\":\"mean\",\"carat\":\"sum\"}))\n\n# Can persist data into RAM if possible making future operations on it faster\nddf_aggs = ddf_aggs.repartition(npartitions = 1).persist()\n\ndf_merged = ddf.merge(ddf_aggs,left_on= \"cut\",right_index=True, suffixes=(\"_original\", \"_aggregated\"))\n\ndf_merged.head(2)\n\n\n\n\n\n \n \n \n carat_original\n cut\n color\n clarity\n depth\n table\n price_original\n x\n y\n z\n price_aggregated\n carat_aggregated\n \n \n \n \n 0\n 0.23\n Ideal\n E\n SI2\n 61.5\n 55.0\n 326\n 3.95\n 3.98\n 2.43\n 3457.54197\n 15146.84\n \n \n 11\n 0.23\n Ideal\n J\n VS1\n 62.8\n 56.0\n 340\n 3.93\n 3.90\n 2.46\n 3457.54197\n 15146.84\n \n \n\n\nNote that not all apis from pandas are available in Dask. For example, ddf.filter(['carat','price']) is not available. For more details and a list of available options, see here.\n\nChallenge\n\nWhat is the price per carat over the entire dataset?\nCreate a column called price_to_carat that calculates this for each row\nCreate a column called expensive that flags whether price is greater than price_to_carat\nHow many expensive diamonds are there\n\n\n\nSolution\n\n\nAverage price to carat $4928\n15003 expensive diamonds compared to whole dataset\n\nprice_per_carat = (ddf.price.sum() / ddf.carat.sum()).compute()\n\nddf = ddf.assign(price_to_carat = ddf.price / ddf.carat)\n\ndef greater_than_avg(price):\n if price > price_per_carat:\n return True\n else:\n return False\n\nddf = ddf.assign(expensive = ddf.price.apply(greater_than_avg))\nddf.sort_values('expensive',ascending= False).compute()\nnumber_expensive = ddf.expensive.sum().compute()"
},
{
"objectID": "notebooks/01b-TransitionToDask.html#dask-best-practice-guide",
"href": "notebooks/01b-TransitionToDask.html#dask-best-practice-guide",
"title": "Parallel Python with Dask",
"section": "Dask Best Practice Guide",
"text": "Dask Best Practice Guide\n\nUse set_index() sparingly to speed up data naturally sorted on a single index\n\nUse ddf.set_index('column')\n\nPersist intelligently\n\n\nIf you have the available RAM for your dataset then you can persist data in memory. On distributed systems, it is a way of telling the cluster that it should start executing the computations that you have defined so far, and that it should try to keep those results in memory.\n\ndf = df.persist()\n\n\n\nRepartition to reduce overhead\n\nAs you reduce or increase the size of your pandas DataFrames by filtering or joining, it may be wise to reconsider how many partitions you need. Adjust partitions accordingly using repartition.\ndf = df.repartition(npartitions=df.npartitions // 100)\n\nConsider storing large data in Apache Parquet Format (binary column based format)\n\n# Time series data with every second observations from year 2000\nts_data \n\n# dask can use datetime index to reduce data efficiently\nts_data[[\"x\", \"y\"]].resample(\"1h\").mean().head()\n\n# Build up lazy data manipulations and compute selectively to reduce data\n\nts_subset = ts_data.groupby('name').aggregate({\"x\": \"sum\", \"y\": \"max\"})\n\n# Repartition appropriately, smaller dataset doesn't need many partitions\nts_subset = ts_subset.repartition(npartitions= 1)\n\nts_subset.head(10)\n\n# Set index selectively as its expensive\nts_subset = ts_subset.set_index(\"name\")\n\n# Persist in RAM if possible after expensive calculations to rather than continue building lazy operations. \nts_subset = ts_subset.persist()\n\n# Continue with pandas if memory is fine\nts_subset_df = ts_subset.compute()\nts_subset_df.sort_values(\"name\").head(3)\n/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/site-packages/dask/dataframe/core.py:4948: UserWarning: New index has same name as existing, this is a no-op.\n warnings.warn(\n\n\n\n\n \n \n \n x\n y\n \n \n name\n \n \n \n \n \n \n Alice\n 322.982262\n 0.999975\n \n \n Bob\n 34.466002\n 0.999986\n \n \n Charlie\n 222.431201\n 0.999984"
},
{
"objectID": "notebooks/01b-TransitionToDask.html#using-external-functions-in-dask",
"href": "notebooks/01b-TransitionToDask.html#using-external-functions-in-dask",
"title": "Parallel Python with Dask",
"section": "Using external functions in Dask",
"text": "Using external functions in Dask\nfrom sklearn.linear_model import LinearRegression\n\ndef train(partition):\n if not len(partition):\n return\n est = LinearRegression()\n est.fit(partition[[\"x\"]].values, partition.y.values)\n return est\n\n'''\nThe meta argument tells Dask how to create the DataFrame or Series that will hold the result of .apply(). \nIn this case, train() returns a single value, so .apply() will create a Series. \nThis means we need to tell Dask what the type of that single column should be and optionally give it a name.\n'''\nresults = ts_subset.groupby(\"name\").apply(\n train, meta=(\"LinearRegression\", object)\n).compute()\n\nresults[\"Bob\"] # linear model of a particular group\nLinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegressionLinearRegression()"
},
{
"objectID": "notebooks/01b-TransitionToDask.html#dataframes-reading-in-messy-data",
"href": "notebooks/01b-TransitionToDask.html#dataframes-reading-in-messy-data",
"title": "Parallel Python with Dask",
"section": "DataFrames: Reading in messy data",
"text": "DataFrames: Reading in messy data\nGo through existing Binder - demostrates both dask and using delayed functions."
},
{
"objectID": "notebooks/01b-TransitionToDask.html#dask-arrays",
"href": "notebooks/01b-TransitionToDask.html#dask-arrays",
"title": "Parallel Python with Dask",
"section": "Dask Arrays",
"text": "Dask Arrays\nimport dask.array as da\nx = da.random.random((10000, 10000), chunks=(1000, 1000))\nx\n\n\n \n \n \n \n \n \n Array \n Chunk \n \n \n \n \n \n Bytes \n 762.94 MiB \n 7.63 MiB \n \n \n \n Shape \n (10000, 10000) \n (1000, 1000) \n \n \n Count \n 1 Graph Layer \n 100 Chunks \n \n \n Type \n float64 \n numpy.ndarray \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n\n \n 10000\n 10000\n\n \n \n\n# numpy syntax as usual\ny = x + x.T\nz = y[::2, 5000:].mean(axis=1) # axis 0 is index, axis 1 is columns\nz\n# Trigger compute and investigate Client\n\n\n \n \n \n \n \n \n Array \n Chunk \n \n \n \n \n \n Bytes \n 39.06 kiB \n 3.91 kiB \n \n \n \n Shape \n (5000,) \n (500,) \n \n \n Count \n 7 Graph Layers \n 10 Chunks \n \n \n Type \n float64 \n numpy.ndarray \n \n \n \n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n\n \n 5000\n 1\n\n \n \n\n\nMore more info on arrays - Go through tutorial on\nhttps://tutorial.dask.org/"
},
{
"objectID": "notebooks/01b-TransitionToDask.html#diagnostics---profile-resource-efficiency-in-real-time",
"href": "notebooks/01b-TransitionToDask.html#diagnostics---profile-resource-efficiency-in-real-time",
"title": "Parallel Python with Dask",
"section": "Diagnostics - Profile resource efficiency in real time",
"text": "Diagnostics - Profile resource efficiency in real time\nThe Dask Dashboard enables resource monitoring across RAM, CPU, workers, threads and tasks (functions).\nhttps://docs.dask.org/en/stable/dashboard.html\n\nA few key definitions:\n\nBytes Stored and Bytes per Worker: Cluster memory and Memory per worker.\nTask Processing/CPU Utilization/Occupancy: Tasks being processed by each worker/ CPU Utilization per worker/ Expected runtime for all tasks currently on a worker.\nProgress: Progress of a set of tasks.\n\nThere are three different colors of workers in a task graph:\n\nBlue: Processing tasks.\nGreen: Saturated: It has enough work to stay busy.\nRed: Idle: Does not have enough work to stay busy.\nTask Stream: Individual task across threads.\n\nWhite colour represents deadtime.\n\n\n# To load diagnostic in web browser on local\nfrom dask.distributed import Client\nclient = Client()\nclient #client.shutdown after use\n\n\n \n \n Client\n Client-c7757bc6-5514-11ed-ba5c-fe453513c75b\n \n\n \n \n Connection method: Cluster object\n Cluster type: distributed.LocalCluster\n \n \n\n \n \n \n Dashboard: http://127.0.0.1:8787/status\n \n \n \n \n\n \n\n \n \n Cluster Info\n \n \n \n \n LocalCluster\n 28e449c4\n \n \n \n Dashboard: http://127.0.0.1:8787/status\n \n \n Workers: 5\n \n \n \n \n Total threads: 10\n \n \n Total memory: 32.00 GiB\n \n \n \n \n Status: running\n Using processes: True\n\n\n \n \n\n \n \n Scheduler Info\n \n\n \n \n \n \n Scheduler\n Scheduler-00cc99f1-ae30-4653-a674-8f38139fd85b\n \n \n \n Comm: tcp://127.0.0.1:63334\n \n \n Workers: 5\n \n \n \n \n Dashboard: http://127.0.0.1:8787/status\n \n \n Total threads: 10\n \n \n \n \n Started: Just now\n \n \n Total memory: 32.00 GiB\n \n \n \n \n \n\n \n \n Workers\n \n\n \n \n \n \n \n \n Worker: 0\n \n \n \n \n Comm: tcp://127.0.0.1:63357\n \n \n Total threads: 2\n \n \n \n \n Dashboard: http://127.0.0.1:63364/status\n \n \n Memory: 6.40 GiB\n \n \n \n \n Nanny: tcp://127.0.0.1:63337\n \n \n \n \n \n Local directory: /var/folders/1b/_jymrbj17cz6t7cxdl86xshh0000gr/T/dask-worker-space/worker-0x5qvryt\n \n \n\n \n\n \n\n \n \n \n \n \n \n \n \n \n \n Worker: 1\n \n \n \n \n Comm: tcp://127.0.0.1:63355\n \n \n Total threads: 2\n \n \n \n \n Dashboard: http://127.0.0.1:63362/status\n \n \n Memory: 6.40 GiB\n \n \n \n \n Nanny: tcp://127.0.0.1:63340\n \n \n \n \n \n Local directory: /var/folders/1b/_jymrbj17cz6t7cxdl86xshh0000gr/T/dask-worker-space/worker-0s__jczq\n \n \n\n \n\n \n\n \n \n \n \n \n \n \n \n \n \n Worker: 2\n \n \n \n \n Comm: tcp://127.0.0.1:63358\n \n \n Total threads: 2\n \n \n \n \n Dashboard: http://127.0.0.1:63360/status\n \n \n Memory: 6.40 GiB\n \n \n \n \n Nanny: tcp://127.0.0.1:63338\n \n \n \n \n \n Local directory: /var/folders/1b/_jymrbj17cz6t7cxdl86xshh0000gr/T/dask-worker-space/worker-mnj91k25\n \n \n\n \n\n \n\n \n \n \n \n \n \n \n \n \n \n Worker: 3\n \n \n \n \n Comm: tcp://127.0.0.1:63356\n \n \n Total threads: 2\n \n \n \n \n Dashboard: http://127.0.0.1:63361/status\n \n \n Memory: 6.40 GiB\n \n \n \n \n Nanny: tcp://127.0.0.1:63341\n \n \n \n \n \n Local directory: /var/folders/1b/_jymrbj17cz6t7cxdl86xshh0000gr/T/dask-worker-space/worker-nvggbtok\n \n \n\n \n\n \n\n \n \n \n \n \n \n \n \n \n \n Worker: 4\n \n \n \n \n Comm: tcp://127.0.0.1:63359\n \n \n Total threads: 2\n \n \n \n \n Dashboard: http://127.0.0.1:63363/status\n \n \n Memory: 6.40 GiB\n \n \n \n \n Nanny: tcp://127.0.0.1:63339\n \n \n \n \n \n Local directory: /var/folders/1b/_jymrbj17cz6t7cxdl86xshh0000gr/T/dask-worker-space/worker-sgo11i0u\n \n \n\n \n\n \n\n \n \n \n \n \n\n \n\n\n \n \n\n \n \n\n \n\n# Example of efficient resource utilisation\nimport dask.array as da\nx = da.random.random(size = (10_000,10_000,10), chunks= (1000,1000,5))\ny = da.random.random(size = (10_000,10_000,10), chunks= (1000,1000,5))\nz = (da.arcsin(x) + da.arcsin(y)).sum(axis = (1,2))\nz.compute()\narray([114139.43869439, 114133.41973571, 114035.048779 , ...,\n 114502.06960059, 114077.42169854, 114266.38837534])\n# Inefficient resource utilisation - dask introduces too much overhead for simple sizes np handles well\nx = da.random.random(size = (10_000_000),chunks = (1000,))\nx.sum().compute()\n4999875.009376019\n\n\nKey points\n\nThe similarity-by-design of the Dask API with pandas makes the transition easy compared to alternatives - although not all functions are replicated.\nScaling up to distributed systems, or down to simply running on your laptop, makes code easily transferable between different resources.\nDask enables parallelism without low level alterations in code."
},
{
"objectID": "notebooks/01d-dask_delayed.html",
"href": "notebooks/01d-dask_delayed.html",
"title": "Parallel Python with Dask",
"section": "",
"text": "Can Dask be used for embarassingly parallel problems?\nHow do you apply it to real functions?\n\n\n\n\n\nLearn about dask delayed\nApply delayed to real problems\nLearn to profile code\n\n\nIn this example we will explore the Schrodinger equation, and how we can use dask for an embarassingly parallel problem.\nSee here for similar problems: https://github.com/natsunoyuki/Computational_Physics_in_Python\n# Import the packages we need\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom scipy import sparse\nfrom scipy.sparse import linalg as sla\nimport time\nDefine a “computationally intensive” function. Here we are solving for the eigenvalues of \\({\\displaystyle i\\hbar {\\frac {d}{dt}}\\vert Ψ (t)\\rangle ={\\hat {H}}\\vert Ψ (t)\\rangle }\\)\ndef schrodinger1D(Vfun):\n \"\"\"\n Solves the 1 dimensional Schrodinger equation numerically\n \n ------ Inputs ------\n Vfun: function, potential energy function\n \n ------- Returns -------\n evl: np.array, eigenvalues\n evt: np.array, eigenvectors\n x: np.array, x axis values\n \n ------- Params to set -------\n xmin: minimum value of the x axis\n xmax: maximum value of the x axis\n Nx: number of finite elements in the x axis\n neigs: number of eigenvalues to find\n \"\"\"\n \n xmin = -10\n xmax = 10\n Nx = 250\n neigs = 5\n\n # for this code we are using Dirichlet Boundary Conditions\n x = np.linspace(xmin, xmax, Nx) # x axis grid\n dx = x[1] - x[0] # x axis step size\n # Obtain the potential function values:\n V = Vfun(x)\n # create the Hamiltonian Operator matrix:\n H = sparse.eye(Nx, Nx, format = \"lil\") * 2\n for i in range(Nx - 1):\n H[i, i + 1] = -1\n H[i + 1, i] = -1\n \n H = H / (dx ** 2)\n # Add the potential into the Hamiltonian\n for i in range(Nx):\n H[i, i] = H[i, i] + V[i]\n # convert to csc matrix format\n H = H.tocsc()\n \n # obtain neigs solutions from the sparse matrix\n [evl, evt] = sla.eigs(H, k = neigs, which = \"SM\")\n for i in range(neigs):\n # normalize the eigen vectors\n evt[:, i] = evt[:, i] / np.sqrt(\n np.trapz(np.conj(\n evt[:, i]) * evt[:, i], x))\n # eigen values MUST be real:\n evl = np.real(evl)\n \n return evl, evt, x\nDefine a function to plot H.\ndef plot_H(H,neigs=5):\n evl = H[0] # energy eigen values\n indices = np.argsort(evl)\n\n print(\"Energy eigenvalues:\")\n for i,j in enumerate(evl[indices]):\n print(\"{}: {:.2f}\".format(i + 1, j))\n\n evt = H[1] # eigen vectors \n x = H[2] # x dimensions \n i = 0\n\n plt.figure(figsize = (4, 2))\n while i < neigs:\n n = indices[i]\n y = np.real(np.conj(evt[:, n]) * evt[:, n]) \n plt.subplot(neigs, 1, i+1) \n plt.plot(x, y)\n plt.axis('off')\n i = i + 1 \n plt.show()\nDefine some potenial energy functions we want to explore.\ndef Vfun1(x, params=[1]):\n '''\n Quantum harmonic oscillator potential energy function\n '''\n V = params[0] * x**2\n return V\n \n \ndef Vfun2(x, params = 1e10):\n '''\n Infinite well potential energy function\n '''\n V = x * 0\n V[:100]=params\n V[-100:]=params\n return V\n \n \ndef Vfun3(x, params = [-0.5, 0.01, 7]):\n '''\n Double well potential energy function\n '''\n A = params[0]\n B = params[1]\n C = params[2]\n V = A * x ** 2 + B * x ** 4 + C\n return V\n\n## Plot these with \n# x = np.linspace(-10, 10, 100)\n# plt.plot(Vfun1(x))\nLet’s get an idea for how long our schrodinger equation takes to solve.\n%%time\nH = schrodinger1D(Vfun1)\nplot_H(H)\nEnergy eigenvalues:\n1: 1.00\n2: 3.00\n3: 4.99\n4: 6.99\n5: 8.98\n\nLet’s profile this function. Is there any way we can speed it up? Or apply some of the techniques we have learned? We can use the iPython/Jupyter magic command %%prun which uses cProfile.\nTLDR: maybe not! Not all code can be “dasked” or parallelised easily.\n%%prun -s cumulative -q -l 10 -T profile.txt\n \nH = schrodinger1D(Vfun1)\n 51234 function calls (51201 primitive calls) in 243.228 seconds\n\n Ordered by: cumulative time\n List reduced from 214 to 10 due to restriction <10>\n\n ncalls tottime percall cumtime percall filename:lineno(function)\n 1 0.000 0.000 243.228 243.228 {built-in method builtins.exec}\n 1 0.000 0.000 243.228 243.228 <string>:1(<module>)\n 1 0.003 0.003 243.227 243.227 3456611332.py:13(schrodinger1D)\n 1 0.303 0.303 243.196 243.196 arpack.py:1096(eigs)\n 876 240.090 0.274 242.492 0.277 arpack.py:719(iterate)\n 875 0.121 0.000 2.402 0.003 _interface.py:201(matvec)\n 875 0.107 0.000 2.173 0.002 _interface.py:189(_matvec)\n 875 0.205 0.000 2.059 0.002 _interface.py:303(matmat)\n 875 0.008 0.000 1.852 0.002 _interface.py:730(_matmat)\n 875 0.104 0.000 1.844 0.002 _base.py:400(dot)\nOkay. There may not be anything we can improve of greatly. The slowest part is a highly optimised scipy subroutine that is calling fortran under-the-hood! So what if we wanted to run this function 2 times, 3 times, a million times? Perhaps trying different configuration parameters, or specifically here, different potential energy functions.\n# The slow way: Loop through each of the PE definitions \n# and run the function one at a time.\nH = []\nfor f in [Vfun1,Vfun2,Vfun3] :\n tic = time.time()\n result = schrodinger1D(f)\n print(time.time() - tic, \"s for\", f)\n print(\"{:.4f}s for {}\".format(time.time()-tic, f))\n H.append(result)\n \n# plot_H(H[0])\n# plot_H(H[1])\n# plot_H(H[2])\n\n\nNow let’s try and solve the three variations in parallel. This is an embarassingly parallel problem, as each operation is completely seperate from the other.\nimport dask\n%%time\nlazy_H = []\nfor f in [Vfun1,Vfun2,Vfun3]:\n H_temp = dask.delayed(schrodinger1D)(f)\n lazy_H.append(H_temp)\nlazy_H\n%%time \nHH = dask.compute(*lazy_H)\nDone! That is it. You can now run the schrodinger1D as many times as you like in parallel and dask will take of distributing out the work to as many cpus as it can gets its threads on!\n\n\nCan you modify some of the parameters in the schrodinger1D function and see how the timing changes?\n\n\nSolution\n\nTry changing the xmin, xmax, and Nx parameter. These adjust the resolution of the model. You can quickly see how you may want to parallelise this code as each numerical solution can take a long time at high-resolutions.\nxmin = -100\nxmax = 100\nNx = 500\nThen re-run with\n%%time\nH = schrodinger1D(Vfun1)\n\n\n\n\nCan you re-write the the schrodinger1D function to accept “params” as an argument, then run multiple parameter configurations with a single Potential Energy function?\n\n\nStep 1\n\nModify the schrodinger1D function to accept an additional argument, and pass that argument to the Vfun call.\n#Need to change line 1\ndef schrodinger1D(Vfun, params): \n ...\n # And change line 29\n V = Vfun(x, params = params)\n\n\n\nStep 2\n\nChoose the Vfun you want to explore, and make a list of parameters we want to sweep. I will be looking at Vfun3. A way to make a set of params is to use the product function from the itertools package.\nimport itertools\nparam_config = [[-1,0,1],[-1,0,1],[-1,0,1]]\nparams=list(map(list, itertools.product(*param_config)))\nprint(params)\n[-1, -1, -1]\n[-1, -1, 0]\n[-1, -1, 1]\n[-1, 0, -1]\n[-1, 0, 0]\n[-1, 0, 1]\n[-1, 1, -1]\n[-1, 1, 0]\n[-1, 1, 1]\n[0, -1, -1]\n[0, -1, 0]\n[0, -1, 1]\n[0, 0, -1]\n[0, 0, 0]\n[0, 0, 1]\n[0, 1, -1]\n[0, 1, 0]\n[0, 1, 1]\n[1, -1, -1]\n[1, -1, 0]\n[1, -1, 1]\n[1, 0, -1]\n[1, 0, 0]\n[1, 0, 1]\n[1, 1, -1]\n[1, 1, 0]\n[1, 1, 1]\n\n\n\nStep 3\n\nRe-write the dask delayed function to include your new paramaters.\n%%time\nlazy_H = []\nfor param in params:\n print(params)\n H_temp = dask.delayed(schrodinger1D)(Vfun3, param)\n lazy_H.append(H_temp)\n \nlazy_H.compute()\n \n\n\n\n\nHow do you implement this same functionality in native Python Multiprocessing?\n\n\nSolution\n\nThe answer looks something like this:\nwith Pool(processes=ncpus) as pool: \n y=pool.imap(schrodinger1D, [Vfun1,Vfun2,Vfun3])\n pool.close()\n pool.join()\n outputs = [result for result in y]\nSee the complete solution and description here: schrodinger1D.py\n\n\n\n\n\nDask can be used for embarassingly parallel problems.\nFinding where to make your code faster and understanding what kind of code/data you can determine which approaches you use."
},
{
"objectID": "notebooks/01b-dask_demo.html",
"href": "notebooks/01b-dask_demo.html",
"title": "Parallel Python with Dask",
"section": "",
"text": "Exploring Dask in Data bound and Input-Output bound computations\n\n\n\n\n\nTODO\n\n\n#More\nTODO - more stuff\nConcept\n# Code\n\n\nTO DO - This describes a challenge.\n\n\nSolution\n\n#Declare the name of the function\ndef solveme(x):\n '''Run code\n ''' \n \n pass\n\n\n\n\n\nTODO1\nTODO1"
},
{
"objectID": "notebooks/01c-scientific_computing.html",
"href": "notebooks/01c-scientific_computing.html",
"title": "Parallel Python with Dask",
"section": "",
"text": "Parallel computing is when many different tasks are carried out simultaneously. Python does this by creating independent processes that ship data, program files and libraries to an isolated ecosystem where computation is performed. There are three main models for parallel computing:\n\nEmbarrassingly parallel: the code does not need to synchronize/communicate with other instances, and you can run multiple instances of the code separately, and combine the results later. If you can do this, great! (array jobs, task queues)\nShared memory parallelism: Parallel threads need to communicate and do so via the same memory (variables, state, etc). (OpenMP)\nMessage passing: Different processes manage their own memory segments. They share data by communicating (passing messages) as needed. (Message Passing Interface (MPI)).\n\n\n\nTraditional implemententations of making code parallel are done on a low level.\nHowever, open source software has evolved dramatically over the last few years allowing more high level implementations and concise ‘pythonic’ syntax that wraps around low level tools. These modern tools also address the nature of why code takes long to run in the Big Data / Data Science world we live in. These are the reasons why code being slow to run has changed from being the result of function-bound bottlenecks to data-bound bottlenecks.\nThe focus on this course is to use these modern high level implementations to address both Data and Function Bound bottlenecks.\n\n\n\n\nA process is a collection of resources including program files and memory that operates as an independent entity. Since each process has a seperate memory space, it can operate independently from other processes. It cannot easily access shared data in other processes.\nA thread is the unit of execution within a process. A process can have anywhere from just one thread to many threads. Threads are considered lightweight because they use far less resources than processes. Threads also share the same memory space so are not independent.\n\n\n\n\n\n\n\n\n\nThe designers of the Python language made the choice that only one thread in a process can run actual Python code by using the so-called global interpreter lock (GIL).\nExternal libraries (NumPy, SciPy, Pandas, etc), written in C or other languages, can release the lock and run multi-threaded. Code writen in native Python has the GIL limitation.\nThe multiprocessing library can be used to release the GIL on native Python code.\nimport dask\nimport dask.distributed as dd\nimport dask.array as da\nimport dask.dataframe as dd\nimport pandas as pd\nimport random\nfrom multiprocessing import Pool ### The default pool makes one process per CPU\n\n\n\n# reference:\n# https://aaltoscicomp.github.io/python-for-scicomp/parallel/\n\ndef sample(n):\n n_inside_circle = 0\n for i in range(n):\n x = random.random()\n y = random.random()\n if x**2 + y**2 < 1.0:\n n_inside_circle += 1\n return n_inside_circle / n * 4\n# Using apply from pandas\nps = pd.Series([10**5,20**5])\nps.apply(sample)\n0 3.13884\n1 3.14272\ndtype: float64\n# Create a pool object from with a with statement \nwith Pool() as p:\n result = p.map(sample,ps)\n # will engage p.close() automatically\nProcess SpawnPoolWorker-1:\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 367, in get\n return _ForkingPickler.loads(res)\nAttributeError: Can't get attribute 'sample' on <module '__main__' (built-in)>\nProcess SpawnPoolWorker-2:\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 367, in get\n return _ForkingPickler.loads(res)\nAttributeError: Can't get attribute 'sample' on <module '__main__' (built-in)>\nProcess SpawnPoolWorker-10:\nProcess SpawnPoolWorker-8:\nProcess SpawnPoolWorker-11:\nProcess SpawnPoolWorker-9:\nProcess SpawnPoolWorker-3:\nProcess SpawnPoolWorker-7:\nProcess SpawnPoolWorker-4:\nProcess SpawnPoolWorker-12:\nProcess SpawnPoolWorker-6:\nProcess SpawnPoolWorker-5:\nTraceback (most recent call last):\nTraceback (most recent call last):\nTraceback (most recent call last):\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\nKeyboardInterrupt\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\nKeyboardInterrupt\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\nKeyboardInterrupt\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\nKeyboardInterrupt\nTraceback (most recent call last):\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\nTraceback (most recent call last):\nKeyboardInterrupt\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 365, in get\n res = self._reader.recv_bytes()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/connection.py\", line 221, in recv_bytes\n buf = self._recv_bytes(maxlength)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/connection.py\", line 419, in _recv_bytes\n buf = self._recv(4)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/connection.py\", line 384, in _recv\n chunk = read(handle, remaining)\nKeyboardInterrupt\nKeyboardInterrupt\nKeyboardInterrupt\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\nKeyboardInterrupt\nKeyboardInterrupt: \nMultiprocessing introduces an initial fixed cost in time (creating Pool objects). Knowing what hardware you are working on is needed to tailor the number of processes created with what is available. There is a risk of creating too many processes which make the initial fixed cost excessively large.\n\n\n\n# Create the dask equivalent input\nds = dd.from_pandas(ps,npartitions = 2)\n%%timeit #605 ms ± 10.9 ms per loop\nresult = ds.apply(sample,meta=('x', 'float64')).mean().compute()\n894 ms ± 9.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n%%timeit # 1.08 s ± 47.8 ms per loop\np = Pool()\nresult = p.map(sample,ps)\np.close()\nProcess SpawnPoolWorker-13:\nProcess SpawnPoolWorker-14:\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 367, in get\n return _ForkingPickler.loads(res)\nAttributeError: Can't get attribute 'sample' on <module '__main__' (built-in)>\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 367, in get\n return _ForkingPickler.loads(res)\nAttributeError: Can't get attribute 'sample' on <module '__main__' (built-in)>\nProcess SpawnPoolWorker-20:\nProcess SpawnPoolWorker-24:\nProcess SpawnPoolWorker-21:\nProcess SpawnPoolWorker-23:\nProcess SpawnPoolWorker-16:\nProcess SpawnPoolWorker-22:\nProcess SpawnPoolWorker-17:\nTraceback (most recent call last):\nProcess SpawnPoolWorker-19:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\nKeyboardInterrupt\nProcess SpawnPoolWorker-15:\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\nKeyboardInterrupt\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\nKeyboardInterrupt\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\nKeyboardInterrupt\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\nProcess SpawnPoolWorker-18:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\nTraceback (most recent call last):\nKeyboardInterrupt\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\nKeyboardInterrupt\nTraceback (most recent call last):\nTraceback (most recent call last):\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 365, in get\n res = self._reader.recv_bytes()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/connection.py\", line 221, in recv_bytes\n buf = self._recv_bytes(maxlength)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/connection.py\", line 419, in _recv_bytes\n buf = self._recv(4)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/connection.py\", line 384, in _recv\n chunk = read(handle, remaining)\nKeyboardInterrupt\nKeyboardInterrupt\nKeyboardInterrupt\nTraceback (most recent call last):\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/pool.py\", line 114, in worker\n task = get()\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/queues.py\", line 364, in get\n with self._rlock:\n File \"/Users/darya/opt/miniconda3/envs/quantum/lib/python3.9/multiprocessing/synchronize.py\", line 95, in __enter__\n return self._semlock.__enter__()\nKeyboardInterrupt\nKeyboardInterrupt: \n\n\n\nDask uses multiprocessing by default to overcome the GIL. Hence comparing the run time of the multiprocessing library to Dask with a function-bound problem will yield similar results.\nYet dask offers an ecosystem of resource management (Scheduler, diagnostics, data partitions and Task Graphs) that make it a more attractive way to achieve the same thing in most cases. Resource management is handled automatically by the scheduler.\n# for reference, delaying the same function.\[email protected]\ndef dd_sample(n):\n n_inside_circle = 0\n for i in range(n):\n x = random.random()\n y = random.random()\n if x**2 + y**2 < 1.0:\n n_inside_circle += 1\n return n_inside_circle / n * 4\n\n\nresult = ds.apply(dd_sample,meta=('x', 'float64')).mean().compute()\nresult\n3.1386374999999997\n%%timeit #595 ms ± 1.54 ms per loop\nresult = ds.apply(dd_sample,meta=('x', 'float64')).mean().compute()\n855 ms ± 7.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\n\n\n\nUsing Multiprocessing (or mpi4py - not covered here) are the traditional ways to make functions run in parallel in Python\nUsing Dask and its ecosystem is the modern approach"
},
{
"objectID": "notebooks/01a-pandas_fundamentals.html",
"href": "notebooks/01a-pandas_fundamentals.html",
"title": "Parallel Python with Dask",
"section": "",
"text": "Learn the basic Pandas data structures and methods\nExplore a typical data manipulation and plotting process\n\n\n\n\nPandas API Reference\n# Import libraries and datasets\nimport pandas as pd\nimport numpy as np\nimport scipy as sp\nimport seaborn as sns\nimport dask\n\nts_data = dask.datasets.timeseries()\ndf = sns.load_dataset('diamonds')\ndf.head() #inspect DataFrame\n\n\n\n\n \n \n \n carat\n cut\n color\n clarity\n depth\n table\n price\n x\n y\n z\n \n \n \n \n 0\n 0.23\n Ideal\n E\n SI2\n 61.5\n 55.0\n 326\n 3.95\n 3.98\n 2.43\n \n \n 1\n 0.21\n Premium\n E\n SI1\n 59.8\n 61.0\n 326\n 3.89\n 3.84\n 2.31\n \n \n 2\n 0.23\n Good\n E\n VS1\n 56.9\n 65.0\n 327\n 4.05\n 4.07\n 2.31\n \n \n 3\n 0.29\n Premium\n I\n VS2\n 62.4\n 58.0\n 334\n 4.20\n 4.23\n 2.63\n \n \n 4\n 0.31\n Good\n J\n SI2\n 63.3\n 58.0\n 335\n 4.34\n 4.35\n 2.75\n \n \n\n\n# DataFrame attributes can be accessed\n\ndf.index # name of index, in this case a range\ndf.columns # variables carat to z\ndf.values # values as a numpy.ndarray\ndf.dtypes # data types of variables\ndf.shape # rows to column structure\ndf.ndim # number of dimensions\n\n# functions are attached to pd.Series can be engaged\ndf.cut # referenced column on its own\ndf.cut.value_counts() \ndf.cut.unique()\ndf.carat.mean()\n0.7979397478680014\n\n\n\nFunctions are available that are attached to the DataFrame class\nCommon methods are:\n\nfilter: Subset the dataframe rows or columns according to the specified index labels.\nassign: assign / mutate new columns in dataframe\nquery: query the columns of a DataFrame with a boolean expression\nsort_values : arrange rows of DataFrame\napply : Apply a function along an axis of the DataFrame\n\n# More complex subsetting of DataFrame by observations or variables\n\n# filter variables\ndf.filter(['cut']) # returns pandas.DataFrame\ndf['cut'] # as opposed to this which returns pandas.Series or df.cut\ndf.filter([\"carat\",\"cut\"]) # filter more than one variables \ndf.filter(regex= \"^c\") # with regex - a whole other topic...\n\n# query observations\n# The quotes in query need to be single-outside, double-inside \ndf.query('color == \"E\"') # filter observations by criteria\ndf.query('cut == \"Ideal\" or cut == \"Premium\"') # filter observations with logical expression\ndf.query('cut == \"Ideal\" | cut == \"Premium\"') # same thing\ndf.query(\"cut.str.match('^G')\") # query doesn't have regex parameter but can be incorporated via str attribute\ndf.query(\"clarity.str.contains('^\\w+1')\")\ndf.query('price > 500') # querying numeric\n\n# other ways to filter variables or observations by string exist\nsubset = [col for col in df.columns.str.contains('c')] # list comprehension returning list of booleans\ndf.filter(df.columns[subset]) # which are tweaked to filter command\ndf[df.cut.str.startswith('Good')] # subsetting observations \n\n# most DataFrame functions return a DataFrame so one can combine different DataFrame operations\ndf.query('price < 500').head() \n\n# chaining manipulations into larger readable structure\n(df\n .filter(['carat', 'color'])\n .query('color == \"E\"')\n .head(3))\n\n# or using functions applied after a chain\n(df\n .query('price < 4000')\n .price.std())\n\n# While we've only so far looked at functions attached to pd.Dataframe,\n# one can use external functions provided what it expects is catered for.\n\nnp.linalg.norm(df.filter(['x','y','z']).values) # norm expects an array\n\n2094.3009834069335\n\n# arrange data by values\ndf.sort_values(by = ['carat','price'],ascending = False)\n\n# groupby: splits DataFrame into multiple compartments and returns a group-by object\n# which aggregations can be applied on each group\ndf.groupby('cut').price.agg('std')\ndf.groupby('cut').mean(numeric_only=True)\n\n# Using assign to create / mutate a variable\n\ndf.assign(size = 1) #fills same value\ndf = df.assign(size = np.sqrt(df.x ** 2 + df.y ** 2 + df.z ** 2)) #element wise vector addition\n\n# apply : apply a function to a DataFrame over columns (axis = 1) or rows (axis = 0) \ndf.assign(norm = df.filter(['x','y','z']).apply(np.linalg.norm,axis = 1)) # element / rowwise norm equivalent\n\ndf.assign(demeaned = lambda df : df.price - df.price.mean()) \n\n#if aggregation is based on grouping\ndf_cut = df.groupby('cut')\ndf.assign(demeaned = df.price - df_cut.price.transform('mean')) #transform\n\n# map : Map values of Series according to an input mapping or function.\n# very similar to apply but acts on pd.Series rather than pd.DataFrame\n\ndf.price.map(lambda r : r + 1) #returns a pd.Series\n\n# applymap: Apply a function to a DataDrame element-wise\ndf.filter(['x','y','z']).applymap(lambda x : x **2) \n\n# Reshaping data with melt\n# Melt converts data to long format. \n# pivot is the column equivalent to expand data wider\n''' some melt arguments are :\nid_vars ; Column(s) to use as identifier variables\nvalue_vars ; Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.\nvar_name ; Name to use for the ‘variable’ column\nvalue_name ; Name to use for the ‘value’ column\n'''\n\ndf_longer = (df.filter(['cut','carat','clarity','x','y','z','price'])\n .melt(id_vars=['cut','price','clarity','carat'], \n value_vars = ['x','y','z'],\n value_name = \"dim\"\n )\n)\n\n# Longer format usually good for plotting.\n\nsns.relplot(x=\"dim\", y=\"carat\", hue=\"cut\", size=\"price\",\n sizes=(10, 200), alpha=.5, palette=\"muted\",\n height=6, data=df_longer.query('dim < 12'));\n\n\n\n\nMake a plot of the carat vs price, group the colors by the cut and the symbol size by the color of the diamond. Limit the dataset to just show the “I1” clarity.\n\n\nSolution\n\nThis can be done in a few ways, but Seaborn interfaces with pandas-like dataframes seamlessly to make these simple data-manging tasks easy.\nsns.relplot(x=\"carat\", y=\"price\", hue=\"cut\",size='color',\n sizes=(10, 200), alpha=.5, palette=\"muted\",\n height=6, data=data.query('clarity == \"I1\"'))\n\n\n\n\nTechniques for writing efficient Python:\n\nUnderstanding the difference between mutable and immutable types\nUse np functions rather than function written in pure Python\nConsider generators\nCache results\n\nA few useful links with ideas on how to do this:\n\nhttps://caam37830.github.io/book/index.html\nhttps://python-course.eu/\nhttps://realpython.com/fibonacci-sequence-python/\n\ndef FibonacciGenerator(n):\n \"\"\" \n note: n is limit of fibonacci value rather than count\n \"\"\"\n a = 0\n b = 1\n while a < n:\n yield a\n a, b = b, a + b\n \n# Recursive slower case\ndef fibonacci_of(n):\n if n in {0, 1}: # Base case\n return n\n return fibonacci_of(n - 1) + fibonacci_of(n - 2) #return same function \n\ndef is_even(sequence):\n \"\"\" reduces a sequence to even numbers\n \"\"\"\n for n in sequence:\n if n % 2 == 0:\n yield n\n# Can consume generators by converting to list\nlist(is_even([1,2,3,4]))\nlist(FibonacciGenerator(500))\n[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377]\n# Build up sequences of manipulations in memory with generators \n# and selectively trigger consumption for efficiency\nlist(is_even(FibonacciGenerator(500)))\n[0, 2, 8, 34, 144]\n%%timeit #758 nanos ± 0.393 ns per loop\nresults1 = list(FibonacciGenerator(500)) #generator brought into memory via list\n756 ns ± 5.39 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)\n%%timeit #192 microsec ± 1.92 ns per loop\nresults2 = [fibonacci_of(n) for n in range(15)]\n189 µs ± 352 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n# Caching is another method to improve efficiency, new syntax as of Python 3.9\nfrom functools import cache\n\n@cache\ndef factorial(n):\n return n * factorial(n-1) if n else 1\nfactorial(10) # no previously cached result, makes 11 recursive calls\nfactorial(5) # just looks up cached value result\nfactorial(12) # makes two new recursive calls, the other 10 are cached\n479001600\n\n\n\n\npd.DataFrame and pd.Series are the most common data structures for tabular data.\nFunctions are “attatched” to the objects. For example pd.Series.sum(), pd.Series.str.contains(), pd.Series.quantile() etc\nThe Pandas API is an invaluable reference to remember the notations."
}
]