Skip to content

Commit 6a99c72

Browse files
authored
[DOCS] Adds outlier detection example (elastic#524)
1 parent 7b0702e commit 6a99c72

7 files changed

Lines changed: 287 additions & 2 deletions

File tree

docs/en/stack/get-started-trial.asciidoc

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,7 @@ For more information about Elastic license levels, see
99
https://www.elastic.co/subscriptions.
1010

1111
You can start a 30-day trial to try out all of the platinum features, including
12-
{security-features} and {ml-features}. Click **Start trial** on the
13-
**License Management** page in {kib}.
12+
{ml-features}. Click **Start trial** on the **License Management** page in {kib}.
1413

1514
IMPORTANT: If your cluster has already activated a trial license for the current
1615
major version, you cannot start a new trial. For example, if you have already
Lines changed: 268 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
[role="xpack"]
2+
[testenv="platinum"]
3+
[[ecommerce-outliers]]
4+
=== Finding outliers in the eCommerce sample data
5+
6+
beta[]
7+
8+
The goal of <<dfa-outlier-detection,{oldetection}>> is to find the most unusual
9+
documents in an index. Let's try to detect unusual customer behavior in the
10+
{kibana-ref}/add-sample-data.html[eCommerce sample data set].
11+
12+
. Obtain a license that includes the {ml-features}.
13+
+
14+
--
15+
include::{docdir}/get-started-trial.asciidoc[]
16+
--
17+
18+
. If the {es} {security-features} are enabled, obtain a user ID with sufficient
19+
privileges to complete these steps.
20+
+
21+
--
22+
You need `manage_data_frame_transforms` cluster privileges to preview and create
23+
{transforms}. Members of the built-in `data_frame_transforms_admin`
24+
role have these privileges.
25+
26+
You must also be a member of the `machine_learning_admin` built-in role to
27+
create and manage {dfanalytics-jobs}.
28+
29+
You also need `read` and `view_index_metadata` index privileges on the source
30+
indices and `read`, `create_index`, and `index` privileges on the destination
31+
indices.
32+
33+
For more information, see <<security-privileges>> and <<built-in-roles>>.
34+
--
35+
36+
. Create a {transform} that generates an entity-centric index with numeric or
37+
boolean data to analyze.
38+
+
39+
--
40+
In this example, we'll use the eCommerce orders sample data and pivot the data
41+
such that we get a new index that contains a sales summary for each customer.
42+
43+
In particular, create a {transform} that calculates the sum of the products
44+
(`products.quantity`) and the sum of prices (`products.taxful_price`) in all of
45+
the orders, grouped by customer (`customer_full_name`). Also include a value
46+
count aggregation, so that we know how many orders (`order_id`) exist for each
47+
customer.
48+
49+
You can preview the {transform} before you create it in {kib}:
50+
51+
[role="screenshot"]
52+
image::images/ecommerce-transform-preview.jpg["Creating a {transform} in {kib}"]
53+
54+
Alternatively, you can preview and create the {transform} with the following
55+
APIs:
56+
57+
[source,console]
58+
--------------------------------------------------
59+
POST _data_frame/transforms/_preview
60+
{
61+
"source": {
62+
"index": [
63+
"kibana_sample_data_ecommerce"
64+
]
65+
},
66+
"pivot": {
67+
"group_by": {
68+
"customer_full_name.keyword": {
69+
"terms": {
70+
"field": "customer_full_name.keyword"
71+
}
72+
}
73+
},
74+
"aggregations": {
75+
"products.quantity.sum": {
76+
"sum": {
77+
"field": "products.quantity"
78+
}
79+
},
80+
"products.taxful_price.sum": {
81+
"sum": {
82+
"field": "products.taxful_price"
83+
}
84+
},
85+
"order_id.value_count": {
86+
"value_count": {
87+
"field": "order_id"
88+
}
89+
}
90+
}
91+
}
92+
}
93+
94+
PUT _data_frame/transforms/ecommerce-customer-sales
95+
{
96+
"source": {
97+
"index": [
98+
"kibana_sample_data_ecommerce"
99+
]
100+
},
101+
"pivot": {
102+
"group_by": {
103+
"customer_full_name.keyword": {
104+
"terms": {
105+
"field": "customer_full_name.keyword"
106+
}
107+
}
108+
},
109+
"aggregations": {
110+
"products.quantity.sum": {
111+
"sum": {
112+
"field": "products.quantity"
113+
}
114+
},
115+
"products.taxful_price.sum": {
116+
"sum": {
117+
"field": "products.taxful_price"
118+
}
119+
},
120+
"order_id.value_count": {
121+
"value_count": {
122+
"field": "order_id"
123+
}
124+
}
125+
}
126+
},
127+
"description": "E-commerce sales by customer",
128+
"dest": {
129+
"index": "ecommerce-customer-sales"
130+
}
131+
}
132+
--------------------------------------------------
133+
// TEST[skip:set up sample data]
134+
135+
For more details about creating {transforms}, see <<ecommerce-dataframes>>.
136+
--
137+
138+
. Start the {transform}.
139+
+
140+
--
141+
142+
TIP: Even though resource utilization is automatically adjusted based on the
143+
cluster load, a {transform} increases search and indexing load on your
144+
cluster while it runs. If you're experiencing an excessive load, however, you
145+
can stop it.
146+
147+
You can start, stop, and manage {transforms} in {kib}. Alternatively, you can
148+
use the {ref}/start-data-frame-transform.html[start {transforms}] API. For
149+
example:
150+
151+
[source,console]
152+
--------------------------------------------------
153+
POST _data_frame/transforms/ecommerce-customer-sales/_start
154+
--------------------------------------------------
155+
// TEST[skip:setup kibana sample data]
156+
157+
--
158+
159+
. Create a {dfanalytics-job} to detect outliers in the new entity-centric index.
160+
+
161+
--
162+
There is a wizard for creating {dfanalytics-jobs} on the
163+
*Machine Learning* > *Analytics* page in {kib}:
164+
165+
[role="screenshot"]
166+
image::images/ecommerce-outlier-job.jpg["Create a {dfanalytics-job} in {kib}"]
167+
168+
Alternatively, you can use the
169+
{ref}/put-dfanalytics.html[create {dfanalytics-jobs} API]. For example:
170+
171+
[source,console]
172+
--------------------------------------------------
173+
PUT _ml/data_frame/analytics/ecommerce
174+
{
175+
"source": {
176+
"index": "ecommerce-customer-sales"
177+
},
178+
"dest": {
179+
"index": "ecommerce-outliers"
180+
},
181+
"analysis": {
182+
"outlier_detection": {
183+
}
184+
},
185+
"analyzed_fields" : {
186+
"includes" : ["products.quantity.sum","products.taxful_price.sum","order_id.value_count"]
187+
}
188+
}
189+
--------------------------------------------------
190+
// TEST[skip:setup kibana sample data]
191+
--
192+
193+
. Start the {dfanalytics-job}.
194+
+
195+
--
196+
You can start, stop, and manage {dfanalytics-jobs} on the
197+
*Machine Learning* > *Analytics* page in {kib}. Alternatively, you can use the
198+
{ref}/start-dfanalytics.html[start {dfanalytics-jobs}] and
199+
{ref}/stop-dfanalytics.html[stop {dfanalytics-jobs}] APIs. For
200+
example:
201+
202+
[source,console]
203+
--------------------------------------------------
204+
POST _ml/data_frame/analytics/ecommerce/_start
205+
--------------------------------------------------
206+
// TEST[skip:setup kibana sample data]
207+
--
208+
209+
. View the results of the {oldetection} analysis.
210+
+
211+
--
212+
The {dfanalytics-job} creates an index that contains the original data and
213+
{olscores} for each document. The {olscore} indicates how different each entity
214+
is from other entities.
215+
216+
In {kib}, you can view the results from the {dfanalytics-job} and sort them
217+
on the outlier score:
218+
219+
[role="screenshot"]
220+
image::images/outliers.jpg["View {oldetection} results in {kib}"]
221+
222+
The `ml.outlier` score is a value between 0 and 1. The larger the value, the
223+
more likely they are to be an outlier.
224+
225+
In addition to an overall outlier score, each document is annotated with feature
226+
influence values for each field. These values add up to 1 and indicate which
227+
fields are the most important in deciding whether an entity is an outlier or
228+
inlier. For example, the dark shading on the `products.taxful_price.sum` field
229+
for Wagdi Shaw indicates that the sum of the product prices was the most
230+
influential feature in determining that Wagdi is an outlier.
231+
232+
If you want to see the exact feature influence values, you can retrieve them
233+
from the index that is associated with your {dfanalytics-job}. For example:
234+
235+
[source,console]
236+
--------------------------------------------------
237+
GET ecommerce-outliers/_search?q="Wagdi Shaw"
238+
--------------------------------------------------
239+
// TEST[skip:setup kibana sample data]
240+
241+
The search results include the following {oldetection} scores:
242+
243+
[source,js]
244+
--------------------------------------------------
245+
...
246+
"ml" :{
247+
"outlier_score" : 0.9653657078742981,
248+
"feature_influence.products.quantity.sum" : 0.00592468399554491,
249+
"feature_influence.order_id.value_count" : 0.01975759118795395,
250+
"feature_influence.products.taxful_price.sum" : 0.974317729473114
251+
}
252+
...
253+
--------------------------------------------------
254+
// NOTCONSOLE
255+
--
256+
257+
Now that you've found unusual behavior in the sample data set, consider how you
258+
might apply these steps to other data sets. If you have data that is already
259+
marked up with true outliers, you can determine how well the {oldetection}
260+
algorithms perform by using the evaluate {dfanalytics} API. See
261+
<<ml-dfanalytics-evaluate>>.
262+
263+
TIP: If you do not want to keep the {transform} and the {dfanalytics-job}, you
264+
can delete them in {kib} or use the
265+
{ref}/delete-data-frame-transform.html[delete {transform} API] and
266+
{ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When
267+
you delete {transforms} and {dfanalytics-jobs}, the destination indices and
268+
{kib} index patterns remain.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
[role="xpack"]
2+
[testenv="platinum"]
3+
[[dfanalytics-examples]]
4+
== {dfanalytics-cap} examples
5+
++++
6+
<titleabbrev>Examples</titleabbrev>
7+
++++
8+
9+
beta[]
10+
11+
These examples demonstrate how to use {dfanalytics} to derive useful
12+
insights from your data.
13+
14+
* <<ecommerce-outliers>>
15+
16+
17+
include::ecommerce-outliers.asciidoc[]
171 KB
Loading
543 KB
Loading
417 KB
Loading

docs/en/stack/ml/df-analytics/index.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,4 +30,5 @@ include::dfa-outlierdetection.asciidoc[]
3030
include::dfa-regression.asciidoc[]
3131
include::evaluatedf-api.asciidoc[]
3232
include::api-quickref.asciidoc[]
33+
include::examples.asciidoc[]
3334
include::dfanalytics-limitations.asciidoc[]

0 commit comments

Comments
 (0)