IBMGenerator

IBM Synthetic Data Generator for Itemsets and Sequences

Type make, which will create the executable file 'gen'

type ./gen -help for general help

For itemsets, type ./gen lit -help For sequences, type ./gen seq -help

Itemset Datasets

These datasets mimic the transactions in a retailing environment, where people tend to buy sets of items together, the so called potential maximal frequent set. The size of the maximal elements is clustered around a mean with a few long itemsets. A transaction may contain one or more of such frequent sets. The transaction size is also clustered around a mean, but a few of them may contain many items. Let D denote the number of transactions, T the average transaction size, I the size of a maximal potentially frequent itemset, L the number of maximal potentially frequent itemsets, and N the number of items. The data is generated using the following procedure. We first generate L maximal itemsets of average size I by choosing from the N items. We next generate D transactions of average size T by choosing from the L maximal itemsets.

Type: ./gen lit -help

for all the parameters to generate sequence datasets:

Command Line Options:

-ncust number_of_customers (in 1000's) (default: 100)

-slen avg_trans_per_customer (default: 10)

-tlen avg_items_per_transaction (default: 2.5)

-nitems number_of_different_items (in '000s) (default: 10000)

-rept repetition-level (default: 0)

-seq.npats number_of_seq_patterns (default: 5000)

-seq.patlen avg_length_of_maximal_pattern (default: 4)

-seq.corr correlation_between_patterns (default: 0.25)

-seq.conf avg_confidence_in_a_rule (default: 0.75)

-lit.npats number_of_patterns (default: 25000)

-lit.patlen avg_length_of_maximal_pattern (default: 1.25)

-lit.corr correlation_between_patterns (default: 0.25)

-lit.conf avg_confidence_in_a_rule (default: 0.75)

-fname (write to filename.data and filename.pat)

-ascii (Write data in ASCII format; default: False)

-version (to print out version info)

An example run can be:

./gen lit -ntrans 100 -tlen 10 -nitems 1 -npats 1000 -patlen 4 -fname T10I4D100K -ascii

This will generate a datafile named "T10I4D100K.data"

Sequence Datasets

The generator generates sequence datasets that mimic real-world transactions, where people buy a sequence of sets of items. Some customers may buy only some items from the sequences, or they may buy items from multiple sequences. The input-sequence size and event size are clustered around a mean and a few of them may have many elements.

The datasets are generated using the following process. First NI maximal events of average size I are generated by choosing from N items. Then NS maximal sequences of average size S are created by assigning events from NI to each sequence. Next a customer (or input-sequence) of average C transactions (or events) is created, and sequences in NS are assigned to different customer elements, respecting the average transaction size of T. The generation stops when D input-sequences have been generated. Default values are NS = 5000, NI = 25000 and N = 10000.

Type: ./gen seq -help

for all the parameters to generate sequence datasets:

Command Line Options:

-ncust number_of_customers (in 1000's) (default: 100)

-slen avg_trans_per_customer (default: 10)

-tlen avg_items_per_transaction (default: 2.5)

-nitems number_of_different_items (in '000s) (default: 10000)

-rept repetition-level (default: 0)

-seq.npats number_of_seq_patterns (default: 5000)

-seq.patlen avg_length_of_maximal_pattern (default: 4)

-seq.corr correlation_between_patterns (default: 0.25)

-seq.conf avg_confidence_in_a_rule (default: 0.75)

-lit.npats number_of_patterns (default: 25000)

-lit.patlen avg_length_of_maximal_pattern (default: 1.25)

-lit.corr correlation_between_patterns (default: 0.25)

-lit.conf avg_confidence_in_a_rule (default: 0.75)

-fname (write to filename.data and filename.pat)

-ascii (Write data in ASCII format; default: False)

-version (to print out version info)

An example run can be:

./gen seq -ncust 200 -fname C10T2.5S4I1.25D200K -ascii

This will generate four files:

[fname].data -- the actual data file

[fname].conf -- configuration info

[fname].pat -- the embedded patterns

[fname].ntpc -- info on number of trans per customer (ignore this file)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Makefile		Makefile
README.md		README.md
command.cc		command.cc
dist.cc		dist.cc
dist.h		dist.h
expdev.cc		expdev.cc
gammln.cc		gammln.cc
gasdev.cc		gasdev.cc
gen.cc		gen.cc
gen.h		gen.h
glob.h		glob.h
main.cc		main.cc
mygen.cc		mygen.cc
poidev.cc		poidev.cc
ran0.cc		ran0.cc
ran1.cc		ran1.cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IBMGenerator

Itemset Datasets

Sequence Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IBMGenerator

Itemset Datasets

Sequence Datasets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages