Skip to content

wlin25/IBMGenerator

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IBMGenerator

IBM Synthetic Data Generator for Itemsets and Sequences

Type make, which will create the executable file 'gen'

type ./gen -help for general help

For itemsets, type ./gen lit -help For sequences, type ./gen seq -help

Itemset Datasets

These datasets mimic the transactions in a retailing environment, where people tend to buy sets of items together, the so called potential maximal frequent set. The size of the maximal elements is clustered around a mean with a few long itemsets. A transaction may contain one or more of such frequent sets. The transaction size is also clustered around a mean, but a few of them may contain many items. Let D denote the number of transactions, T the average transaction size, I the size of a maximal potentially frequent itemset, L the number of maximal potentially frequent itemsets, and N the number of items. The data is generated using the following procedure. We first generate L maximal itemsets of average size I by choosing from the N items. We next generate D transactions of average size T by choosing from the L maximal itemsets.

Type: ./gen lit -help

for all the parameters to generate sequence datasets:

Command Line Options:

-ncust number_of_customers (in 1000's) (default: 100)

-slen avg_trans_per_customer (default: 10)

-tlen avg_items_per_transaction (default: 2.5)

-nitems number_of_different_items (in '000s) (default: 10000)

-rept repetition-level (default: 0)

-seq.npats number_of_seq_patterns (default: 5000)

-seq.patlen avg_length_of_maximal_pattern (default: 4)

-seq.corr correlation_between_patterns (default: 0.25)

-seq.conf avg_confidence_in_a_rule (default: 0.75)

-lit.npats number_of_patterns (default: 25000)

-lit.patlen avg_length_of_maximal_pattern (default: 1.25)

-lit.corr correlation_between_patterns (default: 0.25)

-lit.conf avg_confidence_in_a_rule (default: 0.75)

-fname (write to filename.data and filename.pat)

-ascii (Write data in ASCII format; default: False)

-version (to print out version info)

An example run can be:

./gen lit -ntrans 100 -tlen 10 -nitems 1 -npats 1000 -patlen 4 -fname T10I4D100K -ascii

This will generate a datafile named "T10I4D100K.data"

Sequence Datasets

The generator generates sequence datasets that mimic real-world transactions, where people buy a sequence of sets of items. Some customers may buy only some items from the sequences, or they may buy items from multiple sequences. The input-sequence size and event size are clustered around a mean and a few of them may have many elements.

The datasets are generated using the following process. First NI maximal events of average size I are generated by choosing from N items. Then NS maximal sequences of average size S are created by assigning events from NI to each sequence. Next a customer (or input-sequence) of average C transactions (or events) is created, and sequences in NS are assigned to different customer elements, respecting the average transaction size of T. The generation stops when D input-sequences have been generated. Default values are NS = 5000, NI = 25000 and N = 10000.

Type: ./gen seq -help

for all the parameters to generate sequence datasets:

Command Line Options:

-ncust number_of_customers (in 1000's) (default: 100)

-slen avg_trans_per_customer (default: 10)

-tlen avg_items_per_transaction (default: 2.5)

-nitems number_of_different_items (in '000s) (default: 10000)

-rept repetition-level (default: 0)

-seq.npats number_of_seq_patterns (default: 5000)

-seq.patlen avg_length_of_maximal_pattern (default: 4)

-seq.corr correlation_between_patterns (default: 0.25)

-seq.conf avg_confidence_in_a_rule (default: 0.75)

-lit.npats number_of_patterns (default: 25000)

-lit.patlen avg_length_of_maximal_pattern (default: 1.25)

-lit.corr correlation_between_patterns (default: 0.25)

-lit.conf avg_confidence_in_a_rule (default: 0.75)

-fname (write to filename.data and filename.pat)

-ascii (Write data in ASCII format; default: False)

-version (to print out version info)

An example run can be:

./gen seq -ncust 200 -fname C10T2.5S4I1.25D200K -ascii

This will generate four files:

[fname].data -- the actual data file

[fname].conf -- configuration info

[fname].pat -- the embedded patterns

[fname].ntpc -- info on number of trans per customer (ignore this file)

About

IBM Synthetic Data Generator for Itemsets and Sequences

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C++ 98.3%
  • Other 1.7%