The mdsim datasets are synthetic, anonymized representations of data and domain knowledge found in a MEMS sensor development workflow for a fictional MEMS sensor product. They emulate the structure and heterogeneity of practical use-case data and include intentionally designed pitfalls that commonly arise when integrating information across sources. The datasets do not model physical MEMS behavior.
Datasets/manufacturing_*.csvcontains in-line manufacturing data for 2 different part types (PX1 & PX2). For each part type there are 1000 wafers simulated, each holding 25 partsDatasets/manufacturing_EOL_*.csvcontains EOL testing data for the parts PX1 & PX2Datasets/engineering.csvcontains in-line data for a new part type variant PX2V2. There are 10 wafers simulated, each holding 25 partsDatasets/engineering_EOL.csvcontains EOL testing data for the part PX2V2Datasets/assembly_map.csvis a mapping table, tracking what parts are assembled into which sensor. There are 25000 units of sensor SY1 assembled.Datasets/Final_testing.csvcontains the final test results for the sensor SY1 units
engineering_coordinates.pngresp.manufacturing_coordinates.pngvisualize wafer coordinate grids used in the simulation for engineering resp. manufacturingDAGs/[name]_dag.svgvisualizes the directed acyclic graph (DAG) used for simulation of the respective part/sensormdsim_ontology.svggraph visualization of the mdsim ontology modeling the simulated data, a simplified version of the mdev ontology
mdsim_ontology.ttlRDF definition of the mdsim ontologyengineering_mapping_example.ttlExample RML mapping for the engineering.cvsSHACL/gridalignment.ttlExample SHACL shapes for flagging wafers & parts with coordinates not matching the ontology standardSHACL/units.ttlExample SHACL for flagging non qudt units
The following conflicts have been injected into the datasets:
- products are uniquely identified by
usnNOT byprod_name - parts are uniquely identified by
x_pos,y_pos,wafer_id, they have no ID unit:Wis the only valid qudt unit in engineering and manufacturing. All others need preprocessing / conversion- PXF_THK_ADJ uses µm in manufacturing & nm in engineering
- manufacturing adds a new measurement parameter Bond_ZOffset to the processing of part PX2
- manufacturing/engineering use
prod_namefor part types, assembly usessub_prod_name. (prod_namein assembly describes the sensor type), FT usesprod_namefor the type but has no part type info - engineering uses a different wafer coordinate grid than manufacturing (not center)
- engineering uses Linux timestamp, manufacturing uses datetimestamp
- EOL testing records start & duration, final testing records start & end
The simulation framework implements a DAG-based causal model to capture dependencies between process parameters. Each parameter node in the DAG can serve as either an exogenous input (controllable parameter) or an endogenous output (measurable parameter derived from upstream causes). The causal structure enables systematic propagation of parameter values through the manufacturing process chain. Timestamps and other metadata are tracked and propagated as part of the simulation process as well.
Parameters are categorized into two types:
- Controllable parameters: Design variables sampled from specified bounds using Latin Hypercube Sampling (LHS) to ensure orthogonal coverage of the design space
- Measurable parameters: Endogenous variables computed from upstream causal relationships
The causal computation proceeds as follows:
- Initialization: Sample controllable parameters (LHS)
- Topological traversal: Compute measurable parameters in topological order
- Validation: Check parameter values against specification limits
- Postprocessing: Split and transform into data structures resembling real datasets. Introduce common SICs.
The DAG structure guarantees that all parameters can be computed in a valid topological order, where each node's value depends only on previously computed upstream nodes. This ensures deterministic, reproducible propagation without circular dependencies.
Each directed edge
For a linear edge relationship, the contribution from parent
-
$v_i$ is the value of parent parameter$P_i$ -
$w_{ij}$ is the edge weight (default: 1.0)
For a child parameter
-
$b_j$ is the intercept (bias) term for parameter$j$ (default: 0.0) -
$w_{ij}$ are the edge weights from each parent -
$\epsilon_j \sim \mathcal{N}(0, \sigma_j^2)$ represents measurement noise