Requirements:
- Spark 3.5
- Pandas 2.0+
- Python 3
- spark-xml_2.12-0.18.0.jar (included)
- Data file 'data.xml' must be placed in 'sample' directory
├── jars
│ ├── spark-xml_1.12-0.18.0.jar
├── sample
│ ├── data.xml
├── output
│ ├── ...
├── pd_output
│ ├── ...
├── main.py
├── pd_main.py
├── out.txt
├── readme.md
└── .gitignore
out.txtfile contains the output frommain.pyexecutionpd_out.txtfile contains the output frompd_main.pyexecutionoutputdirectory contains tables creates by pyspark scriptpd_outputdirectory contains tables creates by python script
Located in working directory
To run pyspark script
spark-submit --jars jars/spark-xml_2.12-0.18.0.jar main.pyTo run python/pandas script
python3 pd_main.py