A Go application that reads CSV files, converts each row to XML format, optionally parses them with Apache Tika, and indexes the documents to OpenSearch for searchable storage.
- CSV Parsing: Reads CSV files with header rows and creates a map for each data row
- XML Generation: Converts each CSV row into a separate XML file
- Tika Integration: Uses Apache Tika server to parse and extract content from XML files
- OpenSearch Indexing: Indexes parsed documents to OpenSearch for full-text search
- Bulk Operations: Supports bulk indexing for better performance
- Docker Support: Includes Docker Compose setup for OpenSearch and Tika services
- Go 1.21 or later
- Docker and Docker Compose
- Make (optional, for using Makefile commands)
search/
├── cmd/
│ └── main.go # Main application entry point
├── internal/
│ ├── csvparser/ # CSV parsing module
│ │ └── parser.go
│ ├── xmlconverter/ # XML conversion module
│ │ └── converter.go
│ ├── tika/ # Tika client module
│ │ └── client.go
│ └── opensearch/ # OpenSearch indexer module
│ └── indexer.go
├── data/ # Sample data directory
│ └── sample.csv
├── output/ # Generated XML files directory
├── docker-compose.yml # Docker services configuration
├── go.mod # Go module definition
├── Makefile # Build and run commands
└── README.md # This file
# Navigate to the project directory
cd search
# Download dependencies
go mod downloadStart OpenSearch and Tika using Docker Compose:
# Using Make
make docker-up
# Or using Docker Compose directly
docker-compose up -dWait about 10-15 seconds for services to be ready. You can verify:
# Check OpenSearch
curl http://localhost:9200
# Check Tika
curl http://localhost:9998/versionProcess the sample CSV file:
# Using Make
make run
# Or using Go directly
go run cmd/main.go -csv data/sample.csv
# Or build and run
go build -o bin/csv-indexer cmd/main.go
./bin/csv-indexer -csv data/sample.csv- OpenSearch Dashboards: http://localhost:5601
- OpenSearch API: http://localhost:9200
- Search your index:
curl -X GET "http://localhost:9200/csv-documents/_search?pretty"
go run cmd/main.go [options]| Option | Description | Default |
|---|---|---|
-csv |
Path to the CSV file to process (required) | - |
-output |
Directory to store XML files | output |
-index |
OpenSearch index name | csv-documents |
-opensearch |
OpenSearch URL | http://localhost:9200 |
-tika |
Tika server URL | http://localhost:9998 |
-bulk |
Use bulk indexing | true |
-skip-tika |
Skip Tika parsing and index XML directly | false |
-help |
Show help message | - |
# Process a CSV file with default settings
go run cmd/main.go -csv data/sample.csv
# Use a custom index name
go run cmd/main.go -csv data/sample.csv -index my-custom-index
# Skip Tika parsing (faster but less content extraction)
go run cmd/main.go -csv data/sample.csv -skip-tika
# Process without bulk indexing (slower but shows individual progress)
go run cmd/main.go -csv data/sample.csv -bulk=falseThe project includes a Makefile for common operations:
make help # Show all available commands
make docker-up # Start Docker services
make docker-down # Stop Docker services
make docker-clean # Stop services and remove volumes
make build # Build the application
make run # Run with sample data
make run-skip-tika # Run without Tika parsing
make test-opensearch # Test OpenSearch connection
make test-tika # Test Tika connection
make search # Interactive search in indexed documents
make list-indices # List all OpenSearch indices
make delete-index # Delete the csv-documents index
make clean # Clean generated files
make full-run # Complete setup and run-
CSV Parsing: The application reads your CSV file and creates a map for each row using the headers as keys.
-
XML Generation: Each row is converted to an XML document with sanitized field names:
<?xml version="1.0" encoding="UTF-8"?> <record> <name>John Doe</name> <age>28</age> <city>New York</city> ... </record>
-
Tika Parsing (optional): XML files are sent to the Tika server for content extraction and metadata parsing.
-
OpenSearch Indexing: Documents are indexed with:
- Full text content
- Original CSV data as metadata
- File paths
- Timestamps
-
Search: Documents become searchable through OpenSearch's full-text search capabilities.
Your CSV file should:
- Have a header row as the first line
- Use comma separators
- Follow standard CSV escaping rules
Example:
name,age,city,occupation,email
John Doe,28,New York,Software Engineer,[email protected]
Jane Smith,32,San Francisco,Product Manager,[email protected]- URL: http://localhost:9200
- Single-node setup with security disabled (for development)
- Data persisted in Docker volume
- URL: http://localhost:5601
- Web interface for data visualization and management
- URL: http://localhost:9998
- Apache Tika 2.9.1 for content extraction
# Check if ports are already in use
netstat -an | grep -E "9200|9998|5601"
# Check Docker logs
docker-compose logsIncrease memory in docker-compose.yml:
environment:
- "OPENSEARCH_JAVA_OPTS=-Xms1g -Xmx1g"- Ensure Tika container is running:
docker-compose ps - Wait longer for service startup
- Use
-skip-tikaflag to bypass Tika parsing
- Check CSV file format and path
- Verify OpenSearch is running
- Check application output for error messages
go test ./...go fmt ./...
go vet ./...go get <package>
go mod tidy- Health check:
GET http://localhost:9200 - Index stats:
GET http://localhost:9200/csv-documents/_stats - Search:
POST http://localhost:9200/csv-documents/_search - Delete index:
DELETE http://localhost:9200/csv-documents
- Version:
GET http://localhost:9998/version - Parse:
PUT http://localhost:9998/tika - Metadata:
PUT http://localhost:9998/meta
MIT
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
For issues, questions, or suggestions, please open an issue on the project repository.