- Fixed bug: output file that contains variables that were not greenlit was not created correctly (
dataflow.main.DataFlow._store_info_csv)
- Added new function to parse position indices for specific variables. In the config files, the
setting
parse_pos_indicescan be set for single variables where position indices are available.
- Fixed release
- This is a major update that refactors many parts of the code
- Removed:
dbc-influxdbdependency, required functionality is now directly built intodataflow - Adjusted how
.read_csv()reads data files, to comply with current pandas requirements. For all filetypes, the timestamp is now built in a separate step after reading the file, never during reading. (dataflow.filetypereader.filetypereader.FileTypeReader._add_timestamp) - Refactored the way
rawfuncis handled.rawfuncvariables are now created and added to the main dataframe before looping through the main dataframe. This means thatrawfuncvariables are now handled like the variables from the data files. All relevant tag entries are adjusted during the execution of the respectiverawfunc. - Uploading to the database does not require the Python dependency
dbc-influxdbanymore.dataflowuses its own uploading routine. This was necessary to guarantee faster execution and cleaner code. - Added new function to apply gain between two dates (
dataflow.rawfuncs.common.apply_gain_between_dates) - Added new function to add offset between to dates(
dataflow.rawfuncs.common.add_offset_between_dates) - Added new rawfunc to correct O2 measurements using temperature, used at
site
CH-CHA(dataflow.rawfuncs.ch_cha.correct_o2) gainis now set to1as a float if not specifically given, before it was an integer- Added new database tag
offset - Added new database tag
site
- Run ID now also includes nanoseconds to better differentiate between (many) log files of runs that were started in
parallel (
dataflow.common.times.make_run_id) - Added parameter to add an optional
suffixto Run ID - Updated
dbc-influxdbdependency tov0.11.3
- Variables are now strictly converted to
float, because the automatic detection of datatypes confused the database which lead to values being skipped becauseintwas expected butfloatwas delivered (dataflow.main.DataFlow._to_numeric) - Columns of
objectstype are now officially excludedinfer_objects=False(dataflow.main.DataFlow._convert_to_float_or_string)
- In the
configsit is now possible to define multiple IDs that identify good data rows. Indataflowthis is now handled accordingly. - In the
configs, this is done by specifying e.g.data_keep_good_rows: [ 0, [ 102, 103 ], [ 202, 203 ] ], which means that all data rows that start with either102or103are kept and use variable info indata_vars, and202or203use the variable info given indata_vars2. - In case single integers are given instead of a list, then all records that start with that integer are kept. For
example,
data_keep_good_rows: [ 0, 102, 202 ], which means that all data rows that start with102are kept and use variable info indata_vars, and all data rows that start with202use the variable info given indata_vars2.
- Added new function to calculate soil water content
SWCfromSDPvariables measured at the siteCH-CHA. The function to do the calculation was taken from the previous MeteoScreening tool. Conversions for other sites follow later. (dataflow.rawfuncs.ch_cha.calc_swc_from_sdpanddataflow.main.DataFlow._execute_rawfuncs) - After reading the data file, all rows that do not contain a timestamp are now
removed. This is the case e.g. for the file
CH-CHA_iDL_BOX1_1min_20160930-1545.csv.gzthat contains the stringap>0.004216865in the 3rd row of the timestamp column. (dataflow.main.DataFlow._varscanner)
- Updated date offsets to be compliant with new versions of
pandas( see here). (dataflow.common.times.timedelta_to_string) - Adjusted check for missing IDs due to the new option in
data_keep_good_rowsas described above (dataflow.main.DataFlow._check_special_format_alternating_missed_ids) - Updated detection of good rows for special format alternating, it can now handle multiple IDs that mark good
rows (
dataflow.filetypereader.special_format_alternating.special_format_alternating)
- Fixed
EmptyDataErrorbug when reading compressedgzipfiles that have filesize zero when uncompressed. This error occurs when completely empty files are gzipped. In this case, the filesize of the compressed file is > 0. When the script then tries to uncompress the file, the exceptionpd.errors.EmptyDataErroris raised. There are now more checks implented to avoid empty dataframes. (dataflow.filetypereader.filetypereader.FileTypeReader._readfile)
- Updated
dbc-influxdbto v0.11.1
- Added new method to harmonize time string representations when inferring the time resolution. Necessary because
pandas outputs the time string for one-minute data as
min, but dataflow prefers to useT. (dataflow.common.times.DetectFrequency.harmonize_timestring) - Change in environment: now using
condaenv with specific Python version3.9.18.poetryis still used for dependency management but is now directly installed in thecondaenv. Beforepoetrywas installed at system level with the system level Python3.9.7. This setup has the advantage that the script is now completely independent from the Python version installed at system level. - Added
environment.ymlfor creating a completecondaenvironment which includes the required Python version and all required packages.
-
Updated packages to newest versions:
> poetry update Updating dependencies Resolving dependencies... Package operations: 0 installs, 16 updates, 0 removals • Updating typing-extensions (4.8.0 -> 4.9.0) • Updating certifi (2023.7.22 -> 2024.2.2) • Updating markupsafe (2.1.3 -> 2.1.5) • Updating numpy (1.26.0 -> 1.26.4) • Updating pytz (2023.3.post1 -> 2024.1) • Updating setuptools (68.2.2 -> 69.1.0) • Updating tzdata (2023.3 -> 2024.1) • Updating urllib3 (2.0.4 -> 1.26.18) • Updating blinker (1.6.2 -> 1.7.0) • Updating importlib-metadata (6.8.0 -> 7.0.1) • Updating influxdb-client (1.37.0 -> 1.40.0) • Updating jinja2 (3.1.2 -> 3.1.3) • Updating pandas (2.1.0 -> 2.2.0) • Updating werkzeug (2.3.7 -> 3.0.1) • Updating wcmatch (8.5 -> 8.5.1)
- Fixed bug in
rawfuncs(dataflow.main.DataFlow._execute_rawfuncs)
FileScanner (FS) and VarScanner (VS) are no longer executed separately, but always sequentially. This way
more processes can be started in parallel.
FS searches for files and tries to assign a filetype to each found file, then VS uploads data to the database, file-by-file. The connection to the database is established before VS is started.
I did some tests on how to handle data that are stored in a great number of raw data files. I found that the best solution seems to be to handle data file-by-file. The approach to first read in all files of a specific filetype to one single dataframe (with all file data merged) and then upload data from that large dataframe caused memory issues. Some high-resolution raw data files (1SEC) simply have too many records over the course of a year and the test computer ran out of memory. Handling data uploads file-by-file avoids memory issues.
- Added
FileTypeReaderclass which was originally implemented indbc-influxdb. I think it makes more sense to include it indataflow. (dataflow.filetypereader.filetypereader.FileTypeReader) - Now using
skip_blank_lines=Falseinstead ofskip_blank_lines=Truewhen reading with.read_csv(). - Added variable suffix
-PRF-QCL-for special format-ICOSSEQ-if the data originated from the QCL profile measurements. The non-QCL profile variables still have the same suffix-PRF-. (dataflow.filetypereader.special_format_icosseq.special_format_icosseq) - Added class
DetectFrequencyto detect the time resolution of time series automatically. This class is based on a similar implementation in thediivelibrary. (dataflow.common.times.DetectFrequency)
- Update:
dbc-influxdbversion was updated tov0.10.2
- Fixed bug in imports
- Downgraded
urllib3package to version1.26.18because versions >2 require OpenSSL v1.1.1+ but on the target system OpenSSL v1.0.2 is installed which cannot be updated.yum installon this Linux system only finds v1.0.2.
It is now possible to calculate variables from available data. This is sometimes necessary, e.g., when data
were recorded with erroneous units due to a wrong calibration factor, or when not the final, required measurement
was stored such as SDP instead of SWC.
- New function to calculate soil water content
SWCfromSDPvariables, at the time of this writing this is possible for the siteCH-FRU. The function to do the calculation was taken from the previous MeteoScreening tool. Conversions for other sites follow later. (rawfuncs.ch_fru.calc_swc_from_sdp) - New function to calculate Boltzmann corrected long-wave radiation (variable
LW_INandLW_OUTinW m-2) from temperature from radiation sensor in combination with raw LW_IN measurements from the sensor. (rawfuncs.common.calc_lwin) - These calculations have to be defined directly in the
configs. - The general logic is that all variables required for a specific
rawfuncare first collected in a dedicated dataframe, and then the new variables are calculated.
Here are the settings in the configs:
Theta_11_AVG: { field: SDP_GF1_0.05_1, units: mV, gain: 1, rawfunc: [ calc_swc ], measurement: SDP }
The name of the SWC variable will accordingly be SWC_GF1_0.05_1.
Here are the settings in the configs:
LWin_2_AVG: { field: LW_IN_RAW_T1_2_1, units: false, gain: 1, rawfunc: [ calc_lw, PT100_2_AVG, LWin_2_AVG, LW_IN_T1_2_1 ], measurement: _RAW }
PT100_2_AVG: { field: T_RAD_T1_2_1, units: degC, gain: 1, rawfunc: [ calc_lw, PT100_2_AVG, LWin_2_AVG, LW_IN_T1_2_1 ], measurement: _instrumentmetrics }
This means that function calc_lw uses temperature variable PT100_2_AVG and recorded variable LWin_2_AVG to
calculate the new variable LW_IN_T1_2_1. Note that both variables that are required for the calc_lw function
(LWin_2_AVG and PT100_2_AVG) have the same rawfunc: setting.
- Addition:
FileScannernow raises a warning if not all required keys are available as column in the filescanner dataframe. To solve this warning, the required key must be initialized as column thenfilescanner_dfis first created infilescanner.filescanner.FileScanner._init_df. If the key is not in the dataframe, then pandas raises a future warning due to upcasting, more details: - Change: Removed arg
mangle_dupe_colswhen using pandas.read_csv()(deprecated in pandas) - Update:
dbc-influxdbversion was updated tov0.10.1 - Update: Updated all packages to newest versions
dbc-influxdbversion was updated tov0.8.1
dbc-influxdbversion was updated tov0.8.0
- The x newest files are now detected based on file modification time instead of filedate
in (
filescanner.filescanner.FileScanner.run).
- Updated all dependencies to their newest (possible) version
dbc-influxdbversion was updated tov0.7.0
- Added support for
-ALTERNATING-filetypes (special format). For a description of this special format please see CHANGELOG ofdbc-influxdbhere: filescanner: Changed the logic of how the filedate is parsed from the filename. Settings provided in filetype settingfiletype_dateparserare now first converted to a list, then the script loops through the provided settings and tries to parse the filedate from the filename.- If one element in
filetype_dateparserisget_from_filepath, then the parent subfolders from the filepath of the respective file are checked, e.g.:- Assuming the filepath for file
Davos-Logger.datis//someserver/CH-DAV_Davos/10_meteo/2013/08/Davos-Logger.dat, then the first subfolder is checked whether it matches a month (between01and12), and the second subfolder is checked whether it matches a year (between1900and2099). If both is True, then the filedate is constructed as datetime, in this casedt.datetime(year=2013, month=8, day=1, hour=0, minute=0).
- Assuming the filepath for file
- If one element in
filetype_dateparserisfalse, then the filedate is constructed from the modification time of the respective file. Note that the modification time sometimes has nothing to do with the contents of the file.
- If one element in
- Included
nrowssetting for specifying how many data rows of each files are uploaded to the database. This is useful to quickly test upload data from many files, e.g., for checking if units of resolution changed. This setting was already available indbc-influxdb, but now it can be passed directly fromdataflow.
- Updated dependency for
dbc-influxdbto v0.5.0 (installed directly from GitLab)
- Scripts for running
dataflowon a local machine (on demand) are now collected in folderlocal_run
- File data are now uploaded with timezone info
timezone='UTC+01:00, which corresponds to CET (Central European winter time, UTC+01:00). This way all data are stored asUTCin the database.UTCis the same asGMT. - Created Python file
local_run.py. This file allows to upload files manually from a local machine. This is necessary to upload the historic data (many files). The script usesmultiprocessingrun in parallal. Parallelization currently works for FILEGROUPS. - Implemented new arg
parse_var_pos_indicesfordbc-influxdb.upload_filetype(), which is now part of theconfigsfor all filegroups:parse_var_pos_indices=filetypeconf['data_vars_parse_pos_indices']
- Added check: filesize must be > 0, otherwise file is skipped
- Added check for empty data before extending filescanner_df
- The
dbcpackage is now included with its new namedbc-influxdb
- Moved
varscannertodbcpackage - Now using
dbcpackage (currently v0.1.0) to scan files for variables and to upload data to database dbcwas installed directly from the release version on GitLab whendataflowis installed on the database server withpipx. During development,dbcis included as dev-dependency from a local folder.- Removed
filereadermodule, it is now part of thedbclibrary - Removed
freqfromfrom tags (indbc, but mentioning this here) - Refactored code
dataflowis now part of projectPOETconfigsfolder (with filegroups etc.) is no longer part ofdataflow, but a separate projectdbconf.yaml, the configuration for the database, is no longer in theconfigssubfolder, but in subfolderconfigs_secret
Updated in configs folder:
- Added filetype
DAV12-RAW-FF4-TOA5-DAT-TBL1-1MIN-202006241033 - Added filetype
DAV12-RAW-FF5-TOA5-DAT-TBL1-1MIN-202006240958 - Added filetype
DAV12-RAW-FF3-TOA5-DAT-TBL1-1MIN-201903041642
- Target bucket is now determined from CLI args
siteanddatatypeinstead of fromfiletypeconfiguration. - Therefore, the
db_bucketsetting in thefiletypeconfiguration files has been removed. - It is now possible to directly upload data to the
testbucket by setting thetestuploadarg toTrue. (not yet available in CLI) - Removed the tag
srcfile: this tag will no longer be uploaded to the database. The reason is that this tag causes duplicates (multiple entries per timestamp for the same variable) in case of overlapping source files. - Since
srcfileis no longer stored as tag, it is now output to the log file. - Added new option in filetype settings:
data_remove_bad_rowswhich has similar functionality asdata_keep_good_rows, but it removes data rows based on e.g. a string instead of keeping them. This option was implemented because of inconsistencies in filetypeDAV17-RAW-NABEL-PRF-SSV-DAT-P2-5MIN-200001010000. - Started documentation of filetype settings in
configs:filegroups:README.md - Added a general filetype of EddyPro fluxnet files (Level-0 fluxes)
- Restructured the
configs:filegroups: theprocessingsubfolder now contains filetypes that are the same across sites, e.g., thefiletypefor EddyPro full_output files. - Added additional restriction for
20_ec_fluxes:Level-0files: the path to their source file must contain the stringLevel-0. - Added option to ignore files that contain certain strings.
- Added
filetypes for early DAV17 NABEL CO2 profiles and DAV13 profile data
- All configs except for the database settings are now part of the main code. The database
settings in the filedbconf.yamlremains external (outside main code) for security reasons.
- The "filescanner-was-here" file is now generated in the folder as soon as
varscanneris working in the respective folder (before file was generated aftervarscannerfinished). This allows the parallel execution of the script because it avoids that two parallelvarscannerruns interfere in the same folder. - For each variable, a gain can now be specified in the filetype. If no gain is given, then gain is set to 1. If gain is set, then the raw data values are multiplied by gain before ingestion to the database.
- In the filetype configurations, the keys
filetype_idandfiletype_dateparsernow accept lists where multiple values can be defined. This is useful if the file naming changed, but the data format remained the same, e.g.DAV11-RAWfiles. - Added wcmatch library for extended pattern matching (not used at the moment)
- In the filetype configurations,
filetype_dateparsercan now be given as part of the filename, e.g. inDAV11-RAW. The filename is now parsed for the filedate based on the length of the providedfiletype_dateparserstrings.
- FIXED: Import error when using CLI
- ADDED: List of ignored extensions, currently part of
filescanner
- ADDED: Support for EddyPro full output files
- ADDED: The variable name stored as
_fieldis now also stored as tagvarnameto make it accessible via tag filters. - CHANGED: Instead of the full filepath of the source file, the database now only
stores the filename in tag
srcfile. Main reason is that depending on from where the file is ingested, the full filepath can be different (e.g. if the raw data server is mounted differently on a machine), which then results in a different tag entry. In such case the variable is uploaded again (because tags are different), even though it is already present in the db. - CHANGED: Auxiliary variable info are now collected in separate
measurementcontainers_SD(standard deviations),_RAW(uncorrected) andIU(instrument units). This change did not affect thedataflowsource code, but was done via theconfigs(which is a folder separate from the source code).
- Added 'if testrun' option in main for testing the script locally
- ADDED:
filescannernow outputs an additional file that lists all files for which no filetypes were defined - ADDED:
varscannernow outputs an additional file that lists all variables that were not greenlit (i.e. not defined in theconfigs) and therefore also not uploaded to the db - ADDED: new filetype
CHA10-RAW-TOA5-DAT-TBL1-1MIN-201612071258(in external folderconfigs) - ADDED: new filetype
FRU10-RAW-TOA5-DAT-TBL1-1MIN-201711201643(in external folderconfigs)
- ADDED: README now shows a list of currently implemented filetypes
- ADDED: new filetype
DAV12-RAW-FF6-TOA5-DAT-TBL1-1MIN-202110221616(in external folderconfigs) - ADDED: new filetype
DAV10-RAW-TOA5-DAT-TBL1-10S-201802281101.yaml(in external folderconfigs) - REMOVED: some
printchecks from code
- Changed:
filescannerandvarscannercan now be executed independentlyfilescannerscans the server for data files and outputs results todataflowoutput foldervarscannerscans thedataflowoutput folder for allfilescannerresults
- Changed the way required subpackages are imported: included a
try-exceptclause that first tries to import subpackages with relative imports (needed for CLI execution of the script on the server afterpipxinstallation of the script), then, if the relative imports failed, absolute imports are
called (needed for script execution withoutpipxinstallation). In short, after the script was installed usingpipxit needed relative imports, while absolute imports were needed when the script was directly executed e.g. from the environment. - Removed:
html_pagebuilderis no longer executed together withfilescannerandvarsvanner.
Instead, it will be in a separate script.
- First implementations of DataScanner, VarScanner and FileScanner
- First implementation of html_pagebuilder
Check variable naming after all current files running
data_version from settings file time resolution in CLI?
- meteoscreening files > todo ?Overview of data uploaded to database. store var dtype in info
one hot encoding for strings in chambers?
auto-detect bucket units for all files? prioritize units given in yaml timezone for db? timestamp END or START? average of winddir remove offset from radiation mode html log output tests