Support for parsing xTB outputs by Andrew-S-Rosen · Pull Request #1296 · cclib/cclib

Andrew-S-Rosen · 2023-10-26T07:04:28Z

This is a follow-up to the unmerged PR from #1129 and expands upon it substantially, using it as a starting point. Most of the content can be found in the xtbparser.py file.

Some improvements made here compared to the prior PR:

The logic behind parsing the various attributes works regardless of if the calculation is a relaxation or static calculation.
The logic behind parsing the various attributes is no longer based on very strict string searches on things like the number of white spaces and number of "---" characters, which would have been extremely prone to breaking.
The new parser addresses recent changes to the xTB output file format.
Significant refactoring was carried out to improve readability and ease of maintenance.
Several cclib-supported attributes that were not included in the prior PR are now included here.
Type hints have been added.
Allow for parsing of the gradients file
Add unit tests
Miscellaneous bugfixes.

The only aspects I have not carried over are the catchall attributes atomprop and bondprop proposed in #789. If they are of interest, I recommend porting them over from #1129 after this gets merged. I only wanted to deal with official cclib-supported attributes for now.

Closes #789
Closes #1129.

Andrew-S-Rosen · 2023-10-26T17:08:46Z

@berquist, @shivupa: Could you please let me know where (in the test suite) I should be adding the tests for the xTB logfile parsing? Thanks! The development instructions are a little vague about how precisely the logfiles are being tested and what assertions are being applied.

Also, looks like we need to add an XTB folder in the regression suite: https://github.com/cclib/cclib-data

shivupa · 2023-10-26T19:06:49Z

@berquist, @shivupa: Could you please let me know where (in the test suite) I should be adding the tests for the xTB logfile parsing? Thanks! The development instructions are a little vague about how precisely the logfiles are being tested and what assertions are being applied.

I'd focus on a single point test first. If you have a dvb_sp.out, then you can add that in data/xTB. Then add a line in this file

cclib/test/testdata

Line 318 in 91c8d60

    
           SP        QChem       GenericSPTest         basicQChem5.1               dvb_sp.out

That will let the SP test run (https://github.com/cclib/cclib/blob/master/test/data/testSP.py), and then decorate the functions that need skipped for now (or inherit the Generic_SP_test if needed).

Also, looks like we need to add an XTB folder in the regression suite: https://github.com/cclib/cclib-data

Yes, cclib-data is for regression tests only. Unit and integration tests live in this repo. I don't expect you need integration tests yet. We mainly add things there if there was a bug report that leads to a fix.

Andrew-S-Rosen · 2023-10-26T21:42:27Z

@awvwgk: I'm giving the preliminary xTB parser a full overhaul since it was a good start but not really in a place I'd feel confident for production.

In giving it a refresh, I realized that I remain unsure where (if anywhere) the spin is consistently reported in the xTB logfile. If the user supplied it as an input, e.g. via --uhf, then it will appear in the Calculation Setup block. If they don't specify it all, then there's no mention of spin in the output. It'd be tempting to say "if it's not in the log file, it's spin multiplicity 1," but that isn't true; if someone runs xTB using the detailed input format, those input parameters aren't echoed anywhere in the log file to the best of my knowledge. So, one could run with spin using detailed input, and it'd be nowhere in the logs.

Do you happen to have any suggestions? If my understanding is indeed correct, then it might just have to be an optional return.

Andrew-S-Rosen · 2023-10-27T20:08:30Z

@shivupa: I am pretty much done with the refactoring, updating it for the current version of xTB, bugfixing, and adding features. Mostly just have the automated tests to set up and run now. However, I have one remaining question in the meantime.

Namely, what is the ideal approach if the gradients (for the grads attribute) are not stored in the main logfile but are instead stored in a separate file called gradient. Is there a way to allow the user to pass in a directory or, perhaps, multiple file paths?

shivupa · 2023-10-27T21:08:20Z

We should be able to handle multiple files (for example, turbomole requires this). I can provide some details later today about this!

Andrew-S-Rosen · 2023-10-28T02:58:48Z

@shivupa: Thanks! A bit of insight would be great when you have a moment 😄 In the meantime, I'll sort out adding some of these tests. Because xTB is not a DFT method, there are a fair number of fields that will need to be skipped, but that's okay.

shivupa · 2023-10-29T00:19:06Z

Sorry traveling.

So this works for turbomole

#cclib/data/Turbomole/basicTurbomole7.5
from cclib.io import ccread
files = ["water_hf_solvent_cosmo/basis", "water_hf_solvent_cosmo/control", "water_hf_solvent_cosmo/coord", "water_hf_solvent_cosmo/energy", "water_hf_solvent_cosmo/mos", "water_hf_solvent_cosmo/statistics", "water_hf_solvent_cosmo/water.log"]
cclib_obj = ccread(files)
print(cclib_obj.getattributes())

I think this should just work straight away for the turbomole parser but you might need to write an xtb specific sort_input.

cclib/cclib/parser/turbomoleparser.py

Line 73 in 91c8d60

def sort_input(self, file_names: typing.List[str]) -> typing.List:

@oliver-s-lee does that sound right?

oliver-s-lee · 2023-10-30T09:25:34Z

Hi folks, the mechanism for parsing from multiple files is relatively simple, it just uses some magic to make all the files appear as one long file. From the parser's point of view, all you need is to define a sort_input() method if you need the files to appear in a given order (if it doesn't matter you can leave this out). sort_input() is called with a list of the available input files, and should return a list with the desired order.

If calling from the code, you can just specify a list of files as @shivupa demonstrated. From the command line, you need to give the --multi/-m option:
ccget --multi main.log gradient etc

Andrew-S-Rosen · 2023-10-30T14:42:21Z

Thanks!! That's very helpful! I'll give that a go.

Andrew-S-Rosen · 2023-11-03T03:56:33Z

Thanks for merging in the RTD fix, @berquist! Haven't forgotten about this PR btw. Had to take a short break after getting a little overwhelmed trying to sort out the logic behind the test suite. 😅 Will be back at it shortly.

Edit: Hoping to get back to this in the beginning of December. The main limiting factor is that I don't have a good handle on how the test suite is running. I know how to write tests of course, but things are somewhat optimized in the cclib test suite such that I don't have an immediate idea where to start. I need time to look into it.

Andrew-S-Rosen · 2023-11-26T05:10:34Z

Two questions

This is at the point where the only thing needed before merging is a set of tests. There are some additional features to add based on parsing multiple files, but that'll be saved for follow-up work.

With that said, I have two remaining questions outlined below.

Multiple File Parsing

@oliver-s-lee: It sounds like cclib operates by concatenating everything in memory to one big file. On the surface, this seems simple, but in practice it leads to some complexities. For instance, xTB outputs a charges file that (as the name suggests) contains the charges. It looks like so:

   -0.56476049
    0.28238024
    0.28238024

Note that there are no strings in the file that one can use to identify that this is from the charges file. The only way to know the meaning of the data is through the context of which file it came from.

With that said, what is the recommended approach? For simplicity, I will probably push this off to a follow-up PR, but it'd still be useful to know.

Tests

@shivupa, @berquist: I, unfortunately, remain quite confused about how to properly implement tests for the parser. Of course, I understand how to write a test suite on my own that would evaluate the various attributes; however, I don't really have a good grasp on how to do things "the cclib" way since this is not documented anywhere (unless I have missed it). This is especially true because xTB doesn't share many attributes that other codes have (e.g. no basis, for instance). As one point of initial confusion, I can specify SP XTB GenericSPTest in the testdata, but assertions are being made against B3LYP, which aren't relevant here. Do I create a custom XtbSPTest class that inherits from GenericSPTest and overwrite any default assertions so they are relevant for xTB? I could use some assistance on how to best do this without writing my own tests that don't conform to the cclib testdata stuff.

I'm assuming y'all would not appreciate if I wrote my own test_xtb_parser.py and just made a manual set of assertions for the various attributes? 😅

oliver-s-lee · 2023-11-26T10:59:42Z

Hi @Andrew-S-Rosen, great to see the xTB support is progressing so well! With regards to multi-file parsing, yes you are largely correct. A minor (but important) technical difference is that the files are not actually all loaded into memory at once, because this would require a large continuous block of memory that would be very wasteful. Rather, each line is read from each file one-by-one, but there's some magic as you say to make it appear like one long-file.

Regardless, the issue of detecting which file is which is an interesting one, and not something we've come up against before I don't think. You can access the name of the current file like this:
inputfile.filenames[inputfile.file_pointer]
Which should work fine for most cases. I personally like pathlib for dealing with filenames, where you could do something like:

if Path(inputfile.filenames[inputfile.file_pointer]).name == "charges":
  ...

There might be some fringe cases where this doesn't work, like if we're reading files from a URL where the filename might not be available, but I'd recommend not worrying about those for now :)

For testing, I'll let the other guys confirm as they have more experience with the testing suite, but yes creating a custom XTBSPTest class inheriting from GenericSPTest and modifying based on what's available is totally fine. For concepts which don't exist in xTB, you can either just skip the test entirely or change the value to whatever is the closest equivalent, whichever makes the most sense. You can of course add new tests for concepts which are unique to xTB. As a rough example:

class XTBSPTest(GenericSPTest):

    @skipForParser('xTB', 'xTB has no concept of basis sets')
    def testnbasis(self):
        pass

    def testscfenergy(self):
        """Is the SCF energy within the target?"""
        assert abs(self.data.scfenergies[-1] - 1.12345) < 40

berquist · 2023-11-26T19:00:38Z

Do I create a custom XtbSPTest class that inherits from GenericSPTest and overwrite any default assertions so they are relevant for xTB? I could use some assistance on how to best do this without writing my own tests that don't conform to the cclib testdata stuff.

Yes. I will make an example for you and commit it, then you can expand on that.

I'm assuming y'all would not appreciate if I wrote my own test_xtb_parser.py and just made a manual set of assertions for the various attributes?

Suffer through this for now, but the test suite is being incrementally rewritten to properly follow pytest expectations.

One important thing is that skipFor decorators go on methods in the parent class and those methods are not overridden in the child.

Andrew-S-Rosen · 2023-11-26T19:21:53Z

@berquist, @oliver-s-lee: Thank you both for the very helpful comments! I should have a clear path forward with this :)

berquist · 2023-11-26T21:08:04Z

Should've said this earlier, but I will comment again when I'm done committing to the branch.

berquist · 2023-11-27T14:50:05Z

Done; my last few commits show how to write customized test cases and skip those test methods whose implementation is missing or doesn't make sense for the program. The strategy is to incrementally remove each decorator as functionality is added or fixed.

I'd start with homos, I saw the attribute is the wrong length.

Andrew-S-Rosen · 2023-11-27T16:15:59Z

@berquist: Thanks a ton!! I really appreciate you setting the stage for this. I'll take it from here and patch things up as needed. I'll let you know when it's ready for review.

Andrew-S-Rosen · 2023-11-29T22:02:42Z

@berquist: How am I meant to run the tests locally? Doing pytest . within the cclib/test/data directory doesn't start any tests. Additionally, running pytest testSP.py results in all failures because none of the self attributes are defined.

berquist · 2023-11-30T04:04:43Z

The principal driver of tests is https://github.com/cclib/cclib/blob/master/.github/scripts/run_pytest.bash. The key is, from the root directory,

python -m pytest -v -s --terse test -k "not test_method"

where -s is so that the main parser tests actually display their output, and --terse is a cclib-specific flag to keep some of that main parser output tame. But because there is a wrapper over all these parser tests so that they only appear as a single pytest test, you can't use -k to isolate xTB.

An alternative that might work is

python -m test.test_data --terse XTB

which uses the old unittest functionality.

berquist · 2023-12-11T14:42:47Z

Since you aren't on Slack, I'll bug you here...I would like to start (and possibly finish) the 1.8.1 release process this coming weekend. Probably both of us would like to see this in but I don't want to merge until you're satisfied with tests. Do you need any additional pointers or help?

Andrew-S-Rosen · 2023-12-11T17:52:37Z

@berquist: Thanks for the ping. Admittedly, I have been a bit overwhelmed by the gnarly merge conflicts (see below) and the test suite. I just tried to resolve some of the conflicts, but I ended up making things worse so reverted it.

I'm going to put the merge conflicts aside for a moment and try to focus on increasing the test coverage today. If you have a spare moment, getting some help with the conflicts would be huge... some of the files have very large and obscure conflicts e.g. see test/regression.py where the whole file is marked as a conflict, making it hard to tell what to keep/change.

berquist · 2023-12-11T18:11:32Z

It slipped my mind that you would have to deal with the rebase/merge over the reformatted code. Just looking at the filenames, the reason for most (all?) of these conflicts is actually line endings: these files were CRLF, I converted them all to LF, and Git is incredibly stupid about attempting to resolve this for reasons I don't know.

I can't get to this during the week, so hack on tests for now, and ~Saturday I will rebase the whole thing (unless you want to...). Then I will review, since I don't think anyone's done that yet.

Andrew-S-Rosen · 2023-12-11T18:27:25Z

@berquist: Ah, thanks! Good to know about the line endings. With that in mind, I might be able to get the conflict resolved. I'll take a stab but won't spend too much time if it's still a mess.

Will work on more tests!

berquist · 2023-12-11T20:38:56Z

If you do want to try tackling it yourself (you probably don't...), you will want to edit each commit in the interactive rebase, do pre-commit run -a, and if there are changes, add and amend.

(For the pre-commit command, you can ignore run -a currently complaining about how some ReST markup looks like a merge conflict. I need to fix that...)

Andrew-S-Rosen mentioned this pull request Oct 27, 2023

Add xtb recipes Quantum-Accelerators/quacc#1113

Closed

Andrew-S-Rosen changed the title ~~[WIP] Support for parsing xTB outputs~~ Support for parsing xTB outputs Oct 28, 2023

berquist mentioned this pull request Nov 15, 2023

Initial support for parsing xTB output (WIP) #1129

Closed

berquist added parsers xTB labels Nov 15, 2023

berquist added this to the v1.8.1 milestone Nov 15, 2023

Andrew-S-Rosen and others added 28 commits December 16, 2023 16:05

Add type hint

923bbff

rename xtb data directory otherwise test discovery silently doesn't work

ad949a4

xTB: explicitly skip over missing lines in orbital occupation output

c430726

xTB: fix parsing multiplicity from input command

dbfc034

xTB: add dvb_sp data for 6.6.0

d9ce091

xTB: add not-yet-enabled tests and initial test skipping

14f1751

xTB: skip remaining SP tests but parse file

d466c33

xTB: begin customized SP tests

01368d7

Remove unused imports

f7f6067

Add support for grad and hessian parsing

209d83b

Fix ohess parsing

8b73588

Fix free energy parsing

f47dff1

Add support for temperature parsing

02f5a21

Clean up test files

6745802

Add IR test fiels

c631c5a

Fix SCF energy parsing

78fe0e6

Fix atomcoords

723beab

Fix tests

0000690

Fix comment

9c97ef0

Remove redundant skip

c08db0e

Typo fix

590fe3b

Delete data/XTB/basicXTB6.6.1/test.ipynb

d9871a9

Refactor methods

c875564

Use set_attribute (part 1)

e794e97

Remove unused import

2eeabee

fix

cb0d266

fix

3f8ce33

fix

03319ca

berquist merged commit 363c4f3 into cclib:master Dec 17, 2023

Andrew-S-Rosen deleted the rosen-xtb branch December 17, 2023 01:01

Conversation

Andrew-S-Rosen commented Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andrew-S-Rosen commented Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivupa commented Oct 26, 2023

Uh oh!

Andrew-S-Rosen commented Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andrew-S-Rosen commented Oct 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivupa commented Oct 27, 2023

Uh oh!

Andrew-S-Rosen commented Oct 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivupa commented Oct 29, 2023

Uh oh!

oliver-s-lee commented Oct 30, 2023

Uh oh!

Andrew-S-Rosen commented Oct 30, 2023

Uh oh!

Andrew-S-Rosen commented Nov 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andrew-S-Rosen commented Nov 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Two questions

Multiple File Parsing

Tests

Uh oh!

oliver-s-lee commented Nov 26, 2023

Uh oh!

berquist commented Nov 26, 2023

Uh oh!

Andrew-S-Rosen commented Nov 26, 2023

Uh oh!

berquist commented Nov 26, 2023

Uh oh!

berquist commented Nov 27, 2023

Uh oh!

Andrew-S-Rosen commented Nov 27, 2023

Uh oh!

Andrew-S-Rosen commented Nov 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

berquist commented Nov 30, 2023

Uh oh!

berquist commented Dec 11, 2023

Uh oh!

Andrew-S-Rosen commented Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

berquist commented Dec 11, 2023

Uh oh!

Andrew-S-Rosen commented Dec 11, 2023

Uh oh!

berquist commented Dec 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Andrew-S-Rosen commented Oct 26, 2023 •

edited

Loading

Andrew-S-Rosen commented Oct 26, 2023 •

edited

Loading

Andrew-S-Rosen commented Oct 26, 2023 •

edited

Loading

Andrew-S-Rosen commented Oct 27, 2023 •

edited

Loading

Andrew-S-Rosen commented Oct 28, 2023 •

edited

Loading

Andrew-S-Rosen commented Nov 3, 2023 •

edited

Loading

Andrew-S-Rosen commented Nov 26, 2023 •

edited

Loading

Andrew-S-Rosen commented Nov 29, 2023 •

edited

Loading

Andrew-S-Rosen commented Dec 11, 2023 •

edited

Loading