Skip to content

Support for parsing xTB outputs#1296

Merged
berquist merged 78 commits intocclib:masterfrom
Andrew-S-Rosen:rosen-xtb
Dec 17, 2023
Merged

Support for parsing xTB outputs#1296
berquist merged 78 commits intocclib:masterfrom
Andrew-S-Rosen:rosen-xtb

Conversation

@Andrew-S-Rosen
Copy link
Contributor

@Andrew-S-Rosen Andrew-S-Rosen commented Oct 26, 2023

This is a follow-up to the unmerged PR from #1129 and expands upon it substantially, using it as a starting point. Most of the content can be found in the xtbparser.py file.

Some improvements made here compared to the prior PR:

  • The logic behind parsing the various attributes works regardless of if the calculation is a relaxation or static calculation.
  • The logic behind parsing the various attributes is no longer based on very strict string searches on things like the number of white spaces and number of "---" characters, which would have been extremely prone to breaking.
  • The new parser addresses recent changes to the xTB output file format.
  • Significant refactoring was carried out to improve readability and ease of maintenance.
  • Several cclib-supported attributes that were not included in the prior PR are now included here.
  • Type hints have been added.
  • Allow for parsing of the gradients file
  • Add unit tests
  • Miscellaneous bugfixes.

The only aspects I have not carried over are the catchall attributes atomprop and bondprop proposed in #789. If they are of interest, I recommend porting them over from #1129 after this gets merged. I only wanted to deal with official cclib-supported attributes for now.

Closes #789
Closes #1129.

@Andrew-S-Rosen
Copy link
Contributor Author

Andrew-S-Rosen commented Oct 26, 2023

@berquist, @shivupa: Could you please let me know where (in the test suite) I should be adding the tests for the xTB logfile parsing? Thanks! The development instructions are a little vague about how precisely the logfiles are being tested and what assertions are being applied.

Also, looks like we need to add an XTB folder in the regression suite: https://github.com/cclib/cclib-data

@shivupa
Copy link
Member

shivupa commented Oct 26, 2023

@berquist, @shivupa: Could you please let me know where (in the test suite) I should be adding the tests for the xTB logfile parsing? Thanks! The development instructions are a little vague about how precisely the logfiles are being tested and what assertions are being applied.

I'd focus on a single point test first. If you have a dvb_sp.out, then you can add that in data/xTB. Then add a line in this file

SP QChem GenericSPTest basicQChem5.1 dvb_sp.out

That will let the SP test run (https://github.com/cclib/cclib/blob/master/test/data/testSP.py), and then decorate the functions that need skipped for now (or inherit the Generic_SP_test if needed).

Also, looks like we need to add an XTB folder in the regression suite: https://github.com/cclib/cclib-data

Yes, cclib-data is for regression tests only. Unit and integration tests live in this repo. I don't expect you need integration tests yet. We mainly add things there if there was a bug report that leads to a fix.

@Andrew-S-Rosen
Copy link
Contributor Author

Andrew-S-Rosen commented Oct 26, 2023

@awvwgk: I'm giving the preliminary xTB parser a full overhaul since it was a good start but not really in a place I'd feel confident for production.

In giving it a refresh, I realized that I remain unsure where (if anywhere) the spin is consistently reported in the xTB logfile. If the user supplied it as an input, e.g. via --uhf, then it will appear in the Calculation Setup block. If they don't specify it all, then there's no mention of spin in the output. It'd be tempting to say "if it's not in the log file, it's spin multiplicity 1," but that isn't true; if someone runs xTB using the detailed input format, those input parameters aren't echoed anywhere in the log file to the best of my knowledge. So, one could run with spin using detailed input, and it'd be nowhere in the logs.

Do you happen to have any suggestions? If my understanding is indeed correct, then it might just have to be an optional return.

@Andrew-S-Rosen
Copy link
Contributor Author

Andrew-S-Rosen commented Oct 27, 2023

@shivupa: I am pretty much done with the refactoring, updating it for the current version of xTB, bugfixing, and adding features. Mostly just have the automated tests to set up and run now. However, I have one remaining question in the meantime.

Namely, what is the ideal approach if the gradients (for the grads attribute) are not stored in the main logfile but are instead stored in a separate file called gradient. Is there a way to allow the user to pass in a directory or, perhaps, multiple file paths?

@shivupa
Copy link
Member

shivupa commented Oct 27, 2023

We should be able to handle multiple files (for example, turbomole requires this). I can provide some details later today about this!

@Andrew-S-Rosen
Copy link
Contributor Author

Andrew-S-Rosen commented Oct 28, 2023

@shivupa: Thanks! A bit of insight would be great when you have a moment 😄 In the meantime, I'll sort out adding some of these tests. Because xTB is not a DFT method, there are a fair number of fields that will need to be skipped, but that's okay.

@Andrew-S-Rosen Andrew-S-Rosen changed the title [WIP] Support for parsing xTB outputs Support for parsing xTB outputs Oct 28, 2023
@shivupa
Copy link
Member

shivupa commented Oct 29, 2023

Sorry traveling.

So this works for turbomole

#cclib/data/Turbomole/basicTurbomole7.5
from cclib.io import ccread
files = ["water_hf_solvent_cosmo/basis", "water_hf_solvent_cosmo/control", "water_hf_solvent_cosmo/coord", "water_hf_solvent_cosmo/energy", "water_hf_solvent_cosmo/mos", "water_hf_solvent_cosmo/statistics", "water_hf_solvent_cosmo/water.log"]
cclib_obj = ccread(files)
print(cclib_obj.getattributes())

I think this should just work straight away for the turbomole parser but you might need to write an xtb specific sort_input.

def sort_input(self, file_names: typing.List[str]) -> typing.List:

@oliver-s-lee does that sound right?

@oliver-s-lee
Copy link
Contributor

Hi folks, the mechanism for parsing from multiple files is relatively simple, it just uses some magic to make all the files appear as one long file. From the parser's point of view, all you need is to define a sort_input() method if you need the files to appear in a given order (if it doesn't matter you can leave this out). sort_input() is called with a list of the available input files, and should return a list with the desired order.

If calling from the code, you can just specify a list of files as @shivupa demonstrated. From the command line, you need to give the --multi/-m option:
ccget --multi main.log gradient etc

@Andrew-S-Rosen
Copy link
Contributor Author

Thanks!! That's very helpful! I'll give that a go.

@Andrew-S-Rosen
Copy link
Contributor Author

Andrew-S-Rosen commented Nov 3, 2023

Thanks for merging in the RTD fix, @berquist! Haven't forgotten about this PR btw. Had to take a short break after getting a little overwhelmed trying to sort out the logic behind the test suite. 😅 Will be back at it shortly.

Edit: Hoping to get back to this in the beginning of December. The main limiting factor is that I don't have a good handle on how the test suite is running. I know how to write tests of course, but things are somewhat optimized in the cclib test suite such that I don't have an immediate idea where to start. I need time to look into it.

@Andrew-S-Rosen
Copy link
Contributor Author

Andrew-S-Rosen commented Nov 26, 2023

Two questions

This is at the point where the only thing needed before merging is a set of tests. There are some additional features to add based on parsing multiple files, but that'll be saved for follow-up work.

With that said, I have two remaining questions outlined below.

Multiple File Parsing

@oliver-s-lee: It sounds like cclib operates by concatenating everything in memory to one big file. On the surface, this seems simple, but in practice it leads to some complexities. For instance, xTB outputs a charges file that (as the name suggests) contains the charges. It looks like so:

   -0.56476049
    0.28238024
    0.28238024

Note that there are no strings in the file that one can use to identify that this is from the charges file. The only way to know the meaning of the data is through the context of which file it came from.

With that said, what is the recommended approach? For simplicity, I will probably push this off to a follow-up PR, but it'd still be useful to know.

Tests

@shivupa, @berquist: I, unfortunately, remain quite confused about how to properly implement tests for the parser. Of course, I understand how to write a test suite on my own that would evaluate the various attributes; however, I don't really have a good grasp on how to do things "the cclib" way since this is not documented anywhere (unless I have missed it). This is especially true because xTB doesn't share many attributes that other codes have (e.g. no basis, for instance). As one point of initial confusion, I can specify SP XTB GenericSPTest in the testdata, but assertions are being made against B3LYP, which aren't relevant here. Do I create a custom XtbSPTest class that inherits from GenericSPTest and overwrite any default assertions so they are relevant for xTB? I could use some assistance on how to best do this without writing my own tests that don't conform to the cclib testdata stuff.

I'm assuming y'all would not appreciate if I wrote my own test_xtb_parser.py and just made a manual set of assertions for the various attributes? 😅

@oliver-s-lee
Copy link
Contributor

Hi @Andrew-S-Rosen, great to see the xTB support is progressing so well! With regards to multi-file parsing, yes you are largely correct. A minor (but important) technical difference is that the files are not actually all loaded into memory at once, because this would require a large continuous block of memory that would be very wasteful. Rather, each line is read from each file one-by-one, but there's some magic as you say to make it appear like one long-file.

Regardless, the issue of detecting which file is which is an interesting one, and not something we've come up against before I don't think. You can access the name of the current file like this:
inputfile.filenames[inputfile.file_pointer]
Which should work fine for most cases. I personally like pathlib for dealing with filenames, where you could do something like:

if Path(inputfile.filenames[inputfile.file_pointer]).name == "charges":
  ...

There might be some fringe cases where this doesn't work, like if we're reading files from a URL where the filename might not be available, but I'd recommend not worrying about those for now :)

For testing, I'll let the other guys confirm as they have more experience with the testing suite, but yes creating a custom XTBSPTest class inheriting from GenericSPTest and modifying based on what's available is totally fine. For concepts which don't exist in xTB, you can either just skip the test entirely or change the value to whatever is the closest equivalent, whichever makes the most sense. You can of course add new tests for concepts which are unique to xTB. As a rough example:

class XTBSPTest(GenericSPTest):

    @skipForParser('xTB', 'xTB has no concept of basis sets')
    def testnbasis(self):
        pass

    def testscfenergy(self):
        """Is the SCF energy within the target?"""
        assert abs(self.data.scfenergies[-1] - 1.12345) < 40

@berquist
Copy link
Member

Do I create a custom XtbSPTest class that inherits from GenericSPTest and overwrite any default assertions so they are relevant for xTB? I could use some assistance on how to best do this without writing my own tests that don't conform to the cclib testdata stuff.

Yes. I will make an example for you and commit it, then you can expand on that.

I'm assuming y'all would not appreciate if I wrote my own test_xtb_parser.py and just made a manual set of assertions for the various attributes?

Suffer through this for now, but the test suite is being incrementally rewritten to properly follow pytest expectations.


One important thing is that skipFor decorators go on methods in the parent class and those methods are not overridden in the child.

@Andrew-S-Rosen
Copy link
Contributor Author

@berquist, @oliver-s-lee: Thank you both for the very helpful comments! I should have a clear path forward with this :)

@berquist
Copy link
Member

Should've said this earlier, but I will comment again when I'm done committing to the branch.

@berquist
Copy link
Member

Done; my last few commits show how to write customized test cases and skip those test methods whose implementation is missing or doesn't make sense for the program. The strategy is to incrementally remove each decorator as functionality is added or fixed.

I'd start with homos, I saw the attribute is the wrong length.

@Andrew-S-Rosen
Copy link
Contributor Author

@berquist: Thanks a ton!! I really appreciate you setting the stage for this. I'll take it from here and patch things up as needed. I'll let you know when it's ready for review.

@Andrew-S-Rosen
Copy link
Contributor Author

Andrew-S-Rosen commented Nov 29, 2023

@berquist: How am I meant to run the tests locally? Doing pytest . within the cclib/test/data directory doesn't start any tests. Additionally, running pytest testSP.py results in all failures because none of the self attributes are defined.

@berquist
Copy link
Member

The principal driver of tests is https://github.com/cclib/cclib/blob/master/.github/scripts/run_pytest.bash. The key is, from the root directory,

python -m pytest -v -s --terse test -k "not test_method"

where -s is so that the main parser tests actually display their output, and --terse is a cclib-specific flag to keep some of that main parser output tame. But because there is a wrapper over all these parser tests so that they only appear as a single pytest test, you can't use -k to isolate xTB.

An alternative that might work is

python -m test.test_data --terse XTB

which uses the old unittest functionality.

@berquist
Copy link
Member

Since you aren't on Slack, I'll bug you here...I would like to start (and possibly finish) the 1.8.1 release process this coming weekend. Probably both of us would like to see this in but I don't want to merge until you're satisfied with tests. Do you need any additional pointers or help?

@Andrew-S-Rosen
Copy link
Contributor Author

Andrew-S-Rosen commented Dec 11, 2023

@berquist: Thanks for the ping. Admittedly, I have been a bit overwhelmed by the gnarly merge conflicts (see below) and the test suite. I just tried to resolve some of the conflicts, but I ended up making things worse so reverted it.

I'm going to put the merge conflicts aside for a moment and try to focus on increasing the test coverage today. If you have a spare moment, getting some help with the conflicts would be huge... some of the files have very large and obscure conflicts e.g. see test/regression.py where the whole file is marked as a conflict, making it hard to tell what to keep/change.

@berquist
Copy link
Member

It slipped my mind that you would have to deal with the rebase/merge over the reformatted code. Just looking at the filenames, the reason for most (all?) of these conflicts is actually line endings: these files were CRLF, I converted them all to LF, and Git is incredibly stupid about attempting to resolve this for reasons I don't know.

I can't get to this during the week, so hack on tests for now, and ~Saturday I will rebase the whole thing (unless you want to...). Then I will review, since I don't think anyone's done that yet.

@Andrew-S-Rosen
Copy link
Contributor Author

@berquist: Ah, thanks! Good to know about the line endings. With that in mind, I might be able to get the conflict resolved. I'll take a stab but won't spend too much time if it's still a mess.

Will work on more tests!

@berquist
Copy link
Member

If you do want to try tackling it yourself (you probably don't...), you will want to edit each commit in the interactive rebase, do pre-commit run -a, and if there are changes, add and amend.

(For the pre-commit command, you can ignore run -a currently complaining about how some ReST markup looks like a merge conflict. I need to fix that...)

@berquist berquist merged commit 363c4f3 into cclib:master Dec 17, 2023
@Andrew-S-Rosen Andrew-S-Rosen deleted the rosen-xtb branch December 17, 2023 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for xTB

5 participants