Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import from bean-extract / beancount.ingest? #23

Closed
wadabum opened this issue Jul 21, 2019 · 13 comments · Fixed by #62
Closed

Import from bean-extract / beancount.ingest? #23

wadabum opened this issue Jul 21, 2019 · 13 comments · Fixed by #62

Comments

@wadabum
Copy link

wadabum commented Jul 21, 2019

So, I´ve spend hours writing a bean-extract compatible parser using beancount.ingest.importer.ImporterProtocol

Can one somehow string those things together?


  • I do NOT mean parsing a written-our .beancount file,
    but "run bean-extract" and use that result as data_source - similar to Fava?

something along the lines of the usual:

# beancount.config
from importers import my_importer
configured_my_importer = my_importer.Importer(
  currency='USD',
)
CONFIG = [
  configured_my_importer,
]
# ....
# ....
# ....
    data_sources = [
        dict(
            module='beancount_import.source.beancount-ingest',
            class=configured_my_importer,
        )
# or
    data_sources = [
        dict(
            module='beancount_import.source.beancount-extract',
            config='path/to/beancount.config', # parse&read CONFIG from there
            dirs='documents/',
        )

or however that´s gonna map out... ;-)

(think that could proxy-close #18 as well...)

@jbms
Copy link
Owner

jbms commented Jul 23, 2019

There isn't any existing way to do this, but some sort of convergence with beancount.ingest is something I'm very interested in exploring. However, there are some challenges to overcome:

  1. beancount.ingest is based on generating partial transactions (that don't necessarily balance) with missing postings. beancount-import requires full transactions, with Expenses:FIXME postings to indicate unknown accounts (a single Expenses:FIXME posting is allowed to correspond to multiple unknown postings). This is simple to handle, though --- just generate Expenses:FIXME postings to balance the transaction.

  2. With just step 1, you could have a partial integration with beancount-import, but to get the automatic tracking of which transactions have already been imported, and automatic classification of unknown accounts, the beancount-import data source needs to provide more information than is available in the beancount.ingest model.

@dumbPy
Copy link
Contributor

dumbPy commented Jul 26, 2020

How about exporting this required info in meta?
I am currently exporting transaction ref/ cheque No. and jumbled description into meta.
ML models that can use this to predict postings and use your awesome UI for manual corrections would be great. currently exploring smart_importer but it doesn't have a way to fix interactively like this project.
I too want this to work.

@jbms
Copy link
Owner

jbms commented Jul 26, 2020

If you specified which metadata fields indicate a unique transaction id, and which metadata fields to use for automatic classification, then I think it would work.

@dumbPy
Copy link
Contributor

dumbPy commented Jul 26, 2020

Do you have any plans for defining API for the UI part?

I am currently using smart_importer and I think having an interactive UI makes a lot of difference.
Having a clearly defined API would mean making UI compatible with all the future ML models.
My statements have a lot of numbers and plain text and comments and I would surely love to try some simple NLP models like Spacy's classifier.

I personally loved the idea of being able to make some manual entries and later merge them effortlessly with the statement.

@jbms
Copy link
Owner

jbms commented Jul 26, 2020

I guess you want to use the beancount-import UI with the account prediction provided by smart_importer and ingesters defined using the benacount.ingest framework?

I don't have plans to try to make that work myself, but if you're interested in making that happen, it does sound like it could be very useful, and I'm happy to offer advice and would be interested in merging it into beancount-import, for example.

The datasource API provided by beancount-import is already sufficient, I think, for integrating with beancount.ingest. Currently there is no API for plugging in a different account prediction mechanism, but that could be added.

Or maybe it would make sense for beancount-import to merge into smart_importer entirely, but not sure if the smart_importer authors are interested.

@Zburatorul
Copy link
Collaborator

Zburatorul commented Jul 27, 2020

Correct me if I'm wrong, but it sounds like the stumbling block of this program (ingester->smart_importer->beancount-import UI) is that the prediction/ML part needs to be factored out in beancount-import?
Is there any other issue related to matching existing and currently imported transactions? It looks to me like smart_importer is capable of detecting duplication.

Is it possible then to get the smart_importer to do all the deduplication and account prediction, while simply turning off this functionality in beancount-import?

Your posts from two years ago seem relevant:
https://www.mail-archive.com/[email protected]/msg01628.html
https://www.mail-archive.com/[email protected]/msg01630.html

@jbms
Copy link
Owner

jbms commented Jul 27, 2020

Currently beancount-import relies on an account name starting with Expenses:FIXME to indicate an "unknown" account, which has 3 effects:

  1. when matching to existing transactions in the journal, such an account can match any account name
  2. automatic prediction may be done
  3. the user may manually enter the correct account name in the UI

Individual beancount-import datasources are already responsible for doing deduplication, so it is no problem for smart_importer to do that. To let smart_importer perform the automatic prediction, while still preserving effects (1) and (3), beancount-import could be modified to support an extra metadata key on postings that would indicate the predicted account while still allowing the account to be considered "unknown".

This would be a pretty simple change to make, e.g. in _get_unknown_account_predictions in reconcile.py.

@Zburatorul
Copy link
Collaborator

@dumbPy, are you interested in trying to get your smart_importer to work with beancount-import along the lines of #60 ?
You'd have to make it add two meta field to each predicted posting.

@addisonklinke
Copy link

@jbms Individual beancount-import datasources are already responsible for doing deduplication

Is this still accurate?

If I understand correctly, that means an ImporterProtocol subclass needs to confirm each transaction it generates isn't a duplicate of an existing entry in the transactions.beancount file?

@jbms
Copy link
Owner

jbms commented Jun 20, 2023

The generic importer handles deduplication based on (account, date, units, narration), so individual ImporterProtocol subclasses should not need to do that.

@addisonklinke
Copy link

addisonklinke commented Jun 20, 2023

Hmm is there an option (aside from subclassing the source) to avoid requiring the narration match? Usually I tidy these up in the import UI, so what's in the transactions file is more concise and/or better formatted than the raw importer output

@jbms
Copy link
Owner

jbms commented Jun 20, 2023

There isn't currently any such option, but I agree that matching on narration is problematic. Potentially you could match on just date, account, and amount alone. But it would be better to store the original narration as a metadata field, so that it can be used for training and for deduplication

@addisonklinke
Copy link

Yes currently I'm storing the original in the source_desc metadata field. I think it'd make more sense to use that as part of the deduplication key than the final narration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants