Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reader.field_terms raises OverflowError for DATETIME fields #24

Open
Quemoy opened this issue Mar 30, 2022 · 0 comments
Open

reader.field_terms raises OverflowError for DATETIME fields #24

Quemoy opened this issue Mar 30, 2022 · 0 comments

Comments

@Quemoy
Copy link

Quemoy commented Mar 30, 2022

When attempting to use field_terms function of whoosh.reading.IndexReader class, an exception is raised. Using the following index and documents as a test

from whoosh import index
from whoosh.fields import  Schema, TEXT, ID, DATETIME
from pathlib import Path
from datetime import datetime

schema = Schema(doc_ID=ID(unique=True, stored=True),
                m_date=DATETIME(stored=True),
                content=TEXT(stored=True)
                )

idx_fp = Path.home()/'test'
ix = index.create_in(idx_fp, schema)

with ix.writer() as wr:
    wr.add_document(doc_ID='A1', m_date=datetime(2020,1,1), content='doc A1')
    wr.add_document(doc_ID='B2', m_date=datetime(1985,3,15), content='doc B2')
    wr.add_document(doc_ID='C3', m_date=datetime(2019,12,8), content='doc C3')
    wr.add_document(doc_ID='D4', m_date=datetime(1977,1,1), content='doc D4')

read = ix.reader()

When attempting to retrieve all terms for the m_date field via list(read.field_terms('m_date')), an OverflowError is raised with the following message:

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-2-da69936def15> in <module>
----> 1 list(read.field_terms('m_date'))

~/.conda/envs/whoosh/lib/python3.7/site-packages/whoosh/reading.py in field_terms(self, fieldname)
    260         from_bytes = self.schema[fieldname].from_bytes
    261         for btext in self.lexicon(fieldname):
--> 262             yield from_bytes(btext)
    263 
    264     def __iter__(self):

~/.conda/envs/whoosh/lib/python3.7/site-packages/whoosh/fields.py in from_bytes(self, bs)
    843     def from_bytes(self, bs):
    844         x = NUMERIC.from_bytes(self, bs)
--> 845         return long_to_datetime(x)
    846 
    847     def _parse_datestring(self, qstring):

~/.conda/envs/whoosh/lib/python3.7/site-packages/whoosh/util/times.py in long_to_datetime(x)
     87     x -= seconds * 1000000
     88 
---> 89     return datetime.min + timedelta(days=days, seconds=seconds, microseconds=x)
     90 
     91 

OverflowError: date value out of range

It seems that this is caused by a DATETIME field having extra terms added on by whoosh. For example, printing the result of read.all_terms() gives the following:

('content', b'a1')
('content', b'b2')
('content', b'c3')
('content', b'd4')
('content', b'doc')
('doc_ID', b'A1')
('doc_ID', b'B2')
('doc_ID', b'C3')
('doc_ID', b'D4')
('m_date', b'\x00\x80\xdd\x88\xed\x0fe\xa0\x00')
('m_date', b'\x00\x80\xdetF.\x1d\xc0\x00')
('m_date', b'\x00\x80\xe2Y$\xf4\xf5\x00\x00')
('m_date', b'\x00\x80\xe2[\x07\xc1&\x00\x00')
('m_date', b'\x08\x00\x80\xdd\x88\xed\x0fe\xa0')
('m_date', b'\x08\x00\x80\xdetF.\x1d\xc0')
('m_date', b'\x08\x00\x80\xe2Y$\xf4\xf5\x00')
('m_date', b'\x08\x00\x80\xe2[\x07\xc1&\x00')
('m_date', b'\x10\x00\x00\x80\xdd\x88\xed\x0fe')
('m_date', b'\x10\x00\x00\x80\xdetF.\x1d')
('m_date', b'\x10\x00\x00\x80\xe2Y$\xf4\xf5')
('m_date', b'\x10\x00\x00\x80\xe2[\x07\xc1&')
('m_date', b'\x18\x00\x00\x00\x80\xdd\x88\xed\x0f')
('m_date', b'\x18\x00\x00\x00\x80\xdetF.')
('m_date', b'\x18\x00\x00\x00\x80\xe2Y$\xf4')
('m_date', b'\x18\x00\x00\x00\x80\xe2[\x07\xc1')
('m_date', b' \x00\x00\x00\x00\x80\xdd\x88\xed')
('m_date', b' \x00\x00\x00\x00\x80\xdetF')
('m_date', b' \x00\x00\x00\x00\x80\xe2Y$')
('m_date', b' \x00\x00\x00\x00\x80\xe2[\x07')
('m_date', b'(\x00\x00\x00\x00\x00\x80\xdd\x88')
('m_date', b'(\x00\x00\x00\x00\x00\x80\xdet')
('m_date', b'(\x00\x00\x00\x00\x00\x80\xe2Y')
('m_date', b'(\x00\x00\x00\x00\x00\x80\xe2[')
('m_date', b'0\x00\x00\x00\x00\x00\x00\x80\xdd')
('m_date', b'0\x00\x00\x00\x00\x00\x00\x80\xde')
('m_date', b'0\x00\x00\x00\x00\x00\x00\x80\xe2')
('m_date', b'8\x00\x00\x00\x00\x00\x00\x00\x80')

The terms of the content and doc_ID field all makes sense given the documents, but the m_date field has extra terms stored in the index. Printing out the content of the m_date field using the following code:

dt = ix.schema['m_date']
for x in read.lexicon('m_date'):
    print(repr(dt.from_bytes(x)))

This prints four datetime values (which are the values for the documents)

datetime.datetime(1977, 1, 1, 0, 0)
datetime.datetime(1985, 3, 15, 0, 0)
datetime.datetime(2019, 12, 8, 0, 0)
datetime.datetime(2020, 1, 1, 0, 0)

...before running into an overflow exception on the fifth term, b'\x08\x00\x80\xdd\x88\xed\x0fe\xa0'. Using the code that's in the NUMERIC class,

In [4]: from whoosh.util.numeric import from_sortable
In [5]: dt._struct.unpack(b'\x08\x00\x80\xdd\x88\xed\x0fe\xa0'[1:])[0]
Out[5]: 36272377181463968
In [6]: from_sortable(dt.numtype, dt.bits, dt.signed, 36272377181463968)
Out[6]: -9187099659673311840

Obviously this raises an exception since python dates cannot be before year 0.

Do we know what the extra terms added for the DATETIME field in the index is?

cclauss pushed a commit to cclauss/whoosh-1 that referenced this issue Jan 4, 2024
Removing Pypy3.10 from the tox.yml file for now until we can fix the github action and tests to work with it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant