-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mbox takes any sentence starting by ^From
as a new message
#20
Comments
Yes, this is a known bug (see db40e3a). This is an error that comes from the current implementation of https://docs.python.org/3.5/library/mailbox.html#mbox In Any idea is welcome. |
The same issue that in 2.x series for mbox. I thought you had solved it here :-) I think the actual issue is not checking if the next line is a header. That would reduce the odds of having issues. Although, still it may fail in a discussion about email, if somebody pastes header + body as part of the text ;-) |
What about inspecting messages right after being identified as such by the mailbox library, looking for those mandatory headers? If those headers are not found, the message could be safely added to the body of the "previous" message, since it was not really a message. I guess in the following code in perceval/backends/mbox.py for message in self.parse_mbox(tmp_path):
if not self._validate_message(message):
continue
yield message it would be a matter of, instead of continuing, adding to the body of the previous message. Granted, this would force to defer the yield until the next message is analyzed, which could make the code a bit more complex... If you want I can have it a try to see if I can produce a PR for this... |
The problem is there are messages that do not include some mandatory headers, such as
That's my concern but if you find a clean way to do it, go ahead. Take into account that this is a problem related to the |
However, if the next line is not a header ( It will fail in other cases, but not in the ones that people start a sentence with And yes, the implementation of mbox in Python is naive, slow and bogus. It is unfortunate there is not a good mbox parser in Python like in other languages. |
I wonder how many mboxes out there do not have the fomat:
|
I took all my compressed files to compare false positives. $ find .mlstats/compressed -type f | wc -l
16194
$ find .mlstats/compressed -type f -exec zgrep '^From ' {} \; > from-sample.txt From some reason I did not investigate, zgrep's output included some lines without $ egrep '^From ' from-sample.txt > from-sample-s.txt
$ wc -l from-sample-s.txt
2604383 from-sample-s.txt From there, I explored which lines did not end on a year and did not have a time. I removed the ones with @, nobody, MAILER-DAEMON and ' at ', because those were repetitive. And I checked them manually. I did not find a single line that would not be a beginning of email.
So, I could say, that any line that does not match
Not that many, to be honest. But those would be the double, because it will truncate messages where these lines belong to. Well, I narrowed the list to check it manually and I put them in a file
That said:
You can check with your own datasets. |
Thanks @gpoo. I will try the same with some sets. In the meantime, do you know any other good mbox parser in other languages, like C, Java, Perl whatever? I'm curious to know how they fix this issue because I think it's a problem more related to the mbox format itself than to the parser. If someone fixed this in other languages we can apply the same techniques or even send a patch to the standard library to fix it. |
This patch adds the to_unicode() function to avoid encoding problems related to ascii and unicode conversions. This function is required by uuid() function. Fixes chaoss#20
This patch sets the system encoding to UTF-8 using setdefaultencoding() from sys module. It is a little hacky but really useful while we moving the code to Python 3. Fixes chaoss#20
The mbox backend parses anything starting with
From
as a new message. Therefore, the following message from OpenStack will be taken as 2 messages:The main issue is that leaves the message truncated. In this example, the last paragraph and signature will be lost.
If the purpose is to only parse metadata, the approach is ok. Although there would not be reason to store the body of the message.
The text was updated successfully, but these errors were encountered: