-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated fix for #1652: autodetect line breaks type in editor #49
base: master
Are you sure you want to change the base?
Updated fix for #1652: autodetect line breaks type in editor #49
Conversation
d0dd55a
to
56a39c2
Compare
56a39c2
to
4e0dba3
Compare
4e0dba3
to
b9c6e7e
Compare
b9c6e7e
to
501be0a
Compare
i would choose an entirely different approach: instead of converting upon loading/saving, deal with line endings "on the fly". this has the advantage that even the worst misdetection cannot corrupt the file, you can always make a best effort to display it, and you can switch the mode dynamically when you find it wrong. |
Feel free to choose an entirely different approach and provide patches. For 6 years the original issue (http://www.midnight-commander.org/ticket/1652) exists, this is the only patch which surfaced, on which 3+ people worked over last 2 years. If you think that these people didn't consider other approaches, then sorry, it's a bit naive. |
if you did consider different approaches, then document your conclusions for others to verify. |
Bother to read the linked ticket. |
yes, sorry. on the technical side, your "fear" of adding the necessary abstraction seems entirely overblown. there is already an abstraction for multi-byte chars (for the utf-8 support). adding the necessary abstraction for multi-byte line endings isn't exactly rocket science. and quite frankly, for somebody who volunteers for maintainership, you have a funny reluctance to getting to know the code. on the user side, the concerns about the potential detriments of the patch have led you to choosing an approach that is so conservative that many people won't even notice that the feature is there. this is entirely counterproductive. the right solution would be always on, and yet without risk of accidentally destroying a file (fwiw, at least for me, this isn't about editing actual binaries - it's about "funny" text files, text interspersed with binary, etc.). |
Alternative patches welcome.
Better maintainer welcome.
Note: not my concerns, but concerns of the overly conservative maintainer. It did not "led" me, I just proposed that as a compromise, to get some progress on this issue. Currently, we have funny situation that there're concerns of binary safety of text editor, while the text editor has problem working with real-world text files. And finally: seeing you were a contributor, I'm not interested in fingerpointing between you and me. As you're not bad maintainer, I'm not interested in having quarrels with you. And as you're not a maintainer, I'm not interested spending time on arguments which then are empty. |
i don't find this strange at all. seriously, if you want a real text editor, by all means, use one. i use mcedit only for quick edits of more or less random files - and i know i can trust it to behave consistently irrespective of file type and not mess up the file. i really wouldn't want to lose that, even if it goes at the cost of other features. |
And I use mc as a super-IDE for all my development work, thru the years, leaving behind vims, emacses, ideas, eclipses, and other "exciting-new" stuff popping up and down thru the time. So, as a casual mc user (feel free to skip arguing on this definition, you said what you said with your "quick edits") you will keep the behavior you're used to. And core users, like myself, will be able to work with real-world text files adequately. So, proposed compromise looks very viable in practice. |
i've been using mcedit like that myself a decade ago, and the slavaz team clearly did so as well judging by the enhancements they made. how is the EOL feature relevant for your use as an IDE? you're supposed to use the SCM system's EOL conversions. while you're right that i could disable the feature, at that point it just becomes dead code from my perspective. why would i be supposed to approve that? there is an additional factor: i could (very occasionally) actually make good use of the feature, so i don't want it to be so cumbersome to use. but i don't want to enable it by default because of the risk (or the inconsistent behavior when it needs to fall back to an entirely different mode when it encounters a file it cannot deal with). so it's not just that i dislike the kinda hacky proposed solution - i'd also really like to see the feature done properly, which is why i'm trying to convince you to give it another shot. |
I figure, visible tabs, visible trailing spaces are pretty advanced features, and one of the reasons why I think elaborating mcedit further makes good sense.
That's mostly offtopic, but the passage above gives the answer - there're too many such IDEs, and they're usually language-specific. I'm an open-source hacker, and need to quickly "scroll" thru many projects in short time, and I don't know what next language I make hack on next, and for how long, and don't want to spend extended time setting up an infrastructure for it. Instead, I want to have a universal IDE to work with any kind of source, and mc and mcedit serve that pretty well, modulo few genric last-mile imrovements like this (well, I my also imagine few more adhoc improvements, but I don't even want to think about them now too much, as I value mc's simplicity and generic nature).
Yes, and I use that. But as I mentioned, I'm an open-source hacker deal with steady stream of new projects regularly (say, at least dozen each month, usually more). Not all of them set up right, a lot have line-endings mix up. In the end, I figured that I spend too much effort just to report such issues upstream (which are hardly appreciated), or fork and clean up somebody's stuff, etc. Going Internet-style, i.e. being conservative regarding what I produce, but liberal regarding what can be dealt with without much trouble is what revived my interest in this patch, so I tried to clean it up based on collected feedback, and get into mainline after all, instead of maintaining in my fork.
mc open-source project, and I think it's important that it would cater itself for working well with other open-source projects (including quirks as described above).
mc itself is kinda hacky coded thing. mc is buggy, including hard bugs like segfaults, hangs, history of data corruption. We live in real world, and this patch provides real-world solution to a problem. As I mentioned, my study of the problem showed that alternative to this way is either doing myriad small patches throughout the editor codebase, which while small, not a few chars, but make control structure more complicated; or rewrite editor pretty much. Both of these choices would be irresponsible - both on my side to propose and on maintainers' to accept - that's what can easily lead to omission/overlook which may lead to segfaults/data corruption. Done like it is, the patch does solve the problem and is contained - you can look at it, and raise concerns, unlike some spaghetti patch, whose effect you cannot predict. Once raise, concerns can be addressed, even if it won't be a complete solution, but compromise. But that's what art of real-world software engineering is. |
how could one argue with that ... ok, let's assume a maintainer who wants to merge this pops up. the series obviously needs a rebase, and i think it should be squashed (slavaz' remaining parts are rather small, so i think a simple done-with: slavaz footer will do for attribution). i really don't like the option being there. why not make the thing so conservative that it cannot possibly fail - check the whole file? i don't really understand why you are using mc_read in the first place - the file is read in its entirety anyway, so why not insert the detection and adjustment between the regular read and the insertion into the edit buffer? where is the inverse conversion on file writeout? |
* On opening file, detect line-endings used by sampling some initial content. * If it happen to be CR or CRLF, skip fast load path, and in edit_insert_file() convert such line endings to '\n'. * Save detected line ending type for editor. Signed-off-by: Paul Sokolovsky <[email protected]> Signed-off-by: Slava Zanko <[email protected]>
501be0a
to
93c77de
Compare
Patchset rebased onto latest master. Will respond to other questions a bit later. |
Allows to enable or disable line break type detection in the editor. Default value is off, based on concerns expressed regarding binary-safety of the editor. With this setting conservatively set to off, editor behavior stays exactly the same as before, unless user explicitly enables this option. Signed-off-by: Ilia Maslakov <[email protected]> Signed-off-by: Paul Sokolovsky <[email protected]>
Signed-off-by: Slava Zanko <[email protected]>
Detection now requires consistent line breaks being present within sampling buffer. If breaks of different types are found, or there's a slightest suspicion that it may be binary content, it reverts to binary-safe "as-is" mode. Also, refactor detection routines to facilitate unit testing. Signed-off-by: Paul Sokolovsky <[email protected]>
Also, test only line break detection algorithm, not anything else (no file operations, etc.). Signed-off-by: Paul Sokolovsky <[email protected]>
93c77de
to
6d496be
Compare
I keep patches separate and with original authorship as an argument that this is not my patch - 2 official maintainers worked on it too. The patch can be squashed easily anytime, once it's confirmed as ready for merging. |
Which whole file - 1GB, 4GB, 1TB one? It's one of the known issues that mistakenly pressing F4 on a wrong large file is a good way to DoS current mc session and possibly whole system. So, scanning thru entire file is irresponsible, you need to make that operation be O(1) of file size, not O(n). Then you just select suitable window size and few Ks already offer 99% guarantee of detection. This implementation also nicely decouples from the rest of editor, to not make it more of spaghetti. |
That's handled by the existing "Save As..." code, which allows to save a file with different line-breaks. So, the detection process just sets another default value for Save/Save As to use (standard default is "don't change"). |
Again, that option was added as a compromise to get going with that feature. I was pointed out that this patch changes behavior of editor, and I made changes so it wasn't, unless user opts in. Yes, I myself think that this feature should be on by default, as apparently more users use text editor to edit text files than binary ones. But it's not me to decide. (For perspective, I also think that "Esc key mode: single press" should be default, as realistically nobody uses Esc to manually enter escape sequences nowadays, or use 300 or so baud rates. If I were a maintainer, I would raise a question to make it a default, perhaps with a command-line switch to force-disable - I'm sure we lose bunch of users because of the current default). |
my point is that the compromise should not be necessary:
this is entirely true. but this problem exists regardless of your patch. given that the line ending scan is at least an order of magnitude faster than the code to insert the file into the buffer (char by char!), injecting it between the existing mc_read and insertion loop would not noticeably worsen the situation. and when a fix is implemented (probably a more or less arbitrary file size limitation inducing a confirmation dialog), the detection code would be automatically covered by it.
i totally don't buy that. in the worst case you need an out parameter which the load function sets as a "side effect" (i haven't looked at the code in detail, but it isn't that bad). |
(sorry for delay with response)
Sorry, I don't understand what you hint at - if a design is compromised, then everyone should feel free to compromise it even further? Sorry, I don't think that's sustainable or responsible approach.
So, if you mean detecting line-break type in parallel with loading file contents itself, then in general that wouldn't be possible, because before loading each line, one already needs to know line break type of a file. And doing it in 2 passes over file has exactly DoS potential I mentioned above: on a system with low free physical memory, doing that will cause swapping. If system also has a lot of other I/O activity, that can lead to true system-wide I/O due to bugs in contemporary OSes: https://bugzilla.kernel.org/show_bug.cgi?id=12309 (these bugs happen because people do not think about issues we discuss here). Such issues may affect stability and performance of a modern desktop, but people (e.g. myself) run mc on small systems (like home routers/NASes/etc.) where 32MB of RAM or less is norm. It's huge difference whether unneeded DoS-like operation will scan something once and completes in 10mins or twice and completes in 20mins.
And yet that's generally accepted software engineering practice - consider that the most "efficient" way to write something is to write tightly-couples spaghetti code in assembler, nobody any longer does that. Vice-versa, there're design patterns how to decouple things to achieve other kinds of efficiencies than the one implied by spaghetti assembler code. You miss the fact that this is just a first version of the implementation of a fix for this issue (trac:1652). If got into mainline, but deemed as requiring more improvements, it can be further elaborated. You also miss the fact that many people contributed to even this solution, and if you see immediate need for further tweaks, please contribute them. Otherwise, the way this patch is done, it addresses the original issue, while being conservative enough to not affect users which want to be unaffected, and has other important traits, like being easy to maintain on top of the mainline sources. And that's how it is maintained so far. If official maintainers see some (non DoS imposing) changes as a pre-requisite to it being merged, I'm happy to address them. |
no, that's actually your argument:
my argument is that not duplicating work is good, and comes with the added bonus of getting fixed for free when the code path that exists anyway is fixed. but you're right that my idea won't work as-is, because the files are read in chunks, so a whole-sale detection and conversion can't happen before the file has been loaded in its entirety. that sounds like a somewhat bigger refactoring exercise. however, your argumentation still makes no sense. the existing code path is sufficient to DoS the system. adding a second full-file reader would double the problem, but at that point this isn't very relevant any more. so the rather obvious conclusion is not avoiding it, but adding an adequate protection mechanism to the new code (which would automatically cover the old code as well if it was always enabled). |
Wow! It work's! Nice. Thanks. |
Do you look at the vi editor from the openwatcom-v2? Looks like mcedit, mouse support, multiwindow, crlf autodetection.... |
Suggestion: always show LB type. Curretly LF type is not displayed |
Is this going to be merged? I would really appreciate it. Thanks! |
a48cc07
to
5583c3d
Compare
e9e72d9
to
de7d72c
Compare
This is rebased and updated branch https://github.com/MidnightCommander/mc/tree/1652_autodetect_lb , addressing bugreport http://www.midnight-commander.org/ticket/1652 .
It consists of rebasing to the latest codebase original patches (editor functionality by myself, editor options by Ilia Maslakov, line break detection unit tests by Slava Zanko), and additional changes to make linebreak type detection much more conservative and robust (plus some refactoring of unittests).
This patcheset adheres to following requirements: