Skip navigation

Category Archives: sed

I needed to inspect a relatively small portion of a large log file (~1Gb), which will make chock even powerfull text-editors like vi(m) or emacs

I proceded in two steps:
1) found the match in the file and pulled the line number

—————————————————————————————-
awk ‘/ May 10 /{a=$0; b = NR;}END{print a,” :: “,b}’ log.txt
—————————————————————————————-
which yielded:
Thu May 10 02:17:05 ART 2012 :: 29199076

2) then I dumped the content from the that line and filtered it with head
—————————————————————-
tail -n +29199076 log.txt | head -n 100
—————————————————————-
That is possible with the trick of using “tail -n +(N)” which brings lines from the N line onwards

As and alternative to the last one, as explained here, sed could’ve been used in the following manner:
—————————————————————-
sed -n -e 29199076,29199176 -e 29199077q log.txt
—————————————————————-
(the last parameter, for efficiency, tells to quit at the limit line + 1 )

Advertisements

(This is mainly a remainder post for myself)
For certain reasons I sometimes have to edit text pasted from an emacs buffer that I was editing with the longlines-mode enabled. Hence as this mode does, the paragraphs are hard wrapped beyond a certain amount of characters (when they extend over ‘fill-column’ lenght).

Although “the soft newlines used for line wrapping will not show up when the text is yanked or saved to disk”, they will remain if, say, I had carelessly pasted it directly into a gmail form to save for later reuse there.

My way to remove those artificially-inserted line breaks, is running this oneliner on the text region.

sed -ne '1h;1!H;${;g;s#\n\([^\n]\)# \1#g;p}' | sed -e 's#^[ \t]*\(.*\)$#\1#g'

(The first sed command tells to put a space and remove the line break. using the multiline search and replace method
The second just gets rid of the leading white space at the beginning of line)

Hand editing web stuff (outside of wysiwyg editors, that is) a lot of times requires dealing with messy html markup. Even though tags are meant to be parsed by browsers, and there are even performance benefits for serving ugly obfuscated code without white spaces, the human readability of the markup needs some tidiness in the formatting of web pages for which the indentation of tags makes us able to understand what goes on inside it.

Emacs has a large set of commands to internally handle code indentation, but never occurred to me until recently that they could be helpful to simply re-align the clutter of html tags that you can see from the output of a source view.

I went on defining something basic like this little elisp function to quickly help with the general re-alignment of code.

(defun my-tidy ()
  "Automatically re-indents code"
  (interactive)
  (execute-kbd-macro
   (read-kbd-macro "C-u -999 M-x indent-region RET C-x C-x M-x indent-region")))

(global-set-key (kbd "C-M-*") 'my-tidy)

Now that was fine but still insufficient. Say, what about the embedded css or javascript that normally goes along inside a web page?
I didn’t like the way emacs (in its default mode, of course) gets css code re-aligned, plus I actually do have specific requirements which demand the ability to reformat the css style sheets declaration and rules when editing a web document. As I use firefox with its web-developer plug-in, in order to save screen real state I leave the narrowest possible window at my left, therefore it is inconvenient to have css stuff deeply indented to the right (wanted just one space from left margin), additionally, rules needed to be broken up nicely as well (avoiding long chains like #div a.rule1, #div b.rule2, #div c.rule 3 for example).

This screen-shot might show better why the formatting of css code matters so much in my work setup.

my css editing style

So playing a bit more with the idea, I went on crafting some good old sed and awk one-liners for reformatting all what could be found embedded inside <style> tags. The formidable shell-command-on-region in emacs allows such things. You will note that the regexp ain’t look pretty,  it’s sort of long (I might make another post explaining how it breaks up), and also has the dreaded leaning toothpick syndrome! cause emacs lisp needs characters to be double escaped.

In short, all what I wanted is wrapped up below exactly as it now goes inside my emacs init :

(defun select-css-code ()
  "Select region contained by <style></style> tags.
   Simply highliths what is between those tags for embedded css content"
  (interactive)
  (save-excursion)
  (let(p1 p2)
    (goto-char (point-min))
    (search-forward "<style ")
    (backward-char 7)
    (setq p1 (point))
    (search-forward "</style>")
    (setq p2(point))
    (goto-char p1)
    (push-mark p2)
    (setq mark-active t)))

(defun select-javascript-code ()
  "Select region contained by <script></script> tags.
   Simply highliths what is between those tags for embedded javascript content"
  (interactive)
  (save-excursion)
  (let(p1 p2)
    (goto-char (point-min))
    (search-forward "<script ")
    (backward-char 8)
    (setq p1 (point))
    (search-forward "</script>")
    (setq p2(point))
    (goto-char p1)
    (push-mark p2)
    (setq mark-active t)))

(defun re-indent-web-page-code ()
  "Re-indents html code including its embedded javascript and css.
The css code gets indented diferently (through some awk and sed one-liners)
to ease the editing of styles with Firefox using a window of its Web-Developer plug-in."
  (interactive)
  (progn
    (mark-whole-buffer)
    (indent-rigidly (region-beginning)(region-end) -999)
    (indent-region (region-beginning) (region-end))
    (select-javascript-code)
    (javascript-mode)
    (indent-rigidly (region-beginning)(region-end) -999)
    (indent-region (region-beginning) (region-end))
    (html-mode)
    (select-css-code)
       (setq command  "awk  '/{/ {gsub(/,/,\",\\n\")} {print }' | sed -ne '1h;1!H;${;g;s#{\\([^\\n]\\)#{\\n\\1#g;p}' | sed -ne '1h;1!H;${;g;s#;}#;\\n}#g;p}' | sed -ne '1h;1!H;${;g;s#;\\([^\\n]\\)#;\\n\\1#g;p}' |  sed -e 's#^[ \\t]*\\(.*\\)$#\\1#g' |  awk  '!/{|}|^ *#/&&!/^\\// {$0 = \" \"$0} {print }'  | awk NF |  sed -e 's#\}$#}\\n#g' | sed -ne '1h;1!H;${;g;s#,\\n#,#g;p}' | awk  '/{/ {gsub(/,/,\",\\n\")} {print }' | sed -e 's#^[ \\t]*\\(.*\\),$#\\1,#g'" )
      (shell-command-on-region (mark)(point) command t t)))

UPDATE: note that instead of a macro call

    (execute-kbd-macro
     (read-kbd-macro "M-x indent-region"))

I’m using this straight forward lisp expression

    (indent-region (region-beginning) (region-end))

TODO:
The function is almost there, I still need to address a couple of things:
What if we have many [style or javascript] sections intermingled in our html?
One way to address that would be to successively grab the content, send it to other place using the acummulating-text function of emacs like ‘append-to-buffer ( http://www.gnu.org/software/emacs/manual/html_node/emacs/Accumulating-Text.html), then, we can switch to that second buffer, treat the code there and simply get it back to the original document.
Also noted that the css code doesn’t get indented if the javascript tags aren’t found, so I’ll revise the logic to allow it regardless of whether javascript code exist or not.

To remove empty lines in a region simply do:

M-x flush-lines RET ^$ RET 

or, if blank lines contain some white spaces characters:

M-x flush-lines RET ^\W*$ RET 

Whereas with sed (which I use inside emacs via M-| shell-command-on-region ) it’s simply:

sed -e '/^[	 ]*$/d' 
(this is either a tab or a space, press the TAB key since most versions 
of sed don't recognize the \t character) 

How easy is to forget this type of things!

On how to delete a chunk of text contained in multiple lines: use sed to catch the range (/a/,/b/)

There were some javascript google ads calls from a page I downloaded via wget, doing:

 wget www.someSite.com/somePage.html > toCleanItUp.htlm

After highlighting all the page C-x h I did a M-x shell-command-on-region and used the following:

sed "/<script .*>/,/<\/script>/d" > nowItIsClean.html

If feeling lazy, (or if I don’t have the file opened already in an emacs buffer), there’s the straightforward way of using cat:

cat toCleanItUp.html | sed "/<script .*>/,/<\/script>/d" > nowItIsClean.html

Ah, the beauty of unix tools!

Found a well written “sed by example” article with practical examples (specially in part 3).
There are plenty of good resources out there, today I partially checked a thorough intro and tutorial written by Bruce Barnett .
And this collection of sed oneliners has useful stuff as well to get you covered.

Today had to experiment a little with sed ranges.
Whereas the following does not work obviously because of the ambiguity of matching all lines containing 7, as 7, 17, 27, 37, 47 do)

yes 'nope this sed regex range does not work' | head -50 | cat -n | sed -n -e '/7/,/28/p'

These two variants could be used:

# (notice the space before 7 in this case)
yes 'this sed regex range filters correctly from seventh to twenty-eighth' | head -50 | cat -n | sed -n -e '/ 7/,/28/p'

or the following which uses the POSIX character class definition for space, (check here)


yes 'printing from seven to twenty-eighth using a POSIX character in the regex range ' | head -50 | cat -n | sed -n -e '/^[[:space:]]*7[[:space:]].*$/,/28/p'

I’m adding these other other oneliners also for my reference, even though not all use ranges

To exclude last line:
yes 'the last gets swallowed ' | head -10 | cat -n | sed '$d'

To print last line:
yes 'version 1, print last line' | head -10 | cat -n | sed '$!d'

yes 'version 2, printing last line | head -10 | cat -n | sed -n '$p'

To print only first line (like tail -1)
sed q

Print only first five lines (like tail -5)
yes ''version 1, prints up to fifth line' | head -10 | cat -n | sed -n '1,5p'

yes 'version 2, prints from first to fifth line' | head -10 | cat -n | sed '6,$'

yes 'version 3, simplest ' | head -10 | cat -n | sed 5q

whereas this other does the opposite, printing from sixth to last:
yes 'this filters up to the fifth line' | head -10 | cat -n | sed '6,$!d'

Remember to check ttp://sed.sourceforge.net/sed1line.txt for a fantastic compilation of sed oneliners