Table Partitioning – Greenplum and Netezza Differences

Greenplum Data Warehouse Partitioning

Table Partitioning in Greenplum Database

Greenplum Database divides tables into parts (also known as partitions) to enable massively parallel processing. Tables are partitioned during CREATE TABLE using the PARTITION BY (and optionally the SUBPARTITION BY) clause. When you partition a table in Greenplum Database, you create a top-level (or parent) table with one or more levels of sub-tables (or child tables). Internally, Greenplum Database creates an inheritance relationship between the top-level table and its underlying partitions, similar to the functionality of the INHERITS clause of PostgreSQL. Greenplum uses the partition criteria defined during table creation to create each partition with a distinct CHECK constraint, which limits the data that table can contain. The query planner uses CHECK constraints to determine which table partitions to scan to satisfy a given query predicate. The Greenplum system catalog stores partition hierarchy information so that rows inserted into the top-level parent table propagate correctly to the child table partitions. To change the partition design or table structure, alter the parent table using ALTER TABLE with the PARTITION clause.

Creating Partitioned Tables
You partition tables when you create them with CREATE TABLE. This section provides
examples of SQL syntax for creating a table with various partition designs.
To partition a table:
1. Decide on the partition design: date range, numeric range, or list of values.
2. Choose the column(s) on which to partition the table.
3. Decide how many levels of partitions you want. For example, you can create a
date range partition table by month and then subpartition the monthly partitions by
sales region.
• Defining Date Range Table Partitions
• Defining Numeric Range Table Partitions
• Defining List Table Partitions
• Defining Multi-level Partitions
• Partitioning an Existing Table

Defining Date Range Table Partitions:
CREATE TABLE sales (id int, date date, amt decimal(10,2))
DISTRIBUTED BY (id)
PARTITION BY RANGE (date)
( START (date ‘2008-01-01’) INCLUSIVE
END (date ‘2009-01-01’) EXCLUSIVE
EVERY (INTERVAL ‘1 day’) );
You can also declare and name each partition individually. For example:

CREATE TABLE sales (id int, date date, amt decimal(10,2))
DISTRIBUTED BY (id)
PARTITION BY RANGE (date)
( PARTITION Jan08 START (date ‘2008-01-01’) INCLUSIVE ,
PARTITION Feb08 START (date ‘2008-02-01’) INCLUSIVE ,
PARTITION Mar08 START (date ‘2008-03-01’) INCLUSIVE ,
PARTITION Apr08 START (date ‘2008-04-01’) INCLUSIVE ,
PARTITION May08 START (date ‘2008-05-01’) INCLUSIVE ,
PARTITION Jun08 START (date ‘2008-06-01’) INCLUSIVE ,
PARTITION Jul08 START (date ‘2008-07-01’) INCLUSIVE ,
PARTITION Aug08 START (date ‘2008-08-01’) INCLUSIVE ,
PARTITION Sep08 START (date ‘2008-09-01’) INCLUSIVE ,
PARTITION Oct08 START (date ‘2008-10-01’) INCLUSIVE ,
PARTITION Nov08 START (date ‘2008-11-01’) INCLUSIVE ,
PARTITION Dec08 START (date ‘2008-12-01’) INCLUSIVE
END (date ‘2009-01-01’) EXCLUSIVE );
You do not have to declare an END value for each partition, only the last one. In this
example, Jan08 ends where Feb08 starts.
Defining Numeric Range Table Partitions
A numeric range partitioned table uses a single numeric data type column as the
partition key column. For example:
CREATE TABLE rank (id int, rank int, year int, gender
char(1), count int)
DISTRIBUTED BY (id)
PARTITION BY RANGE (year)
( START (2001) END (2008) EVERY (1),
DEFAULT PARTITION extra );

Refer to Greenplum DBA Guide for further information:

Netezza’s alertnative to partitioning

Netezza Zone Maps are basically a combination of the best both partitioning and indexing. If you have your data organized right, you can get minimize the data read off disk as you mention. Beyond the DDL argument though, traditional database indexes and partitions have significant disadvantages to the zone map approach.

Netezza does not have indexes that the DBA must create and maintain, it does automatically build and maintain zone maps for all tables in the database. These zone maps contain the minimum and maximum value for all columns in every extent within each table. So, before the query starts reading from disk, it looks at the predicates in the query, and compares them to the zone maps to determine which table extents can be skipped and which need to be read.

For example, if you want to calculate the total sales of Red Ford Mustangs in June 2011, Netezza can skip any table extent that does not have data for June 2011. So, for a database with 7 years of sales history, it can skip any extent that has a maximum that is less, or a minimum that is greater than, June 2011. This eliminates 98% or more of the I/O required.

Indexes are even more illustrative of the advantage zone maps offer. Most indexes are useful only for single-record retrieval. The problem is the cost of maintaining and storing and reading the indexes. While there are differences in implementation (B-tree, bitmap, etc), they all boil down to mapping a single index value to a single record. If you have a billion rows in your fact table, you are going to have a billion rows in your (thin) index. Zone maps are always going to be an order of magnitude smaller in size since they are at the extent level, and so the maintenance costs are trivial.

Also, if you are processing large percentage of data, indexes are no help at all. If your data is not organized (sorted) on the indexed column, then the values you are looking for are likely to be spread throughout the extents on disk. That means you have to read the entire table anyway, with the additional cost of having to read the index first!

Syntax:
CREATE [ TEMPORARY | TEMP ] TABLE table_name
(
column_name type [ [ constraint_name ] column_constraint [ constraint_
characteristics ] ] [, … ]
[ [ constraint_name ] table_constraint [ constraint_characteristics ]
] [, … ]
) [ DISTRIBUTE ON ( column [, …] ) ]
[ORGANIZE ON {(<column>) | NONE}]
ORGANIZE ON {(<columns>) | NONE} ]
Specifies which columns (from one to four) the table is to be organized on. Not available for external tables. If columns are specified, the columns cannot be dropped, the table cannot have any materialized views, and all specified column data types must be zone-mappable. When NONE is specified, any organizing key definitions are removed form the host catalog. The table data reorganization takes effect when GROOM TABLE is run. For more information, see “Using Clustered Base Tables” in the IBM Netezza System Administrator’s Guide.
Example:
CREATE TABLE test_dm (id int, date_dt timestamp, amt decimal(10,2))
DISTRIBUTE ON (id)
ORGANIZE ON (id,date_dt);

insert into test_dm values (100,’2014-05-01 00:00:00′,10.1);

insert into test_dm values (200,’2014-05-05 00:00:00′,20.1);

insert into test_dm values (300,’2014-05-06 00:00:00′,340.1);
insert into test_dm values (400,now(),10.1);
explain verbose select * from test_dm where date_dt::date = now()::date;

Sequential Scan table “TEST_DM” (cost=0.0..0.0 rows=1 width=20 conf=64) {(TEST_DM.”ID”)}
QUERY VERBOSE PLAN:
QUERY SQL:
QUERY PLANTEXT:
NOTICE: QUERY PLAN:
Node 1.
explain verbose select * from test_dm where date_dt::date = now()::date
(xpath_none, locus=spu subject=self)
(spu_send, locus=host subject=self)
(host_return, locus=host subject=self)
[SPU Sequential Scan table “TEST_DM” {(TEST_DM.”ID”)}]
[SPU Return]
[Host Return]
Restrictions:
Projections:
— Estimated Rows = 1, Width = 20, Cost = 0.0 .. 0.0, Conf = 64.0
1:TEST_DM.”ID” 2:TEST_DM.DATE_DT 3:TEST_DM.AMT
((TEST_DM.DATE_DT < ‘2014-05-13 00:00:00’::”TIMESTAMP”) AND (TEST_DM.DATE_DT >= ‘2014-05-12 00:00:00’::”TIMESTAMP”))

 

 

Advertisements
Posted in Uncategorized | Leave a comment

Working with Regex

Cool website where you can play on regex
http://www.regexr.com/

Grabbing HTML Tags
<TAG\b[^>]*>(.*?)</TAG> matches the opening and closing pair of a specific HTML tag. Anything between the tags is captured into the first backreference. The question mark in the regex makes the star lazy, to make sure it stops before the first closing tag rather than before the last, like a greedy star would do. This regex will not properly match tags nested inside themselves, like in <TAG>one<TAG>two</TAG>one</TAG>.

<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1> will match the opening and closing pair of any HTML tag. Be sure to turn off case sensitivity. The key in this solution is the use of the backreference \1 in the regex. Anything between the tags is captured into the second backreference. This solution will also not match tags nested in themselves.

Trimming Whitespace

You can easily trim unnecessary whitespace from the start and the end of a string or the lines in a text file by doing a regex search-and-replace. Search for ^[ \t]+ and replace with nothing to delete leading whitespace (spaces and tabs). Search for [ \t]+$ to trim trailing whitespace. Do both by combining the regular expressions into ^[ \t]+|[ \t]+$. Instead of [ \t] which matches a space or a tab, you can expand the character class into [ \t\r\n] if you also want to strip line breaks. Or you can use the shorthand \s instead.

IP Addresses

Matching an IP address is another good example of a trade-off between regex complexity and exactness. \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b will match any IP address just fine, but will also match 999.999.999.999 as if it were a valid IP address. Whether this is a problem depends on the files or data you intend to apply the regex to. To restrict all 4 numbers in the IP address to 0..255, you can use the following regex. It stores each of the 4 numbers of the IP address into a capturing group. You can use these groups to further process the IP number. Free-spacing mode allows this to fit the width of the page.

\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

If you don’t need access to the individual numbers, you can shorten the regex with a quantifier to:

\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

Similarly, you can shorten the quick regex to \b(?:\d{1,3}\.){3}\d{1,3}\b

Deleting Duplicate Lines From a File

If you have a file in which all lines are sorted (alphabetically or otherwise), you can easily delete (consecutive) duplicate lines. Simply open the file in your favorite text editor, and do a search-and-replace searching for ^(.*)(\r?\n\1)+$ and replacing with \1. For this to work, the anchors need to match before and after line breaks (and not just at the start and the end of the file or string), and the dot must not match newlines.

Here is how this works. The caret will match only at the start of a line. So the regex engine will only attempt to match the remainder of the regex there. The dot and star combination simply matches an entire line, whatever its contents, if any. The parentheses store the matched line into the first backreference.

Next we will match the line separator. I put the question mark into \r?\n to make this regex work with both Windows (\r\n) and UNIX (\n) text files. So up to this point we matched a line and the following line break.

Now we need to check if this combination is followed by a duplicate of that same line. We do this simply with \1. This is the first backreference which holds the line we matched. The backreference will match that very same text.

If the backreference fails to match, the regex match and the backreference are discarded, and the regex engine tries again at the start of the next line. If the backreference succeeds, the plus symbol in the regular expression will try to match additional copies of the line. Finally, the dollar symbol forces the regex engine to check if the text matched by the backreference is a complete line. We already know the text matched by the backreference is preceded by a line break (matched by \r?\n). Therefore, we now check if it is also followed by a line break or if it is at the end of the file using the dollar sign.

The entire match becomes line\nline (or line\nline\nline etc.). Because we are doing a search and replace, the line, its duplicates, and the line breaks in between them, are all deleted from the file. Since we want to keep the original line, but not the duplicates, we use \1 as the replacement text to put the original line back in.

Removing Duplicate Items From a String

We can generalize the above example to afterseparator(item)(separator\1)+beforeseparator, where afterseparator and beforeseparator are zero-length. So if you want to remove consecutive duplicates from a comma-delimited list, you could use (?<=,|^)([^,]*)(,\1)+(?=,|$).

The positive lookbehind (?<=,|^) forces the regex engine to start matching at the start of the string or after a comma. ([^,]*) captures the item. (,\1)+ matches consecutive duplicate items. Finally, the positive lookahead (?=,|$) checks if the duplicate items are complete items by checking for a comma or the end of the string.

How to Find or Validate an Email Address

The regular expression I receive the most feedback, not to mention “bug” reports on, is the one you’ll find right on this site’s home page: \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b. This regular expression, I claim, matches any email address. Most of the feedback I get refutes that claim by showing one email address that this regex doesn’t match. Usually, the “bug” report also includes a suggestion to make the regex “perfect”.

As I explain below, my claim only holds true when one accepts my definition of what a valid email address really is, and what it’s not. If you want to use a different definition, you’ll have to adapt the regex. Matching a valid email address is a perfect example showing that (1) before writing a regex, you have to know exactly what you’re trying to match, and what not; and (2) there’s often a trade-off between what’s exact, and what’s practical.

The virtue of my regular expression above is that it matches 99% of the email addresses in use today. All the email address it matches can be handled by 99% of all email software out there. If you’re looking for a quick solution, you only need to read the next paragraph. If you want to know all the trade-offs and get plenty of alternatives to choose from, read on.

If you want to use the regular expression above, there’s two things you need to understand. First, long regexes make it difficult to nicely format paragraphs. So I didn’t include a-z in any of the three character classes. This regex is intended to be used with your regex engine’s “case insensitive” option turned on. (You’d be surprised how many “bug” reports I get about that.) Second, the above regex is delimited with word boundaries, which makes it suitable for extracting email addresses from files or larger blocks of text. If you want to check whether the user typed in a valid email address, replace the word boundaries with start-of-string and end-of-string anchors, like this: ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$.

The previous paragraph also applies to all following examples. You may need to change word boundaries into start/end-of-string anchors, or vice versa. And you will need to turn on the case insensitive matching option.

Trade-Offs in Validating Email Addresses

Yes, there are a whole bunch of email addresses that my pet regex doesn’t match. The most frequently quoted example are addresses on the .museum top level domain, which is longer than the 4 letters my regex allows for the top level domain. I accept this trade-off because the number of people using .museum email addresses is extremely low. I’ve never had a complaint that the order forms or newsletter subscription forms on the JGsoft websites refused a .museum address (which they would, since they use the above regex to validate the email address).

To include .museum, you could use ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$. However, then there’s another trade-off. This regex will match john@mail.office. It’s far more likely that John forgot to type in the .com top level domain rather than having just created a new .office top level domain without ICANN’s permission.

This shows another trade-off: do you want the regex to check if the top level domain exists? My regex doesn’t. Any combination of two to four letters will do, which covers all existing and planned top level domains except .museum. But it will match addresses with invalid top-level domains like asdf@asdf.asdf. By not being overly strict about the top-level domain, I don’t have to update the regex each time a new top-level domain is created, whether it’s a country code or generic domain.

^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|edu|gov|mil|
biz|info|mobi|name|aero|asia|jobs|museum)$ could be used to allow any two-letter country code top level domain, and only specific generic top level domains. By the time you read this, the list might already be out of date. If you use this regular expression, I recommend you store it in a global constant in your application, so you only have to update it in one place. You could list all country codes in the same manner, even though there are almost 200 of them.

Email addresses can be on servers on a subdomain, e.g. john@server.department.company.com. All of the above regexes will match this email address, because I included a dot in the character class after the @ symbol. However, the above regexes will also match john@aol…com which is not valid due to the consecutive dots. You can exclude such matches by replacing [A-Z0-9.-]+\. with (?:[A-Z0-9-]+\.)+ in any of the above regexes. I removed the dot from the character class and instead repeated the character class and the following literal dot. E.g. \b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}\b will match john@server.department.company.com but not john@aol…com.

Another trade-off is that my regex only allows English letters, digits and a few special symbols. The main reason is that I don’t trust all my email software to be able to handle much else. Even though John.O’Hara@theoharas.com is a syntactically valid email address, there’s a risk that some software will misinterpret the apostrophe as a delimiting quote. E.g. blindly inserting this email address into a SQL will cause it to fail if strings are delimited with single quotes. And of course, it’s been many years already that domain names can include non-English characters. Most software and even domain name registrars, however, still stick to the 37 characters they’re used to.

The conclusion is that to decide which regular expression to use, whether you’re trying to match an email address or something else that’s vaguely defined, you need to start with considering all the trade-offs. How bad is it to match something that’s not valid? How bad is it not to match something that is valid? How complex can your regular expression be? How expensive would it be if you had to change the regular expression later? Different answers to these questions will require a different regular expression as the solution. My email regex does what I want, but it may not do what you want.

 

Posted in Uncategorized | Leave a comment

Monit – Cool way to monitor services

Monit is a helpful program that automatically monitors and manages server programs to ensure that they not only stay online consistently, but that the file size, checksum, or permissions are always correct. Additionally monit comes with a basic web interface through which all of the processes can be set up. This tutorial will cover the most basic setup and configuration.
Install Monit
Monit is easiest to install through apt-get:
sudo apt-get install monit

Once monit downloads, you can add programs and processes to the configuration file:
sudo nano /etc/monit/monitrc

Monit can be started up with a command that then keeps it running in the background
monit

Typing monit status displays monit’s details:
The Monit daemon 5.3.2 uptime: 1h 25m

System ‘myhost.mydomain.tld’
status Running
monitoring status Monitored
load average [0.03] [0.14] [0.20]
cpu 3.5%us 5.9%sy 0.0%wa
memory usage 26100 kB [10.4%]
swap usage 0 kB [0.0%]
data collected Thu, 30 Aug 2012 18:35:00

Configure Monit
Monit is very easy to use nearly out of the box. By default, it is set up to check that services are running every 2 minutes and stores its log file in “/var/log/monit.log”.

These settings can be altered at the beginning of the configuration file in the set daemon and set logfile lines respectively.
Web Service
Monit comes with it’s own web server running on port 2812. To configure the web interface, find and uncomment the section that begins with set httpd port 2812. Once the section is uncommented, write in your server’s IP or domain name as the address, allow anyone to connect, and then create a monit user and password
set httpd port 2812
use address 12.34.56.789 # only accept connection from localhost
allow 0.0.0.0/0.0.0.0 # allow localhost to connect to the server and
allow admin:monit # require user ‘admin’ with password ‘monit’

Once this is configured, monit should reload and reread the configuration file, and the web interface will be available:
monit reload

You will then be able to access the monit web interface by going to “example.com:2812” Login with your chosen username and password. Your screen should look something like this.
Configuring Programs Self-Monitoring
Once the web services are set up, you can begin to input the programs that you want monitored and protected into the “/etc/monit/monitrc” configuration file. To simply ensure that programs stay online, you can use the /etc/init.d commands to stop or start a program. Here are some example configurations: Apache:
check process apache with pidfile /run/apache2.pid
start program = “/etc/init.d/apache2 start” with timeout 60 seconds
stop program = “/etc/init.d/apache2 stop”

MySQL
check process mysqld with pidfile /var/run/mysqld/mysqld.pid
start program = “/etc/init.d/mysql start”
stop program = “/etc/init.d/mysql stop”

Nginx
check process nginx with pidfile /var/run/nginx.pid
start program = “/etc/init.d/nginx start”
stop program = “/etc/init.d/nginx stop”
Finish Up
Once you have configured all of the programs that you want to run, they will be automatically tracked and restarted should they turn off.

You can control the programs through both the web interface or the command line.

Once you have set up the configuration, check the syntax:
monit -t

After resolving any possible syntax errors, you can start running all of the monitored programs.
monit start all

Posted in Uncategorized | Leave a comment

Install packages from EPEL (Extra Packages for Enterprise Linux)

What is EPEL
EPEL (Extra Packages for Enterprise Linux) is open source and free community based repository project from Fedora team which provides 100% high quality add-on software packages for Linux distribution including RHEL (Red Hat Enterprise Linux), CentOS, and Scientific Linux. Epel project is not a part of RHEL/Cent OS but it is designed for major Linux distributions by providing lots of open source packages like networking, sys admin, programming, monitoring and so on. Most of the epel packages are maintained by Fedora repo.
Why we use EPEL repository?
Provides lots of open source packages to install via Yum.
Epel repo is 100% open source and free to use.
It does not provide any core duplicate packages and no compatibility issues.
All epel packages are maintained by Fedora repo.
How To Enable EPEL Repository in RHEL/CentOS 6/5?
First, you need to download the file using Wget and then install it using RPM on your system to enable the EPEL repository. Use below links based on your Linux OS versions. (Make sure you must be root user).
RHEL/CentOS 6 32-64 Bit
## RHEL/CentOS 6 32-Bit ##
# wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
# rpm -ivh epel-release-6-8.noarch.rpm

## RHEL/CentOS 6 64-Bit ##
# wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
# rpm -ivh epel-release-6-8.noarch.rpm
RHEL/CentOS 5 32-64 Bit
## RHEL/CentOS 5 32-Bit ##
# wget http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
# rpm -ivh epel-release-5-4.noarch.rpm

## RHEL/CentOS 5 64-Bit ##
# wget http://download.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm
# rpm -ivh epel-release-5-4.noarch.rpm
RHEL/CentOS 4 32-64 Bit
## RHEL/CentOS 4 32-Bit ##
# wget http://download.fedoraproject.org/pub/epel/4/i386/epel-release-4-10.noarch.rpm
# rpm -ivh epel-release-4-10.noarch.rpm

## RHEL/CentOS 4 64-Bit ##
# wget http://download.fedoraproject.org/pub/epel/4/x86_64/epel-release-4-10.noarch.rpm
# rpm -ivh epel-release-4-10.noarch.rpm
How Do I Verify EPEL Repo?
You need to run the following command to verify that the EPEL repository is enabled. Once you ran the command you will see epel repository.
# yum repolist
Sample Output
Loaded plugins: downloadonly, fastestmirror, priorities
Loading mirror speeds from cached hostfile
* base: centos.aol.in
* epel: ftp.cuhk.edu.hk
* extras: centos.aol.in
* rpmforge: be.mirror.eurid.eu
* updates: centos.aol.in
Reducing CentOS-5 Testing to included packages only
Finished
1469 packages excluded due to repository priority protections
repo id repo name status
base CentOS-5 – Base 2,718+7
epel Extra Packages for Enterprise Linux 5 – i386 4,320+1,408
extras CentOS-5 – Extras 229+53
rpmforge Red Hat Enterprise 5 – RPMforge.net – dag 11,251
repolist: 19,075
How Do I Use EPEL Repo?
You need to use YUM command for searching and installing packages. For example we search for Zabbix package using epel repo, lets see it is available or not under epel.
# yum –enablerepo=epel info zabbix
Sample Output
Available Packages
Name : zabbix
Arch : i386
Version : 1.4.7
Release : 1.el5
Size : 1.7 M
Repo : epel
Summary : Open-source monitoring solution for your IT infrastructure
URL : http://www.zabbix.com/
License : GPL
Description: ZABBIX is software that monitors numerous parameters of a network.
Let’s install Zabbix package using epel repo option –enablerepo=epel switch.
# yum –enablerepo=epel install zabbix
Note: The epel configuration file is located under /etc/yum.repos.d/epel.repo.
This way you can install as many as high standard open source packages using EPEL repo.

Posted in Uncategorized | Leave a comment

Splunk LDAP Configuration

LDAP Configuration:
First perform ldapsearch to make sure ldap search working so splunk can perform the same:
ldapsearch -h xxx.xx.xx.61 -p 389 -D”CN=SVCACC_CDH,OU=ServiceAccounts,DC=prod,DC=wudip,DC=com” -b “DC=prod,DC=dom,DC=com” -W | grep ‘do_group’

Go to Settings–> Access Controls –> Authentication method -> Configure splunk to use LDAP and map groups
-> Click New ->
Enter the configuration parameters
#####
Strategy Name: <any name>
Host: xxx.xx.xx.61
Port: 389
Connection Order: 1

Bind DN: CN=SVCACC_CDH,OU=ServiceAccounts,DC=prod,DC=dom,DC=com

Bind Password: *****
Confirm Password: *****

User Settings:
User base DN: DC=prod,DC=dom,DC=com

User base filter:

User name attribute: samaccountname

Real name attribute: displayname

Group mapping attribute: dn

Group Settings:
Group base DN: DC=prod,DC=dom,DC=com

Static group search filter:

Group name attribute: cn

Static member attribute: member

<Rest is default values ..>
###
**** Make note ***
In Advanced settings:
Disable: Enable referrals with anonymous bind only

Click Save button and you should be able successfully able to start LDAP.

In case of any issue where it says credentials are invalid or some other error, check for detailed error information here:
http://www.linuxtopia.org/online_books/network_administration_guides/ldap_administration/appendix-common-errors_Common_causes_of_LDAP_errors.html

In case of any errors:
Enable detailed logging using

/opt/splunk/bin/stop
mv /opt/splunk/var/log/splunk/splunkd.log /opt/splunk/var/log/splunk/splunkd.log.old
/opt/splunk/bin/start –debug
— analyze logs for details.

Posted in Uncategorized | Leave a comment

Writing PostgreSQL Functions in C

Writing PostgreSQL Functions in C

Introduction

PostgreSQL is a powerhouse in its own right, but it is also extensible. You have a lot of options when it comes to extending PostgreSQL. You can use SQL, PL/pgSQL, PL/Tcl, PL/Perl, PL/Python, and more. Each option has its advantages. C has the advantage of speed, of course. Because PostgreSQL itself is written in C, writing C code also allows you to leverage some of PostgreSQL’s own internal code. Your C code can also readily leverage any number of other programming libraries. C functions are also a stepping stone toward defining your own new PostgreSQL base types. Ready to get started? Great! Because instead of trying to butter you up with a witty preamble, I’m going to take you straight to work.


Prerequisites

You’ll need a fairly recent version of PostgreSQL to follow along. I’m not going to discuss PostgreSQL’s older Version 0 calling convention. I’m also using the PGXS build system, which was introduced in PostgreSQL version 8.0. The packaged version of PostgreSQL included with Debian Sarge or Redhat AS4 is older than that. I have compiled and used the code below on Debian Etch’s packaged version, Fedora Core 6’s packaged version, and on my own compiled from source installation of PostgreSQL 8.2.4.

If you are using your distribution’s packaged version of the PostgreSQL server, rather than rolling your own; you need to make sure that you install the corresponding development package, for example:

Debian Etch:
  apt-get install postgresql-server-dev-8.1 postgresql-8.1
Fedora Core 6:
  yum install postgresql-server postgresql-devel

I’m going to assume you know the basics of how to configure and run PostgreSQL. If not, there are plenty of resources to help you get going. I also included an example build script at the end of this article.

Obviously you’ll also need a C compiler. Make also comes in handy. Unless, like Alan Cox, you are able to directly manipulate the bits inside of your computer with your mind, you will also appreciate the use of a keyboard and monitor. I think we’re ready now.


Hello, World!

This month I’ll present a simple example, and follow-up with some explanations. We’re going to write a function that takes a single text argument, and appends it to the the string ‘Hello, ‘.

prompt> select hello( 'World!' );
     hello
---------------
 Hello, World!

Using your favorite editor (**cough**emacs**cough**), create a file called example.c as follows:

#include "postgres.h"
#include "fmgr.h"
#include <string.h>

#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif

Datum hello( PG_FUNCTION_ARGS );

PG_FUNCTION_INFO_V1( hello );
Datum
hello( PG_FUNCTION_ARGS )
{
   // variable declarations
   char greet[] = "Hello, ";
   text *towhom;
   int greetlen;
   int towhomlen;
   text *greeting;

   // Get arguments.  If we declare our function as STRICT, then
   // this check is superfluous.
   if( PG_ARGISNULL(0) ) {
      PG_RETURN_NULL();
   }
   towhom = PG_GETARG_TEXT_P(0);

   // Calculate string sizes.
   greetlen = strlen(greet);
   towhomlen = VARSIZE(towhom) - VARHDRSZ;

   // Allocate memory and set data structure size.
   greeting = (text *)palloc( greetlen + towhomlen );
   VARATT_SIZEP( greeting ) = greetlen + towhomlen  + VARHDRSZ;

   // Construct greeting string.
   strncpy( VARDATA(greeting), greet, greetlen );
   strncpy( VARDATA(greeting) + greetlen,
            VARDATA(towhom),
            towhomlen );

   PG_RETURN_TEXT_P( greeting );
}

Now let’s build this program, install it, and use it. Section 33.9.6 of the official PostgreSQL documentation explains the details. Unless you’re doing something complicated, the PostgreSQL Extension Building Infrastructure, aka PGXS, will probably suffice. Create the following Makefile in the same directory as your code:

MODULES = example
PGXS := $(shell pg_config --pgxs)
include $(PGXS)

Run ‘make’ and then ‘make install’ as a user with permission to write the example.so file you just made to the PostgreSQL server’s lib directory.

With the module in place, all that’s left to do is to bind our ‘hello’ function inside of our example.so module to a SQL function that we can use inside our database. The following SQL will do the trick. Put the following text in a file called ‘example.sql’.

CREATE OR REPLACE FUNCTION
  hello( TEXT )
RETURNS
  TEXT
AS
  'example.so', 'hello'
LANGUAGE
  C
STRICT
IMMUTABLE;

Open a database with the PostgreSQL interactive command line editor ‘psql’ as a user with permission to create new functions (i.e. superuser). Create your new ‘hello’ function by loading the example.sql code above. For example:

prompt> \i example.sql
CREATE FUNCTION

=> CREATE TEMP TABLE test( name ) AS
-> VALUES ('Xavier'), ('Yari'), ('Zack');

=> SELECT hello( name ) FROM test;
    hello     
--------------
 Hello, Xavier
 Hello, Yari
 Hello, Zack
(3 rows)

What We Did

What could possibly be cooler than that?! Now let’s review what we did.

I’ll begin by noting that there are actually two PostgreSQL C language calling conventions: an older ‘Version 0’ convention and the newer ‘Version 1’ convention. I’m focusing exclusively on the Version 1 convention.

We begin by including postgres.h and fmgr.h. There are other PostgreSQL header files that you will need if you begin developing more complicated code, but you can pretty much expect that you’ll always need at least these two. The include statements are followed by a “magic block”. This block is required as of PostgreSQL version 8.2. This statement ensures that the database doesn’t load improperly versioned object files. The magic block is followed by our C function declaration. We can have as many functions in a single object file as we like. In this example, we only declare and define one function, called ‘hello’. I like to put forward declarations of all my C functions at the beginning of my code to avoid annoying compiler warnings. A quiet compiler is a happy compiler. The PG_FUNCTION_INFO_V1 macro is required for all dynamically loaded Version 1 functions. I actually have no idea what it does, but I know what happens when you leave it out: nothing good.

And now at the beginning of our actual function definition, we can see in the variable declarations that we’re using a PostgreSQL defined type: ‘text’. Of course there are other PostgreSQL types worth knowing about as well. Table 33-1 in the official PostgreSQL documentation enumerates the C types corresponding to built-in SQL types. You should use PG_GETARG_xxx() and PG_RETURN_xxx() functions to retrieve and return these values from your function. A few examples:

int32    PG_GETARG_INT32(0)     PG_RETURN_INT32(arg)
bytea *  PG_GETARG_BYTEA_P(0)   PG_RETURN_BYTEA_P(arg)
text *   PG_GETARG_TEXT_P(0)    PG_RETURN_TEXT_P(arg)

The argument to the GETARG functions indicates the position of the argument in the function call. The argument to the RETURN functions is a variable containing data to be returned. The _P suffix means ‘pointer’. Let’s get our input. Again, the GETARG arguments indicate position. So PG_ARGISNULL(0) refers to our first (and only) argument. If that argument is NULL, we return NULL, and we’re done. Later, when we look at the SQL we use to import our function, we’ll see that declaring our SQL function STRICT makes this NULL check superfluous, but I like to include it anyway. If our argument is not NULL, then we drop through and use the PG_GETARG_TEXT_P macro to assign the first argument (they are zero indexed) to our ‘towhom’ variable.

Variable length types such as bytea or text are always passed by reference. We then use macros to set and retrieve the data within our variable length data types. Let’s take a quick look at the ‘varlena’ data structure used internally by PostgreSQL to represent variable length data structures. This will help us understand how some of the other macros we use operate. (As an aside, the name of this structure is where A. Elain Mustain got the name for her online PostgreSQL newsletter.)

struct varlena
{
  int32 vl_len;
  char  vl_dat[1];
};

The value vl_len member of this structure indicates how much data the structure is holding. The vl_dat member holds the data. Pretty simple, huh? However, there’s no reason to go poking at this structure directly. In fact, we should not assume that this structure even exists. Instead, we get and set variable length data with macros. The VARSIZE macro tells us the total size of our ‘towhom’ text. The VARHDRSZ macro is a constant, and represents the overhead consumed by the vl_len member of our varlena structure. So VARSIZE(towhom) – VARHDRSZ gives us the number of octets in our text string. We use the VARATT_SIZEP macro to set the vl_len member of our ‘greeting’ text equal to the size of the text string it’s going to store. Note that PostgreSQL text strings are not null terminated. The varlena data structure obviates the need for an end of string delimiter. The VARDATA macro returns a pointer to the data member of our ‘greeting’ string.

Again: while it’s educational to understand what’s happening under the hood, don’t go poking around at PostgreSQL internals when you don’t have to. Use the provided macros instead.

One last note about our C code. Use ‘palloc’, not ‘malloc’. PostgreSQL’s memory allocation function automatically takes care of certain cleanup operations that would otherwise be impossible. How would you free the memory pointed to by a pointer that’s returned from our function, for example? PostgreSQL takes care of this for us. That’s not to say we shouldn’t free memory when we can. Use the palloc function’s corresponding ‘pfree’ function to free memory when it’s no longer needed.


Our SQL Function Definition

Our CREATE FUNCTION call maps our C function to a corresponding SQL function, which we will also call ‘hello’. Our SQL function is defined to take a single TEXT argument, and to return a TEXT result. We can include more than one function inside a single object file, but for today, our example.so file only includes our single ‘hello’ function.

Our function is declared to be STRICT, meaning that it will return NULL on NULL input. This saves processing overhead (and makes our functions’s NULL checking superfluous, as already mentioned). Our function is also declared IMMUTABLE, which means that our function does not modify the database, and that for any given input, the function will always return the same output.

See the PostgreSQL documentation for CREATE FUNCTION for more details about this statement.


Wrapping Up

In my next article, I’ll expand on this introduction to PostgreSQL C programming. I’ll show how to use PGXS when you want to link against external libraries, and I’ll explain how to process tuple arguments. I’ll also show how to use the ‘ereport’ function to provide diagnostic output while debugging C code that’s running inside of a database.

Until then, you might want to take a look in the PostgreSQL distribution’s ‘contrib’ directory for more examples. Even if you don’t understand all the code there, you’ll almost certainly find something you can use. There’s a lot of really excellent work in there.


Hasty PostgreSQL Installation HOWTO

Here’s the script I use to build PostgreSQL. You’ll need to make sure you have the development versions of required libraries installed, e.g. on Debian Sarge you’ll need libpam-dev, libperl-dev, etc.

NAME=postgresql
VER=8.2.4
APP=${NAME}-${VER}
SRCDIR=/usr/get/src/${NAME}/${APP}
INSTALLDIR=/local/apps/versioned/$APP
LOGFILE=/local/data/build/${APP}.log

[ -d $INSTALLDIR ] && rm -rf $INSTALLDIR
cd ${SRCDIR}
[ -f config.cache ] && rm -f config.cache
# make clean
./configure \
    --prefix=$INSTALLDIR \
    --with-pgport=5432 \
    --with-pam \
    --with-perl \
    --with-python \
    --with-openssl \
    --with-readline \
    --enable-thread-safety \
    > $LOGFILE 2>&1

RETVAL=$?
[ ! $RETVAL -eq 0 ] &&
make >> $LOGFILE 2>&1

ETVAL=$?
[ ! $RETVAL -eq 0 ] &&
make install >> $LOGFILE 2>&1

RETVAL=$?
exit $RETVAL

I like to symlink my versioned installation to /local/apps/postgresql (so I can quickly change between minor version numbers by changing a single symlink); then I symlink the binaries in /local/apps/postgresql to a bin directory on my search path, somewhere where ldconfig will find them (see /etc/ld.so.conf), and so on. Create a PostgreSQL superuser and group (Typically ‘postgres’ & ‘postgres’). Create a directory for log files and for the database which are owned by this user, and give that user write permission. Here’s a simple init script. Use the ‘init’ function to create an empty database.

#!/bin/sh

VER=8.2
PGACCOUNT=postgres
POSTBIN=/local/bin
PGCTL=${POSTBIN}/pg_ctl
INITDB=${POSTBIN}/initdb
DATADIR=/db/postgres/${VER}
LOGFILE=/var/log/postgres/pg_ctl.log
ENCODING="SQL_ASCII"

# Set postgresql options in postgresql.conf

[ -f ${PGCTL} ] || exit 0

init () {
    echo -n "Initializing PostgreSQL database"
    echo
    su - postgres -c "${INITDB} -D ${DATADIR} -E $ENCODING"
    echo
}

start () {
    echo -n "Starting PostgreSQL..."
    echo
    su - postgres -c "${PGCTL} start -D ${DATADIR} -l ${LOGFILE}"
    echo
}

stop () {
    echo -n "Stopping PostgreSQL..."
    echo
    su - postgres -c "${PGCTL} stop -D ${DATADIR} -m smart"
    echo
}

restart () {
    echo -n "Restarting PostgreSQL..."
    echo
    stop
    sleep 5
    start
}

reload () {
    echo -n "Reloading PostgreSQL configuration..."
    echo
    su - postgres -c "${PGCTL} reload -D ${DATADIR}"
    echo
}

status () {
    echo -n "Checking PostgreSQL status..."
    echo
    su - postgres -c "${PGCTL} status -D ${DATADIR}"
}

case "$1" in
  init)
    init
    ;;
  start)
    start
    ;;
  stop)
    stop
    ;;
  restart)
    restart
    ;;
  reload)
    reload
    ;;
  *)
    echo "Usage: $0 {init|start|stop|restart|reload|status}"
    exit 1
esac

exit 0

I also like to give myself and root, ident authorization to run as the ‘postgres’ superuser (i.e. I like to be able to run ‘psql -U postgres’ when I’m doing development work so I can get into the database quickly). Append something like the following to ${your_db_directory}/pg_ident.conf:

amap          myusername          iddb
amap          myusername          myusername
amap          myusername          postgres
amap          postgres            postgres
amap          root                postgres

This presumes you have enabled ident authentication in pg_hba.conf, like so:

local all all ident amap

That’s a pretty hasty PostgreSQL installation HOWTO. It’s a little off-topic, but I thought it might help jump start a potential proselyte or two.


Emacs Errata

I’m a big fan of SQL Mode for Emacs (see Resources below). I run split windows, and use keyboard shortcuts to run bits of the SQL I’m editing in my *SQL* buffer. However, I’ve always found that sql.el needs a little tweak. The problem is that when you run ‘M-x sql-postgres’, it won’t prompt you for a user name. Unless you always want to connect to the database as your shell account username, or unless you exclusively use ident authentication, you’ll want to fix this. Find sql.el on your system, and edit the PostgreSQL section as follows:

< (sql-get-login 'database 'server)
> (sql-get-login 'database 'user 'server)

> (if (not (string= "" sql-user))
>   (setq params (append (list "-U" sql-user) params)))

Run ‘byte-compile-file’ to create sql.elc. If you enter nothing at the ‘User:’ prompt, you will connect as the current user. If you want to use PostgreSQL’s ident authentication (assuming you have this configured in PostgreSQL’s pg_hba.conf and pg_ident.conf), leave the ‘Server:’ prompt blank. Enter a server name and a password if you are doing host based authentication.

Posted in Uncategorized | Leave a comment

Linux find command with examples

First, create the following sample empty files under your home directory to try some of the find command examples mentioned below.

# vim create_sample_files.sh
touch MybashProgram.sh
touch mycprogram.c
touch MyCProgram.c
touch Program.c

mkdir backup
cd backup

touch MybashProgram.sh
touch mycprogram.c
touch MyCProgram.c
touch Program.c

# chmod +x create_sample_files.sh

# ./create_sample_files.sh

# ls -R
.:
backup                  MybashProgram.sh  MyCProgram.c
create_sample_files.sh  mycprogram.c      Program.c

./backup:
MybashProgram.sh  mycprogram.c  MyCProgram.c  Program.c

1. Find Files Using Name

This is a basic usage of the find command. This example finds all files with name — MyCProgram.c in the current directory and all its sub-directories.

# find -name "MyCProgram.c"
./backup/MyCProgram.c
./MyCProgram.c

2. Find Files Using Name and Ignoring Case

This is a basic usage of the find command. This example finds all files with name — MyCProgram.c (ignoring the case) in the current directory and all its sub-directories.

# find -iname "MyCProgram.c"
./mycprogram.c
./backup/mycprogram.c
./backup/MyCProgram.c
./MyCProgram.c

3. Limit Search To Specific Directory Level Using mindepth and maxdepth

Find the passwd file under all sub-directories starting from root directory.

# find / -name passwd
./usr/share/doc/nss_ldap-253/pam.d/passwd
./usr/bin/passwd
./etc/pam.d/passwd
./etc/passwd


Find the passwd file under root and one level down. (i.e root — level 1, and one sub-directory — level 2)

# find -maxdepth 2 -name passwd
./etc/passwd


Find the passwd file under root and two levels down. (i.e root — level 1, and two sub-directories — level 2 and 3 )

# find / -maxdepth 3 -name passwd
./usr/bin/passwd
./etc/pam.d/passwd
./etc/passwd


Find the password file between sub-directory level 2 and 4.

# find -mindepth 3 -maxdepth 5 -name passwd
./usr/bin/passwd
./etc/pam.d/passwd

4. Executing Commands on the Files Found by the Find Command.

In the example below, the find command calculates the md5sum of all the files with the name MyCProgram.c (ignoring case). {} is replaced by the current file name.

# find -iname "MyCProgram.c" -exec md5sum {} \;
d41d8cd98f00b204e9800998ecf8427e  ./mycprogram.c
d41d8cd98f00b204e9800998ecf8427e  ./backup/mycprogram.c
d41d8cd98f00b204e9800998ecf8427e  ./backup/MyCProgram.c
d41d8cd98f00b204e9800998ecf8427e  ./MyCProgram.c

5. Inverting the match.

Shows the files or directories whose name are not MyCProgram.c .Since the maxdepth is 1, this will look only under current directory.

# find -maxdepth 1 -not -iname "MyCProgram.c"
.
./MybashProgram.sh
./create_sample_files.sh
./backup
./Program.c

6. Finding Files by its inode Number.

Every file has an unique inode number, using that we can identify that file. Create two files with similar name. i.e one file with a space at the end.

# touch "test-file-name"

# touch "test-file-name "
[Note: There is a space at the end]

# ls -1 test*
test-file-name
test-file-name


From the ls output, you cannot identify which file has the space at the end. Using option -i, you can view the inode number of the file, which will be different for these two files.

# ls -i1 test*
16187429 test-file-name
16187430 test-file-name


You can specify inode number on a find command as shown below. In this example, find command renames a file using the inode number.

# find -inum 16187430 -exec mv {} new-test-file-name \;

# ls -i1 *test*
16187430 new-test-file-name
16187429 test-file-name


You can use this technique when you want to do some operation with the files which are named poorly as shown in the example below. For example, the file with name — file?.txt has a special character in it. If you try to execute “rm file?.txt”, all the following three files will get removed. So, follow the steps below to delete only the “file?.txt” file.

# ls
file1.txt  file2.txt  file?.txt


Find the inode numbers of each file.

# ls -i1
804178 file1.txt
804179 file2.txt
804180 file?.txt


Use the inode number to remove the file that had special character in it as shown below.

# find -inum 804180 -exec rm {} \;

# ls
file1.txt  file2.txt
[Note: The file with name "file?.txt" is now removed]

7. Find file based on the File-Permissions

Following operations are possible.

  • Find files that match exact permission
  • Check whether the given permission matches, irrespective of other permission bits
  • Search by giving octal / symbolic representation


For this example, let us assume that the directory contains the following files. Please note that the file-permissions on these files are different.

# ls -l
total 0
-rwxrwxrwx 1 root root 0 2009-02-19 20:31 all_for_all
-rw-r--r-- 1 root root 0 2009-02-19 20:30 everybody_read
---------- 1 root root 0 2009-02-19 20:31 no_for_all
-rw------- 1 root root 0 2009-02-19 20:29 ordinary_file
-rw-r----- 1 root root 0 2009-02-19 20:27 others_can_also_read
----r----- 1 root root 0 2009-02-19 20:27 others_can_only_read


Find files which has read permission to group. Use the following command to find all files that are readable by the world in your home directory, irrespective of other permissions for that file.

# find . -perm -g=r -type f -exec ls -l {} \;
-rw-r--r-- 1 root root 0 2009-02-19 20:30 ./everybody_read
-rwxrwxrwx 1 root root 0 2009-02-19 20:31 ./all_for_all
----r----- 1 root root 0 2009-02-19 20:27 ./others_can_only_read
-rw-r----- 1 root root 0 2009-02-19 20:27 ./others_can_also_read


Find files which has read permission only to group.

# find . -perm g=r -type f -exec ls -l {} \;
----r----- 1 root root 0 2009-02-19 20:27 ./others_can_only_read


Find files which has read permission only to group [ search by octal ]

# find . -perm 040 -type f -exec ls -l {} \;
----r----- 1 root root 0 2009-02-19 20:27 ./others_can_only_read

8. Find all empty files (zero byte file) in your home directory and its subdirectory

Most files of the following command output will be lock-files and place holders created by other applications.

# find ~ -empty


List all the empty files only in your home directory.

# find . -maxdepth 1 -empty


List only the non-hidden empty files only in the current directory.

# find . -maxdepth 1 -empty -not -name ".*"

9. Finding the Top 5 Big Files

The following command will display the top 5 largest file in the current directory and its subdirectory. This may take a while to execute depending on the total number of files the command has to process.

# find . -type f -exec ls -s {} \; | sort -n -r | head -5

10. Finding the Top 5 Small Files

Technique is same as finding the bigger files, but the only difference the sort is ascending order.

# find . -type f -exec ls -s {} \; | sort -n  | head -5


In the above command, most probably you will get to see only the ZERO byte files ( empty files ). So, you can use the following command to list the smaller files other than the ZERO byte files.

# find . -not -empty -type f -exec ls -s {} \; | sort -n  | head -5

11. Find Files Based on file-type using option -type

Find only the socket files.

# find . -type s


Find all directories

# find . -type d


Find only the normal files

# find . -type f


Find all the hidden files

# find . -type f -name ".*"


Find all the hidden directories

# find -type d -name ".*"

12. Find files by comparing with the modification time of other file.

Show files which are modified after the specified file. The following find command displays all the files that are created/modified after ordinary_file.

# ls -lrt
total 0
-rw-r----- 1 root root 0 2009-02-19 20:27 others_can_also_read
----r----- 1 root root 0 2009-02-19 20:27 others_can_only_read
-rw------- 1 root root 0 2009-02-19 20:29 ordinary_file
-rw-r--r-- 1 root root 0 2009-02-19 20:30 everybody_read
-rwxrwxrwx 1 root root 0 2009-02-19 20:31 all_for_all
---------- 1 root root 0 2009-02-19 20:31 no_for_all

# find -newer ordinary_file
.
./everybody_read
./all_for_all
./no_for_all

13. Find Files by Size

Using the -size option you can find files by size.

Find files bigger than the given size

# find ~ -size +100M


Find files smaller than the given size

# find ~ -size -100M


Find files that matches the exact given size

# find ~ -size 100M


Note: – means less than the give size, + means more than the given size, and no symbol means exact given size.

14. Create Alias for Frequent Find Operations

If you find some thing as pretty useful, then you can make it as an alias. And execute it whenever you want.


Remove the files named a.out frequently.

# alias rmao="find . -iname a.out -exec rm {} \;"
# rmao


Remove the core files generated by c program.

# alias rmc="find . -iname core -exec rm {} \;"
# rmc

15. Remove big archive files using find command

The following command removes *.zip files that are over 100M.

# find / -type f -name *.zip -size +100M -exec rm -i {} \;"

Remove all *.tar file that are over 100M using the alias rm100m (Remove 100M). Use the similar concepts and create alias like rm1g, rm2g, rm5g to remove file size greater than 1G, 2G and 5G respectively.

# alias rm100m="find / -type f -name *.tar -size +100M -exec rm -i {} \;"
# alias rm1g="find / -type f -name *.tar -size +1G -exec rm -i {} \;"
# alias rm2g="find / -type f -name *.tar -size +2G -exec rm -i {} \;"
# alias rm5g="find / -type f -name *.tar -size +5G -exec rm -i {} \;"

# rm100m
# rm1g
# rm2g
# rm5g

Find Files Based on Access / Modification / Change Time

You can find files based on following three file time attribute.

  1. Access time of the file. Access time gets updated when the file accessed.
  2. Modification time of the file. Modification time gets updated when the file content modified.
  3. Change time of the file. Change time gets updated when the inode data changes.

 
In the following examples, the difference between the min option and the time option is the argument.

  • min argument treats its argument as minutes. For example, min 60 = 60 minutes (1 hour).
  • time argument treats its argument as 24 hours. For example, time 2 = 2*24 hours (2 days).
  • While doing the 24 hours calculation, the fractional parts are ignored so 25 hours is taken as 24 hours, and 47 hours is also taken as 24 hours, only 48 hours is taken as 48 hours. To get more clarity refer the -atime section of the find command man page.

Example 1: Find files whose content got updated within last 1 hour

To find the files based up on the content modification time, the option -mmin, and -mtime is used. Following is the definition of mmin and mtime from man page.

  • -mmin n File’s data was last modified n minutes ago.
  • -mtime n File’s data was last modified n*24 hours ago.

 
Following example will find files in the current directory and sub-directories, whose content got updated within last 1 hour (60 minutes)

# find . -mmin -60

 
In the same way, following example finds all the files (under root file system /) that got updated within the last 24 hours (1 day).

# find / -mtime -1

Example 2: Find files which got accessed before 1 hour

To find the files based up on the file access time, the option -amin, and -atime is used. Following is the definition of amin and atime from find man page.

  • -amin n File was last accessed n minutes ago
  • -atime n File was last accessed n*24 hours ago

 
Following example will find files in the current directory and sub-directories, which got accessed within last 1 hour (60 minutes)

# find -amin -60

 
In the same way, following example finds all the files (under root file system /) that got accessed within the last 24 hours (1 day).

# find / -atime -1

Example 3: Find files which got changed exactly before 1 hour

To find the files based up on the file inode change time, the option -cmin, and -ctime is used. Following is the definition of cmin and ctime from find man page.

  • -cmin n File’s status was last changed n minutes ago.
  • -ctime n File’s status was last changed n*24 hours ago.

 
Following example will find files in the current directory and sub-directories, which changed within last 1 hour (60 minutes)

# find . -cmin -60

 
In the same way, following example finds all the files (under root file system /) that got changed within the last 24 hours (1 day).

# find / -ctime -1

Example 4: Restricting the find output only to files. (Display only files as find command results)

The above find command’s will also show the directories because directories gets accessed when the file inside it gets accessed. But if you want only the files to be displayed then give -type f in the find command as
 
The following find command displays files that are accessed in the last 30 minutes.

# find /etc/sysconfig -amin -30
.
./console
./network-scripts
./i18n
./rhn
./rhn/clientCaps.d
./networking
./networking/profiles
./networking/profiles/default
./networking/profiles/default/resolv.conf
./networking/profiles/default/hosts
./networking/devices
./apm-scripts
[Note: The above output contains both files and directories]

# find /etc/sysconfig -amin -30 -type f
./i18n
./networking/profiles/default/resolv.conf
./networking/profiles/default/hosts
[Note: The above output contains only files]

Example 5: Restricting the search only to unhidden files. (Do not display hidden files in find output)

When we don’t want the hidden files to be listed in the find output, we can use the following regex.
The below find displays the files which are modified in the last 15 minutes. And it lists only the unhidden files. i.e hidden files that starts with a . (period) are not displayed in the find output.

# find . -mmin -15 \( ! -regex ".*/\..*" \)

Finding Files Comparatively Using Find Command

Human mind can remember things better by reference such as, i want to find files which i edited after editing the file “test”. You can find files by referring to the other files modification as like the following.

Example 6: Find files which are modified after modification of a particular FILE

Syntax: find -newer FILE

 
Following example displays all the files which are modified after the /etc/passwd files was modified. This is helpful, if you want to track all the activities you’ve done after adding a new user.

# find -newer /etc/passwd

Example 7: Find files which are accessed after modification of a specific FILE

Syntax: find -anewer FILE

 
Following example displays all the files which are accessed after modifying /etc/hosts. If you remember adding an entry to the /etc/hosts and would like to see all the files that you’ve accessed since then, use the following command.

# find -anewer /etc/hosts

Example 8: Find files whose status got changed after the modification of a specific FILE.

Syntax: find -cnewer FILE

 
Following example displays all the files whose status got changed after modifying the /etc/fstab. If you remember adding a mount point in the /etc/fstab and would like to know all the files who status got changed since then, use the following command.

find -cnewer /etc/fstab

Perform Any Operation on Files Found From Find Command

We have looked at many different ways of finding files using find command in this article and also in our previous article. If you are not familiar in finding files in different ways, i strongly recommend you to read the part 1.
 
This section explains about how to do different operation on the files from the find command. i.e how to manipulate the files returned by the find command output.
 
We can specify any operation on the files found from find command.

find <CONDITION to Find files> -exec <OPERATION> \;

 
The OPERATION can be anything such as:

  • rm command to remove the files found by find command.
  • mv command to rename the files found.
  • ls -l command to get details of the find command output files.
  • md5sum on find command output files
  • wc command to count the total number of words on find command output files.
  • Execute any Unix shell command on find command output files.
  • or Execute your own custom shell script / command on find command output files.

Example 9: ls -l in find command output. Long list the files which are edited within the last 1 hour.

# find -mmin -60
./cron
./secure

# find -mmin -60 -exec ls -l {} \;
-rw-------  1 root root 1028 Jun 21 15:01 ./cron
-rw-------  1 root root 831752 Jun 21 15:42 ./secure

Example 10: Searching Only in the Current Filesystem

System administrators would want to search in the root file system, but not in the other mounted partitions. When you have multiple partitions mounted, and if you want to search in /. You can do the following.
 
Following command will search for *.log files starting from /. i.e If you have multiple partitions mounted under / (root), the following command will search all those mounted partitions.

# find / -name "*.log"

 
This will search for the file only in the current file system. Following is the xdev definition from find man page:

  • -xdev Don’t descend directories on other filesystems.

 
Following command will search for *.log files starting from / (root) and only in the current file system. i.e If you have multiple partitions mounted under / (root), the following command will NOT search all those mounted partitions.

# find / -xdev -name "*.log"

Example 11: Using more than one { } in same command

Manual says only one instance of the {} is possible. But you can use more than one {} in the same command as shown below.

# find -name "*.txt" cp {} {}.bkup \;

 
Using this {} in the same command is possible but using it in different command it is not possible, say you want to rename the files as following, which will not give the expected result.

find -name "*.txt" -exec mv {} `basename {} .htm`.html \;

Example 12: Using { } in more than one instance.

You can simulate it by writing a shell script as shown below.

# mv "$1" "`basename "$1" .htm`.html"

 
These double quotes are to handle spaces in file name. And then call that shell script from thefind command as shown below.

find -name "*.html" -exec ./mv.sh '{}' \;

So for any reason if you want the same file name to be used more than once then writing the simple shell script and passing the file names as argument is the simplest way to do it.

Example 13: Redirecting errors to /dev/null

Redirecting the errors is not a good practice. An experienced user understands the importance of getting the error printed on terminal and fix it.
 
Particularly in find command redirecting the errors is not a good practice. But if you don’t want to see the errors and would like to redirect it to null do it as shown below.

find -name "*.txt" 2>>/dev/null

 
Sometimes this may be helpful. For example, if you are trying to find all the *.conf file under / (root) from your account, you may get lot of “Permission denied” error message as shown below.

$ find / -name "*.conf"
/sbin/generate-modprobe.conf
find: /tmp/orbit-root: Permission denied
find: /tmp/ssh-gccBMp5019: Permission denied
find: /tmp/keyring-5iqiGo: Permission denied
find: /var/log/httpd: Permission denied
find: /var/log/ppp: Permission denied
/boot/grub/grub.conf
find: /var/log/audit: Permission denied
find: /var/log/squid: Permission denied
find: /var/log/samba: Permission denied
find: /var/cache/alchemist/printconf.rpm/wm: Permission denied
[Note: There are two valid *.conf files burned in the "Permission denied" messages]

 
So, if you want to just view the real output of the find command and not the “Permission denied” error message you can redirect the error message to /dev/null as shown below.

$ find / -name "*.conf" 2>>/dev/null
/sbin/generate-modprobe.conf
/boot/grub/grub.conf
[Note: All the "Permission denied" messages are not displayed]

Example 14: Substitute space with underscore in the file name.

Audio files you download from internet mostly come with the spaces in it. But having space in the file name is not so good for Linux kind of systems. You can use the find and rename command combination as shown below to rename the files, by substituting the space with underscore.
 
The following replaces space in all the *.mp3 files with _

$ find . -type f -iname “*.mp3″ -exec rename “s/ /_/g” {} \;

Example 15: Executing two find commands at the same time

As shown in the examples of the find command in its manual page, the following is the syntax which can be used to execute two commands in single traversal.
 
The following find command example, traverse the filesystem just once, listing setuid files and directories into /root/suid.txt and large files into /root/big.txt.

# find /    \( -perm -4000 -fprintf /root/suid.txt '%#m %u %p\n' \) , \
 \( -size +100M -fprintf /root/big.txt '%-10s %p\n' \)

 

Posted in Uncategorized | Leave a comment