Removing Data Noise with Apatar
There used to be a time when people would say, “I can’t make a decision because I do not have enough information!” Now, you are more likely to hear something like, “I can’t make a decision because I have too much information!”
The Internet has allowed us to amass an extremely large amount of data. More than we have ever had access to previously. This has presented us with a unique problem. We have too much information and we need to find ways to filter out the data “noise” so that we can get to the good stuff.
I have discovered a very useful tool called Apatar that allows you to do just that.
Apatar: An Introduction
Almost every company has multiple locations where it stored data. It can be very difficult to generate useful information when it is scattered over a number of different locations or even computers. Add to this the complexity of different ways of storing data; databases, text files, RSS feeds, webpages, XML, etc.
Apatar is designed to be an intermediary which lets you bring all of this different data together and then filter out the results in the format you want. From the Apatar website:
- Push and pull integration support: Apatar is capable to support data-, event- and service-oriented integration with the same interface. This unique capability allows enterprise customers, partners and developers to cover all integration needs: batch and near real-time integration.
- Productivity advantage and a short learning curve: the business-rules-driven approach is shared throughout Apatar, regardless of the data, event or service orientation of each integration mechanism.
- Shared, reusable metadata: with a single metadata repository, the consistency of the integration processes is guaranteed. The repository also promotes the reusability of business rules for data transformation and data validation across processes.
- Open source architecture. Unlike all proprietary data integration software and many open source solutions, Apatar is 100% open source with no proprietary source code.
Filter and Convert Data: An Example

Let’s look at how we can use Apatar to perform a simple data manipulation task. What we are going to do is take the Daily Cup of Tech RSS feed, filter the feed so that only the posts written by contributing authors are left, and then transform the data so that we produce a CSV file with the name of the author, the title of the post, and the URL to the post.
If you want to follow along with me, you can download and install Apatar and then get the DataMap and CSV files. Extract the DataMap and CSV file to the root of your C:\ drive to allow you the quickest and easiest display of the map.
While this may seem like a bit of an onerous task to begin with, it is something that can actually be done in just a few steps with Apatar.
Step 1: Select Your Blocks
To start the process, we are going to first select the connectors (which represent data storage types) and operations (which modify data in different ways). To do this, simply drag the connector or operation that you want from the left pane to the right pane. Then, rearrange the blocks however you want.
For our example, we are using two connectors (RSS and TextFile) and two operators (Filter and Transform). Drag these four blocks to the right-hand pane and arrange them however you want.

Step 2: Connect the Dots
We now need to define the flow of data from the RSS feed to the CSV file. You will notice on each of the blocks two dots, one red and one green. These are data connectors. The green dots represent data input from another block and the red dots represent data output. By connecting the green dots to the red dots, we define which blocks feed data to each other.
Note: Not all block have a green and a red dot. Some will only have one. Others will have multiples of each. It all depends on the purpose of each block.
To connect the blocks together, simply click on a red dot and drag it to the desired green dot. This will create a line with an arrow indicating the direction of data flow.
In our example, we are connecting as follows:
RSS → Filter → Transform → TextFile
Your DataMap should now look like the final results that we mentioned earlier.
Step 3: Configure the Blocks
We now have the basic framework for our DataMap. Now, we need to configure each block so that they represent the data that we are working with and the changes that we want to make to the data.
To configure a block, simply right click on the block and select Configure. Each block’s configuration screen will differ depending on what its abilities are.
Rather than go through a long (read “boring”) step-by-step on how to configure each block, I suggest you download the DataMap and CSV files so that you can see each of the configurations for yourself.
Step 4: Troubleshoot
It is rare that I get a DataMap to work one the first try. I generally need to go through and troubleshoot some problem along the way.
One of the most useful abilities that Apatar has is to show you what the output of and individual block looks like. By simply right-clicking on a block and clicking on Preview, you can see what type of information you are getting out of that block. Simply work your way from start to finish and you can pretty quickly determine which is the offending block.
Step 5: Run, Apatar, Run
Once you have determined that everything in your DataMap is working properly, you can run the project and generate your output. The length of time this takes depends on the complexity of your DataMap and the amount of data that you need to process.
To run your project, select File → Run.
Conclusion
You should now have a nice clean new set of data that has only what you want in it. I would encourage everyone to spend some time playing around with Apatar so that you can get a good understanding of what it can do and then let your imagination run wild!
If you found this post useful, why don't you buy me a cup of coffee to show your gratitude?
5 Responses to “Removing Data Noise with Apatar”
-
University Update - Open Source - Removing Data Noise with Apatar Says:
July 30th, 2007 at 8:54 am[…] Contact the Webmaster Link to Article open source Removing Data Noise with Apatar » Posted at Daily Cup of Tech on […]
-
Tom Gleeson Says:
July 30th, 2007 at 9:59 amYou should also check out Talend (http://www.talend.com), it’s more enterprisey and works as a code generator (either Java or Perl). The advantage of this is, once you generate the transformation you can ’set it in stone’, and use it on any other machine with the only requirement being the presence of either Perl (most *nix) or Java (most desktops). You could also use the Perl PAM module to generate a “Perl EXE”, removing the need for any pre installed software.
Tom
-
Todd Says:
July 30th, 2007 at 10:29 amThis seems to be like a personal version of
Yahoo Tubes. Am I understanding both of the products accurately?Todd
-
Tim Fehlman Says:
July 30th, 2007 at 10:36 amThis would be a fair assessment but it can do a lot more than Yahoo Tubes.
Tim
-
Renat Khasanshyn Says:
August 8th, 2007 at 1:47 amTim,
Thanks from the entire Apatar community.
Re:
Tom, I appreciate your notes on Perl. Yes, Apatar does not generate Perl, nor it requires Perl to be installed on a user machine. Apatar generates XML code, which can be parsed, changed and custom-configured to the likes of the user, even without an Apatar graphical job designer. Once generated, this code can be shared and re-used on any machine. As far as enterprise capabilities. Although Apatar’s major capabilities are yet have to be released, it’s primary audience are business users and DBAs rather than developers, with easy of use and easy of installation being the first priorities. As far as enterprise capabilities, Apatar brings on-demand meta data repository, RSS aggregation and a number of enterprise and mashup connectors yet to be developed by other open source projects.
I think this is the beauty of open source when communities can share, customize and re-use each other works without license restrictions for the better.
Renat

