At Shopzilla we’ve been building a new system that captures data as it flows through our existing systems and collects it into a new Oracle database.  From there the business folks wanted it copied into Salesforce.com, where it’s easier for them to use.

For the portion of the system that copies from Oracle to Salesforce, we found a number of vendors with off-the-shelf products that could do the job.  After an evaluation, we selected Informatica Cloud Services and purchased the appropriate license.  We’ve had it in production for about 3 months now, and it’s working out well for us.  The web UI used to configure tasks is straightforward and easy for new users to learn.  Mapping of source rows to target fields can be done with drag and drop, and the mapping can be validated from the same page.  There is a ‘Make a copy’ button that makes it simple to duplicate an existing task and then edit as needed, instead of having to start from scratch.

Informatica has a built-in mechanism for migrating from Dev to QA to Production.  When migrating you start by logging into the target organization, then from there log into the source organization and pull the desired tasks and connections into the target organization.  You just need accounts for both organizations.  (This may not seem that impressive if you’re used to migrating .war files, but when compared to using the Force.com IDE with Salesforce it’s much easier.)  There is one downside: an existing task can’t be replaced during migration. This means that you will typically need to first delete an existing task before you migrate in the new version.

Issues We Ran Into

Rounding Errors in Large Numbers

We found this because one of the Oracle tables had primary key values created with an algorithm that produces 35-digit numbers.  This was no problem in Oracle; the column datatype is NUMBER(38).  However after uploading to Salesforce we found the values didn’t quite match.  An example:

 12795812359292270001909990099008659    (Oracle)
 12795812359292270319152466166284288.00 (Salesforce)

After opening a case with Informatica support, we learned that all numbers are converted to floating point with 15 digits of precision, even if they’re integers.  There is no way to avoid this.  Further simple tests showed that 16-digit integers will probably transfer without issue, but 17-digit integers won’t.

Now if you’re familiar with Salesforce, you’re probably wondering about how we thought we were going to store a 35-digit number in Salesforce anyway, given that Salesforce only stores 18 digits.  The answer is that the field in Salesforce is actually a Text(38) field, which would hold the entire value as a string and have a couple of spaces left over just in case.  We converted the Oracle number into a string using the to_char() function built into Informatica.  This is where the problem came in, because Informatica converts the number to a 15 digit floating point value before to_char() runs.  There is no way for to_char() to help, because it’s already too late.

Fortunately there is a workaround: call to_char() in Oracle instead of Informatica.  Once we switched, we had no more value mismatches.

Trailing Zeros in Smaller Numbers

We found this one a little later on, because the people who saw it first weren’t familiar with Salesforce and thought was just something that Salesforce did.  Primary key values from another table picked up a series of trailing zeros during the upload.  An example:

 4332891                 (Oracle)
 4332891.000000000000000 (Salesforce)

This turned out to be another side effect of the conversion to 15 digit floating point.  We were once again using the to_char() function in Informatica to convert an integer to a string (because we hope someday those values will exceed 18 digits).  The value gets converted to floating point, then to_char() converts it to a string, nicely formatting the floating point value with a decimal point and trailing zeros.

Fortunately for us, the same workaround applies: just call to_char() in Oracle instead of Informatica.  Once we switched, we had no more trailing zeros.

Oracle Views vs. Informatica Transformations

After some experimentation, we found it better to create Oracle views to do any needed joins and data transformations.  Informatica can do simple joins on the fly, but it can’t do more complicated joins.  We have one Oracle view where the underlying query now contains 3 left outer joins and a CASE statement to collect all the data needed for one Salesforce object.  (Yes, that query is slow.)  There is another view that concatenates multiple values in order to create a unique external identifier for Salesforce.  The concatenation started out in Informatica, but when we needed to use the same value in two different places in the mapping it moved to the Oracle view.  And as discussed above, we had to move to_char() calls from Informatica to Oracle.

We now have a rule of thumb that we create a view for every type of data going into Salesforce.  One reason is that it’s easier to keep track of if all mappings query views rather than having some that do and some that don’t.  The other is that things that start simple don’t always stay that way (as seen with the to_char() calls), and it’s easier if you don’t switch for a table to a view and have to redo most of the Informatica task.

Informatica Log Files Rollover

Informatica keeps a separate set of log files for each task.  The file names end with 0, then 1, 2, etc. up to 9, after which it overwrites the 0 file.  There isn’t a way to increase the number of log files written before the count rolls over.  For tasks that run once every 24 hours this isn’t a big deal, but we have one that runs every 30 minutes.  At that pace the log files are overwritten after 5 1/2 hours, which means that if something happens in the middle of the night it’s gone before the developers realize there was a problem.  To work around this we set up automated archiving of log files before they get rolled over.  We’re using Rsync-Incr, which was written by Colas Nahaboo and is available at http://colas.nahaboo.net/Software/RsyncIncr.

We use Splunk for most of our log files, and we really would like to use it for Informatica logs as well.  The problem is that when Splunk ingests a line in an error log, it determines the timestamp of the error by looking through the line for a timestamp it can recognize.  Informatica logs don’t have an error timestamp, but they do contain the data from the upsert that failed.  This means that Splunk will find a timestamp, but it’s a business data timestamp, such as the date that a user originally created their account.  As a result the errors show up in Splunk with no correlation at all as to when they occurred.  If you know a way to have Splunk log errors based on when the file was written, *please* post a comment!

Current status

We’ve now had Informatica running in production for about 3 months, and have successfully moved tens of millions of records through it.  Building the system would have been easier if I’d known of the issues in advance, hence this blog post, but I would choose it again.

Comments (4)

Comments (4)

Comment RSS  |  Trackback URI

by John Mark in February 23rd, 2011 at 11:20 am    

To solve your timestamp problem, take a look at the answer given here:

http://answers.splunk.com/questions/3428/timestamps-problem

-John Mark
Splunk Community Director

by kbains in February 24th, 2011 at 1:04 am    

try:

DATETIME_CONFIG = NONE in props.conf for that source.

by ssorkin in February 24th, 2011 at 10:15 am    

You can configure splunk to never look for timestamps, but rather fall back on the modtime of the file using “DATETIME_CONFIG = NONE” in props.conf for the given sourcetype. Similarly you can configure “DATETIME_CONFIG=CURRENT” will use the time when the data is indexed. This technique is buried in http://www.splunk.com/base/Documentation/latest/Admin/Configuretimestamprecognition.

by krishna in March 21st, 2011 at 6:13 pm    

I’m not sure about the Cloud services, but there are 2 log formats with Informatica session logs. The old one used to be like “MAPPING>”, whereas the newer one contains a proper timestamp (not sub-second thou) which Splunk should be able to parse correctly (and not confuse with date values appearing in other places)…

Related Posts (5)

Leave a Comment