RefinePro

Backend Python and JavaScript ETL developer

2021-10-07T00:00:00-04:00

Introduction

We are a small data-driven company looking for a senior or mid-level ETL developer. We offer an organized remote work environment with colleagues across the world. We don’t ask you to know everything, but we ask that you always stay ready to learn. In exchange we promise to provide you with a dynamic work environment that promotes sharing, learning, and a fondness for the power of data. Learn more below! We can’t wait to hear from you.

About the Job

Job Type: Full time and permanent
Job Location: Remote with the option to find co-workers and offices in Toronto or Montreal.
Timezone: America
Experience: Senior and mid-level (5+ years of relevant experience)
Role: Python and JavaScript ETL Developer: Integration Team.
Compensation: Based on experience
Industry: Consulting Data services
Company Size: 5 to 15 employees

Who are we?

At RefinePro, we believe that access to data is key to understand today’s challenges.

That’s why, we have been designing, developing and monitoring hundreds of data pipelines with a focus on data engineering and operational excellence since 2014. Our goal? Make sure data is within everyone’s reach. Our clients rely on our data feed to gather new insights and build products or services. We help our customers transform the insurance, retail, compliance, and manufacturing industries. We provide them with data strategy, system architecture, implementation and outsourcing services.

And to achieve our mission, we:

Always put our customer’s needs at the center of our operation,
Keep data quality in mind at all time, and
Invest early on DevOps and DataOps best practices to automate our processes

We are a distributed asynchronous team, operating with a remote-first mindset. We have offices in Canada (Toronto, Montreal).

And we have nothing to hide! So go check out our Joel test results to know everything there is to know about our code writing skills.

Who are you?

You are an ETL developer or data engineer with significant experience in scripting languages, developing and maintaining data pipelines. You previously worked with large datasets and messy data with at least a few millions records.

You believe coding is only part of your job, and you pay attention to code quality, documentation, unit testing, and data quality. You are comfortable speaking directly with the client to present your approach and collect feedbacks.

You know when to ask for help, and are not afraid to learn new technologies.

And like us, you believe data is the future.

The mission

Work with the product owners and our customers to develop a technical vision for the project, including ETL specifications, workflows and data model definition.
Design and build ETL flows, using Python, Bash or SQL
Work with the DevOps team to deploy your creations to production
Document operation procedure to execute, monitor and maintain data pipelines
Monitor and test integrations for multiple customers.
Support on-going projects
Carry out research & development about new data acquisition and transformation technology

Remember: You must be comfortable working directly with the customer in a consulting model.

Skills and Requirements

Skills

JavaScript and Python and their common ETL libraries (pandas, numpy, click)
Git
Bash, Batch
Comfortable with the command-line
Knowledge of the different data formats (CSV, JSON, YML, XML)
Excellent understanding of the SQL language
Understanding of API design and usage
Good knowledge of web development (though we will not ask you to develop web apps)
Familiar with development concepts and best practices (Kanban, DRY, Design patterns)
Good written and spoken English
Willing to learn new tools and technology.

Bonus points

You have experience working with an all-remote or distributed team.
You know other languages such as Java, Node, or PHP.
You’re excited by ETL frameworks like Talend, Pentaho, or Alteryx.
You have previous experience with web scraping.
You know Linux or AWS
You have experience with other database engines
You have experience with Elastic Search, Kibana

What we offer

Direct access to our senior developers and founder. We all work together, and we want our people to thrive.
An efficient and well-organized remote environment, with REAL processes and experience. (We were “in” before remote work was even “in.”)
Flexible schedule.
Work-life balance.
Documented projects and specifications.

If you haven’t looked at our Joel test results yet, do it now (and let us know)!

The recruitment process

Submit your resume and cover letter to [email protected]
We will schedule an introduction meeting with one of our founders.
Positive meeting? Great! We’ll ask you to pass a technical test to review your skills. But because a test can limit how you express your talent, we will follow up with a technical interview. At this stage, you will also meet one of your future colleagues to have a different perspective on the company. Ultimately, we also want you to want us.
Are you still interested in working with us? Awesome! If we feel we have a fit, we will get through the paperwork and get you started on a project.

If you have any questions, please reach out to us at [email protected]. And we hope to see you soon!

The RefinePro team.

Webinar Data Operations for CRM and Marketing

2020-09-20T00:00:00-04:00

In case you missed our webinar with Macro on September 10, 2020, you can find here the recording and the slides.

Watch the webinar recording

Data management is a core concern facing many companies today. Good data can have a tremendous positive impact on organizational efficiency, productivity, and revenue. Messy data, on the other hand, can have the opposite effect. It can lead to financial losses and disorganization.

What Is the Business Impact of Data Operations?

Better Data-Driven Decisions
Improved Lead Management & Qualification
Less Frustration

In this webinar, Dan and Martin discuss the business impact of good data and will touch upon how companies can better manage their data and enhance their data operations strategy.

Martin will also be presenting a demo of OpenRefine, a free open-source data clean up tool.

Watch the webinar recording

Dan is the president and founder of Macro. His professional background includes B2B demand generation consulting for Microsoft Dynamics CRM and various international marketing roles in Europe and North America.

Martin is the founder and CEO of RefinePro, a Canadian company focused on data processing and normalization. He created RefinePro to make data within reach of small and medium-sized businesses or departme

Download The Data Innovation Canvas

2020-07-09T00:00:00-04:00

Learn more about the Data Innovation Canvas from Communitech.

Watch Chris Willsher, Director of Data Platforms at Communitech, presents the canvas during the May 8th Communitech® Data Hub Sessions. The video starts at 41:25

Download the data innovation canvas

The secret for long term-growth - or why data is the new oil

2020-07-09T00:00:00-04:00

Data is the new oil. It stands at the center of an organization’s value proposition, at the core of their product or service creation process. Understanding and managing data is a core competency and not a by-product.

COMPANIES THAT ARE LEADERS IN THE USE OF DATA ARE THREE TIMES MORE LIKELY TO BE FINANCIALLY SUCCESSFUL. source: Economic Intelligence Unit

It’s estimated that 40,000 more exabytes of data is either created, replicated, or consumed annually in 2020 compared to 1,200 in 2010. And this increase is happening in almost every industry. Organizations must then learn to exploit and refine data if they want to grow. And to do so, they need to understand how their data strategy evolves at every step of their customer journey and product life cycle. They need their employees to understand how to use, read, understand, and interpret data. And they need to know how to build products and services that are driven by data.

With data, your organization can create products or services that directly help customers. Your offer can take the form of tools (API, data feeds, recommendation engine, etc.) or knowledge and insights (thanks to advanced analytics). But ultimately, data still needs to be at the center of your organization’s business strategy. And the same way the Business Model Canvas is supposed to help organizations develop their business model, Communitech created the Data Innovation Canvas to help organizations develop a data model.

Download the Data Innovation Canvas

So you can now start fuelling your organization with data for better growth.

INVEST IN YOUR FUTURE

We all start from an empty (or nearly empty) data store. Building a data strategy is the key to slowly assemble all the data you need, in a structured, efficient, ethical, legal, and reliable way. As you fill up your data warehouse, you can slowly define how your product or your service will incorporate data, and how you can leverage data over time to continually improve your offer.

It’s a compound interest. The longer you invest in your data, i.e. collect, aggregate, enrich, and analyze them, the more value you gain. Over time, you build historical information on specific elements of your service, product, and customer base, that will improve your analysis and allow you to gain new insights. Later down the road, your historical information will help you create better data initiatives. Essentially, you’re being paid off by your own work. You’re making interests over your interests.

Start now. Because this history is built over your doing, your innovations, and your products and services, it’s entirely exclusive to you and impossible to reproduce. Yes, it’ll take time. You might not be able to use data to address current challenges. The goal is more in the long term: How can you leverage data that improve over time? The only way to do so is by thinking two or three steps ahead to set the right goals.

Keep in mind that the more data you collect, the more accurate you’ll be. But the more accurate you’ll be, the more changes you’ll need to make to your products, services, and workflows to incorporate data in your operations.

Data is the new oil, but it’s not sufficient to fill the tank of your machine once and leave for a year. During your data journey, you’ll rethink how to use that fuel efficiently. You will also see new opportunities and new destinations to reach with a bus filled with happy and empowered customers.

FIND THE PERFECT PRODUCT/SERVICE MARKET FIT

Once you’ve filled your Data Innovation Canvas, you’ll be ready to build a data-driven business model. In other words, your business model will define what role data will play in the growth of your organization and how it will adapt to data changes.

Adaptability is the keyword here. Our world is moving fast, and that’s already an understatement. An organization that accepts to harness the power of data must also be ready to adapt to its many changes and variations. You should also take into consideration your organization’s learning curve when working with data. Mainly:

Your business plan should consider all possible ways of presenting data and their potential impact on your revenue model and targeted segments. One dataset, for example, could be used simultaneously to feed different initiatives.
Data changes all the time. It’s constantly being updated, transformed, erased, combined, added, etc. You need to consider the possible change in your data sources overtime, and how it’ll affect your organization and its capacity to deliver its products and services. If you use external data, you need to make sure your architecture is strong and flexible enough to handle unpredictable changes.
Data is not confined to one department. It can, and should, be used by different teams for different applications. The more you use your data, the most reliable it becomes, and the more you find ways to use it!
Data shouldn’t be complex. You don’t need to create the next artificial intelligence unicorn. Start with a narrow scope and explore the real use case for it. If you make it easy to explain, you’ll help increase the transparency of your activities for your users and stakeholders, and thus gain their trust. And this will show to be a lot more efficient than diving headfirst into complex cross-analysis, especially if it’s not needed.
Your customers change all the time too. And their expectations of your organization change also. Keep an eye out where the industry is going.

With a business model based on data, and most importantly based on the volatile nature of data, you multiply your chances of finding the perfect product/service market fit. Instead of relying on time-sensitive market analysis, your growth will be rooted in a deep understanding of your data and different user experience iteration. You’ll drive that bus full of customers not only with the right fuel but, most importantly, in the right direction.

Download the Data Innovation Canvas

TAKE AWAY

Now you know: collecting data and learning how to use it is your next big move. But it’s never that easy, isn’t it? There is such a thing as bad data. Before you throw yourself in juicy datasets, make sure you ask yourself the 10 essential questions to make sure you’re feeding your organization with relevant data from reliable sources. It’ll also help you make sure the data you’re collecting is usable for your organization, or that you have the required abilities to clean it. And once you’ve clearly defined your source, don’t forget to plan ahead: how will you use these different sources? Can you cross analyze to derive new insights?

Collecting, cleaning, and analyzing data are not easy (or cheap) tasks. You need to set the right expectations, or you risk drowning is tasks you hadn’t planned for, driving a bus that’s missing critical pieces of mechanics. It might be easier to start off with data that’s naturally closer to your needs as to limit the cleansing and preparation needed. And then, slowly, as your organization grows, you can improve your data granularity (the level of detail) or linkage (how it relates to other information).

The secret is investing in your infrastructure. Build data foundations from the start, from engineering, based on a solid Data Canvas. Build foundations that will last while you iterate on the front end and insight delivery. Give yourself enough space to package data differently, according to your users.

You’ll find all sorts of “ready to buy” datasets out there. But the datasets that matter are the ones you build. So, start collecting data early to build historical information. Because one thing you’ll never be able to buy back is time.

Got a project or idea in mind?
We have the experts to make it happen.

Tell Us About It

Download PDF Form 10 Questions Before Using New Data

2020-06-15T00:00:00-04:00

Download PDF Form Web Scraping Software Comparison Table

2020-06-15T00:00:00-04:00

10 questions to ask before using new data

2020-05-25T00:00:00-04:00

Data extraction projects are complex and often require quite a lot of time and effort. To make sure your organization is creating value and that your money and your time are well spent, the first logical step is to choose your sources carefully. To help you achieve just that, we create a list of 10 questions you need to ask before you set your sights on a dataset. The goal here is to collect and analyze all the data existing information in order to clarify its ownership, publication, structure, content, quality, relationship, etc. Only by going through this process can you guarantee the suitability of your sources and identify potential problems and particularities.

This checklist will help you assess all the elements you need to know in order to proceed with your data project. Most of all, once you have all the answers, you will have everything you need to define what will be your game plan to transform and manipulate the datasets you chose.

So, without further ado, here are ten questions to ask before using new data.

Question 1: Who owns the data?
Question 2: Who publishes the data?
Question 3: Is the dataset documented?
Question 4: How is the data collected?
Question 5: How is the data maintained and updated?
Question 6: What are the format and granularity?
Question 7: Does the data follow standards?
Question 8: Can you link your data to another dataset?
Question 9: Under what licenses the data is released?
Question 10: Are there data privacy issues?

Download our editable form for your personal use

Question 1. Who owns the data?

And by “own,” we don’t necessarily mean “publish” (see question 2). You need to know where the data originally comes from, and whom to contact if you ever have any questions or issues that need solving. Also, if you ever need to attribute ownership (see question 9) when reusing the data, this owner will be the one you will refer too. Basically, you need to put a human face and a name to the data you wish to extract and use.

Question 2. Who publishes the data?

There are a lot of platforms out there that offer huge datasets, like Quandl, and data.world. But it doesn’t mean they own the data they share with you. It is, therefore, imperative that you know who owns and publishes and shares the data. This distinction will help you better answer the following questions.

Question 3. Is the dataset documented?

You need to gather all the information on how the data was collected (see questions 4 and 5) and how it should be interpreted (see question 7). This includes the schema of the data, with the data type and validation rules. The goal here is to make sure you can answer most questions without having to go back to the data owner.

Question 4. How is the data collected?

The answer to this question will help you identify any potential biases in the way data is collected. It will also give you some extremely important information concerning the data itself: is the data complete or partial? Has it been pre-processed before its publication? You want to know what the original state of the data was and how much it has changed (or not) before reaching you.

Question 5. How is the data maintained and updated?

Now that you know what the data looked like originally, you want to know what processes it goes through. For example, you want to ensure your data will still be reliable in the long-term. You also want to know how often it’s updated and if the set contains all the records or only updated ones. You also need to know if there’s a change in the collect methodology, or if the dataset stops being available.

Without these answers, you might end up building a script for something that won’t be available in two days’ time, or not in the format you expected.

Question 6. What are the format and granularity?

You must identify the formats in which your data is made available. Formats can usually be categorized as follows:

NonFriendly Format (PDF, web page, Word document, image)
Flat file (CSV, XLS)
Structured file (JSON, XML)
API and web service, provided by the source or by a third party (Quandl, data.world)
Maps (KML, Shapefile, GeoJSON)

Once you know what format you’ll need to deal with, you can better choose the tools and solutions you’ll need to extract and transform your data. It will also help you identify the data granularity, or the lowest data point available. If, for example, your information concerns time, you want to know if the smallest possible data point is second, minute, hour, day, month, or year. The same goes for maps: address, postal code, city, or state?

Knowing what your data looks like and what shape it takes will guarantee that you have all the information you need to extract it using the right tools and solutions.

Question 7. Were specific standards applied to the dataset?

When data is collected and published according to certain standards, it helps remove ambiguity on the collection, aggregation, and preparation methods. Standardization also allows us to compare and combine data according to jurisdiction or period. Data regarding elections, 311 calls, census or transit information, for example, are all standardized.

Question 8. Can you link the data to another dataset?

When profiling your data, you need to make sure you understand all its relationships with other datasets. Is it isolated, or could it be combined with other internal or external data? What new insight can you build from it? How could you merge them? Do they share a common key? Basically, the goal here is to profile your data as it related to other data.

Question 9. Was the data published under a license? And if so, which one?

Some organizations will choose to publish their data under a license, which then defines how the data should be collected, shared, and used. You need to be aware of these licenses and understand how they work. The most common are:

ODC Public Domain Dedication and Licence (PDDL)

With PPDL, users can share, create, and adapt the document. There’s no restriction and the dataset is public domain.

Open Data Commons Attribution License (ODC-By)

With ODC-By, users can share, create, and adapt the document. The only restriction is the attribution, which means that users need to cite the source.

Open Data Commons Open Database License (ODC-ODbL)

With ODC-ODbL, users can share, create, and adapt the document, but they need to cite the sources, share with the same license.

Custom licenses

Unfortunately, custom licenses are extremely popular. They necessitate that you read it, understand it, and make sure you respect its specific requirements. This could impact how you can collect and transform your data, as well as how you can use it.

Question 10. Are there data privacy issues?

Privacy is protected differently from state to state. Important differences even exist between Canada, the United States and Europe. You need to know if your datasets contain Personally Identifiable Information (PII) or if it would be possible to re-identify individuals based on anonymized data. This is especially true with healthcare data.

Download our form to make your own analysis

AND WITH THAT

Knowing what data you need is not enough to start a data extraction project. More than “what,” you need to know “who” your data is. Knowing your data is the only way to know for sure you’re using the right tool, at the right schedule, with the right script, to get the right data and transform it correctly.

This list of ten questions should be your first step in defining if a dataset is worth all the efforts you’re ready to put into it. Data projects are complex projects on their own and they require that you plan them well.

Choose your sources is only the first step to a long story. Depending on your sources and needs (are you dealing with unfriendly formats like PDFs?, you’ll need to define the best tools for web scraping, the best way to maintain data quality throughout the whole process, the best way to build a solid ETL process, and the best architecture for data extraction processes.

Got a project or idea in mind?
We have the experts to make it happen.

Tell Us About It

PDF extraction - Everything you need to know

2020-05-18T00:00:00-04:00

Our team here at RefinePro has a deep experience doing research and development in data processing and automation. And PDF extraction is one of the many services we offer.

But before we go into too much detail … why exactly do people need PDF extraction for?

Portable Document Format, or PDF, is a standardized file format. It allows users to distribute read-only documents that will present the same text and images independently of the hardware, software, or operating system used to open it (Mac, Windows, Linux, iPhone, Android, and others). PDF documents may contain a wide variety of information other than text and graphics, such as interactive elements (annotations and editable fields), structural elements, media, and various other content formats.

In today’s work environment, PDF is often the go-to solution for exchanging business data. Suppliers, for example, mostly prefer PDF to create their price lists and catalogues and to exchange invoices, purchase orders, reports, etc. So, whether you’re trying to gather a larger volume of data on a specific subject in your field of research or just trying to extract a list of items and prices for your eCommerce website, you need to find a way to convert information contained in PDF documents into usable structured data.

And let’s be honest, nobody wants to (or can!) go through doze or even hundreds of documents manually.

PDF documents are easy to read for humans, but they rarely contain any machine-readable data. Their format varies considerably from one file to another, depending on how it was generated. If you’re lucky, the document you’re extracting your data from is in text format, with numbers organized neatly in tables. But if you’re not lucky, the information is embedded in an image. In that case, you’ll need to use Optical Character Recognition (OCR) to help you get the data.

Accessing a massive amount of information stored in PDFs and converting it can then be a burdensome task. Luckily, PDF data extraction offers solutions to automate this task and automatically convert messy information into structured and usable data. And PDF extraction projects are no news for us. We invested in some of the proven technologies, and we are always testing out new software to make sure we help you build the data extraction project you need to meet your goals.

1. PDF EXTRACTION: HOW?

1.1 THE RIGHT TOOL FOR YOUR PROJECT

There are a lot of different systems out there to help you set a solid PDF extraction project. For business analysts, it’s often easier to go with “What You See Is What You Get” interfaces (WYSIWYG) like DocParser. These systems tend to be more expensive, but they are easy to use and set, and they work well with high volume of easy cases. For entry-level programmers, some solutions offer more flexibility and low code complexity, which makes it easier to support exceptions for complex files. However, they still require programming knowledge and expertise on data extraction project as a whole. They usually run on JAVA or Python.

The advantage of working with a partner like RefinePro is that thanks to our years of experience, we can help you select the technology that will best answer your requirements. We listed the four categories you should keep in mind.

1.2 ASSESSING YOUR NEEDS

Your business and legal requirements: You should ask yourself:

Are you working with sensitive data? What privacy laws do you need to comply with?
Do you want to use non-open source technology?
What level of dependency do you want or can have on a service or technology provider?

The connectivity to your systems: This includes the method used to send and receive the PDFs with your systems (e.g. via an API, a database connection, or other) and if you want to process files in batch or on-demand as they are collected?

The volume of data: including how many are your processing per day? How many different layouts? What are the data validation rules (schema, business rules, etc.); and what happens when the validation job rejects data (the review process).

Your Resources: Who will monitor your PDF extraction project? What type of skills (and training) do they need? What kind of medium- and long-term support do you need?

2. REFINEPRO’S PDF EXTRACTION SUBSYSTEMS

Over the years, we have developed an extraction architecture that relies on a set of best practices and proven engineered patterns. We recommend decoupling your steps to make troubleshooting easier. PDF extraction should follow four steps: data collection, data normalization, data validation, and delivery.

These steps are part of an architecture in which ingestion and normalization of each PDF document are divided into three subsystems.

2.1 Subsystem One. Collection and Normalization.

In the first part, we bring together collection and normalization. All the different formats of data collected are being morphed into a standard schema, which is the set of validation rules you implemented to define what a “good” data is. To do so, the developer writes one PDF extraction and one normalization script per PDF layout. In other words, different scripts are used depending on the outlines, style, and logical component content of the PDF. This way, one script will extract data from documents matching the same layout—the same logical structure—to then transform it in a usable format for your team.

In this script, the developer will add all the exceptions related to a specific PDF layout so that each file format can be processed independently. This way, if one script returns an error, it only affects one layout and not the entire project, making troubleshooting easier.

For the normalization step, more specifically, OpenRefine is a great tool if you want to build a fully WYSIWYG solution (something we can help you with). On the other hand, Talend Open Studio, is perfect if you want to outsource the work to entry-level programmers. We can also train your team and help you launch your first project!

2.2 Subsystem Two. Validation and delivery (or the delivery of quality data)

During the second part, which includes validation and delivery, we leverage a unified schema. We only need one validation and one delivery script for all PDF layout. The data is being validated using the schema to ensure compliance with your business rules before it is delivered into your system.

During validation, we define and document the schema, namely the elements that make a “good” data. As such, a validation error occurs when an extracted data doesn’t pass the validation rules established for the project. This corruption can come from a bug in the workflow, or changes in the data sources.

This step is particularly important. When we develop a PDF extraction project script, one of the priorities is to create a validation script to ensure we do not over-engineered data quality. We need to ensure that the validation steps fail as early as possible to avoid corrupting downstream systems.

2.3 Subsystem Three. Scheduling, Monitoring, and Maintaining

The third part is the use of infrastructure or platform to execute, schedule, configure, and monitor the scripts themselves to ensure they keep delivering reliable data. Most importantly, also, data quality (article 3) will need to be monitored thoroughly.

WHAT ABOUT AI?

Artificial Intelligence is the new kid on the block. Everyone knows it, everybody wants to use it, many people claim to have mastered it, but few people actually offer it. In PDF extraction, more specifically, we have seen a lot of promising development, but we’re not there yet. AI can be used for very narrow use cases. Instead of trying to find the next shiny object, we recommend sticking to well-proven and tested solutions that will help you get the results you’re looking for. Be sure, however, that our team is keeping a close eye on all the new technologies out there. Don’t hesitate to contact us [email protected] if you’d like an independent assessment on a specific software.

AND WITH THAT

Here at RefinePro, we provide data strategy, system architecture, implementation, and outsourcing services to help organizations scale and automate data acquisition and transformation workflows. Whether you decide to work with us, with another service provider, or even on your own, you’ll need to make sure to select the right tools (and not just the PDF extracting tool: database, servers, data processing framework, etc.) and set up your processes to meet your data quality requirements while minimizing the maintenance efforts.

For years, we have helped clients define what system and process to put in place to ensure their needs are answered in the most time- and cost-efficient manner. So, before you throw yourself on Google or your in-house expertise to develop a complex data extraction project, contact us!

Got a project or idea in mind?
We have the experts to make it happen.

Tell Us About It

How to divide and conquer your data project for success

2020-05-17T00:00:00-04:00

Data extraction is now one of the most efficient ways for companies to stay up to date with current events and trends, but also to position themselves in their field. But for a lot of small entrepreneurs and even larger companies, the implementation of data extraction projects presents new challenges: How should these processes be implemented, and by whom?

Web Scraping is known as the process by which data is extracted from different sources and then transformed into usable information. As such, a huge part of any web scraping project relies on a strong Extraction, Transformation, and Loading process, known as ETL. But building a solid ETL architecture for your web scraping requires a lot of technical know-how, combined with the knowledge necessary to adapt these “easy-to-use” tools to your specific needs. Most importantly, your project will also rely on many other crucial processes, including data quality management and administrative procedures.

In this article, we explain why all your different data extraction processes should be decoupled for a more seamless workflow. That might sound counterintuitive… But the idea is as old as the world: divide and conquer (even algorithms understand). Divide your script, divide your task, find solution to sub-problems instead of a major crash, and get reliable end results, without draining your economic and human resources.

1. WHAT IS ETL?

During ETL, data is being copied from pre-defined sources before reaching you in a format that makes it usable. An ETL developer can help you build an architecture that will support the ETL process of your project.

Why an “ETL” developer? A developer creating a robust data transformation process work at the crossroad of different fields and execute functions as diversified as database analysis, system integration, and data transformation development. He or she must ensure that every aspect of the data life cycle has been addressed to ensure its operability and maintainability.

2. THE EXTRACTION, TRANSFORMATION, AND LOADING BEHIND ETL

The three main steps of ETL will come as no surprise: extraction, transformation, and loading. But as one might suspect, each of these steps hides a lot of sub-steps that need to be considered. Data going through an ETL process will undergo different stages in its journey. We will now examine how these stages integrate within the three main steps of ETL. And keep in mind our “divide to conquer” motto: every step of your ETL as its own logic and should be considered separately.

2.1 Discovery and curation

Before you sit down with your developer and start coding, you must define precisely what sources you want to extract your data from and how you’re going to document all the relevant information (including the owner, the availability, the updates, etc.). At the end,

You know who published your data, when, and how;
You know what your data looks like (its size, format, relations to other data), and;
You have a map of your data, from the source to your database.

Once discovery and curation are done, you can now jump on step two, data collection.

2.2 Data collection (Extract)

At this substep, your focus should be on getting to data out from its original format. You will need to select the right data extraction technology, whether it is to get data from another database, XLS files, a website or a PDF document. Once it has been extracted, data can now be moved to a landing database in your dedicated data transformation environment. It’s from this new environment, entirely under your control, that you can start reviewing and troubleshooting the data.

2.3 Normalization and validation (Transform)

Now that you’ve extracted your data, you need to transform it. During the normalization and validation, the messy data you obtained is being prepared to match the format of your target system, whether it’s a data warehouse or your eCommerce website. Before you start developing a transformation process, though, make sure you know what the best practices are and which ones you need to implement (and ignore).

But remember! Divide to conquer: data extraction (step 1) should be decoupled from the transformation (step 2). Why?

It makes debugging and restartability easier by segmenting more precisely the journey of your data (see our article on data quality).
It gives your (ETL) developer the possibility to select the best tool for each job. This way, your web scraper will do its job of scraping (which is already a complex job in itself) and your data cleansing tool will do its job of cleaning.

2.4 Enrichment and processing (Transform)

Enrichment and processing steps add value to your data by connecting it with other datasets, such as your own proprietary data (like your customer or product information), for example, or other collected datasets. At this stage, you add your business logic, your secret sauce, so that your team can read and make sense of the data. By adding your business logic to messy data, you give yourself the possibility to develop real business intelligence. And that’s when data extraction really becomes interesting.

Again, decoupling is important here. You should have a single processing script for all your data sources. It is a good time to create historical values by comparing records, a process often referred to as the six slow changing dimensions (mainly the ability to keep track of unpredictable changes). Building historical information from extracted data gives you an edge (if not an unfair advantage) in understanding your market and industry. You could, for example, track the price of an item over time to predict when it will be on sale or out of stock.

2.5 Delivery and consumption (Load)

Your data is now ready to be used. But in order to do so, you need to move it from your external database into a data warehouse available to your team. During this last substep, we read data from the staging or landing table and insert it into your system. We can push it into a warehouse, upload it directly into your operational system (like an eCommerce website), or make it available to your organization via a custom API.

2.6 Administration, making it all work together

It’s the long-forgotten process, but still a crucial one. The administration will orchestrate the many processes we previously covered. A well-administered project will manage depencies between steps, and make sure everything happens in the right order. Your main tools, including your web scraper, will need to be well scheduled, maintained, and monitored. It will also be crucial that you implement practices such as logging, code management, configuration, and project management.

Data discovery and curation, data collection, normalization and validation, enrichment and processing, delivery and consumption, and administration: these are the multiple layers you find when your start scratching the varnish of a data extraction project. They all contribute to the overall stability, but their uniqueness is also what makes the whole structure stronger. With a well-built architecture, the whole system is not affected if one layer is posing a problem or needs to be upgraded for any technical or business reason.

3. AND WITH THAT

Obviously, a complex process such as this can’t be built in a day. It takes trial and error to select the best technology and processes, and also to confirm that you can build value from the data. Every data project has its own specificities. Your organization may have a short-term, extremely precise need for data extraction, but most companies want data that support their business strategy, one they can rely on in the long term to make important decisions. This is why we recommend following an agile data transformation process. We suggest developing complex data products by starting small with a couple of sources before scaling it to a robust business-wide data factory.

Over the years, we’ve built a strong experience in developing and managing these different processes. Our clients rely on us to manage every aspect, layer, and step of their data collection project so they can focus on building lasting insight and product. By doing so, they know they’re paying the right price for their data, and that they’re not overwhelming their development team.

Depending on your team know-how level, your needs will also change. We can deliver custom training for your team before they throw themselves into web scraping, help you build your project from scratch, or even assist you after its implementation. We offer training and mentoring as well as team augmentation, and data-first application development.

Got a project or idea in mind?
We have the experts to make it happen.

Tell Us About It

14 rules to succeed with your ETL project

2020-05-15T00:00:00-04:00

Extracting, transforming, and loading (ETL) data is a complex process at the center of most organizations’ data extraction projects. As we saw in our article on web scraping and ETL, the implementation of an ETL workflow is a process that requires a lot of in-depth knowledge in several subfields of statistics and programming.

ETL developers thus work at the crossroad of different fields. They must ensure that every aspect of the data life cycle has been addressed to ensure its operability and maintainability. To help your developer navigate the deep and dark waters of ETL, we’ve drawn on our years of experience to create a list of ETL principles and best practices.

What you see here is not meant to be a grocery list; these guidelines need to be considered, and then implemented or rejected. Your developer will draw on their understanding of the project and their experience to decide which principles are needed, when, and at what range.

TEN BEST PRACTICES FOR ETL WORKFLOW IMPLEMENTATION

1. Modularity

Modularity is the process of writing reusable code structures to help you keep your job consistent in terms of size and functionalities. With modularity, your project structure is easier to understand, making troubleshooting easier too. The ultimate goal is to improve job readability and maintainability by avoiding the need to write the same code over and over again.

2. Atomicity

Atomicity is used to break down complex jobs into independent and more understandable parts. The workflow is divided between distinct units of work and small and individual executable processes. Each of these parts can be executed separately. It makes testing and troubleshooting easier since the developer doesn’t need to run a long-running process to debug a single operation.

3. Change Detection and Increment

A change detection strategy detects differences and allows incremental data loading. This means that only records changed since the last update is brought into the ETL process, avoiding unnecessary transformations.

4. Scalability

Your ETL process should be implemented in a way that ensures its scalability to allow your project to adapt to the growing volume of data. This way, you don’t have to redefine a new project at every new stage of your growth, saving you time and money.

Another important aspect of any ETL workflow implementation is data quality and error management. We have an article explaining in detail how to guarantee the quality of the data loaded in your databases, and how to deal with validation errors. So, we won’t go into too much detail in this article, but here’s a list of principles to consider during implementation:

5. Error Detection and Data Validation

During validation, the data being extracted is checked according to a predefined profile. This profile represents what a “good” data is for your project. The goal is to control data as early as possible to limit the computing time and avoid processing data that will be rejected later on. It also makes recovery easier as errors are detected early in the process.

6. Recovery and Restartability

Recovery and restartability address the capability of the workflow to resume after an error. It includes the process by which the data stays in a stable state following an error. That requires the use of database backup, as well as commit and rollback features. When we commit, we make permanent a set of changes in the code. Rollback, on the other hand, is the capability to return a program back to an earlier version. They are both used to manage workflow and errors, in combination with another practice known as idempotence.

7. Idempotence

An operation is defined as idempotent when it gives the same result after being called once or multiple times. In real life, the best example would be the elevator button: you have the same result whether you push it once or fifteen times. How can idempotence be relevant in data transformation, knowing how your data is always changing? Because sometimes, it doesn’t. If a data suddenly stops being updated, you still get the same results in your tables. The same would apply if your own transformation deployments were to stop. With idempotent transformations, you avoid a system failure when the ETL process itself fails.

8. Data Lineage

Data lineage helps identify which ETL steps a specific data point went through and to understand where it originated from, when it was loaded, and how it was transformed. It eases debugging and increase trust in the data by making the process transparent, thus validating the integrity of the end results. Thanks to lineage, we can guarantee the integrity of the data and the process that extracted and loaded it into your database.

9. Auditing

Checking your logs for potential mistakes is not enough to ensure that your load was a success. Your system should be designed to check for errors and to support auditing of your primary metrics (like the number of rows processed).

10. Script Configuration

When executed, the configuration variable modifies a workflow behaviour. Variables are stored separately from the job so you can modify them without editing and deploying the scripts. Identifying the right variables is an integral part of any professional ETL job. For example, configuration variables contains parameters such as the server name and credentials, which should never be hardcoded in a job

FOUR BEST PRACTICES FOR ETL WORKFLOW TO SCHEDULE, MONITOR, AND MAINTAIN

Every data project includes an administrative component. We already covered the scheduling, monitoring, and maintaining of web scraping. However, these three important tasks also need to be executed at the ETL process level. Here’s a list of principles that will help your developer manage the workflows more efficiently. Most ETL software comes with a server edition that provides those four features. If you are looking for a technology agnostic (or cheaper) solution, contact us for more details.

11. Orchestration

Your developer will ensure that all the moving parts of your workflow come together to deal with the different nature, frequency, and cadence of your source data. It includes executing the different ETL modules and their dependencies, in the right order, along with logging, scheduling, alert monitoring, and managing code and data storage. This orchestration demands a high level of know-how, but also access to the right resources. You should never hesitate to ask for the services of an expert like us to help you implement your project.

12. Metadata management

Metadata is basically data about your data. They hold all kinds of information describing our ETL workflow. It can include information on where the data comes from, how many data points it contains and the data extraction strategy. Most importantly, a well-designed metadata system will maintain various versions of execution, including the status, the extraction and transformation methods used, the changes in source systems, etc. Thanks to the metadata, your developer can keep track of all these changes over several months or even years and your team will have all it needs to analyze the system more efficiently.

13. Logging

Every step of an ETL project must be logged using a central logging component. Relevant events are then recorded whether they happen before, during, or after extraction, transformation, or loading.

14. Code Management and Storage

You should always keep track of your code and store it somewhere safe. There are three main reasons for this. First, you should always keep track of your code versions, so you can come back and restore your scripts after a bug or an error. Second, code management and storage will enable collaboration between members of your team and other external collaborators. And third, because your code needs to be kept separate from the execution environment.

And with that

With those fourteen best practices, you have everything you need to make sure your ETL workflow fits your organization’s needs. Whether your plan is to grow, to discover new markets, to know more about your competitors, to develop a business plan, or to introduce a new product to the market, data should always be at the cornerstone of your analysis. Selecting the best ETL tool will only be the first step to ensure your end data is reliable and relevant to your goals. Implementing and maintaining these tools and the process that binds them is paramount to your project success.

This article only scratches the surface of ETL design principles and best practices. Your developer will need to know which ones need to be applied, when they should be implemented, and at what range. Your developer needs to balance the robustness of the data pipeline and its development cost. But these principles and guidelines implemented at the right moment with the right goal in mind will guarantee the quality of your data, but will also help you manage an already complex process with more ease and fewer headaches.

At RefinePro, we have been helping clients implement ETL projects for years, and doing so, have developed a deep understanding of its many internal mechanisms. And we rely on such best practices to guarantee that our ETL workflows answer all our client’s needs.

Got a project or idea in mind?
We have the experts to make it happen.

Tell Us About It