How to filter out noisy data？

I hope this message finds you well. I am currently studying your paper and have come across a couple of points that I would greatly appreciate some clarification on. Your insights would be invaluable to my understanding of the work. Here are my questions:

1. In the paper, it is mentioned that the model observes four distinct patterns in the loss trends for each token during training: H->H, L->L, H->L, and L->H. It is noted that, except for the L->L category, the other patterns exhibit higher average loss. To address this, the paper introduces the SLM, which uses a reference model to filter out tokens with higher loss. However, I am curious about how this approach specifically targets and removes the noisy tokens depicted in Figure 2. Intuitively, one might expect noisy tokens to have a higher loss and thus be more likely to be selected by the reference model. Could you please elaborate on the mechanism by which SLM effectively eliminates these noisy tokens?

2. For continue pre-training, I am wondering if it is feasible to use the pretrained model itself as the reference model. What are the potential implications or limitations of such an approach?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to filter out noisy data？ #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to filter out noisy data？ #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions